官术网_书友最值得收藏!

Reshaping data

Reshaping data is similar to creating a contingency table, which enables the user to aggregate data of specific values. The reshape2 package is designed for this specific purpose. Here, we introduce how to use the reshape2 package to transform our dataset from long to wide format with the dcast function. We also cover how to transform it from wide format back to long format with the melt function.

Getting ready

Refer to the Merging data recipe and merge employees and salaries into employees_salary.

How to do it…

Perform the following steps to reshape data:

  1. First, we can use the dcast function to transform data from long to wide:
    > wide_salaries <- dcast(salaries, emp_no ~ year(ymd(from_date)), value.var="salary")
    > wide_salaries[1:3, 1:7]
     emp_no 1985 1986 1987 1988 1989 1990
    1 10001 NA 60117 62102 66074 66596 66961
    2 10002 NA NA NA NA NA NA
    3 10003 NA NA NA NA NA NA
    
  2. We can also transform the data by keeping emp_no and the formatted name string as two of the attributes. Then, we can shape the year of salary payment as the column name and shape salary as its value:
    > wide_employees_salary <- dcast(employees_salary, emp_no + paste(first_name, last_name) ~ year(ymd(from_date)), value.var="salary", variable.name="condition")
    > wide_employees_salary[1:3,1:7]
     emp_no paste(first_name, last_name) 1985 1986 1987 1988 1989
    1 10001 Georgi Facello NA 60117 62102 66074 66596
    2 10002 Bezalel Simmel NA NA NA NA NA
    3 10003 Parto Bamford NA NA NA NA NA
    
  3. On the other hand, we can transform the data from wide back to long format using the melt function:
    > long_salaries <- melt(wide_salaries, id.vars=c("emp_no"))
    > head(long_salaries)
     emp_no variable value
    1 10001 1985 NA
    2 10002 1985 NA
    3 10003 1985 NA
    4 10004 1985 NA
    5 10005 1985 NA
    6 10006 1985 NA
    
  4. To remove data with missing values in long_salaries, we can use na.omit to remove this data:
    > head(na.omit(long_salaries))
     emp_no variable value
    9 10009 1985 60929
    13 10013 1985 40000
    48 10048 1985 40000
    64 10064 1985 40000
    70 10070 1985 55999
    98 10098 1985 40000
    

How it works…

In this recipe, we demonstrated how to reshape data with the reshape2 package. First, we used the dcast function to transform data from the long to wide format. Using this function, we specified that we wanted the data to transform as salaries in the first parameter. Then, we specified the shaping formula by setting emp_no as row and the year of salary paid as the column. Finally, we set salary as presented value in wide format layout.

We can also shape the data with multiple columns; all that we need to do is add another column information on the left-hand side of the formula with a + operator. Thus, in our second case, we set the shaping formula as emp_no + paste(first_name, last_name) ~ year(ymd(from_date)) on the employees_salary merged data. Then, we saw that the output data was presented with emp_no and the formatted name on the left-hand side, and the year of salary paid on the upper side and the presented salary as the value.

Besides shaping the data from long format to wide format, we can also reshape the data back to long format using the melt function. Therefore, we can transform the wide_salaries data back to long format using emp_no as the basis. As there are many missing values (presented as NA), we can use the na.omit function to remove these records.

There's more…

In addition to the dcast and melt functions in the plyr package, we can use stack and unstack to ungroup and group values:

  1. We can group data by value using the unstack function:
    > un_salaries <- unstack(long_salaries[,c(3,1)])
    > head(un_salaries, 3)
    $`10001`
     [1] 60117 62102 66074 66596 66961 71046 74333 75286 75994 76884 80013
    [12] 81025 81097 84917 85112 85097 88958
    
    $`10002`
    [1] 65828 65909 67534 69366 71963 72527
    
    $`10003`
    [1] 40006 43616 43466 43636 43478 43699 43311
    
  2. In contrast, we can concatenate multiple data frames or lists using the stack function:
    > stack_salaries <- stack(un_salaries )
    > head(stack_salaries)
     values ind
    1 60117 10001
    2 62102 10001
    3 66074 10001
    4 66596 10001
    5 66961 10001
    6 71046 10001
    
主站蜘蛛池模板: 金华市| 云安县| 武汉市| 平果县| 九寨沟县| 哈巴河县| 姜堰市| 广德县| 清河县| 阿荣旗| 图们市| 马公市| 屏山县| 锦屏县| 义马市| 漠河县| 凤冈县| 南昌市| 西贡区| 长顺县| 灌南县| 邮箱| 区。| 奉节县| 襄汾县| 普兰县| 泸州市| 革吉县| 武山县| 河津市| 庐江县| 柯坪县| 星座| 和田县| 元朗区| 佛坪县| 云南省| 革吉县| 安多县| 湛江市| 高密市|