965棋牌官网

書(shū)名： R for Data Science Cookbook
作者名： Yu Wei Chiu (David Chiu)
本章字?jǐn)?shù)： 659字
更新時(shí)間： 2021-07-14 10:51:30

Reshaping data

Reshaping data is similar to creating a contingency table, which enables the user to aggregate data of specific values. The reshape2 package is designed for this specific purpose. Here, we introduce how to use the reshape2 package to transform our dataset from long to wide format with the dcast function. We also cover how to transform it from wide format back to long format with the melt function.

Getting ready

Refer to the Merging data recipe and merge employees and salaries into employees_salary.

How to do it…

Perform the following steps to reshape data:

First, we can use the dcast function to transform data from long to wide:

> wide_salaries <- dcast(salaries, emp_no ~ year(ymd(from_date)), value.var="salary")
> wide_salaries[1:3, 1:7]
 emp_no 1985 1986 1987 1988 1989 1990
1 10001 NA 60117 62102 66074 66596 66961
2 10002 NA NA NA NA NA NA
3 10003 NA NA NA NA NA NA

We can also transform the data by keeping emp_no and the formatted name string as two of the attributes. Then, we can shape the year of salary payment as the column name and shape salary as its value:

> wide_employees_salary <- dcast(employees_salary, emp_no + paste(first_name, last_name) ~ year(ymd(from_date)), value.var="salary", variable.name="condition")
> wide_employees_salary[1:3,1:7]
 emp_no paste(first_name, last_name) 1985 1986 1987 1988 1989
1 10001 Georgi Facello NA 60117 62102 66074 66596
2 10002 Bezalel Simmel NA NA NA NA NA
3 10003 Parto Bamford NA NA NA NA NA

On the other hand, we can transform the data from wide back to long format using the melt function:

> long_salaries <- melt(wide_salaries, id.vars=c("emp_no"))
> head(long_salaries)
 emp_no variable value
1 10001 1985 NA
2 10002 1985 NA
3 10003 1985 NA
4 10004 1985 NA
5 10005 1985 NA
6 10006 1985 NA

To remove data with missing values in long_salaries, we can use na.omit to remove this data:

> head(na.omit(long_salaries))
 emp_no variable value
9 10009 1985 60929
13 10013 1985 40000
48 10048 1985 40000
64 10064 1985 40000
70 10070 1985 55999
98 10098 1985 40000

How it works…

In this recipe, we demonstrated how to reshape data with the reshape2 package. First, we used the dcast function to transform data from the long to wide format. Using this function, we specified that we wanted the data to transform as salaries in the first parameter. Then, we specified the shaping formula by setting emp_no as row and the year of salary paid as the column. Finally, we set salary as presented value in wide format layout.

We can also shape the data with multiple columns; all that we need to do is add another column information on the left-hand side of the formula with a + operator. Thus, in our second case, we set the shaping formula as emp_no + paste(first_name, last_name) ~ year(ymd(from_date)) on the employees_salary merged data. Then, we saw that the output data was presented with emp_no and the formatted name on the left-hand side, and the year of salary paid on the upper side and the presented salary as the value.

Besides shaping the data from long format to wide format, we can also reshape the data back to long format using the melt function. Therefore, we can transform the wide_salaries data back to long format using emp_no as the basis. As there are many missing values (presented as NA), we can use the na.omit function to remove these records.

There's more…

In addition to the dcast and melt functions in the plyr package, we can use stack and unstack to ungroup and group values:

We can group data by value using the unstack function:

> un_salaries <- unstack(long_salaries[,c(3,1)])
> head(un_salaries, 3)
$`10001`
 [1] 60117 62102 66074 66596 66961 71046 74333 75286 75994 76884 80013
[12] 81025 81097 84917 85112 85097 88958

$`10002`
[1] 65828 65909 67534 69366 71963 72527

$`10003`
[1] 40006 43616 43466 43636 43478 43699 43311

In contrast, we can concatenate multiple data frames or lists using the stack function:

> stack_salaries <- stack(un_salaries )
> head(stack_salaries)
 values ind
1 60117 10001
2 62102 10001
3 66074 10001
4 66596 10001
5 66961 10001
6 71046 10001

官术网_书友最值得收藏!

R for Data Science Cookbook

Reshaping data

Getting ready

How to do it…

How it works…

There's more…