官术网_书友最值得收藏!

Merging data

Merging data enables us to understand how different data sources relate to each other. The merge operation in R is similar to the join operation in a database, which combines fields from two datasets using values that are common to each.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to merge salaries and employees:

  1. As employees and salaries are common in emp_no, we can merge these two datasets using emp_no as the join key:
    > employees_salary <- merge(employees, salaries, by="emp_no")
    > head(employees_salary,3)
     emp_no birth_date first_name last_name salary from_date to_date
    1 10001 1953-09-02 Georgi Facello 60117 1986-06-26 1987-06-26
    2 10001 1953-09-02 Georgi Facello 62102 1987-06-26 1988-06-25
    3 10001 1953-09-02 Georgi Facello 66596 1989-06-25 1990-06-25
    
  2. Or, we can assign NULL to the attribute that we want to drop:
    > merge(employees, salaries, by="emp_no", all.x =TRUE)
    
  3. In addition to the merge function, we can install and load the plyr package to manipulate data:
    > install.packages("plyr")
    > library(plyr)
    
  4. Besides the standard merge function, we can use the join function in plyr to merge data:
    > join(employees, salaries, by="emp_no")
    

How it works…

Similarly to data tables in a database, we sometimes need to combine two datasets for the purpose of correlating data. In R, we can simply combine two different data frames with common values using the merge function.

In the merge function, we use both salaries and employees as our input data frame. For the by parameter, we can specify emp_no as the key to join these two tables. We will then see that the data with the same emp_no value has now merged into a new data frame. However, sometimes we want to perform either a left join or a right join for the purpose of preserving every data value from either employees or salaries. To perform the left join, we can set all.x to TRUE. Then, we can find every row from the employees dataset preserved in the merged dataset. On the other hand, if one wants to preserve all rows from the salaries dataset, we can set all.y to TRUE.

In addition to using the built-in merge function, we can install the plyr package to merge datasets. The usage of join is very similar to merge; we only have to specify the data to join and the columns with the common values within the by parameter.

There's more…

In the plyr package, we can use the join_all function to join recursive datasets within a list. Here, we can use join_all to join the employees and salaries datasets by emp_no:

> join_all(list(employees, salaries), "emp_no")
主站蜘蛛池模板: 庐江县| 永胜县| 怀化市| 饶阳县| 翁源县| 榆树市| 普定县| 竹溪县| 图片| 潍坊市| 玛曲县| 洛隆县| 黄大仙区| 铜川市| 浙江省| 阿尔山市| 湖北省| 天等县| 湄潭县| 崇阳县| 特克斯县| 双桥区| 新津县| 伊川县| 大同县| 富蕴县| 繁昌县| 庆安县| 平度市| 武邑县| 丽水市| 公主岭市| 乐安县| 桑植县| 榆社县| 松潘县| 阜阳市| 长岛县| 高安市| 普宁市| 即墨市|