官术网_书友最值得收藏!

Filtering data

Data filtering is the most common requirement for users who want to analyze partial data of interest rather than the whole dataset. In database operations, we can use a SQL command with a where clause to subset the data. In R, we can simply use the square bracket to perform filtering.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to filter data:

  1. First, use head and tail to subset the first three rows and last three rows from the employees dataset:
    > head(employees, 3)
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    2 10002 1964-06-02 Bezalel Simmel F 1985-11-21
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    
    > tail(employees, 3)
     emp_no birth_date first_name last_name gender hire_date
    8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
    9 10009 1952-04-19 Sumant Peac F 1985-02-18
    10 10010 1963-06-01 Duangkaew Piveteau F 1989-08-24
    
  2. You can also use the square bracket to subset the first three rows of the data with a given sequence from 1 to 3:
    > employees[1:3,]
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    2 10002 1964-06-02 Bezalel Simmel F 1985-11-21
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    
  3. Then, you can also specify the sequence of columns that you should select:
    > employees[1:3, 2:4]
     birth_date first_name last_name
    1 1953-09-02 Georgi Facello
    2 1964-06-02 Bezalel Simmel
    3 1959-12-03 Parto Bamford
    
  4. Besides subsetting a sequence of columns and rows from the dataset, you can specify certain rows and columns to subset with index vectors:
    > employees[c(2,5), c(1,3)]
     emp_no first_name
    2 10002 Bezalel
    5 10005 Kyoichi
    
  5. If you know the name of the column, you can also select columns with a given name vector:
    > employees[1:3, c("first_name","last_name")]
     first_name last_name
    1 Georgi Facello
    2 Bezalel Simmel
    3 Parto Bamford
    
  6. On the other hand, you can exclude columns with negative index:
    > employees[1:3,-6]
     emp_no birth_date first_name last_name gender
    1 10001 1953-09-02 Georgi Facello M
    2 10002 1964-06-02 Bezalel Simmel F
    3 10003 1959-12-03 Parto Bamford M
    
  7. You can also exclude some attributes using the in and ! operators:
    > employees[1:3, !names(employees) %in% c("last_name", "first_name")]
     emp_no birth_date gender hire_date
    1 10001 1953-09-02 M 1986-06-26
    2 10002 1964-06-02 F 1985-11-21
    3 10003 1959-12-03 M 1986-08-28
    
  8. Furthermore, you can set the equal condition to subset data:
    > employees[employees$gender == 'M',]
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    4 10004 1954-05-01 Chirstian Koblick M 1986-12-01
    5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12
    8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
    
  9. You can use a comparison operator to subset data:
    > salaries[salaries$salary >= 60000 & salaries$salary < 70000,]
    
  10. Additionally, the substr function can extract partial records:
    > employees[substr(employees$first_name,0,2)=="Ge",]
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    
  11. The regular expression is another useful and powerful tool for a user to subset data that they are interested in:
    > employees[grep('[aeiou]$', employees$first_name),]
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12
    6 10006 1953-04-20 Anneke Preusig F 1989-06-02
    8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
    

How it works…

In this recipe, we demonstrated how to filter data with R. In the first case, we used the head and tail functions to examine the first few rows. By default, the head and tail functions will return the first six rows and last six rows of the dataset. We can still specify the number of records to subset in the second input parameter of the function.

Besides using the head and tail functions, we can use a square bracket to subset data. Using the square bracket, the value on the left-hand side of the comma assigns the rows to subset, and the value on the right-hand side of the comma indicates the columns to select. In the second step, we demonstrated that we can subset the first three records from the dataset by assigning a sequence of 1 to 3 on the left-hand side of the comma. If we do not specify anything on the right-hand side of the comma, this means that we will select all variables in our subset. Otherwise, we can specify the columns to select on the right-hand side of the comma. Similarly to step 3, we can select the second to fourth column by specifying a sequence on the right-hand side of the comma; select columns with a given c(3,5) index vector. Furthermore, we can select data with a given c("first_name","last_name") attribute name vector.

In addition to selecting the variables that we require, we can exclude the columns that we do not need with a negative index. Thus, we can place -6 on the right-hand side of the comma to exclude the sixth column from the dataset. We can also use both the ! and in operators to exclude data with a certain column name. In the seventh case, we can exclude attributes with the name as first_name and last_name.

Furthermore, we can filter data similar to SQL operation with a given condition. Here, as we need to use conditions to filter records from the data, we should place the criteria on the left-hand side of the comma. Thus, in cases 8 to 10, we demonstrate that we can filter male employee data with an equality condition, extract salary data ranging between 60,000 and 70,000, and retrieve employees with the first two letters matching Ge with the substr function. Finally, we can also employ the grep function along with a regular expression to subset employees with a vowel as the final letter of their first name.

There's more…

Besides using square brackets, we can also use the subset function to subset data:

  1. We can select first_name and last_name of the first three rows of the employees data:
    > subset(employees, rownames(employees) %in% 1:3, select=c("first_name","last_name"))
     first_name last_name
    1 Georgi Facello
    2 Bezalel Simmel
    3 Parto Bamford
    
  2. We can also set the condition to filter employee data by gender:
    >subset(employees, employees$gender == 'M')
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    4 10004 1954-05-01 Chirstian Koblick M 1986-12-01
    5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12
    8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
    
主站蜘蛛池模板: 东至县| 边坝县| 五莲县| 竹北市| 射阳县| 五大连池市| 元谋县| 石渠县| 盐亭县| 万山特区| 砚山县| 邵武市| 衡水市| 彰化市| 灵寿县| 和硕县| 宁明县| 焦作市| 安平县| 西峡县| 垣曲县| 江永县| 玛多县| 湖州市| 楚雄市| 沧源| 长垣县| 台江县| 夏邑县| 和龙市| 岳普湖县| 长宁区| 临邑县| 竹山县| 海宁市| 荥阳市| 深圳市| 夏邑县| 休宁县| 崇礼县| 晋中市|