官术网_书友最值得收藏!

Converting data types

If we do not specify a data type during the import phase, R will automatically assign a type to the imported dataset. However, if the data type assigned is different to the actual type, we may face difficulties in further data manipulation. Thus, data type conversion is an essential step during the preprocessing phase.

Getting ready

Complete the previous recipe and import both employees.csv and salaries.csv into an R session. You must also specify column names for these two datasets to be able to perform the following steps.

How to do it…

Perform the following steps to convert the data type:

  1. First, examine the data type of each attribute using the class function:
    > class(employees$birth_date)
    [1] "factor"
    
  2. You can also examine types of all attributes using the str function:
    > str(employees)
    
    'data.frame': 10 obs. of 6 variables:
     $ emp_no : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
     $ birth_date: Factor w/ 10 levels "1952-04-19","1953-04-20",..: 3 10 8 4 5 2 6 7 1 9
     $ first_name: Factor w/ 10 levels "Anneke","Bezalel",..: 5 2 7 3 6 1 10 8 9 4
     $ last_name : Factor w/ 10 levels "Bamford","Facello",..: 2 9 1 4 5 8 10 3 6 7
     $ gender : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1
     $ hire_date : Factor w/ 10 levels "1985-02-18","1985-11-21",..: 3 2 4 5 9 7 6 10 1 8
    
  3. Then, you need to convert both birth_date and hired_date to the date format:
    > employees$birth_date <- as.Date(employees$birth_date)
    > employees$hire_date <- as.Date(employees$hire_date)
    
  4. You also need to convert both first_name and last_name into character type:
    > employees$first_name <- as.character(employees$first_name)
    > employees$last_name <- as.character(employees$last_name)
    
  5. Again, you can use str to examine the dataset:
    > str(employees)
    
    'data.frame': 10 obs. of 6 variables:
     $ emp_no : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
     $ birth_date: Date, format: "1953-09-02" ...
     $ first_name: chr "Georgi" "Bezalel" "Parto" "Chirstian" ...
     $ last_name : chr "Facello" "Simmel" "Bamford" "Koblick" ...
     $ gender : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1
     $ hire_date : Date, format: "1986-06-26" ...
    
  6. Furthermore, you can convert the data type of from_date and to_date to date type within salaries:
    > salaries$from_date <- as.Date(salaries$from_date)
    > salaries$to_date <- as.Date(salaries$to_date)
    

How it works…

In this recipe, we demonstrated how to convert the data type of each attribute within the dataset. Before conducting further conversion on any attribute, you must first examine the current type of each attribute. To identify the data type, you can use the class function to determine the data-selecting attribute. Furthermore, to inspect all data types, you can use the str function.

From the output of applying the str function to the employees data frame, we can see that both birth_date and hire_date are in factor type. However, if we need to calculate one's age with the birth_date attribute, we need to convert it to date format. Thus, we change both birth_date and hire_date to date format using the as.Date function.

Also, as the factor type limits the choice of values in one attribute, we may not freely add a record to the dataset. As it is hard to find exactly the same last name and first name from the dataset, we need to convert last_name and first_name to the character type. We can then proceed to append a new record to the employees dataset in the next recipe. Finally, we should also convert from_date and to_date of the salaries dataset to date type, and we can then perform date calculations in the next recipe.

There's more…

Besides using an as function to convert data type, you can specify the data type during the data import phase. Using the read.csv function as an example, you can specify the data type in the colClasses argument. If you want R to automatically select the data type (that is, automatically convert emp_no to integer type), simply specify NA within colClasses:

> employees <- read.csv('~/Desktop/employees.csv', colClasses = c(NA,"Date", "character", "character", "factor", "Date"), head=FALSE)
> str(employees)
'data.frame': 10 obs. of 6 variables:
 $ V1: int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
 $ V2: Date, format: "1953-09-02" ...
 $ V3: chr "Georgi" "Bezalel" "Parto" "Chirstian" ...
 $ V4: chr "Facello" "Simmel" "Bamford" "Koblick" ...
 $ V5: Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1
 $ V6: Date, format: "1986-06-26" ...

By specifying the colClasses argument, emp_no, birth_date, first_name, last_name, gender, and hire_date will be converted into integer type, date type, character type, character type, factor type, and date type respectively.

主站蜘蛛池模板: 无极县| 辽宁省| 尖扎县| 资中县| 桓仁| 南和县| 天台县| 新干县| 沂南县| 临武县| 汨罗市| 修武县| 宁都县| 通州市| 静乐县| 吕梁市| 威信县| 调兵山市| 寿阳县| 南昌县| 铜川市| 正镶白旗| 赫章县| 西畴县| 深州市| 怀远县| 天气| 罗源县| 中江县| 佛山市| 磴口县| 高邑县| 防城港市| 新密市| 安义县| 额敏县| 鹤壁市| 韩城市| 芷江| 额尔古纳市| 凤山市|