官术网_书友最值得收藏!

Scanning text files

In previous recipes, we introduced how to use read.table and read.csv to load data into an R session. However, read.table and read.csv only work if the number of columns is fixed and the data size is small. To be more flexible in data processing, we will demonstrate how to use the scan function to read data from the file.

Getting ready

In this recipe, you need to have completed the previous recipes and have snp500.csv downloaded in the current directory.

How to do it…

Please perform the following steps to scan data from the CSV file:

  1. First, you can use the scan function to read data from snp500.csv:
    > stock_data3 <- scan('snp500.csv',sep=',', what=list(Date = '', Open = 0, High = 0, Low = 0,Close = 0, Volume = 0, Adj_Close = 0), skip=1, fill=T)
    Read 16481 records
    
  2. You can then examine loaded data with mode and str:
    > mode(stock_data3)
    [1] "list"
    > str(stock_data3)
    List of 7
     $ Date : chr [1:16481] "2015-07-02" "2015-07-01" "2015-06-30" "2015-06-29" ...
     $ Open : num [1:16481] 2078 2067 2061 2099 2103 ...
     $ High : num [1:16481] 2085 2083 2074 2099 2109 ...
     $ Low : num [1:16481] 2071 2067 2056 2057 2095 ...
     $ Close : num [1:16481] 2077 2077 2063 2058 2102 ...
     $ Volume : num [1:16481] 3.00e+09 3.73e+09 4.08e+09 3.68e+09 5.03e+09 ...
     $ Adj_Close: num [1:16481] 2077 2077 2063 2058 2102 ...
    

How it works…

When comparing read.csv and read.table, the scan function is more flexible and efficient in data reading. Here, we specify the field name and support type of each field within a list in the what parameter. In this case, the first field is of character type, and the rest of the fields are of numeric type. Therefore, we can set two single (or double) quotes for the Date column, and 0 for the rest of the fields. Then, as we need to skip the header row and automatically add empty fields to any lines with fewer fields than the number of columns, we set skip to 1 and fill to True.

At this point, we can now examine the data with some built-in functions. Here, we use mode to obtain the type of the object and use str to display the structure of the data.

There's more…

On some occasions, the data is separated by fixed width rather than fixed delimiter. To specify the width of each column, you can use the read.fwf function:

  1. First, you can use download.file to download weather.op from the author's GitHub page:
    > download.file("https://github.com/ywchiu/rcookbook/raw/master/chapter2/weather.op", "weather.op")
    
  2. You can then examine the data with the file editor:

    Figure 5: Using the file editor to examine the file

  3. Read the data by specifying the width of each column in widths, the column name in col.names, and skip the first row by setting skip to 1:
    > weather <- read.fwf("weather.op", widths = c(6,6,10,11,9,8), col.names = c("STN","WBAN","YEARMODA","TEMP","MAX","MIN"), skip=1)
    
  4. Lastly, you can examine the data using the head and names functions:
    > head(weather)
     STN WBAN YEARMODA TEMP MAX MIN
    1 8403 99999 20140101 85.8 24 102.7* 69.3*
    2 8403 99999 20140102 86.3 24 102.9* 71.1*
    3 8403 99999 20140103 85.9 24 101.1* 72.0*
    4 8403 99999 20140104 85.6 24 102.7* 70.5*
    5 8403 99999 20140105 84.8 23 102.0* 66.6*
    6 8403 99999 20140106 86.8 23 102.0* 70.9*
    
    > names(weather)
    [1] "STN" "WBAN" "YEARMODA" "TEMP" "MAX" 
    [6] "MIN" 
    
主站蜘蛛池模板: 夹江县| 义乌市| 榆树市| 镇赉县| 灵寿县| 满城县| 古田县| 阜新市| 潼关县| 武平县| 静乐县| 榆中县| 营口市| 祁阳县| 朝阳区| 武穴市| 屏山县| 大城县| 临武县| 江孜县| 南安市| 黎平县| 长白| 来安县| 泸溪县| 吉首市| 威海市| 尤溪县| 贵州省| 天柱县| 紫金县| 延津县| 自贡市| 莱芜市| 长岭县| 高邑县| 通化县| 同江市| 玉田县| 林芝县| 焦作市|