官术网_书友最值得收藏!

Rearranging data

Sometimes, we do not want to filter any part of the data (neither the rows, nor the columns), but the data is simply not in the most useful order due to convenience or performance issues, as we have seen, for instance, in Chapter 3, Filtering and Summarizing Data.

Besides the base sort and order functions, or providing the order of variables passed to the [ operator, we can also use some SQL-like solutions with the sqldf package, or query the data in the right format directly from the database. And the previously mentioned dplyr package also provides an effective method for ordering data. Let's sort the hflights data, based on the actual elapsed time for each of the quarter million flights:

> str(arrange(hflights, ActualElapsedTime))
'data.frame': 227496 obs. of 21 variables:
 $ Year : int 2011 2011 2011 2011 2011 2011 ...
 $ Month : int 7 7 8 9 1 4 5 6 7 8 ...
 $ DayofMonth : int 24 25 13 21 3 29 9 21 8 2 ...
 $ DayOfWeek : int 7 1 6 3 1 5 1 2 5 2 ...
 $ DepTime : int 2005 2302 1607 1546 1951 2035 ...
 $ ArrTime : int 2039 2336 1641 1620 2026 2110 ...
 $ UniqueCarrier : chr "WN" "XE" "WN" "WN" ...
 $ FlightNum : int 1493 2408 912 2363 2814 2418 ...
 $ TailNum : chr "N385SW" "N12540" "N370SW" "N524SW" ...
 $ ActualElapsedTime: int 34 34 34 34 35 35 35 35 35 35 ...
 $ AirTime : int 26 26 26 26 23 23 27 26 25 25 ...
 $ ArrDelay : int 9 -8 -4 15 -19 20 35 -15 86 -9 ...
 $ DepDelay : int 20 2 7 26 -4 35 45 -8 96 1 ...
 $ Origin : chr "HOU" "IAH" "HOU" "HOU" ...
 $ Dest : chr "AUS" "AUS" "AUS" "AUS" ...
 $ Distance : int 148 140 148 148 127 127 148 ...
 $ TaxiIn : int 3 3 4 3 4 4 5 3 5 4 ...
 $ TaxiOut : int 5 5 4 5 8 8 3 6 5 6 ...
 $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
 $ CancellationCode : chr "" "" "" "" ...
 $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...

Well, it's pretty straightforward that flights departing to Austin are among the first few records shown. For improved readability, the above three R expressions can be called in a much nicer way with the pipe operator from the automatically imported magrittr package, which provides a simple way to pass an R object as the first argument of the subsequent R expression:

> hflights %>% arrange(ActualElapsedTime) %>% str

So, instead of nesting R functions, we can now start our R command with the core object and pass the results of each evaluated R expression to the next one in the chain. In most cases, this makes the code more convenient to read. Although most hardcore R programmers have already gotten used to reading the nested function calls from inside-out, believe me, it's pretty easy to get used to this nifty feature! Do not let me confuse you with the inspiring painting of René Magritte, which became the slogan, "This is not a pipe," and a symbol of the magrittr package:

Rearranging data

There is no limit to the number of chainable R expressions and objects one can have. For example, let's also filter a few cases and variables to see how easy it is to follow the data restructuring steps with dplyr:

> hflights %>%
+ arrange(ActualElapsedTime) %>%
+ select(ActualElapsedTime, Dest) %>%
+ subset(Dest != 'AUS') %>%
+ head %>%
+ str
'data.frame': 6 obs. of 2 variables:
 $ ActualElapsedTime: int 35 35 36 36 37 37
 $ Dest : chr "LCH" "LCH" "LCH" "LCH" ...

So, now we have filtered the original dataset a few times to see the closest airport after Austin, and the code is indeed easy to read and understand. This is a nice and efficient way to filter data, although some prefer to use nifty one-liners with the data.table package:

> str(head(data.table(hflights, key = 'ActualElapsedTime')[Dest !=
+ 'AUS', c('ActualElapsedTime', 'Dest'), with = FALSE]))
Classes 'data.table' and 'data.frame': 6 obs. of 2 variables:
 $ ActualElapsedTime: int NA NA NA NA NA NA
 $ Dest : chr "MIA" "DFW" "MIA" "SEA" ...
 - attr(*, "sorted")= chr "ActualElapsedTime"
 - attr(*, ".internal.selfref")=<externalptr>

Almost perfect! The only problem is that we got different results due to the missing values, which were ordered at the beginning of the dataset while we defined the data.table object to be indexed by ActualElapsedTime. To overcome this issue, let's drop the NA values, and instead of specifying the column names as strings along with forcing the with parameter to be FALSE, let's pass a list of column names:

> str(head(na.omit(
+ data.table(hflights, key = 'ActualElapsedTime'))[Dest != 'AUS',
+ list(ActualElapsedTime, Dest)]))
Classes 'data.table' and 'data.frame': 6 obs. of 2 variables:
 $ ActualElapsedTime: int 35 35 36 36 37 37
 $ Dest : chr "LCH" "LCH" "LCH" "LCH" ...
 - attr(*, "sorted")= chr "ActualElapsedTime"
 - attr(*, ".internal.selfref")=<externalptr>

This is exactly the same results as we have seen before. Please note that in this example, we have omitted the NA values after transforming data.frame to data.table, indexed by the ActualElapsedTime variable, which is a lot faster compared to calling na.omit on hflights first and then evaluating all the other R expressions:

> system.time(str(head(data.table(na.omit(hflights),
+ key = 'ActualElapsedTime')[Dest != 'AUS',
+ c('ActualElapsedTime', 'Dest'), with = FALSE])))
 user system elapsed 
 0.374 0.017 0.390 
> system.time(str(head(na.omit(data.table(hflights,
+ key = 'ActualElapsedTime'))[Dest != 'AUS',
+ c('ActualElapsedTime', 'Dest'), with = FALSE])))
 user system elapsed 
 0.22 0.00 0.22
主站蜘蛛池模板: 嘉兴市| 平陆县| 漳州市| 营口市| 和田县| 余庆县| 罗甸县| 长垣县| 承德市| 江达县| 涡阳县| 定兴县| 鲁山县| 娄烦县| 益阳市| 汉沽区| 泰安市| 基隆市| 深州市| 吴忠市| 余干县| 竹山县| 延川县| 将乐县| 图木舒克市| 崇义县| 南川市| 龙陵县| 买车| 灵武市| 漠河县| 濉溪县| 灵宝市| 上饶县| 高台县| 阳高县| 晋州市| 柳林县| 宁波市| 忻州市| 澳门|