手机上玩不朽情缘五滴血

書名： Mastering Data Analysis with R
作者名： Gergely Daróczi
本章字數： 954字
更新時間： 2021-07-09 21:58:54

Rearranging data

Sometimes, we do not want to filter any part of the data (neither the rows, nor the columns), but the data is simply not in the most useful order due to convenience or performance issues, as we have seen, for instance, in Chapter 3, Filtering and Summarizing Data.

Besides the base sort and order functions, or providing the order of variables passed to the [ operator, we can also use some SQL-like solutions with the sqldf package, or query the data in the right format directly from the database. And the previously mentioned dplyr package also provides an effective method for ordering data. Let's sort the hflights data, based on the actual elapsed time for each of the quarter million flights:

> str(arrange(hflights, ActualElapsedTime))
'data.frame': 227496 obs. of 21 variables:
 $ Year : int 2011 2011 2011 2011 2011 2011 ...
 $ Month : int 7 7 8 9 1 4 5 6 7 8 ...
 $ DayofMonth : int 24 25 13 21 3 29 9 21 8 2 ...
 $ DayOfWeek : int 7 1 6 3 1 5 1 2 5 2 ...
 $ DepTime : int 2005 2302 1607 1546 1951 2035 ...
 $ ArrTime : int 2039 2336 1641 1620 2026 2110 ...
 $ UniqueCarrier : chr "WN" "XE" "WN" "WN" ...
 $ FlightNum : int 1493 2408 912 2363 2814 2418 ...
 $ TailNum : chr "N385SW" "N12540" "N370SW" "N524SW" ...
 $ ActualElapsedTime: int 34 34 34 34 35 35 35 35 35 35 ...
 $ AirTime : int 26 26 26 26 23 23 27 26 25 25 ...
 $ ArrDelay : int 9 -8 -4 15 -19 20 35 -15 86 -9 ...
 $ DepDelay : int 20 2 7 26 -4 35 45 -8 96 1 ...
 $ Origin : chr "HOU" "IAH" "HOU" "HOU" ...
 $ Dest : chr "AUS" "AUS" "AUS" "AUS" ...
 $ Distance : int 148 140 148 148 127 127 148 ...
 $ TaxiIn : int 3 3 4 3 4 4 5 3 5 4 ...
 $ TaxiOut : int 5 5 4 5 8 8 3 6 5 6 ...
 $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
 $ CancellationCode : chr "" "" "" "" ...
 $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...

Well, it's pretty straightforward that flights departing to Austin are among the first few records shown. For improved readability, the above three R expressions can be called in a much nicer way with the pipe operator from the automatically imported magrittr package, which provides a simple way to pass an R object as the first argument of the subsequent R expression:

> hflights %>% arrange(ActualElapsedTime) %>% str

So, instead of nesting R functions, we can now start our R command with the core object and pass the results of each evaluated R expression to the next one in the chain. In most cases, this makes the code more convenient to read. Although most hardcore R programmers have already gotten used to reading the nested function calls from inside-out, believe me, it's pretty easy to get used to this nifty feature! Do not let me confuse you with the inspiring painting of René Magritte, which became the slogan, "This is not a pipe," and a symbol of the magrittr package:

There is no limit to the number of chainable R expressions and objects one can have. For example, let's also filter a few cases and variables to see how easy it is to follow the data restructuring steps with dplyr:

> hflights %>%
+ arrange(ActualElapsedTime) %>%
+ select(ActualElapsedTime, Dest) %>%
+ subset(Dest != 'AUS') %>%
+ head %>%
+ str
'data.frame': 6 obs. of 2 variables:
 $ ActualElapsedTime: int 35 35 36 36 37 37
 $ Dest : chr "LCH" "LCH" "LCH" "LCH" ...

So, now we have filtered the original dataset a few times to see the closest airport after Austin, and the code is indeed easy to read and understand. This is a nice and efficient way to filter data, although some prefer to use nifty one-liners with the data.table package:

> str(head(data.table(hflights, key = 'ActualElapsedTime')[Dest !=
+ 'AUS', c('ActualElapsedTime', 'Dest'), with = FALSE]))
Classes 'data.table' and 'data.frame': 6 obs. of 2 variables:
 $ ActualElapsedTime: int NA NA NA NA NA NA
 $ Dest : chr "MIA" "DFW" "MIA" "SEA" ...
 - attr(*, "sorted")= chr "ActualElapsedTime"
 - attr(*, ".internal.selfref")=<externalptr>

Almost perfect! The only problem is that we got different results due to the missing values, which were ordered at the beginning of the dataset while we defined the data.table object to be indexed by ActualElapsedTime. To overcome this issue, let's drop the NA values, and instead of specifying the column names as strings along with forcing the with parameter to be FALSE, let's pass a list of column names:

> str(head(na.omit(
+ data.table(hflights, key = 'ActualElapsedTime'))[Dest != 'AUS',
+ list(ActualElapsedTime, Dest)]))
Classes 'data.table' and 'data.frame': 6 obs. of 2 variables:
 $ ActualElapsedTime: int 35 35 36 36 37 37
 $ Dest : chr "LCH" "LCH" "LCH" "LCH" ...
 - attr(*, "sorted")= chr "ActualElapsedTime"
 - attr(*, ".internal.selfref")=<externalptr>

This is exactly the same results as we have seen before. Please note that in this example, we have omitted the NA values after transforming data.frame to data.table, indexed by the ActualElapsedTime variable, which is a lot faster compared to calling na.omit on hflights first and then evaluating all the other R expressions:

> system.time(str(head(data.table(na.omit(hflights),
+ key = 'ActualElapsedTime')[Dest != 'AUS',
+ c('ActualElapsedTime', 'Dest'), with = FALSE])))
 user system elapsed 
 0.374 0.017 0.390 
> system.time(str(head(na.omit(data.table(hflights,
+ key = 'ActualElapsedTime'))[Dest != 'AUS',
+ c('ActualElapsedTime', 'Dest'), with = FALSE])))
 user system elapsed 
 0.22 0.00 0.22

官术网_书友最值得收藏!

Mastering Data Analysis with R

Rearranging data