官术网_书友最值得收藏!

Running benchmarks

As already discussed in the previous chapters, with the help of the microbenchmark package, we can run any number of different functions for a specified number of times on the same machine to get some reproducible results on the performance.

To this end, we have to define the functions that we want to benchmark first. These were compiled from the preceding examples:

> AGGR1 <- function() aggregate(hflights$Diverted,
+ by = list(hflights$DayOfWeek), FUN = mean)
> AGGR2 <- function() with(hflights, aggregate(Diverted,
+ by = list(DayOfWeek), FUN = mean))
> AGGR3 <- function() aggregate(Diverted ~ DayOfWeek,
+ data = hflights, FUN = mean)
> TAPPLY <- function() tapply(X = hflights$Diverted, 
+ INDEX = hflights$DayOfWeek, FUN = mean)
> PLYR1 <- function() ddply(hflights, .(DayOfWeek),
+ function(x) mean(x$Diverted))
> PLYR2 <- function() ddply(hflights, .(DayOfWeek), summarise,
+ Diverted = mean(Diverted))
> DPLYR <- function() dplyr::summarise(hflights_DayOfWeek,
+ mean(Diverted))

However, as mentioned before, the summarise function in dplyr needs some prior data restructuring, which also takes time. To this end, let's define another function that also includes the creation of the new data structure along with the real aggregation:

> DPLYR_ALL <- function() {
+ hflights_DayOfWeek <- group_by(hflights, DayOfWeek)
+ dplyr::summarise(hflights_DayOfWeek, mean(Diverted))
+ }

Similarly, benchmarking data.table also requires some additional variables for the test environment; as hlfights_dt is already sorted by DayOfWeek, let's create a new data.table object for benchmarking:

> hflights_dt_nokey <- data.table(hflights)

Further, it probably makes sense to verify that it has no keys:

> key(hflights_dt_nokey)
NULL

Okay, now, we can define the data.table test cases along with a function that also includes the transformation to data.table, and adding an index just to be fair with dplyr:

> DT <- function() hflights_dt_nokey[, mean(FlightNum),
+ by = DayOfWeek]
> DT_KEY <- function() hflights_dt[, mean(FlightNum),
+ by = DayOfWeek]
> DT_ALL <- function() {
+ setkey(hflights_dt_nokey, 'DayOfWeek')
+ hflights_dt[, mean(FlightNum), by = DayOfWeek]
+ setkey(hflights_dt_nokey, NULL)
+ }

Now that we have all the described implementations ready for testing, let's load the microbenchmark package to do its job:

> library(microbenchmark)
> res <- microbenchmark(AGGR1(), AGGR2(), AGGR3(), TAPPLY(), PLYR1(),
+ PLYR2(), DPLYR(), DPLYR_ALL(), DT(), DT_KEY(), DT_ALL())
> print(res, digits = 3)
Unit: milliseconds
 expr min lq median uq max neval
 AGGR1() 2279.82 2348.14 2462.02 2597.70 2719.88 10
 AGGR2() 2278.15 2465.09 2528.55 2796.35 2996.98 10
 AGGR3() 2358.71 2528.23 2726.66 2879.89 3177.63 10
 TAPPLY() 19.90 21.05 23.56 29.65 33.88 10
 PLYR1() 56.93 59.16 70.73 82.41 155.88 10
 PLYR2() 58.31 65.71 76.51 98.92 103.48 10
 DPLYR() 1.18 1.21 1.30 1.74 1.84 10
 DPLYR_ALL() 7.40 7.65 7.93 8.25 14.51 10
 DT() 5.45 5.73 5.99 7.75 9.00 10
 DT_KEY() 5.22 5.45 5.63 6.26 13.64 10
 DT_ALL() 31.31 33.26 35.19 38.34 42.83 10

The results are pretty spectacular: from more than 2,000 milliseconds, we could improve our tools to provide the very same results in only a bit more than 1 millisecond. The spread can be demonstrated easily on a violin plot with a logarithmic scale:

> autoplot(res)
Running benchmarks

Therefore, dplyr seems to be the most efficient solution, although if we also take the extra step (to group data.frame) into account, it makes the otherwise clear advantage rather unconvincing. As a matter of fact, if we already have a data.table object, and we can save the transformation of a traditional data.frame object into data.table, then data.table performs better than dplyr. However, I am pretty sure that you will not really notice the time difference between the two high-performance solutions; both of these do a very good job with even larger datasets.

It's worth mentioning that dplyr can work with data.table objects as well; therefore, to ensure that you are not locked to either package, it's definitely worth using both if needed. The following is a POC example:

> dplyr::summarise(group_by(hflights_dt, DayOfWeek), mean(Diverted))
Source: local data table [7 x 2]

 DayOfWeek mean(Diverted)
1 1 0.002997672
2 2 0.002559323
3 3 0.003226211
4 4 0.003065727
5 5 0.002687865
6 6 0.002823121
7 7 0.002589057 

Okay, so now we are pretty sure to use either data.table or dplyr for computing group averages in the future. However, what about more complex operations?

主站蜘蛛池模板: 白水县| 兴文县| 义乌市| 句容市| 城步| 涪陵区| 民乐县| 彭水| 平定县| 拜泉县| 廊坊市| 喀喇沁旗| 兴城市| 夏津县| 柳河县| 满洲里市| 阿图什市| 连江县| 武乡县| 长子县| 巨鹿县| 新和县| 湛江市| 丹江口市| 福安市| 临猗县| 江口县| 米易县| 县级市| 呼图壁县| 万安县| 睢宁县| 萝北县| 屏南县| 高邮市| 黑龙江省| 阜阳市| 衡东县| 新沂市| 襄汾县| 金平|