官术网_书友最值得收藏!

Performance issues

Over 40% of the R code is predominantly written in C, and a little bit over 20% still in Fortran (the rest in C++, Java, and R), making some common computational tasks very costly. Microsoft (and, before, Revolution analytics) did rewrite some of the most frequently used functions from old Fortran to C/C++ in order to address performance issues.

Many package authors did very similar things. For example, Matt Dowle—the main author of the data.table R package—did several language performance lift-ups to speed up most common data wrangling steps.

When comparing similar operations on the same dataset using different packages, such as dplyr, plyr, data.table, and sqldf, one can see the difference in the time performance with the same results.

The following R sample shows roughly a 80 MiB big object with a simple grouping function of how much difference there is in the computation time. Packages dpylr and data.table stand out and have performance gain over 25x times better in comparison to plyr and sqldf. data.table, especially, is extremely efficient and this is mainly due to Matt's extreme impetus to optimize the code of the data.table package in order to gain performance:

set.seed(6546) 
nobs <- 1e+07 
df <- data.frame("group" = as.factor(sample(1:1e+05, nobs, replace = TRUE)), "variable" = rpois(nobs, 100)) 
 
# Calculate mean of variable within each group using plyr - ddply  
library(plyr) 
system.time(grpmean <- ddply( 
  df,  
  .(group),  
  summarize,  
  grpmean = mean(variable))) 
 
 
# Calcualte mean of variable within each group using dplyr 
detach("package:plyr", unload=TRUE) 
library(dplyr) 
 
system.time( 
  grpmean2 <- df %>%  
              group_by(group) %>% 
              summarise(group_mean = mean(variable))) 
 
# Calcualte mean of variable within each group using data.table 
library(data.table) 
system.time( 
  grpmean3 <- data.table(df)[ 
    #i 
    ,mean(variable)    
    ,by=(group)] ) 
 
# Calcualte mean of variable within each group using sqldf 
library(sqldf) 
system.time(grpmean4 <- sqldf("SELECT avg(variable), [group] from df GROUP BY [group]")) 

The Microsoft RevoScaleR package, on the other hand, is optimized as well and can supersede all of these packages in terms of performance and large datasets. This is just to prove how Microsoft has made R ready for large datasets to address the performance issues.

主站蜘蛛池模板: 德州市| 滦平县| 宣恩县| 棋牌| 邵武市| 上虞市| 定襄县| 英超| 蓬莱市| 顺平县| 新密市| 苏尼特右旗| 深州市| 广河县| 河间市| 龙南县| 株洲县| 和硕县| 大安市| 顺昌县| 济阳县| 平南县| 麟游县| 黔东| 梅河口市| 凌云县| 荔浦县| 晴隆县| 邻水| 万源市| 全州县| 酉阳| 台北市| 阿坝| 哈巴河县| 关岭| 祥云县| 宕昌县| 万年县| 安吉县| 和政县|