官术网_书友最值得收藏!

High performance computing

Initially, it is important to measure which lines of code take the most computation time. Here, you should try to solve problems with the processing time of individual calculations by improving the computation time. This can often be done in R by vectorization, or often better by writing individual pieces of code in a compilable language, such as C, C++*, or Fortran**.

In addition, some calculations can be parallelized and accelerated through parallel computing.

Profiling to detect computationally slow functions in code

Take an example where you have written code for your data analysis but it runs (too) slow. However, it is most likely that not all your lines of code are slow and only a few lines need improvement in terms of computational time. In this instance it is very important to know exactly what step in the code takes the most computation time.

The easiest way is to find this out is to work with the R function system.time. We will compare two models:

data(Cars93, package = "MASS")
set.seed(123)
system.time(lm(Price ~ Horsepower + Weight + Type + Origin, data=Cars93))
## user system elapsed
## 0.003 0.000 0.002
library("robustbase")
system.time(lmrob(Price ~ Horsepower + Weight + Type + Origin, data=Cars93))
## user system elapsed
## 0.022 0.000 0.023

The user time is the CPU time for the call and evaluation of the code. The elapsed time is the sum of the user time and the system time. This is the most interesting number. proc.time is another simple function, often used inside functions:

ptm <- proc.time()
lmrob(Price ~ Horsepower + Weight + Type + Origin, data=Cars93)
##
## Call:
## robustbase::lmrob(formula = Price ~ Horsepower + Weight + Type + Origin, data = Cars93)
## \--> method = "MM"
## Coefficients:
## (Intercept) Horsepower Weight TypeLarge TypeMidsize
## -2.72414 0.10660 0.00141 0.18398 3.05846
## TypeSmall TypeSporty TypeVan Originnon-USA
## -1.29751 0.68596 -0.36019 1.88560
proc.time() - ptm
## user system elapsed
## 0.025 0.000 0.027

To get a more precise answer about the computational speed of the methods, we should replicate the experiment. We can see that lm is about 10 times faster than lmrob:

s1 <- system.time(replicate(100, lm(Price ~ Horsepower + Weight + Type + Origin, data=Cars93)))[3]
s2 <- system.time(replicate(100, lmrob(Price ~ Horsepower + Weight + Type + Origin, data=Cars93)))[3]
(s2 - s1)/s1
## elapsed
## 10.27

However, we don't know which part of the code makes a function slow:

Rprof("Prestige.lm.out")
invisible(replicate(100,
 lm(Price ~ Horsepower + Weight + Type + Origin, data=Cars93)))
Rprof(NULL)
summaryRprof("Prestige.lm.out")$by.self
## self.time self.pct total.time total.pct
## ".External2" 0.04 22.22 0.04 22.22
## ".External" 0.02 11.11 0.02 11.11
## "[[.data.frame" 0.02 11.11 0.02 11.11
## "[[<-.data.frame" 0.02 11.11 0.02 11.11
## "as.list" 0.02 11.11 0.02 11.11
## "lm.fit" 0.02 11.11 0.02 11.11
## "match" 0.02 11.11 0.02 11.11
## "vapply" 0.02 11.11 0.02 11.11

We can see which function calls relate to the slowest part of the code.

A more detailed output is reported by using the following. However, since the output is quite long, we have redacted it for the book version (but it is available when running the code bundle that accompanies this book):

require(profr)
## Loading required package: profr
parse_rprof("Prestige.lm.out")

Note

Plots are implemented to show the profiling results.

Further benchmarking

Finally, we will show a data manipulation example using several different packages. This should show the efficiency of data.table and dplyr.

To run the following code snippets, it is mandatory to load the following since the functionality used is included in those packages:

library(microbenchmark); library(plyr); library(dplyr);
library(data.table); library(Hmisc)

The task is to calculate the groupwise (Type, Origin) means of Horsepower, for example:

data(Cars93, package = "MASS")
Cars93 %>% group_by(Type, Origin) %>% summarise(mean = mean(Horsepower))
## Source: local data frame [11 x 3]
## Groups: Type [?]
##
## Type Origin mean
## (fctr) (fctr) (dbl)
## 1 Compact USA 117.42857
## 2 Compact non-USA 141.55556
## 3 Large USA 179.45455
## 4 Midsize USA 153.50000
## 5 Midsize non-USA 189.41667
## 6 Small USA 89.42857
## 7 Small non-USA 91.78571
## 8 Sporty USA 166.50000
## 9 Sporty non-USA 151.66667
## 10 Van USA 158.40000
## 11 Van non-USA 138.25000

First, we calculate the same with base R, where we also write a for loop for calculating the mean. We do this extra dirty for benchmarking purposes:

meanFor <- function(x){
 sum <- 0
 for(i in 1:length(x)) sum <- sum + x[i]
 sum / length(x)
}

## groupwise statistics
myfun1 <- function(x, gr1, gr2, num){
 x[,gr1] <- as.factor(x[,gr1])
 x[,gr2] <- as.factor(x[,gr2])
 l1 <- length(levels(x[,gr1]))
 l2 <- length(levels(x[,gr1]))
 gr <- numeric(l1*l2)
 c1 <- c2 <- character(l1*l2)
 ii <- jj <- 0
 for(i in levels(x[,gr1])){
 for(j in levels(x[,gr2])){
 ii <- ii + 1
 c1[ii] <- i
 c2[ii] <- j
 vec <- x[x[,gr2] == j & x[,gr1] == i, num]
 if(length(vec) > 0) gr[ii] <- meanFor(vec)
 }
 }

 df <- data.frame(cbind(c1, c2))
 df <- cbind(df, gr)
 colnames(df) <- c(gr1,gr2,paste("mean(", num, ")"))
 df
}

## groupwise using mean()
## attention mean.default is faster
myfun2 <- function(x, gr1, gr2, num){
 x[,gr1] <- as.factor(x[,gr1])
 x[,gr2] <- as.factor(x[,gr2])
 l1 <- length(levels(x[,gr1]))
 l2 <- length(levels(x[,gr1]))
 gr <- numeric(l1*l2)
 c1 <- c2 <- character(l1*l2)
 ii <- jj <- 0
 for(i in levels(x[,gr1])){
 for(j in levels(x[,gr2])){
 ii <- ii + 1
 c1[ii] <- i
 c2[ii] <- j
 gr[ii] <- mean(x[x[,gr2] == j & x[,gr1] == i, num])
 }
 }

 df <- data.frame(cbind(c1, c2))
 df <- cbind(df, gr)
 colnames(df) <- c(gr1,gr2,paste("mean(", num, ")"))
 df
}

For data.table, we will create a data table:

Cars93dt <- data.table(Cars93)

We now run the benchmark using microbenchmark and plot the results. For the plot syntax, refer to the section on visualization:

op <- microbenchmark(
 ## pure for loops
 MYFUN1 = myfun1(x=Cars93, gr1="Type", gr2="Origin",
 num="Horsepower"),
 ## pure for loops but using mean
 MYFUN2 = myfun2(x=Cars93, gr1="Type", gr2="Origin",
 num="Horsepower"),
 ## plyr
 PLYR = ddply(Cars93, .(Type, Origin), summarise,
 output = mean(Horsepower)),
 ## base R's aggregate and by
 AGGR = aggregate(Horsepower ~ Type + Origin, Cars93, mean),
 BY = by(Cars93$Horsepower,
 list(Cars93$Type,Cars93$Origin), mean),
 ## Hmisc's summarize
 SUMMARIZE = summarize(Cars93$Horsepower,
 llist(Cars93$Type,Cars93$Origin), mean),
 ## base R's tapply
 TAPPLY = tapply(Cars93$Horsepower,
 interaction(Cars93$Type, Cars93$Origin), mean),
 ## dplyr
 DPLYR = summarise(group_by(Cars93, Type, Origin),
 mean(Horsepower)),
 ## data.table
 DATATABLE = Cars93dt[, aggGroup1.2 := mean(Horsepower),
 by = list(Type, Origin)],
 times=1000L)

The output can now be visualized, see the following screenshot:

m <- reshape2::melt(op, id="expr")
ggplot(m, aes(x=expr, y=value)) +
 geom_boxplot() +
 coord_trans(y = "log10") +
 xlab(NULL) + ylab("computation time") +
 theme(axis.text.x = element_text(angle=45))

We can see that both dplyr and data.table are no faster than the others. Even the dirty for loops are almost as fast.

But we get a very different picture with large data sets:

library(laeken); data(eusilc)
eusilc <- do.call(rbind,
 list(eusilc,eusilc,eusilc,eusilc,eusilc,eusilc,eusilc))
eusilc <- do.call(rbind,
 list(eusilc,eusilc,eusilc,eusilc,eusilc,eusilc,eusilc))
dim(eusilc)
## [1] 726523 28
eusilcdt <- data.table(eusilc)
setkeyv(eusilcdt, c('hsize','db040'))

op <- microbenchmark(
 MYFUN1 = myfun1(x=eusilc, gr1="hsize", gr2="db040", 
 num="eqIncome"),
 MYFUN2 = myfun2(x=eusilc, gr1="hsize", gr2="db040", 
 num="eqIncome"),
 PLYR = ddply(eusilc, .(hsize, db040), summarise, 
 output = mean(eqIncome)),
 AGGR = aggregate(eqIncome ~ hsize + db040, eusilc, mean),
 BY = by(eusilc$eqIncome, list(eusilc$hsize,eusilc$db040), mean),
 SUMMARIZE = summarize(eusilc$eqIncome, 
 llist(eusilc$hsize,eusilc$db040), mean),
 TAPPLY = tapply(eusilc$eqIncome, 
 interaction(eusilc$hsize, eusilc$db040), mean),
 DPLYR = summarise(group_by(eusilc, hsize, db040), 
 mean(eqIncome)),
 DATATABLE = eusilcdt[, mean(eqIncome), by = .(hsize, db040)],
 times=10)

Again, we plot the results, as shown in the following screenshot:

m <- reshape2::melt(op, id="expr")
ggplot(m, aes(x=expr, y=value)) +
 geom_boxplot() + 
 coord_trans(y = "log10") +
 xlab(NULL) + ylab("computation time") +
 theme(axis.text.x = element_text(angle=45, vjust=1))

We can now observe that data.table and dylr are much faster than the other methods (the graphics shows log-scale!).

We can further profile the functions, for example, for aggregate we see that calling gsub and prettyNum needs the most computation time in function aggregate:

Rprof("aggr.out")
a <- aggregate(eqIncome ~ hsize + db040, eusilc, mean)
Rprof(NULL)
summaryRprof("aggr.out")$by.self
## self.time self.pct total.time total.pct
## "gsub" 0.52 48.15 0.68 62.96
## "prettyNum" 0.16 14.81 0.16 14.81
## "<Anonymous>" 0.10 9.26 1.08 100.00
## "na.omit.data.frame" 0.06 5.56 0.12 11.11
## "match" 0.06 5.56 0.08 7.41
## "anyDuplicated.default" 0.06 5.56 0.06 5.56
## "[.data.frame" 0.04 3.70 0.18 16.67
## "NextMethod" 0.04 3.70 0.04 3.70
## "split.default" 0.02 1.85 
 0.04 3.70
## "unique.default" 0.02 1.85 0.02 1.85

Parallel computing

Parallel computing is helpful, especially for simulation tasks, since most simulations can be done independently. Several functions and packages are available in R. For this example, we will show only a simple package, the package snow (Tierney et al., 2015). Using Linux and OS X, one can also use the package parallel, while foreach works for all platforms.

Again, the Cars93 data is used. We want to fit the correlation coefficient between Price and Horsepower using a robust method (minimum covariance determinant - MCD), and in addition we want to fit the corresponding confidence interval by a Bootstrap. For the theory of Bootstrap, refer to Chapter 3, The Discrepancy Between Pencil-Driven Theory and Data-Driven Computational Solutions. Basically, we take Bootstrap samples (using sample()) and for each Bootstrap sample we calculate the robust covariance. From the results (R values), we take certain quantiles which can then be used for determining the confidence interval:

R <- 10000
library(robustbase)
covMcd(Cars93[, c("Price", "Horsepower")], cor = TRUE)$cor[1,2]
## [1] 0.8447
## confidence interval:
n <- nrow(Cars93)
f <- function(R, ...){
 replicate(R, covMcd(Cars93[sample(1:n, replace=TRUE), 
 c("Price", "Horsepower")], cor = TRUE)$cor[1,2])
}
system.time(ci <- f(R))
## user system elapsed 
## 79.056 0.265 79.597
quantile(ci, c(0.025, 0.975))
## 2.5% 97.5%
## 0.7690 0.9504

The aim is now to parallelize this calculation. We will call the package snow and make three clusters. Note that you can make more clusters if you have more CPUs on your machine. You should use the number of CPUs you have minus one as the maximum:

library("snow")
cl <- makeCluster(3, type="SOCK")

We now need to make the package robustbase available for all nodes as well as the data and the function:

clusterEvalQ(cl, library("robustbase"))
clusterEvalQ(cl, data(Cars93, package = "MASS"))
clusterExport(cl, "f")
clusterExport(cl, "n")

We can also set a random seed for each cluster:

clusterSetupRNG(cl, seed=123)
## Loading required namespace: rlecuyer
## [1] "RNGstream"

With clusterCall we can perform parallel computing:

system.time(ci_boot <-
 clusterCall(cl, f, R = round(R / 3)))
## user system elapsed
## 0.001 0.000 38.655
quantile(unlist(ci_boot), c(0.025, 0.975))
## 2.5% 97.5%
## 0.7715 0.9512

We see that we are approximately twice as fast, and we could be even faster if we had more CPUs available.

Finally, we want to stop the cluster:

stopCluster(cl)

Interfaces to C++

Interfaces to C++ are recommended in order to make certain loops faster. We will show a very simple example to calculate the weighted mean. It should highlight the possibilities and let readers first get into touch with the package Rcpp (Eddelbuettel and Francois, 2011; Eddelbuettel, 2013), which simplifies the use of C++ code dramatically compared with R's .Call function.

Two great communicators of R, Hadley Wickham and Romain, used this example in their tutorials.

We want to compare the runtime of an example using R as an interpreted language, and also using Rcpp. We want to calculate the weighted mean of a vector.

A naive R function could look like that. We will use only interpreted R code:

wmeanR <- function(x, w) {
 total <- 0
 total_w <- 0
 for (i in seq_along(x)) {
 total <- total + x[i] * w[i]
 total_w <- total_w + w[i]
 }
 total / total_w
}

There is also a function called weighted.mean available in the base installation of R, and weightedMean in package laeken (Alfons and Templ, 2013).

Let us also define the Rcpp function. The function cppFunction compiles and links a shared library, then internally defines an R function that uses .Call:

library("Rcpp")
## from 
## http://blog.revolutionanalytics.com/2013/07/deepen-your-r-experience-with-rcpp.html 
cppFunction('
 double wmean(NumericVector x, NumericVector w) {
 int n = x.size();
 double total = 0, total_w = 0;
 for(int i = 0; i < n; ++i) {
 total += x[i] * w[i];
 total_w += w[i];
 }
 return total / total_w;
 }
')

Now, let's compare the methods:

x <- rnorm(100000000)
w <- rnorm(100000000)
library("laeken")
op <- microbenchmark(
 naiveR = wmeanR(x, w),
 weighted.mean = weighted.mean(x, w),
 weighedMean = weightedMean(x, w),
 Rcpp.wmean = wmean(x, w), 
 times = 1
)
## Warning in weightedMean(x, w): negative weights
op
## Unit: milliseconds
## expr min lq mean median uq max neval
## naiveR 92561 92561 92561 92561 92561 92561 1
## weighted.mean 5628 5628 5628 5628 5628 5628 1
## weighedMean 4007 4007 4007 4007 4007 4007 1
## Rcpp.wmean 125 125 125 125 125 125 1

Again, we draw a plot to visualize the results, as in the following screenshot globally:

m <- reshape2::melt(op, id="expr")
ggplot(m, aes(x=expr, y=value)) + geom_boxplot() + coord_trans(y = "log10") + xlab(NULL) + ylab("computation time") + the
me(
 axis.text.x = element_text(angle=45, vjust=1))
主站蜘蛛池模板: 方正县| 莎车县| 长沙县| 社旗县| 长阳| 南丰县| 南皮县| 屏东县| 遂昌县| 博兴县| 扎囊县| 临汾市| 荣昌县| 永年县| 上犹县| 中西区| 将乐县| 若尔盖县| 成安县| 岢岚县| 朔州市| 静乐县| 林甸县| 虎林市| 营口市| 大丰市| 怀宁县| 肥西县| 唐山市| 于田县| 崇左市| 华容县| 茌平县| 安宁市| 张家界市| 都安| 黄梅县| 德州市| 南郑县| 定安县| 澜沧|