百人牛牛玩法说明

書名： Simulation for Data Science with R
作者名： Matthias Templ
本章字數： 1243字
更新時間： 2021-07-14 11:17:07

Data manipulation in R

For students working with perfectly prepared data from various R packages on relatively small scale problems, data manipulation is not the big issue. However, in the daily practice of a data scientist, most of the time working on data analysis does not involve applying a suitable function to an already perfectly prepared piece of data. The majority of work is done on data manipulation, in order to collect data from several sources, shape the data into a suitable format, and extract the relevant information. Thus, data manipulation is the core work, and data scientists and statisticians should possess strong data manipulation skills.

Whenever you work with data frames, the package dplyr provides user-friendly and computationally efficient code. One package that supports even more efficient data manipulation is the data.table package (Dowle et al., 2015). However, since both packages have their advantages, we report both. Also, data.table works with two dimensional data objects.

In any case, when working, for example, with arrays of dimensions larger than 2, apply is still the only choice. One example is phonological data, where for each phonological station, measurements for different species are given over time. Such data sets are typically stored as an array, in this case in a four-dimensional array.

Apply and friends with basic R

Using the apply family one can manipulate slices of data from matrices, arrays, lists, and data frames in a repetitive manner. These functions allow you to cross the data over certain dimensions and avoid the explicit use of for loops. Repetitively, functions are applied over all elements of a dimension.

Let's consider again the Cars93 data set. The data set consists of rows (first dimension) and columns (second dimension). To apply a function over the second dimension of this data set, for example, to calculate the number of missing values over the columns, the call looks like the following:

## function to be applied afterwards
func <- function(x){ 
 return(sum(is.na(x)))
}
## apply func on all columns (second dimension (2))
## and store it to na
na <- apply(X = Cars93, MARGIN = 2, FUN = func)
## show those with missings
na[ na > 0 ]
## Rear.seat.room Luggage.room
## 2 11

Here, X must be an array (note that a matrix is a 2-dimensional array), MARGIN is the dimension where FUN is applied.

Note

This apply call is basically the same as the for loop.

p <- ncol(Cars93)
na_for <- numeric(p)
for(i in 1:p){
 na_for[i] <- func(Cars93[, i])
}

identical(as.numeric(na), na_for)
## [1] TRUE

While one can always use for loops, the code from apply is much shorter.

When a given function should be applied on a list (note that a data.frame is also internally treated as a list), lapply might be your friend. The output returned is also a list that has the same number of elements as the object passed to it.

In a previous code call, we assigned a list output to an object called m:

m <- robCompositions::missPatterns(Cars93)
class(m)
## [1] "list"

We apply a function, for example, length(), to all elements of a list by using lapply to access the length of each list element:

lapply(m, length)
## $groups
## [1] 3
##
## $cn
## [1] 3
##
## $tabcomb
## [1] 81
##
## $tabcombPlus
## [1] 28
##
## $rsum
## [1] 93

sapply works basically as lapply, but simplifies the output if possible. For example:

s <- sapply(m, length)
is no longer a list (as for lapply) but a vector of integers.
s
## groups cn tabcomb tabcombPlus rsum
## 3 3 81 28 93
class(s)
## [1] "integer"

The function aggregate is similar to apply. Its difference lies in the function argument, which allows it to subset the data set and apply a function on these subsets. Let us have a look at the function arguments:

args(aggregate)
## function (x, ...)
## NULL

Since we see only the arguments from the generic, but we want to apply the functions to the Cars93 data frame, we may look to see if there is a method for data frames implemented:

methods(aggregate)
## [1] aggregate.cv* aggregate.cvSelect* aggregate.cvTuning*
## [4] aggregate.data.frame aggregate.default* aggregate.formula*
## [7] aggregate.Spatial* aggregate.ts aggregate.zoo*
## see '?methods' for accessing help and source code
args(aggregate.data.frame)
## function (x, by, FUN, ..., simplify = TRUE)
## NULL

To make group-wise statistics, this function can now be applied on our example data, for example, to calculate the median Horsepower and Weight of cars for each cylinder class (Cylinders):

aggregate(Cars93[, c("Horsepower", "Weight")], by = list(Cars93$Cylinders), median)
## Group.1 Horsepower Weight
## 1 3 70.0 1965
## 2 4 110.0 2705
## 3 5 138.5 3602
## 4 6 170.0 3515
## 5 8 210.0 3935
## 6 rotary 255.0 2895

Similar functions to aggregate are by (another print output), summarize from package Hmisc (Harrell Jr, 2016), and summarize and group_by from package dplyr, discussed as follows.

Basic data manipulation with the dplyr package

The base R's functionality on data manipulation is nice, but in some situations the dplyr package is more intuitive and, more importantly, much faster than the base R data manipulation functions. Since this book is about data simulation, and it also discusses computer-intense methods, computational speed is very important, especially for larger simulations.

The package dplyr offers functions for:

Filtering of observations
Variable selection
Recoding
Grouping
Aggregation (in groups)
Note
Other useful packages such as reshape2, stringr, or lubridate are not covered in this book.
data.table is discussed later

Additional packages, such as the dplyr package, sometimes make life easier, and as previously mentioned, the calculation time can be considerable faster.

Some of the steps in data management can be abstracted. Such tasks include: selection of rows or columns — ordering of data — recoding, grouping, and aggregation.

Here are some further reasons for an additional package such as dplyr:

Only a few important keywords to remember
Consistency
Works with different inputs
data.frame, data.tables, sqlite
Simple (but new) syntax
Less code, less error
From now on in this section the following applies (since this is the dplyr language): a column corresponds to a variable and a line corresponds to a observation

First, the package must be loaded (and once to be installed):

library("dplyr")

Note

Some vignettes (short instructions) are available, see help(pa = "dplyr").

dplyr – creating a local data frame

A local data frame can be created using tbl_df().

Why do we need this? Because it offers more efficient print outputs and no chance of accidentally printing huge data sets, which can lead to memory problems or long waiting time.

Remember, Cars93 is a data.frame:

class (Cars93)
## [1] "data.frame"

We then convert to a local data frame for dplyr and look at the new print output that is done by dplyr:

Cars93 <- tbl_df(Cars93)
class(Cars93)
## [1] "tbl_df" "tbl" "data.frame"

## print(Cars93) # output suppressed

dplyr – selecting lines

Using the function slice(), one can select rows according to their line number:

slice(Cars93, 1) # first line, output suppressed

You can also select multiple rows at once.

Note that c() creates a vector from the input numbers, and function n() returns the number of observations (lines). We will select the 1,4,10,15 and the last line of the data:

slice (Cars93, c(1,4,10,15, n ()))
## Source: local data frame [5 x 27]
##
## Manufacturer Model Type Min.Price Price Max.Price MPG.city
## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (int)
## 1 Acura Integra Small 12.9 15.9 18.8 25
## 2 Audi 100 Midsize 30.8 37.7 44.6 19
## 3 Cadillac DeVille Large 33.0 34.7 36.3 16
## 4 Chevrolet Lumina Midsize 13.4 15.9 18.4 21
## 5 Volvo 850 Midsize 24.8 26.7 28.5 20
## Variables not shown: MPG.highway (int), AirBags (fctr), DriveTrain (fctr),
## Cylinders (fctr), EngineSize (dbl), Horsepower (int), RPM (int),
## Rev.per.mile (int), Man.trans.avail (fctr), Fuel.tank.capacity (dbl),
## Passengers (int), Length (int), Wheelbase (int), Width (int),
## Turn.circle (int), Rear.seat.room (dbl), Luggage.room (int), Weight
## (int), Origin (fctr), Make (fctr)

The function filter() can select rows that satisfy a condition.

Example, all observations where variable Manufacturer == is Audi when at the same time the value of variable Min.Price is > 25:

filter(Cars93, Manufacturer == "Audi" & Min.Price > 25)
## Source: local data frame [2 x 27]
##
## Manufacturer Model Type Min.Price Price Max.Price MPG.city
## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (int)
## 1 Audi 90 Compact 25.9 29.1 32.3 20
## 2 Audi 100 Midsize 30.8 37.7 44.6 19
## Variables not shown: MPG.highway (int), AirBags (fctr), DriveTrain (fctr),
## Cylinders (fctr), EngineSize (dbl), Horsepower (int), RPM (int),
## Rev.per.mile (int), Man.trans.avail (fctr), Fuel.tank.capacity (dbl),
## Passengers (int), Length (int), Wheelbase (int), Width (int),
## Turn.circle (int), Rear.seat.room (dbl), Luggage.room (int),
 Weight
## (int), Origin (fctr), Make (fctr)

dplyr – order

With arrange() you can sort the data by one or more variables. By default it is sorted in ascending order, with desc() descending:

Cars93 <- arrange (Cars93, Price)
Cars93 ## output suppressed

You can also sort by multiple variables:

head(arrange(Cars93, desc (MPG.city), Max.Price), 7)
## Source: local data frame [7 x 27]
##
## Manufacturer Model Type Min.Price Price Max.Price MPG.city
## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (int)
## 1 Geo Metro Small 6.7 8.4 10.0 46
## 2 Honda Civic Small 8.4 12.1 15.8 42
## 3 Suzuki Swift Small 7.3 8.6 10.0 39
## 4 Subaru Justy Small 7.3 8.4 9.5 33
## 5 Toyota Tercel Small 7.8 9.8 11.8 32
## 6 Ford Festiva Small 6.9 7.4 7.9 31
## 7 Pontiac LeMans Small 8.2 9.0 9.9 31
## Variables not shown: MPG.highway (int), AirBags (fctr), DriveTrain (fctr),
## Cylinders (fctr), EngineSize (dbl), Horsepower (int), RPM (int),
## Rev.per.mile (int), Man.trans.avail (fctr), Fuel.tank.capacity (dbl),
## Passengers (int), Length (int), Wheelbase (int), Width (int),
## Turn.circle (int), Rear.seat.room (dbl), Luggage.room (int)
, Weight
## (int), Origin (fctr), Make (fctr)

dplyr – selecting columns

Function select() allows you to select variables from the data set:

head (select (Cars93, Manufacturer, Price), 3)
## Source: local data frame [3 x 2]
##
## Manufacturer Price
## (fctr) (dbl)
## 1 Ford 7.4
## 2 Hyundai 8.0
## 3 Mazda 8.3

For a sequence of variables, the operator : can be used:

head (select (Cars93, Manufacturer:Price), 3)
## Source: local data frame [3 x 5]
##
## Manufacturer Model Type Min.Price Price
## (fctr) (fctr) (fctr) (dbl) (dbl)
## 1 Ford Festiva Small 6.9 7.4
## 2 Hyundai Excel Small 6.8 8.0
## 3 Mazda 323 Small 7.4 8.3

Negative indexing is possible, while all variables with the letter prefix minus ( - ) are excluded:

select (Cars93, -Min.Price, -Max.Price) # output suppressed

Some functions are useful within select ():

starts_with()
ends_with()
contains()
matches()
num_range()**

For example:

head (select (Cars93, starts_with ("Man")), 3)
## Source: local data frame [3 x 2]
##
## Manufacturer Man.trans.avail
## (fctr) (fctr)
## 1 Ford Yes
## 2 Hyundai Yes
## 3 Mazda Yes
head (select (Cars93, contains ("Price")), 3)
## Source: local data frame [3 x 3]
##
## Min.Price Price Max.Price
## (dbl) (dbl) (dbl)
## 1 6.9 7.4 7.9
## 2 6.8 8.0 9.2
## 3 7.4 8.3 9.1

Both select() and rename() can be used to rename variables by simply using a new = old syntax. select() returns only the specified variables:

head (select (Cars93, myPrize = Price, Min.Price))
## Source: local data frame [6 x 2]
##
## myPrize Min.Price
## (dbl) (dbl)
## 1 7.4 6.9
## 2 8.0 6.8
## 3 8.3 7.4
## 4 8.4 
6.7
## 5 8.4 7.3
## 6 8.6 7.3

dplyr – uniqueness

Functionality distinct() can be used to keep only unique rows:

Cars93_1 <- select(Cars93, Manufacturer, EngineSize)
dim (Cars93_1)
## [1] 93 2
Cars93_1 <- distinct(Cars93_1)
dim (Cars93_1)
## [1] 79 2

By default, all variables are used to assess whether a row multiple occurs in the data set:

dim(Cars93)
## [1] 93 27
dim( distinct (Cars93, Manufacturer) )
## [1] 32 27
# based on two variables:
dim(distinct(Cars93, Manufacturer, EngineSize))
## [1] 79 27
# based on two variables, second is rounded:
dim(distinct(Cars93
, Manufacturer, rr=round(EngineSize)))
## [1] 57 28

dplyr – creating variables

With function mutate() one can add new variables and retains the old variables:

m <- mutate(Cars93, is_ford = Manufacturer == "Ford")
m[1:3, c(1,28)]
## Source: local data frame [3 x 2]
##
## Manufacturer is_ford
## (fctr) (lgl)
## 1 Ford TRUE
## 2 Hyundai FALSE
## 3 Mazda FALSE

Function transmute() retains only the listed variables, in this case it looks almost the same as the previous example. We do not show an output here:

transmute(Cars93, is_ford = Manufacturer == "Ford", Manufacturer)

Newly created variables can be used again in the same statement:

head (transmute(Cars93, Manufacturer, is_ford = Manufacturer == "Ford", num_ford = ifelse (is_ford, -1, 1)), 3)
## Source: local data frame [3 x 3]
##
## Manufacturer is_ford num_ford
## (fctr) (lgl) (dbl)
## 1 Ford TRUE -1
## 2 Hyundai
 FALSE 1
## 3 Mazda FALSE 1

dplyr – grouping and aggregates

One often wants to perform calculations in groups. Previously, we saw examples using the apply family of the base R package. For data frames, the dplyr package supports grouping. The syntax of the package dplyr is much more elegant than using the base R packages for grouping and aggregation. dplyr supports grouping with the function group_by(), which creates the subsets, and summarize(), which is used to calculate statistics that must provide exactly one number. Package dplyr provides additional, useful aggregation statistics such as the first_value(x), last_value(x), nth_value(x) of a variable.

Let us show grouping by using the variable Manufacturer and by calculating the group size, the minimum of the variable Prize and the maximum of the variable Prize:

by_type <- group_by (Cars93, Type)
summarize (by_type,
 count = n(),min_es = min(EngineSize),
 max_es = max(EngineSize)
)
## Source: local data frame [6 x 4]
##
## Type count min_es max_es
## (fctr) (int) (dbl) (dbl)
## 1 Compact 16 2.0 3.0
## 2 Large 11 3.3 5.7
## 3 Midsize 22 2.0 4.6
## 4 Small 21 1.0 2.2
## 5 Sporty 14 1.3 5.7
## 6 Van 9 2.4 4.3

Via group_by () functions are applied on defined groups. Note that dplyr supports the pipeline syntax from R package magrittr (Bache and Wickham, 2015). For the preceding call, one can also write:

Cars93 %>%
 group_by(Type) %>%
 summarize(count = n(), min_es = min(EngineSize), max_es = max(EngineSize) )
## output suppressed since equally to previous output

The pipeline operator can be interpreted for the preceding example: first the Cars93 data set is chosen and then group_by is applied to this data set. summarize is then applied to the result of group_by. So in general, it makes it possible to provide commands like in a pipe together. The output of the previous is first input into the following command. The commands are performed from left to right (in the direction of the arrow).

Note

arrange() and select() are independent of grouping.

Let's take another example, in order to report the first two observations per group:

by_type <- group_by(Cars93, Type)
slice (by_type, 1: 2)
## Source: local data frame [12 x 27]
## Groups: Type [6]
##
## Manufacturer Model Type Min.Price Price Max.Price MPG.city
## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (int)
## 1 Pontiac Sunbird Compact 9.4 11.1 12.8 23
## 2 Ford Tempo Compact 10.4 11.3 12.2 22
## 3 Chrylser Concorde Large 18.4 18.4 18.4 20
## 4 Chevrolet Caprice Large 18.0 18.8 19.6 17
## 5 Hyundai Sonata Midsize 12.4 13.9 15.3 20
## 6 Mercury Cougar Midsize 14.9 14.9 14.9 19
## 7 Ford Festiva Small 6.9 7.4 7.9 31
## 8 Hyundai Excel Small 6.8 8.0 9.2 29
## 9 Hyundai Scoupe Sporty 9.1 10.0 11.0 26
## 10 Geo Storm Sporty 11.5 12.5 13.5 30
## 11 Chevrolet Lumina_APV Van 14.7 16.3 18.0 18
## 12 Chevrolet Astro Van 14.7 16.6 18.6 15
## Variables not shown: MPG.highway (int), AirBags (fctr), DriveTrain (fctr),
## Cylinders (fctr), EngineSize (dbl), Horsepower (int), RPM (int),
## Rev.per.mile (int), Man.trans.avail (fctr), Fuel.tank.capacity (dbl),
## Passengers (int), Length (int), Wheelbase (int), Width (int),
## Turn.circle (int), Rear.seat.room (dbl), Luggage.room (int), Weight
## (int), Origin (fctr), Make (fctr)

We have shown by example that dplyr provides a simple syntax. Again, the operator %>% syntax makes it even more easily readable:

## output suppressed since the same as above
Cars93 %>% group_by(Type) %>% slice(1:2)

Let's take another example. We want to compute a new variable EngineSize as the square of EngineSize, and for each group we want to compute the minimum of the new variable. In addition, the results should be sorted in descending order:

Cars93 %>% mutate(ES2 = EngineSize^2) %>% group_by(Type) %>%
summarize(min.ES2 = min(ES2)) %>% arrange(desc(min.ES2))
## Source: local data frame [6 x 2]
##
## Type min.ES2
## (fctr) (dbl)
## 1 Large 10.89
## 2 Van 5.76
## 3 Compact 4.00
## 4
 Midsize 4.00
## 5 Sporty 1.69
## 6 Small 1.00

dplyr – window functions

summarize() works for functions that return one single value. To make more complex aggregations, window functions can be used.

There are different types of window functions:

Ranking/ordering: row_number(), min_rank(), percent_rank(), and so on
Offsets: lag(), lead()
Cumulative functions: cumsum(), cummin(), cummax(), cummean(), and so on

Let's perform a simple example. Calculate the cumulative sum and average value within each group of Type:

Cars93 %>%
 group_by(Type) %>%
 arrange(Type) %>%
 select(Manufacturer:Price) %>%
 mutate(cmean = cummean(Price), csum = cumsum(Price))
## Source: local data frame [93 x 7]
## Groups: Type [6]
##
## Manufacturer Model Type Min.Price Price cmean csum
## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (dbl)
## 1 Pontiac Sunbird Compact 9.4 11.1 11.10 11.1
## 2 Ford Tempo Compact 10.4 11.3 11.20 22.4
## 3 Chevrolet Corsica Compact 11.4 11.4 11.27 33.8
## 4 Dodge Spirit Compact 11.9 13.3 11.77 47.1
## 5 Chevrolet Cavalier Compact 8.5 13.4 12.10 60.5
## 6 Oldsmobile Achieva Compact 13.0 13.5 12.33 74.0
## 7 Nissan Altima Compact 13.0 15.7 12.81 89.7
## 8 Chrysler LeBaron Compact 14.5 15.8 13.19 105.5
## 9 Mazda 626 Compact 14.3 16.5 13.56 122.0
## 10 Honda Accord Compact 13.8 17.5 13.95 139.5
## .. ... ... ... ... ... ... ...

Data manipulation with the data.table package

The package data.table is not included in the base R installation and must be installed once. It allows very efficient aggregation of large data sets (for example, data with several gigabytes of memory), efficient merging (join) of several objects, adding and deletion of variables, and efficient importing of data sets (fread()). The syntax is easy to learn but it is different to the syntax of base R.

Let us first convert a data.frame to a data.table using function data.table. We again use the Cars93 data, and print the data table — the print output differs from base R and also from dyplyr. Note that each data.table is also a data.frame and both can be accessed as a list:

require(data.table)
Cars93 <- data.table(Cars93)
Cars93 ## print output suppressed

The utility function tables() lists all data.table objects in the memory and gives information on the dimension and needed memory for each data table:

tables()
## NAME NROW NCOL MB
## [1,] Cars93 93 27 1
## COLS
## [1,] Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,D
## KEY
## [1,]
## Total: 1MB

data.table – variable construction

Using the $ operator, new variables can be constructed. As an example, we will create a new variable where the values are TRUE if the manufacturer is Ford:

Cars93$tmp1 <- Cars93[, j = Manufacturer == "Ford"]
We can modify a variable by the :=-syntax directly (a very nice feature!)
Cars93[, tmp2 := rnorm(nrow(Cars93))]

Note that these modifications are done by-reference — no copy of the data is needed internally.

To delete variables, one of the two following possibilities can be used:

Cars93[, tmp1:=NULL]
Cars93$tmp2 <- NULL

data.table – indexing or subsetting

The indexing is done differently than in base R. Two parameters are used, i: for the rows of the data.table and j: for the columns of the data.table.

We use [] as an indexing operator, but it works slightly differently. j is an expression in the scope of the actual object. Using with=FALSE: j is evaluated as a vector of names or numbers.

Let us extract rows. We suppress the output in the following code listing to avoid filling the book with output from the Cars93 data:

Cars93[i = 2] # second row, all columns
Cars93[i = c(1,5)] # first and fifth row, all columns
Cars93[i = -c(1:5)] # exclude the first five rows

Now let us extract columns. See the details on the following code listing:

Cars93[j = 3] # this does not work since 3 evaluates to 3
## [1] 3
Cars93[j = "Price"] # extract "Price" does not work since "Price" evaluates to "Price"
## [1] "Price"
Cars93[j = Price] # this works, since variable Price exists in the scope of Cars93
## [1] 7.4 8.0 8.3 8.4 8.4 8.6 9.0 9.1 9.2 9.8 10.0 10.0 10.1 10.3
## [15] 10.9 11.1 11.1 11.3 11.3 11.4 11.6 11.8 12.1 12.2 12.5 13.3 13.4 13.5
## [29] 13.9 14.0 14.1 14.4 14.9 15.1 15.6 15.7 15.7 15.8 15.9 15.9 15.9 16.3
## [43] 16.3 16.5 16.6 17.5 17.7 18.2 18.4 18.4 18.5 18.8 19.0 19.1 19.1 19.3
## [57] 19.5 19.5 19.7 19.8 19.9 20.0 20.2 20.7 20.8 20.9 21.5 22.7 22.7 23.3
## [71] 23.7 24.4 25.8 26.1 26.3 26.7 28.0 28.7 29.1 29.5 30.0 31.9 32.5 33.9
## [85] 34.3 34.7 35.2 36.1 37.7 38.0 40.1 47.9 61.9
Cars93[i=1:3, j = "Price", with = FALSE] # also works
## Price
## 1: 7.4
## 2: 8.0
## 3: 8.3

Indexing can also be done more sophisticatedly. For example, if we wanted to extract the first three rows, extract all variables, calculate a new variable that is the price range, or calculate the mean price, we could do the following:

Cars93[1:3, .(Price, Horsepower, Diff.Price = Max.Price - Min.Price, Mean.Price = mean(Price))]
## Price Horsepower Diff.Price Mean.Price
## 1: 7.4 63 1.0 7.9
## 2: 8.0 81 2.4 7.9
## 3: 8.3 82 1.7 7.9

Note

.() is short for list().

data.table – keys

data.table objects can be grouped according to a key. Based on such a key, calculations are very efficient. By using setkey(), a key can be set for a data.table:

setkey(Cars93, Type) # equally: setkeyv(dt, "x")

More than one key can be defined for a data.table. Now sorting is done automatically regarding this key. Actual key variables can be displayed with key():

key(Cars93)
## [1] "Type"

data.table – fast subsetting

By using keys, we can increase the performance of subsetting:

setkey(Cars93, Type)
Cars93["Van"] # all rows with Type == "Van" (output suppressed)

For more than two keys — for example, to extract all observations with Type equal to Van, DriveTrain equals 4WD and Origin equals non-USA — we can apply the following:

setkey(Cars93, Type, DriveTrain, Origin)
Cars93[.("Van", "4WD", "non-USA")]
## Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
## 1: Mazda MPV Van 16.6 19.1 21.7 18 24
## 2: Toyota Previa Van 18.9 22.7 26.6 18 22
## AirBags DriveTrain Cylinders EngineSize Horsepower RPM
## 1: None 4WD 6 3.0 155 5000
## 2: Driver only 4WD 4 2.4 138 5000
## Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
## 1: 2240 No 19.6 7 190
## 2: 2515 Yes 19.8 7 187
## Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin
## 1: 110 72 39 27.5 NA 3735 non-USA
## 2: 113 71 41 35.0 NA 3785 non-USA
## Make
## 1: Mazda MPV
## 2: Toyota Previa

Let's compare efficiency on a data set with characters. We use the microbenchmark package (Mersmann 2015) for this purpose. We see that data.table is more than 60 times faster than base R, and in this case dplyr is the slowest:

require(microbenchmark)
N <- 1000000
dat<- data.table(
 x=sample(LETTERS[1:20], N, replace=TRUE),
 y=sample(letters[1:5], N, replace=TRUE))
head(dat, 3)
## x y
## 1: M a
## 2: B a
## 3: I e
setkey(dat, x,y)

microbenchmark(
 data.table = dat[list(c("B", "D"), c("b", "d"))],
 dplyr = dat %>% slice(x %in% c("B", "D") & y %in% c("b", "d")),
 baseR = dat[x %in% c("B", "D") & y %in% c("b", "d")]
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## data.table 1.13 1.276 1.571 1.351 1.487 7.382 100
## dplyr 70.27 80.049 96.965 83.216 88.785 267.583 100
## baseR 69.65 76.685 95.777 82.128 87.722 281.426 100

data.table – calculations in groups

We can do calculations in groups by using by. In the following example, we will calculate the arithmetic mean price, the interquartile price range, and the median price:

Cars93[, .(mean = mean(Price), IQR = IQR(Price), median = median(Price)), by = Type]
## Type mean IQR median
## 1: Compact 18.21 7.30 16.15
## 2: Large 24.30 6.95 20.90
## 3: Midsize 27.22 17.42 26.20
## 4: Small 10.17 2.70 10.00
## 5: Sporty 19.39 8.25 16.80
## 6: Van 19.10 0.70 19.10

There is further functionality in the data.table package. See .SD to apply functions to more than one variable, .N for the number of elements in each group, and merge to efficiently join data sets.

官术网_书友最值得收藏!

Simulation for Data Science with R

Data manipulation in R

Apply and friends with basic R

Note

Basic data manipulation with the dplyr package

Note

Note

dplyr – creating a local data frame

dplyr – selecting lines

dplyr – order

dplyr – selecting columns

dplyr – uniqueness

dplyr – creating variables

dplyr – grouping and aggregates

Note

dplyr – window functions

Data manipulation with the data.table package

data.table – variable construction

data.table – indexing or subsetting

Note

data.table – keys

data.table – fast subsetting

data.table – calculations in groups