- Simulation for Data Science with R
- Matthias Templ
- 1243字
- 2021-07-14 11:17:07
Data manipulation in R
For students working with perfectly prepared data from various R packages on relatively small scale problems, data manipulation is not the big issue. However, in the daily practice of a data scientist, most of the time working on data analysis does not involve applying a suitable function to an already perfectly prepared piece of data. The majority of work is done on data manipulation, in order to collect data from several sources, shape the data into a suitable format, and extract the relevant information. Thus, data manipulation is the core work, and data scientists and statisticians should possess strong data manipulation skills.
Whenever you work with data frames, the package dplyr
provides user-friendly and computationally efficient code. One package that supports even more efficient data manipulation is the data.table
package (Dowle et al., 2015). However, since both packages have their advantages, we report both. Also, data.table
works with two dimensional data objects.
In any case, when working, for example, with arrays of dimensions larger than 2, apply
is still the only choice. One example is phonological data, where for each phonological station, measurements for different species are given over time. Such data sets are typically stored as an array, in this case in a four-dimensional array.
Apply and friends with basic R
Using the apply
family one can manipulate slices of data from matrices, arrays, lists, and data frames in a repetitive manner. These functions allow you to cross the data over certain dimensions and avoid the explicit use of for
loops. Repetitively, functions are applied over all elements of a dimension.
Let's consider again the Cars93
data set. The data set consists of rows (first dimension) and columns (second dimension). To apply a function over the second dimension of this data set, for example, to calculate the number of missing values over the columns, the call looks like the following:
## function to be applied afterwards func <- function(x){ return(sum(is.na(x))) } ## apply func on all columns (second dimension (2)) ## and store it to na na <- apply(X = Cars93, MARGIN = 2, FUN = func) ## show those with missings na[ na > 0 ] ## Rear.seat.room Luggage.room ## 2 11
Here, X
must be an array (note that a matrix is a 2-dimensional array), MARGIN
is the dimension where FUN
is applied.
Note
This apply
call is basically the same as the for
loop.
p <- ncol(Cars93) na_for <- numeric(p) for(i in 1:p){ na_for[i] <- func(Cars93[, i]) } identical(as.numeric(na), na_for) ## [1] TRUE
While one can always use for
loops, the code from apply is much shorter.
When a given function should be applied on a list (note that a data.frame
is also internally treated as a list), lapply
might be your friend. The output returned is also a list that has the same number of elements as the object passed to it.
In a previous code call, we assigned a list output to an object called m
:
m <- robCompositions::missPatterns(Cars93) class(m) ## [1] "list"
We apply a function, for example, length()
, to all elements of a list by using lapply
to access the length of each list element:
lapply(m, length) ## $groups ## [1] 3 ## ## $cn ## [1] 3 ## ## $tabcomb ## [1] 81 ## ## $tabcombPlus ## [1] 28 ## ## $rsum ## [1] 93
sapply
works basically as lapply
, but simplifies the output if possible. For example:
s <- sapply(m, length) is no longer a list (as for lapply) but a vector of integers. s ## groups cn tabcomb tabcombPlus rsum ## 3 3 81 28 93 class(s) ## [1] "integer"
The function aggregate
is similar to apply
. Its difference lies in the function argument, which allows it to subset the data set and apply a function on these subsets. Let us have a look at the function arguments:
args(aggregate) ## function (x, ...) ## NULL
Since we see only the arguments from the generic, but we want to apply the functions to the Cars93
data frame, we may look to see if there is a method for data frames implemented:
methods(aggregate) ## [1] aggregate.cv* aggregate.cvSelect* aggregate.cvTuning* ## [4] aggregate.data.frame aggregate.default* aggregate.formula* ## [7] aggregate.Spatial* aggregate.ts aggregate.zoo* ## see '?methods' for accessing help and source code args(aggregate.data.frame) ## function (x, by, FUN, ..., simplify = TRUE) ## NULL
To make group-wise statistics, this function can now be applied on our example data, for example, to calculate the median Horsepower
and Weight
of cars for each cylinder class (Cylinders
):
aggregate(Cars93[, c("Horsepower", "Weight")], by = list(Cars93$Cylinders), median) ## Group.1 Horsepower Weight ## 1 3 70.0 1965 ## 2 4 110.0 2705 ## 3 5 138.5 3602 ## 4 6 170.0 3515 ## 5 8 210.0 3935 ## 6 rotary 255.0 2895
Similar functions to aggregate are by
(another print output), summarize
from package Hmisc
(Harrell Jr, 2016), and summarize
and group_by
from package dplyr
, discussed as follows.
Basic data manipulation with the dplyr package
The base R's functionality on data manipulation is nice, but in some situations the dplyr
package is more intuitive and, more importantly, much faster than the base R data manipulation functions. Since this book is about data simulation, and it also discusses computer-intense methods, computational speed is very important, especially for larger simulations.
The package dplyr
offers functions for:
- Filtering of observations
- Variable selection
- Recoding
- Grouping
- Aggregation (in groups)
Note
Other useful packages such as
reshape2
,stringr
, orlubridate
are not covered in this book. data.table
is discussed later
Additional packages, such as the dplyr
package, sometimes make life easier, and as previously mentioned, the calculation time can be considerable faster.
Some of the steps in data management can be abstracted. Such tasks include: selection of rows or columns — ordering of data — recoding, grouping, and aggregation.
Here are some further reasons for an additional package such as dplyr
:
- Only a few important keywords to remember
- Consistency
- Works with different inputs
data.frame
,data.tables
,sqlite
- Simple (but new) syntax
- Less code, less error
- From now on in this section the following applies (since this is the
dplyr
language): a column corresponds to a variable and a line corresponds to a observation
First, the package must be loaded (and once to be installed):
library("dplyr")
Note
Some vignettes (short instructions) are available, see help(pa = "dplyr")
.
dplyr – creating a local data frame
A local data frame can be created using tbl_df()
.
Why do we need this? Because it offers more efficient print outputs and no chance of accidentally printing huge data sets, which can lead to memory problems or long waiting time.
Remember, Cars93
is a data.frame:
class (Cars93) ## [1] "data.frame"
We then convert to a local data frame for dplyr
and look at the new print output that is done by dplyr
:
Cars93 <- tbl_df(Cars93) class(Cars93) ## [1] "tbl_df" "tbl" "data.frame" ## print(Cars93) # output suppressed
dplyr – selecting lines
Using the function slice(),
one can select rows according to their line number:
slice(Cars93, 1) # first line, output suppressed
You can also select multiple rows at once.
Note that c()
creates a vector from the input numbers, and function n()
returns the number of observations (lines). We will select the 1,4,10,15
and the last line of the data:
slice (Cars93, c(1,4,10,15, n ())) ## Source: local data frame [5 x 27] ## ## Manufacturer Model Type Min.Price Price Max.Price MPG.city ## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (int) ## 1 Acura Integra Small 12.9 15.9 18.8 25 ## 2 Audi 100 Midsize 30.8 37.7 44.6 19 ## 3 Cadillac DeVille Large 33.0 34.7 36.3 16 ## 4 Chevrolet Lumina Midsize 13.4 15.9 18.4 21 ## 5 Volvo 850 Midsize 24.8 26.7 28.5 20 ## Variables not shown: MPG.highway (int), AirBags (fctr), DriveTrain (fctr), ## Cylinders (fctr), EngineSize (dbl), Horsepower (int), RPM (int), ## Rev.per.mile (int), Man.trans.avail (fctr), Fuel.tank.capacity (dbl), ## Passengers (int), Length (int), Wheelbase (int), Width (int), ## Turn.circle (int), Rear.seat.room (dbl), Luggage.room (int), Weight ## (int), Origin (fctr), Make (fctr)
The function filter()
can select rows that satisfy a condition.
Example, all observations where variable Manufacturer == is Audi
when at the same time the value of variable Min.Price
is > 25
:
filter(Cars93, Manufacturer == "Audi" & Min.Price > 25) ## Source: local data frame [2 x 27] ## ## Manufacturer Model Type Min.Price Price Max.Price MPG.city ## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (int) ## 1 Audi 90 Compact 25.9 29.1 32.3 20 ## 2 Audi 100 Midsize 30.8 37.7 44.6 19 ## Variables not shown: MPG.highway (int), AirBags (fctr), DriveTrain (fctr), ## Cylinders (fctr), EngineSize (dbl), Horsepower (int), RPM (int), ## Rev.per.mile (int), Man.trans.avail (fctr), Fuel.tank.capacity (dbl), ## Passengers (int), Length (int), Wheelbase (int), Width (int), ## Turn.circle (int), Rear.seat.room (dbl), Luggage.room (int), Weight ## (int), Origin (fctr), Make (fctr)
dplyr – order
With arrange()
you can sort the data by one or more variables. By default it is sorted in ascending order, with desc()
descending:
Cars93 <- arrange (Cars93, Price) Cars93 ## output suppressed
You can also sort by multiple variables:
head(arrange(Cars93, desc (MPG.city), Max.Price), 7) ## Source: local data frame [7 x 27] ## ## Manufacturer Model Type Min.Price Price Max.Price MPG.city ## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (int) ## 1 Geo Metro Small 6.7 8.4 10.0 46 ## 2 Honda Civic Small 8.4 12.1 15.8 42 ## 3 Suzuki Swift Small 7.3 8.6 10.0 39 ## 4 Subaru Justy Small 7.3 8.4 9.5 33 ## 5 Toyota Tercel Small 7.8 9.8 11.8 32 ## 6 Ford Festiva Small 6.9 7.4 7.9 31 ## 7 Pontiac LeMans Small 8.2 9.0 9.9 31 ## Variables not shown: MPG.highway (int), AirBags (fctr), DriveTrain (fctr), ## Cylinders (fctr), EngineSize (dbl), Horsepower (int), RPM (int), ## Rev.per.mile (int), Man.trans.avail (fctr), Fuel.tank.capacity (dbl), ## Passengers (int), Length (int), Wheelbase (int), Width (int), ## Turn.circle (int), Rear.seat.room (dbl), Luggage.room (int) , Weight ## (int), Origin (fctr), Make (fctr)
dplyr – selecting columns
Function select()
allows you to select variables from the data set:
head (select (Cars93, Manufacturer, Price), 3) ## Source: local data frame [3 x 2] ## ## Manufacturer Price ## (fctr) (dbl) ## 1 Ford 7.4 ## 2 Hyundai 8.0 ## 3 Mazda 8.3
For a sequence of variables, the operator :
can be used:
head (select (Cars93, Manufacturer:Price), 3) ## Source: local data frame [3 x 5] ## ## Manufacturer Model Type Min.Price Price ## (fctr) (fctr) (fctr) (dbl) (dbl) ## 1 Ford Festiva Small 6.9 7.4 ## 2 Hyundai Excel Small 6.8 8.0 ## 3 Mazda 323 Small 7.4 8.3
Negative indexing is possible, while all variables with the letter prefix minus ( -
) are excluded:
select (Cars93, -Min.Price, -Max.Price) # output suppressed
Some functions are useful within select ()
:
starts_with()
ends_with()
contains()
matches()
num_range()**
For example:
head (select (Cars93, starts_with ("Man")), 3) ## Source: local data frame [3 x 2] ## ## Manufacturer Man.trans.avail ## (fctr) (fctr) ## 1 Ford Yes ## 2 Hyundai Yes ## 3 Mazda Yes head (select (Cars93, contains ("Price")), 3) ## Source: local data frame [3 x 3] ## ## Min.Price Price Max.Price ## (dbl) (dbl) (dbl) ## 1 6.9 7.4 7.9 ## 2 6.8 8.0 9.2 ## 3 7.4 8.3 9.1
Both select()
and rename()
can be used to rename variables by simply using a new = old
syntax. select()
returns only the specified variables:
head (select (Cars93, myPrize = Price, Min.Price)) ## Source: local data frame [6 x 2] ## ## myPrize Min.Price ## (dbl) (dbl) ## 1 7.4 6.9 ## 2 8.0 6.8 ## 3 8.3 7.4 ## 4 8.4 6.7 ## 5 8.4 7.3 ## 6 8.6 7.3
dplyr – uniqueness
Functionality distinct()
can be used to keep only unique rows:
Cars93_1 <- select(Cars93, Manufacturer, EngineSize) dim (Cars93_1) ## [1] 93 2 Cars93_1 <- distinct(Cars93_1) dim (Cars93_1) ## [1] 79 2
By default, all variables are used to assess whether a row multiple occurs in the data set:
dim(Cars93) ## [1] 93 27 dim( distinct (Cars93, Manufacturer) ) ## [1] 32 27 # based on two variables: dim(distinct(Cars93, Manufacturer, EngineSize)) ## [1] 79 27 # based on two variables, second is rounded: dim(distinct(Cars93 , Manufacturer, rr=round(EngineSize))) ## [1] 57 28
dplyr – creating variables
With function mutate()
one can add new variables and retains the old variables:
m <- mutate(Cars93, is_ford = Manufacturer == "Ford") m[1:3, c(1,28)] ## Source: local data frame [3 x 2] ## ## Manufacturer is_ford ## (fctr) (lgl) ## 1 Ford TRUE ## 2 Hyundai FALSE ## 3 Mazda FALSE
Function transmute()
retains only the listed variables, in this case it looks almost the same as the previous example. We do not show an output here:
transmute(Cars93, is_ford = Manufacturer == "Ford", Manufacturer)
Newly created variables can be used again in the same statement:
head (transmute(Cars93, Manufacturer, is_ford = Manufacturer == "Ford", num_ford = ifelse (is_ford, -1, 1)), 3) ## Source: local data frame [3 x 3] ## ## Manufacturer is_ford num_ford ## (fctr) (lgl) (dbl) ## 1 Ford TRUE -1 ## 2 Hyundai FALSE 1 ## 3 Mazda FALSE 1
dplyr – grouping and aggregates
One often wants to perform calculations in groups. Previously, we saw examples using the apply
family of the base R package. For data frames, the dplyr
package supports grouping. The syntax of the package dplyr
is much more elegant than using the base R packages for grouping and aggregation. dplyr
supports grouping with the function group_by()
, which creates the subsets, and summarize()
, which is used to calculate statistics that must provide exactly one number. Package dplyr
provides additional, useful aggregation statistics such as the first_value(x)
, last_value(x)
, nth_value(x)
of a variable.
Let us show grouping by using the variable Manufacturer
and by calculating the group size, the minimum of the variable Prize
and the maximum of the variable Prize
:
by_type <- group_by (Cars93, Type) summarize (by_type, count = n(),min_es = min(EngineSize), max_es = max(EngineSize) ) ## Source: local data frame [6 x 4] ## ## Type count min_es max_es ## (fctr) (int) (dbl) (dbl) ## 1 Compact 16 2.0 3.0 ## 2 Large 11 3.3 5.7 ## 3 Midsize 22 2.0 4.6 ## 4 Small 21 1.0 2.2 ## 5 Sporty 14 1.3 5.7 ## 6 Van 9 2.4 4.3
Via group_by ()
functions are applied on defined groups. Note that dplyr
supports the pipeline syntax from R package magrittr
(Bache and Wickham, 2015). For the preceding call, one can also write:
Cars93 %>% group_by(Type) %>% summarize(count = n(), min_es = min(EngineSize), max_es = max(EngineSize) ) ## output suppressed since equally to previous output
The pipeline operator can be interpreted for the preceding example: first the Cars93
data set is chosen and then group_by
is applied to this data set. summarize
is then applied to the result of group_by
. So in general, it makes it possible to provide commands like in a pipe together. The output of the previous is first input into the following command. The commands are performed from left to right (in the direction of the arrow).
Note
arrange()
and select()
are independent of grouping.
Let's take another example, in order to report the first two observations per group:
by_type <- group_by(Cars93, Type) slice (by_type, 1: 2) ## Source: local data frame [12 x 27] ## Groups: Type [6] ## ## Manufacturer Model Type Min.Price Price Max.Price MPG.city ## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (int) ## 1 Pontiac Sunbird Compact 9.4 11.1 12.8 23 ## 2 Ford Tempo Compact 10.4 11.3 12.2 22 ## 3 Chrylser Concorde Large 18.4 18.4 18.4 20 ## 4 Chevrolet Caprice Large 18.0 18.8 19.6 17 ## 5 Hyundai Sonata Midsize 12.4 13.9 15.3 20 ## 6 Mercury Cougar Midsize 14.9 14.9 14.9 19 ## 7 Ford Festiva Small 6.9 7.4 7.9 31 ## 8 Hyundai Excel Small 6.8 8.0 9.2 29 ## 9 Hyundai Scoupe Sporty 9.1 10.0 11.0 26 ## 10 Geo Storm Sporty 11.5 12.5 13.5 30 ## 11 Chevrolet Lumina_APV Van 14.7 16.3 18.0 18 ## 12 Chevrolet Astro Van 14.7 16.6 18.6 15 ## Variables not shown: MPG.highway (int), AirBags (fctr), DriveTrain (fctr), ## Cylinders (fctr), EngineSize (dbl), Horsepower (int), RPM (int), ## Rev.per.mile (int), Man.trans.avail (fctr), Fuel.tank.capacity (dbl), ## Passengers (int), Length (int), Wheelbase (int), Width (int), ## Turn.circle (int), Rear.seat.room (dbl), Luggage.room (int), Weight ## (int), Origin (fctr), Make (fctr)
We have shown by example that dplyr
provides a simple syntax. Again, the operator %>%
syntax makes it even more easily readable:
## output suppressed since the same as above Cars93 %>% group_by(Type) %>% slice(1:2)
Let's take another example. We want to compute a new variable EngineSize
as the square of EngineSize
, and for each group we want to compute the minimum of the new variable. In addition, the results should be sorted in descending order:
Cars93 %>% mutate(ES2 = EngineSize^2) %>% group_by(Type) %>% summarize(min.ES2 = min(ES2)) %>% arrange(desc(min.ES2)) ## Source: local data frame [6 x 2] ## ## Type min.ES2 ## (fctr) (dbl) ## 1 Large 10.89 ## 2 Van 5.76 ## 3 Compact 4.00 ## 4 Midsize 4.00 ## 5 Sporty 1.69 ## 6 Small 1.00
dplyr – window functions
summarize()
works for functions that return one single value. To make more complex aggregations, window functions can be used.
There are different types of window functions:
- Ranking/ordering:
row_number()
,min_rank()
,percent_rank()
, and so on - Offsets:
lag()
,lead()
- Cumulative functions:
cumsum()
,cummin()
,cummax()
,cummean()
, and so on
Let's perform a simple example. Calculate the cumulative sum and average value within each group of Type
:
Cars93 %>% group_by(Type) %>% arrange(Type) %>% select(Manufacturer:Price) %>% mutate(cmean = cummean(Price), csum = cumsum(Price)) ## Source: local data frame [93 x 7] ## Groups: Type [6] ## ## Manufacturer Model Type Min.Price Price cmean csum ## (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (dbl) ## 1 Pontiac Sunbird Compact 9.4 11.1 11.10 11.1 ## 2 Ford Tempo Compact 10.4 11.3 11.20 22.4 ## 3 Chevrolet Corsica Compact 11.4 11.4 11.27 33.8 ## 4 Dodge Spirit Compact 11.9 13.3 11.77 47.1 ## 5 Chevrolet Cavalier Compact 8.5 13.4 12.10 60.5 ## 6 Oldsmobile Achieva Compact 13.0 13.5 12.33 74.0 ## 7 Nissan Altima Compact 13.0 15.7 12.81 89.7 ## 8 Chrysler LeBaron Compact 14.5 15.8 13.19 105.5 ## 9 Mazda 626 Compact 14.3 16.5 13.56 122.0 ## 10 Honda Accord Compact 13.8 17.5 13.95 139.5 ## .. ... ... ... ... ... ... ...
Data manipulation with the data.table package
The package data.table
is not included in the base R installation and must be installed once. It allows very efficient aggregation of large data sets (for example, data with several gigabytes of memory), efficient merging (join
) of several objects, adding and deletion of variables, and efficient importing of data sets (fread()
). The syntax is easy to learn but it is different to the syntax of base R.
Let us first convert a data.frame
to a data.table
using function data.table
. We again use the Cars93
data, and print the data table — the print output differs from base R and also from dyplyr
. Note that each data.table
is also a data.frame
and both can be accessed as a list:
require(data.table) Cars93 <- data.table(Cars93) Cars93 ## print output suppressed
The utility function tables()
lists all data.table
objects in the memory and gives information on the dimension and needed memory for each data table:
tables() ## NAME NROW NCOL MB ## [1,] Cars93 93 27 1 ## COLS ## [1,] Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,D ## KEY ## [1,] ## Total: 1MB
data.table – variable construction
Using the $
operator, new variables can be constructed. As an example, we will create a new variable where the values are TRUE
if the manufacturer is Ford
:
Cars93$tmp1 <- Cars93[, j = Manufacturer == "Ford"] We can modify a variable by the :=-syntax directly (a very nice feature!) Cars93[, tmp2 := rnorm(nrow(Cars93))]
Note that these modifications are done by-reference — no copy of the data is needed internally.
To delete variables, one of the two following possibilities can be used:
Cars93[, tmp1:=NULL] Cars93$tmp2 <- NULL
data.table – indexing or subsetting
The indexing is done differently than in base R. Two parameters are used, i:
for the rows of the data.table
and j:
for the columns of the data.table
.
We use []
as an indexing operator, but it works slightly differently. j
is an expression in the scope of the actual object. Using with=FALSE: j
is evaluated as a vector of names or numbers.
Let us extract rows. We suppress the output in the following code listing to avoid filling the book with output from the Cars93
data:
Cars93[i = 2] # second row, all columns Cars93[i = c(1,5)] # first and fifth row, all columns Cars93[i = -c(1:5)] # exclude the first five rows
Now let us extract columns. See the details on the following code listing:
Cars93[j = 3] # this does not work since 3 evaluates to 3 ## [1] 3 Cars93[j = "Price"] # extract "Price" does not work since "Price" evaluates to "Price" ## [1] "Price" Cars93[j = Price] # this works, since variable Price exists in the scope of Cars93 ## [1] 7.4 8.0 8.3 8.4 8.4 8.6 9.0 9.1 9.2 9.8 10.0 10.0 10.1 10.3 ## [15] 10.9 11.1 11.1 11.3 11.3 11.4 11.6 11.8 12.1 12.2 12.5 13.3 13.4 13.5 ## [29] 13.9 14.0 14.1 14.4 14.9 15.1 15.6 15.7 15.7 15.8 15.9 15.9 15.9 16.3 ## [43] 16.3 16.5 16.6 17.5 17.7 18.2 18.4 18.4 18.5 18.8 19.0 19.1 19.1 19.3 ## [57] 19.5 19.5 19.7 19.8 19.9 20.0 20.2 20.7 20.8 20.9 21.5 22.7 22.7 23.3 ## [71] 23.7 24.4 25.8 26.1 26.3 26.7 28.0 28.7 29.1 29.5 30.0 31.9 32.5 33.9 ## [85] 34.3 34.7 35.2 36.1 37.7 38.0 40.1 47.9 61.9 Cars93[i=1:3, j = "Price", with = FALSE] # also works ## Price ## 1: 7.4 ## 2: 8.0 ## 3: 8.3
Indexing can also be done more sophisticatedly. For example, if we wanted to extract the first three rows, extract all variables, calculate a new variable that is the price range, or calculate the mean price, we could do the following:
Cars93[1:3, .(Price, Horsepower, Diff.Price = Max.Price - Min.Price, Mean.Price = mean(Price))] ## Price Horsepower Diff.Price Mean.Price ## 1: 7.4 63 1.0 7.9 ## 2: 8.0 81 2.4 7.9 ## 3: 8.3 82 1.7 7.9
Note
.()
is short for list()
.
data.table – keys
data.table
objects can be grouped according to a key. Based on such a key, calculations are very efficient. By using setkey()
, a key can be set for a data.table
:
setkey(Cars93, Type) # equally: setkeyv(dt, "x")
More than one key can be defined for a data.table
. Now sorting is done automatically regarding this key. Actual key variables can be displayed with key()
:
key(Cars93) ## [1] "Type"
data.table – fast subsetting
By using keys, we can increase the performance of subsetting:
setkey(Cars93, Type) Cars93["Van"] # all rows with Type == "Van" (output suppressed)
For more than two keys — for example, to extract all observations with Type
equal to Van
, DriveTrain
equals 4WD
and Origin
equals non-USA
— we can apply the following:
setkey(Cars93, Type, DriveTrain, Origin) Cars93[.("Van", "4WD", "non-USA")] ## Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway ## 1: Mazda MPV Van 16.6 19.1 21.7 18 24 ## 2: Toyota Previa Van 18.9 22.7 26.6 18 22 ## AirBags DriveTrain Cylinders EngineSize Horsepower RPM ## 1: None 4WD 6 3.0 155 5000 ## 2: Driver only 4WD 4 2.4 138 5000 ## Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length ## 1: 2240 No 19.6 7 190 ## 2: 2515 Yes 19.8 7 187 ## Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin ## 1: 110 72 39 27.5 NA 3735 non-USA ## 2: 113 71 41 35.0 NA 3785 non-USA ## Make ## 1: Mazda MPV ## 2: Toyota Previa
Let's compare efficiency on a data set with characters. We use the microbenchmark
package (Mersmann 2015) for this purpose. We see that data.table
is more than 60 times faster than base R, and in this case dplyr
is the slowest:
require(microbenchmark) N <- 1000000 dat<- data.table( x=sample(LETTERS[1:20], N, replace=TRUE), y=sample(letters[1:5], N, replace=TRUE)) head(dat, 3) ## x y ## 1: M a ## 2: B a ## 3: I e setkey(dat, x,y) microbenchmark( data.table = dat[list(c("B", "D"), c("b", "d"))], dplyr = dat %>% slice(x %in% c("B", "D") & y %in% c("b", "d")), baseR = dat[x %in% c("B", "D") & y %in% c("b", "d")] ) ## Unit: milliseconds ## expr min lq mean median uq max neval ## data.table 1.13 1.276 1.571 1.351 1.487 7.382 100 ## dplyr 70.27 80.049 96.965 83.216 88.785 267.583 100 ## baseR 69.65 76.685 95.777 82.128 87.722 281.426 100
data.table – calculations in groups
We can do calculations in groups by using by
. In the following example, we will calculate the arithmetic mean price, the interquartile price range, and the median price:
Cars93[, .(mean = mean(Price), IQR = IQR(Price), median = median(Price)), by = Type] ## Type mean IQR median ## 1: Compact 18.21 7.30 16.15 ## 2: Large 24.30 6.95 20.90 ## 3: Midsize 27.22 17.42 26.20 ## 4: Small 10.17 2.70 10.00 ## 5: Sporty 19.39 8.25 16.80 ## 6: Van 19.10 0.70 19.10
There is further functionality in the data.table
package. See .SD
to apply functions to more than one variable, .N
for the number of elements in each group, and merge to efficiently join data sets.
- Node.js+Webpack開發實戰
- 從零開始:數字圖像處理的編程基礎與應用
- Java系統分析與架構設計
- 零起步玩轉掌控板與Mind+
- QGIS:Becoming a GIS Power User
- 零基礎學C語言第2版
- IoT Projects with Bluetooth Low Energy
- Android智能手機APP界面設計實戰教程
- Learning Ionic(Second Edition)
- C++從零開始學(視頻教學版)(第2版)
- Visual FoxPro數據庫程序設計
- Selenium自動化測試實戰:基于Python
- Prezi Cookbook
- Learning Network Programming with Java
- Instant Buildroot