書名： Hands-On Data Science with R
作者名： Vitor Bianchi Lanzetta Nataraj Dasgupta Ricardo Anjoleto Farias
本章字數： 272字
更新時間： 2021-06-10 19:12:36

Using tibble and dplyr for data manipulation

tibble is a recent development. It is essentially a more user-friendly version of DataFrames. For example, when you view data.frame in R, it will attempt to print as many rows as your console supports until it reaches the max.print value, at which point you'll get the following message:

getOption("max.print") -- omitted 99000 rows

tibble, on the other hand, will show only the first few rows by default and adjust the viewable columns based on your viewable area on the screen.

To use tibble, and other related functionalities, install the tidyverse package as follows:

install.packages("tidyverse") 
library("tidyverse")

The output of library("tidyverse") is as follows:

Let us create tibble of the state DataFrame that we have used thus far:

tstate <- as_tibble(state.x77) 
tstate$Region <- state.region

Before getting into the details of dplyr, it would help to get familiarized with a commonly used notation in R called pipe, which is represented as %>%. This notation has been a recent development.

Pipes allow the developer to pass the output of one function in the input of a subsequent function successively. For instance, suppose we wanted to find Region with the highest income from our state dataset.

One way to find the region with the maximum income would be to aggregate by Region and then find Region corresponding to the highest value, as follows:

step1 <- aggregate(tstate[,-c(9)], by=list(state$Region), mean, na.rm = T) 
step1

The output is as follows:

step2 <- step1[step1$Income==max(step1$Income),] 
step2

This can, however, be greatly simplified using the %>% pipe operator, as follows:

tstate %>% group_by(Region) %>% summarise(Income = mean(Income)) %>% filter(Income == max(Income)) 
 
# # A tibble: 1 x 2 
# Region   Income 
# <fctr>    <dbl> 
#   1   West 4702.615

It is also possible to summarize all of the column values at once using summarise_all and find the row corresponding to the max income, as in the prior example:

tstate %>% group_by(Region) %>% summarise_all(funs(mean)) %>% filter(Income == max(Income))

The output is as follows:

官术网_书友最值得收藏!

Hands-On Data Science with R

Using tibble and dplyr for data manipulation