官术网_书友最值得收藏!

Using tibble and dplyr for data manipulation

tibble is a recent development. It is essentially a more user-friendly version of DataFrames. For example, when you view data.frame in R, it will attempt to print as many rows as your console supports until it reaches the max.print value, at which point you'll get the following message:

getOption("max.print") -- omitted 99000 rows 

tibble, on the other hand, will show only the first few rows by default and adjust the viewable columns based on your viewable area on the screen.

To use tibble, and other related functionalities, install the tidyverse package as follows:

install.packages("tidyverse") 
library("tidyverse") 

The output of library("tidyverse")  is as follows:

Let us create tibble of the state DataFrame that we have used thus far:

tstate <- as_tibble(state.x77) 
tstate$Region <- state.region 

Before getting into the details of dplyr, it would help to get familiarized with a commonly used notation in R called pipe, which is represented as %>%. This notation has been a recent development.

Pipes allow the developer to pass the output of one function in the input of a subsequent function successively. For instance, suppose we wanted to find Region with the highest income from our state dataset. 

One way to find the region with the maximum income would be to aggregate by Region and then find Region corresponding to the highest value, as follows:

step1 <- aggregate(tstate[,-c(9)], by=list(state$Region), mean, na.rm = T) 
step1 

The output is as follows:

step2 <- step1[step1$Income==max(step1$Income),] 
step2 

This can, however, be greatly simplified using the %>% pipe operator, as follows:

tstate %>% group_by(Region) %>% summarise(Income = mean(Income)) %>% filter(Income == max(Income)) 
 
# # A tibble: 1 x 2 
# Region   Income 
# <fctr>    <dbl> 
#   1   West 4702.615 

It is also possible to summarize all of the column values at once using summarise_all and find the row corresponding to the max income, as in the prior example:

tstate %>% group_by(Region) %>% summarise_all(funs(mean)) %>% filter(Income == max(Income)) 

The output is as follows:

主站蜘蛛池模板: 台州市| 富民县| 康乐县| 左贡县| 正镶白旗| 西乌珠穆沁旗| 白银市| 崇阳县| 榆社县| 蒙城县| 枣阳市| 定边县| 乐东| 杭州市| 石泉县| 浪卡子县| 如东县| 建德市| 准格尔旗| 土默特右旗| 迁西县| 大港区| 红桥区| 渝中区| 西乌珠穆沁旗| 江都市| 新绛县| 方城县| 霍山县| 宣化县| 芒康县| 冷水江市| 诸城市| 唐山市| 西充县| 黄浦区| 浪卡子县| 龙口市| 宾川县| 肥东县| 德钦县|