官术网_书友最值得收藏!

Gapminder

Now we've looked at tidying data, let's have a quick look at using dplyr and ggplot to filter, process, and plot some data. In this section, and throughout this book, we're going to be using the Gapminder data that was made famous by Hans Rosling and the Gapminder foundation. An excerpt of this data is available from the gapminder package, as assembled by Jenny Bryan, and it can be installed and loaded very simply using install.packages("gapminder"); library(gapminder). As the package description indicates, it includes, for each of the 142 countries that are included, the values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.

In order to prepare the data for plotting, we will make use of dplyr, as shown in the following code:

groupedData = gapminder %>%
filter(year > 1960) %>%
group_by(continent, year) %>%
summarise(meanLife = mean(lifeExp))

This single block of code, all executed in one line, produces a dataframe suitable for plotting, and uses chaining to enhance the simplicity of the code. Three separate data operations, filter(), group_by(), and summarise(), are all used, with the results from each being sent to the next instruction using the %>% operator. The three instructions carry out the following tasks:

  • filter(): This is similar to subset(). This operation only keeps rows that meet certain requirements—in this case, years beyond 1960.
  • group_by(): This allows operations to be carried out on subsets of data points—in this case, each continent for each of the years within the dataset. 
  • summarise(): This carries out summary functions, such as sum and mean, on several data points—in this case the mean life expectancy within each continent and available year.

So, to summarize, the preceding code filters the data to select only years beyond 1960, groups it by the continent and year, and finds the mean life expectancy within that continent or year. Printing the output from the preceding code yields the following:

As you can see, the output is a tibble, which has a nice print method that only prints the first several rows. Tibbles are very similar to dataframes, and are often produced by default instead of dataframes within the tidyverse. There are some nice differences, but they are fairly interchangeable with dataframes for our purposes, so we will not get sidetracked by the differences here.

Now we have mentioned tibbles, you can see that the dataframe is a nice summary of the mean life expectancy by year and continent.

主站蜘蛛池模板: 平远县| 桃园市| 宁津县| 枣阳市| 革吉县| 诏安县| 普宁市| 尼玛县| 东乌珠穆沁旗| 汉中市| 安多县| 黔西| 紫金县| 敦煌市| 东明县| 彩票| 建宁县| 芜湖县| 广宁县| 屏东县| 大埔县| 溆浦县| 章丘市| 阜康市| 宽甸| 阜宁县| 文昌市| 嘉善县| 白河县| 宝清县| 永春县| 芜湖县| 博爱县| 固镇县| 北辰区| 河池市| 交城县| 郧西县| 弥渡县| 峨边| 普兰店市|