官术网_书友最值得收藏!

Data aggregation

As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:

>>> df.head()
 country city date hours
0 Germany Hamburg 2015-06-01 8
1 Germany Hamburg 2015-06-02 10
2 Germany Hamburg 2015-06-03 9
3 Germany Hamburg 2015-06-04 7
4 Germany Hamburg 2015-06-05 3

To view a summary per city, we use the describe function on the grouped data set:

>>> df.groupby("city").describe()
 hours
city
Berlin count 10.000000
 mean 6.000000
 std 3.741657
 min 0.000000
 25% 4.000000
 50% 6.000000
 75% 9.750000
 max 10.000000
Birmingham count 10.000000
 mean 5.100000
 std 2.078995
 min 2.000000
 25% 4.000000
 50% 5.500000
 75% 6.750000
 max 8.000000

On certain data sets, it can be useful to group by more than one attribute. We can get an overview about the sunny hours per country and date by passing in two column names:

>>> df.groupby(["country", "date"]).describe()
 hours country date
France 2015-06-01 count 5.000000
 mean 6.200000
 std 1.095445
 min 5.000000
 25% 5.000000
 50% 7.000000
 75% 7.000000
 max 7.000000
 2015-06-02 count 5.000000
 mean 3.600000
 std 3.577709
 min 0.000000
 25% 0.000000
 50% 4.000000
 75% 6.000000
 max 8.000000
UK 2015-06-07 std 3.872983
 min 0.000000
 25% 2.000000
 50% 6.000000
 75% 8.000000
 max 9.000000

We can compute single statistics as well:

>>> df.groupby("city").mean()
 hours
city
Berlin 6.0
Birmingham 5.1
Bordeax 4.7
Edinburgh 7.5
Frankfurt 5.8
Glasgow 4.8
Hamburg 5.5
Leipzig 5.0
London 4.8
Lyon 5.0
Manchester 5.2
Marseille 6.2
Munich 6.6
Nice 3.9
Paris 6.3

Finally, we can define any function to be applied on the groups with the agg method. The above could have been written in terms of agg like this:

>>> df.groupby("city").agg(np.mean)
hours
city
Berlin 6.0
Birmingham 5.1
Bordeax 4.7
Edinburgh 7.5
Frankfurt 5.8
Glasgow 4.8
...

But arbitrary functions are possible. As a last example, we define a custom function, which takes an input of a series object and computes the difference between the smallest and the largest element:

>>> df.groupby("city").agg(lambda s: abs(min(s) - max(s)))
 hours
city
Berlin 10
Birmingham 6
Bordeax 10
Edinburgh 8
Frankfurt 9
Glasgow 10
Hamburg 10
Leipzig 9
London 10
Lyon 8
Manchester 10
Marseille 10
Munich 9
Nice 10
Paris 9
主站蜘蛛池模板: 贡山| 紫阳县| 岐山县| 扬州市| 正镶白旗| 陈巴尔虎旗| 金川县| 那坡县| 焦作市| 德保县| 喀喇| 普兰店市| 凤冈县| 清河县| 辽阳县| 平遥县| 山东省| 白银市| 大丰市| 宾阳县| 彭泽县| 旌德县| 丘北县| 浮梁县| 昭通市| 宁河县| 永泰县| 东丰县| 宁乡县| 上蔡县| 巴林左旗| 锡林浩特市| 淮南市| 中超| 石家庄市| 鄂尔多斯市| 景泰县| 青铜峡市| 安国市| 库车县| 武鸣县|