官术网_书友最值得收藏!

Aggregation and Grouping

After getting the dataset, our analyst may have to answer a few questions. For example, we know the value of the radionuclide concentration per city, but an analyst may be asked to answer: which state, on average, has the highest radionuclide concentration?

To answer the questions posed, we need to group the data somehow and calculate an aggregation on it. But before we go into grouping data, we have to prepare the dataset so that we can manipulate it in an efficient manner. Getting the right types in a pandas DataFrame can be a huge boost for performance and can be leveraged to enforce data consistency— it makes sure that numeric data really is numeric and allows us to execute operations that we want to use to get the answers.

GroupBy allows us to get a more general view of a feature, arranging data given a GroupBy key and an aggregation operation. In pandas, this operation is done with the GroupBy method, over a selected column, such as State. Note the aggregation operation after the GroupBy method. Some examples of the operations that can be applied are as follows:

  • mean
  • median
  • std (standard deviation)
  • mad (mean absolute deviation)
  • sum
  • count
  • abs

    Note

    Several statistics, such as mean and standard deviation, only make sense with numeric data.

After applying GroupBy, a specific column can be selected and the aggregation operation can be applied to it, or all the remaining columns can be aggregated by the same function. Like SQL, GroupBy can be applied to more than one column at a time, and more than one aggregation operation can be applied to selected columns, one operation per column.

The GroupBy command in Pandas has some options, such as as_index, which can override the standard of transforming grouping key's columns to indexes and leaving them as normal columns. This is helpful when a new index will be created after the GroupBy operation, for example.

Aggregation operations can be done over several columns and different statistical methods at the same time with the agg method, passing a dictionary with the name of the column as the key and a list of statistical operations as values.

Exercise 6: Aggregation and Grouping Data

Remember that we have to answer the question of which state has, on average, the highest radionuclide concentration. As there are several cities per state, we have to combine the values of all cities in one state and calculate the average. This is one of the applications of GroupBy: calculating the average values of one variable as per a grouping. We can answer the question using GroupBy:

  1. Import the required libraries:

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

    import seaborn as sns

  2. Load the datasets from the https://opendata.socrata.com/:

    df = pd.read_csv('RadNet_Laboratory_Analysis.csv')

  3. Group the DataFrame using the State column.

    df.groupby('State')

  4. Select the radionuclide Cs-134 and calculate the average value per group:

    df.groupby('State')['Cs-134'].head()

  5. Do the same for all columns, grouping per state and applying directly the mean function:

    df.groupby('State').mean().head()

  6. Now, group by more than one column, using a list of grouping columns.
  7. Aggregate using several aggregation operations per column with the agg method. Use the State and Location columns:

    df.groupby(['State', 'Location']).agg({'Cs-134':['mean', 'std'], 'Te-129':['min', 'max']})

NumPy on Pandas

NumPy functions can be applied to DataFrames directly or through the apply and applymap methods. Other NumPy functions, such as np.where, also work with DataFrames.

主站蜘蛛池模板: 武安市| 竹溪县| 绥化市| 衡山县| 南昌市| 黄平县| 庄河市| 石城县| 普安县| 舟山市| 资中县| 洞头县| 长阳| 方山县| 哈巴河县| 巴中市| 淮南市| 城口县| 讷河市| 措美县| 苏尼特左旗| 安宁市| 遵义县| 瑞丽市| 阿瓦提县| 肇东市| 观塘区| 靖安县| 壤塘县| 浦城县| 陕西省| 浦城县| 南郑县| 定南县| 和硕县| 广德县| 巴马| 古交市| 岚皋县| 仁化县| 青川县|