官术网_书友最值得收藏!

Evaluating relations between variables with ANOVA

Analysis of variance (ANOVA) is a statistical data analysis method invented by statistician Ronald Fisher. This method partitions data of a continuous variable using the values of one or more corresponding categorical variables to analyze variance. ANOVA is a form of linear modeling. If we are modeling with one categorical variable, we speak of one-way ANOVA. In this recipe, we will use two categorical variables so we have two-way ANOVA. In two-way ANOVA, we create a contingency table—a table containing counts for all combinations of the two categorical variables (we will see a contingency table example soon). The linear model is then given by the equation:

This is an additive model where μij is the mean of the continuous variable corresponding to one cell of the contingency table, μ is the mean for the whole data set, αi is the contribution of the first categorical variable, βj is the contribution of the second categorical variable, and ? ij is a cross-term. We will apply this model to weather data.

How to do it...

The following steps apply two-way ANOVA to wind speed as continuous variable, rain as a binary variable, and wind direction as categorical variable:

  1. The imports are as follows:
    from statsmodels.formula.api import ols
    import dautil as dl
    from statsmodels.stats.anova import anova_lm
    import seaborn as sns
    import matplotlib.pyplot as plt
    from IPython.display import HTML
  2. Load the data and fit the model with statsmodels:
    df = dl.data.Weather.load().dropna()
    df['RAIN'] = df['RAIN'] > 0
    formula = 'WIND_SPEED ~ C(RAIN) + C(WIND_DIR)'
    lm = ols(formula, df).fit()
    hb = dl.HTMLBuilder()
    hb.h1('ANOVA Applied to Weather Data')
    hb.h2('ANOVA results')
    hb.add_df(anova_lm(lm), index=True)
  3. Display a truncated contingency table and visualize the data with Seaborn:
    df['WIND_DIR'] = dl.data.Weather.categorize_wind_dir(df)
    hb.h2('Truncated Contingency table')
    hb.add_df(df.groupby([df['RAIN'], df['WIND_DIR']]).count().head(3),index=True)
    
    sns.pointplot(y='WIND_SPEED', x='WIND_DIR',
                  hue='RAIN', data=df[['WIND_SPEED', 'RAIN', 'WIND_DIR']])
    HTML(hb.html)

Refer to the following screenshot for the end result (see anova.ipynb file in this book's code bundle):

See also

主站蜘蛛池模板: 和平区| 绥阳县| 清远市| 甘南县| 五台县| 于都县| 银川市| 曲沃县| 漳州市| 永州市| 三台县| 高清| 武强县| 蒙阴县| 洞口县| 贡觉县| 邢台县| 隆林| 渭源县| 盐津县| 固始县| 建德市| 桐乡市| 双柏县| 丰原市| 伽师县| 兴安县| 柳河县| 永靖县| 临清市| 图木舒克市| 望谟县| 宣城市| 景宁| 高唐县| 曲松县| 南和县| 资中县| 甘孜| 肇东市| 大洼县|