官术网_书友最值得收藏!

Correlating a binary and a continuous variable with the point biserial correlation

The point-biserial correlation correlates a binary variable Y and a continuous variable X. The coefficient is calculated as follows:

The subscripts in (3.21) correspond to the two groups of the binary variable. M1 is the mean of X for values corresponding to group 1 of Y. M2 is the mean of X for values corresponding to group 0 of Y.

In this recipe, the binary variable we will use is rain or no rain. We will correlate this variable with temperature.

How to do it...

We will calculate the correlation with the scipy.stats.pointbiserialr() function. We will also compute the rolling correlation using a 2 year window with the np.roll() function. The steps are as follows:

  1. The imports are as follows:
    import dautil as dl
    from scipy import stats
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    from IPython.display import HTML
  2. Load the data and correlate the two relevant arrays:
    df = dl.data.Weather.load().dropna()
    df['RAIN'] = df['RAIN'] > 0
    
    stats_corr = stats.pointbiserialr(df['RAIN'].values, df['TEMP'].values)
  3. Compute the 2 year rolling correlation as follows:
    N = 2 * 365
    corrs = []
    
    for i in range(len(df.index) - N):
        x = np.roll(df['RAIN'].values, i)[:N]
        y = np.roll(df['TEMP'].values, i)[:N]
        corrs.append(stats.pointbiserialr(x, y)[0])
    
    corrs = pd.DataFrame(corrs,
                         index=df.index[N:],
                         columns=['Correlation']).resample('A')
  4. Plot the results with the following code:
    plt.plot(corrs.index.values, corrs.values)
    plt.hlines(stats_corr[0], corrs.index.values[0], corrs.index.values[-1],
               label='Correlation using the whole data set')
    plt.title('Rolling Point-biserial Correlation of Rain and Temperature with a 2 Year Window')
    plt.xlabel('Year')
    plt.ylabel('Correlation')
    plt.legend(loc='best')
    HTML(dl.report.HTMLBuilder().watermark())

Refer to the following screenshot for the end result (see correlating_pointbiserial.ipynb file in this book's code bundle):

See also

  • The relevant SciPy documentation at 2015).
主站蜘蛛池模板: 库尔勒市| 观塘区| 扶风县| 德兴市| 信丰县| 石楼县| 军事| 孟连| 左贡县| 永福县| 招远市| 吉林市| 兴仁县| 莆田市| 太保市| 西充县| 桐柏县| 剑川县| 宜阳县| 朝阳区| 武穴市| 健康| 元谋县| 大连市| 新密市| 武功县| 东城区| 康平县| 灵宝市| 梧州市| 增城市| 吉木萨尔县| 襄汾县| 奉化市| 德安县| 德州市| 南丹县| 罗山县| 鄂伦春自治旗| 诏安县| 邢台市|