官术网_书友最值得收藏!

Estimating kernel density

Often, we have an idea about the kind of distribution that is appropriate for our data. If that is not the case, we can apply a procedure called kernel density estimation. This method doesn't make any assumptions and is nonparametric. We basically smooth the data in an attempt to get a handle on the probability density. To smooth data, we can use various functions. These functions are called kernel functions in this context. The following equation defines the estimator:

In the preceding formula, K is the kernel function, a function with properties similar to a PDF. The bandwidth h parameter controls the smoothing process and can be kept fixed or varied. Some libraries use rules of thumb to calculate h, while others let you specify its value. SciPy, statsmodels, scikit-learn, and Seaborn implement kernel density estimation using different algorithms.

How to do it...

In this recipe, we will estimate bivariate kernel density using weather data:

  1. The imports are as follows:
    import seaborn as sns
    import matplotlib.pyplot as plt
    import dautil as dl
    from dautil.stats import zscores
    import statsmodels.api as sm
    from sklearn.neighbors import KernelDensity
    import numpy as np
    from scipy import stats
    from IPython.html import widgets
    from IPython.core.display import display
    from IPython.display import HTML
  2. Define the following function to plot the estimated kernel density:
    def plot(ax, a, b, c, xlabel, ylabel):
        dl.plotting.scatter_with_bar(ax, 'Kernel Density', a.values, b.values, c=c, cmap='Blues')
        ax.set_xlabel(xlabel)
        ax.set_ylabel(ylabel)
  3. In the following notebook cell, load the data and define widgets for the selection of weather variables:
    df = dl.data.Weather.load().resample('M').dropna()
    columns = [str(c) for c in df.columns.values]
    var1 = widgets.Dropdown(options=columns, selected_label='RAIN')
    display(var1)
    var2 = widgets.Dropdown(options=columns, selected_label='TEMP')
    display(var2)
  4. In the next notebook cell, define variables using the values of the widgets we created:
    x = df[var1.value]
    xlabel = dl.data.Weather.get_header(var1.value)
    y = df[var2.value]
    ylabel = dl.data.Weather.get_header(var2.value)
    X = [x, y]
  5. The next notebook cell does the heavy lifting with the most important lines highlighted:
    # need to use zscores to avoid errors
    Z = [zscores(x), zscores(y)]
    kde = stats.gaussian_kde(Z)
    
    _, [[sp_ax, sm_ax], [sk_ax, sns_ax]] = plt.subplots(2, 2)
    plot(sp_ax, x, y, kde.pdf(Z), xlabel, ylabel)
    sp_ax.set_title('SciPy')
    
    sm_kde = sm.nonparametric.KDEMultivariate(data=X, var_type='cc',
                                              bw='normal_reference')
    sm_ax.set_title('statsmodels')
    plot(sm_ax, x, y, sm_kde.pdf(X), xlabel, ylabel)
    
    XT = np.array(X).T
    sk_kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(XT)
    sk_ax.set_title('Scikit Learn')
    plot(sk_ax, x, y, sk_kde.score_samples(XT), xlabel, ylabel)
    
    sns_ax.set_title('Seaborn')
    sns.kdeplot(x, y, ax=sns_ax)
    sns.rugplot(x, color="b", ax=sns_ax)
    sns.rugplot(y, vertical=True, ax=sns_ax)
    sns_ax.set_xlabel(xlabel)
    sns_ax.set_ylabel(ylabel)
    
    plt.tight_layout()
    HTML(dl.report.HTMLBuilder().watermark())

Refer to the following screenshot for the end result (see the kernel_density_estimation.ipynb file in this book's code bundle):

See also

主站蜘蛛池模板: 绥宁县| 航空| 慈利县| 于都县| 长顺县| 吉木萨尔县| 英超| 台湾省| 奉节县| 紫金县| 个旧市| 桓台县| 大冶市| 松原市| 漯河市| 尉犁县| 庐江县| 疏附县| 民权县| 保康县| 柘城县| 类乌齐县| 沂水县| 介休市| 邹城市| 高阳县| 乌兰浩特市| 平原县| 大厂| 从江县| 手游| 朝阳市| 高台县| 伊通| 瓦房店市| 梅河口市| 手机| 宁远县| 阜南县| 特克斯县| 东宁县|