- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Tarek Amr
- 536字
- 2021-06-18 18:24:32
Finding regression intervals
It's not always guaranteed that we have accurate models. Sometimes, our data is inherently noisy and we cannot model it using a regressor. In these cases, it is important to be able to quantify how certain we arein our estimations. Usually, regressors make point predictions. These are the expected values (typically the mean) of the target (y) at each value of x. A Bayesian ridge regressor is capable of returning the expected values as usual, yet it also returns the standard deviation of the target (y) at each value of x.
To demonstrate how this works, let's create a noisy dataset, where :
import numpy as np
import pandas as pd
df_noisy = pd.DataFrame(
{
'x': np.random.random_integers(0, 30, size=150),
'noise': np.random.normal(loc=0.0, scale=5.0, size=150)
}
)
df_noisy['y'] = df_noisy['x'] + df_noisy['noise']
Then, we can plot it in the form of a scatter plot:
df_noisy.plot(
kind='scatter', x='x', y='y'
)
Plotting the resulting data frame will give us the following plot:

Now, let's train two regressors on the same data—LinearRegression and BayesianRidge. I will stick to the default values for the Bayesian ridge hyperparameters here:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import BayesianRidge
lr = LinearRegression()
br = BayesianRidge()
lr.fit(df_noisy[['x']], df_noisy['y'])
df_noisy['y_lr_pred'] = lr.predict(df_noisy[['x']])
br.fit(df_noisy[['x']], df_noisy['y'])
df_noisy['y_br_pred'], df_noisy['y_br_std'] = br.predict(df_noisy[['x']], return_std=True)
Notice how the Bayesian ridge regressor returns two values when predicting.
The predictions made by the two models are very similar. Nevertheless, we can use the standard deviation returned to calculate a range around the values that we expect most of the future data to fall into.The following code snippet creates plots for the two models and their predictions:
fig, axs = plt.subplots(1, 3, figsize=(16, 6), sharex=True, sharey=True)
# We plot the data 3 times
df_noisy.sort_values('x').plot(
title='Data', kind='scatter', x='x', y='y', ax=axs[0]
)
df_noisy.sort_values('x').plot(
kind='scatter', x='x', y='y', ax=axs[1], marker='o', alpha=0.25
)
df_noisy.sort_values('x').plot(
kind='scatter', x='x', y='y', ax=axs[2], marker='o', alpha=0.25
)
# Here we plot the Linear Regression predictions
df_noisy.sort_values('x').plot(
title='LinearRegression', kind='scatter', x='x', y='y_lr_pred',
ax=axs[1], marker='o', color='k', label='Predictions'
)
# Here we plot the Bayesian Ridge predictions
df_noisy.sort_values('x').plot(
title='BayesianRidge', kind='scatter', x='x', y='y_br_pred',
ax=axs[2], marker='o', color='k', label='Predictions'
)
# Here we plot the range around the expected values
# We multiply by 1.96 for a 95% Confidence Interval
axs[2].fill_between(
df_noisy.sort_values('x')['x'],
df_noisy.sort_values('x')['y_br_pred'] - 1.96 *
df_noisy.sort_values('x')['y_br_std'],
df_noisy.sort_values('x')['y_br_pred'] + 1.96 *
df_noisy.sort_values('x')['y_br_std'],
color="k", alpha=0.2, label="Predictions +/- 1.96 * Std Dev"
)
fig.show()
Running the preceding code gives us the following graphs. In the BayesianRidge case, the shaded area shows where we expect 95% of our targets to fall:

Regression intervals are handy when we want to quantify our uncertainties. In Chapter 8, Ensembles – When One Model Is Not Enough, we will revisit regression intervals
- C語言程序設(shè)計(jì)案例教程
- Learning Neo4j
- Learning Java Functional Programming
- UML和模式應(yīng)用(原書第3版)
- Java異步編程實(shí)戰(zhàn)
- 技術(shù)領(lǐng)導(dǎo)力:程序員如何才能帶團(tuán)隊(duì)
- Easy Web Development with WaveMaker
- Oracle Database 12c Security Cookbook
- Python深度學(xué)習(xí):基于TensorFlow
- Visualforce Developer’s guide
- Python3.5從零開始學(xué)
- Django Design Patterns and Best Practices
- Learning Jakarta Struts 1.2: a concise and practical tutorial
- STM8實(shí)戰(zhàn)
- Mapping with ArcGIS Pro