官术网_书友最值得收藏!

Evaluating the model

We have used a learning algorithm to estimate a model's parameters from training data. How can we assess whether our model is a good representation of the real relationship? Let's assume that you have found another page in your pizza journal. We will use this page's entries as a test set to measure the performance of our model. We have added a fourth column; it contains the prices predicted by our model.

Several measures can be used to assess our model's predictive capability. We will evaluate our pizza price predictor using a measure called R-squared. Also known as the coefficient of determination, R-squared measures how close the data are to a regression line. There are several methods for calculating R-squared. In the case of simple linear regression, R-squared is equal to the square of the Pearson product-moment correlation coefficient (PPMCC), or Pearson's r. Using this method, R-squared must be a positive number between zero and one. This method is intuitive; if R-squared describes the proportion of variance in the response variable that is explained by the model, it cannot be greater than one or less than zero. Other methods, including the method used by scikit-learn, do not calculate R-squared as the square of Pearson's r. Using these methods, R-squared can be negative if the model performs extremely poorly. It is important to note the limitations of performance metrics. R-squared in particular is sensitive to outliers, and can spuriously increase when features are added to the model.

We will follow the method used by scikit-learn to calculate R-squared for our pizza price predictor. First we must measure the total sum of squares. yi is the observed value of the response variable for the ith test instance, and is the mean of the observed values of the response variable.

Next we must find the RSS. Recall that this is also our cost function.

Finally, we can find R-squared using the following:

The R-squared score of 0.662 indicates that a large proportion of the variance in the test instances' prices is explained by the model. Now let's confirm our calculation using scikit-learn. The score method of LinearRegression returns the model's R-squared value, as seen in the following example:

# In[1]: 
import numpy as np
from sklearn.linear_model import LinearRegression

X_train = np.array([6, 8, 10, 14, 18]).reshape(-1, 1)
y_train = [7, 9, 13, 17.5, 18]

X_test = np.array([8, 9, 11, 16, 12]).reshape(-1, 1)
y_test = [11, 8.5, 15, 18, 11]

model = LinearRegression()
model.fit(X_train, y_train)
r_squared = model.score(X_test, y_test)
print(r_squared )

# Out[1]:
0.6620
主站蜘蛛池模板: 江北区| 洪江市| 亳州市| 漠河县| 上饶县| 达州市| 临汾市| 嘉义市| 双桥区| 大英县| 当涂县| 泾阳县| 长白| 广平县| 江永县| 崇阳县| 临沭县| 西林县| 亳州市| 醴陵市| 平顺县| 云和县| 繁峙县| 博爱县| 翁牛特旗| 龙陵县| 米泉市| 涞水县| 吉木萨尔县| 重庆市| 云南省| 镇雄县| 综艺| 利辛县| 皮山县| 阿荣旗| 贵定县| 邮箱| 泰兴市| 丹棱县| 百色市|