官术网_书友最值得收藏!

  • Mastering Machine Learning on AWS
  • Dr. Saket S.R. Mengle Maximo Gurmendez
  • 559字
  • 2021-06-24 14:23:19

Implementing linear regression through scikit-learn

Like we did in the previous chapter, we will show you how you can quickly use scikit-learn to train a linear model straight from a SageMaker notebook instance. First, you must create the notebook instance (choosing conda_python3 as the kernel).

  1. We will start by loading the training data into a pandas dataframe:
housing_df = pd.read_csv(SRC_PATH + 'train.csv')
housing_df.head()

The preceding code displays the following output:

  1. The last column, (medv), stands for median value and represents the variable that we're trying to predict (dependent variable) based on the values from the remaining columns (independent variables).

As usual, we will split the dataset for training and testing:

from sklearn.model_selection import train_test_split

housing_df_reordered = housing_df[[label] + training_features]

training_df, test_df = train_test_split(housing_df_reordered,
test_size=0.2)

  1. Once we have these datasets, we will proceed to construct a linear regressor:
from sklearn.linear_model import LinearRegression

regression = LinearRegression()

training_features = ['crim', 'zn', 'indus', 'chas', 'nox',
'rm', 'age', 'dis', 'tax', 'ptratio', 'lstat']

model = regression.fit(training_df[training_features],
training_df['medv'])

We start by constructing an estimator (in this case, linear regression) and fit the model by providing the matrix of training values, (training_df[training_features]), and the labels, (raining_df['medv']).

  1. After fitting the model, we can use it to get predictions for every row in our testing dataset. We do this by appending a new column to our existing testing dataframe:
test_df['predicted_medv'] = model.predict(test_df[training_features])
test_df.head()

The preceding code displays the following output:

  1. It's always useful to check our predictions graphically. One way to do this is by plotting the predicted versus actual values as a scatterplot:
test_df[['medv', 'predicted_medv']].plot(kind='scatter', 
x='medv',
y='predicted_medv')

The preceding code displays the following output:

Note how the values are located mostly on the diagonal. This is a good sign, as a perfect regressor would yield all data points exactly on the diagonal (every predicted value would be exactly the same as the actual value).

  1. In addition to this graphical verification, we obtain an evaluation metric that tells us how good our model is at predicting the values. In this example, we use R-squared evaluation metrics, as explained in the previous section, which is available in scikit-learn.

Let's look at the following code block:

from sklearn.metrics import r2_score

r2_score(test_df['medv'], test_df['predicted_medv'])

0.695

A value near 0.7 is a decent value. If you want to get a sense of what a good R2 correlation is, we recommend you play this game: http://guessthecorrelation.com/.

Our linear model will create a predicted price by multiplying the value of each feature by a coefficient and adding up all these values, plus an independent term, or intercept.

We can find the values of these coefficients and intercept by accessing the data members in the model instance variable:

model.coef_

array([-7.15121101e-02, 3.78566895e-02, -4.47104045e-02, 5.06817970e+00,
-1.44690998e+01, 3.98249374e+00, -5.88738235e-03, -1.73656446e+00,
1.01325463e-03, -6.18943939e-01, -6.55278930e-01])

model.intercept_
32.20

It is usually very convenient to examine the coefficients of the different variables as they can be indicative of the relative importance of the features in terms of their independent predictive ability. 

By default, most linear regression algorithms such as scikit-learn or Spark will automatically do some degree of preprocessing (for example, it will scale the variables to prevent features with large values from introducing bias). Additionally, these algorithms support regularization parameters and provide you with options to choose the optimizer that's used to efficiently search for the coefficients that maximize the R2 score (or minimize the loss function).

主站蜘蛛池模板: 菏泽市| 刚察县| 龙口市| 曲靖市| 岐山县| 大荔县| 丰原市| 新安县| 金湖县| 长阳| 眉山市| 临颍县| 揭东县| 清流县| 防城港市| 赤壁市| 白城市| 潜江市| 南京市| 璧山县| 武川县| 武宣县| 伊宁县| 安徽省| 治县。| 雅江县| 平定县| 阿瓦提县| 若羌县| 长子县| 塔城市| 甘德县| 阿克陶县| 东乌| 昭觉县| 云阳县| 玉山县| 龙游县| 大石桥市| 彩票| 治县。|