官术网_书友最值得收藏!

Data analysis – supervised machine learning

The purpose of this analysis is to predict the survivors. So, the outcome will be survived or not, which is a binary classification problem; in it, you have only two possible classes.

There are lots of learning algorithms that we can use for binary classification problems. Logistic regression is one of them. As explained by Wikipedia:


In statistics, logistic regression or logit regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable (a dependent variable that can take on a limited number of values, whose magnitudes are not meaningful but whose ordering of magnitudes may or may not be meaningful) based on one or more predictor variables. That is, it is used in estimating empirical values of the parameters in a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Frequently (and subsequently in this article) "logistic regression" is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—and problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression. Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable.[1] As such it treats the same set of problems as does probit regression using similar techniques.

In order to use logistic regression, we need to create a formula that tells our model the type of features/inputs we're giving it:

# model formula
# here the ~ sign is an = sign, and the features of our dataset
# are written as a formula to predict survived. The C() lets our
# regression know that those variables are categorical.
# Ref: http://patsy.readthedocs.org/en/latest/formulas.html
formula = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp + C(Embarked)'
# create a results dictionary to hold our regression results for easy analysis later
results = {}
# create a regression friendly dataframe using patsy's dmatrices function
y,x = dmatrices(formula, data=titanic_data, return_type='dataframe')
# instantiate our model
model = sm.Logit(y,x)
# fit our model to the training data
res = model.fit()
# save the result for outputing predictions later
results['Logit'] = [res, formula]
res.summary()
Output:
Optimization terminated successfully.
Current function value: 0.444388
Iterations 6

Figure 11: Logistic regression results

Now, let's plot the prediction of our model versus actual ones and also the residuals, which is the difference between the actual and predicted values of the target variable:

# Plot Predictions Vs Actual
plt.figure(figsize=(18,4));
plt.subplot(121, axisbg="#DBDBDB")
# generate predictions from our fitted model
ypred = res.predict(x)
plt.plot(x.index, ypred, 'bo', x.index, y, 'mo', alpha=.25);
plt.grid(color='white', linestyle='dashed')
plt.title('Logit predictions, Blue: \nFitted/predicted values: Red');
# Residuals
ax2 = plt.subplot(122, axisbg="#DBDBDB")
plt.plot(res.resid_dev, 'r-')
plt.grid(color='white', linestyle='dashed')
ax2.set_xlim(-1, len(res.resid_dev))
plt.title('Logit Residuals');
Figure 12: Understanding the logit regression model

Now, we have built our logistic regression model, and prior to that, we have done some analysis and exploration of the dataset. The preceding example shows you the general pipelines for building a machine learning solution.

Most of the time, practitioners fall into some technical pitfalls because they lack experience of understanding the concepts of machine learning. For example, someone might get an accuracy of 99% over the test set, and then without doing any investigation of the distribution of classes in the data (such as how many samples are negative and how many samples are positive), they deploy the model.

To highlight some of these concepts and differentiate between different kinds of errors that you need to be aware of and which ones you should really care about, we'll move on to the next section.

主站蜘蛛池模板: 明光市| 弋阳县| 顺平县| 麦盖提县| 东明县| 大理市| 葵青区| 南和县| 清苑县| 桃园市| 凉山| 河西区| 清原| 拜泉县| 铁岭市| 北辰区| 民权县| 含山县| 霸州市| 当雄县| 耒阳市| 偏关县| 钟祥市| 栾城县| 鹿泉市| 郯城县| 禄劝| 马山县| 扎鲁特旗| 河源市| 东平县| 收藏| 柘城县| 易门县| 宽城| 闻喜县| 全椒县| 扶余县| 班戈县| 防城港市| 屏边|