官术网_书友最值得收藏!

Applying random forests

Random forests in scikit-learn use the Estimator interface, allowing us to use almost the exact same code as before to do cross-fold validation:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This results in an immediate benefit of 65.3 percent, up by 2.5 points by just swapping the classifier.

Random forests, using subsets of the features, should be able to learn more effectively with more features than normal decision trees. We can test this by throwing more features at the algorithm and seeing how it goes:

X_all = np.hstack([X_lastwinner, X_teams])
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This results in 63.3 percent—a drop in performance! One cause is the randomness inherent in random forests only chose some features to use rather than others. Further, there are many more features in  X_teams than in X_lastwinner, and having the extra features results in less relevant information being used. That said, don't get too excited by small changes in percentages, either up or down. Changing the random state value will have more of an impact on the accuracy than the slight difference between these feature sets that we just observed. Instead, you should run many tests with different random states, to get a good sense of the mean and spread of accuracy values.

We can also try some other parameters using the GridSearchCV class, as we introduced in Chapter 2, Classifying using scikit-learn Estimators:

from sklearn.grid_search import GridSearchCV

parameter_space = {
"max_features": [2, 10, 'auto'],
"n_estimators": [100, 200],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}

clf = RandomForestClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print("Accuracy: {0:.1f}%".format(grid.best_score_ * 100))

This has a much better accuracy of 67.4 percent!

If we wanted to see the parameters used, we can print out the best model that was found in the grid search. The code is as follows:

print(grid.best_estimator_)

The result shows the parameters that were used in the best scoring model:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features=2, max_leaf_nodes=None,
min_samples_leaf=2, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=14, verbose=0, warm_start=False)
主站蜘蛛池模板: 涞源县| 芜湖市| 合阳县| 镇赉县| 永丰县| 萨迦县| 荆门市| 白玉县| 巴彦淖尔市| 紫云| 二连浩特市| 灵台县| 营口市| 武义县| 沾益县| 崇阳县| 谢通门县| 德惠市| 四子王旗| 大渡口区| 静海县| 阜康市| 阆中市| 肥城市| 上杭县| 襄樊市| 二连浩特市| 信阳市| 旬阳县| 高阳县| 阿克| 江西省| 馆陶县| 大安市| 东乡县| 蕲春县| 华坪县| 大姚县| 南平市| 石门县| 确山县|