- Building Machine Learning Systems with Python
- Luis Pedro Coelho Willi Richert Matthieu Brucher
- 414字
- 2021-07-23 17:11:23
Which classifier to use
So far, we have looked at two classical classifiers, namely the decision tree and the nearest neighbor classifier. Scikit-learn supports many more, but it does not support everything that has ever been proposed in academic literature. Thus, one may be left wondering: which one should I use? Is it even important to learn about all of them?
In many cases, knowledge of your dataset may help you decide which classifier has a structure that best matches your problem. However, there is a very good study by Manuel Fernández-Delgado and his colleagues titled, Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? This is a very readable, very practically-oriented study, where the authors conclude that there is actually one classifier which is very likely to be the best (or close to the best) for a majority of problems, namely random forests.
What is a random forest? As the name suggests, a forest is a collection of trees. In this case, a collection of decision trees. How do we obtain many trees from a single dataset? If you try to call the methods we used before several times, you will find that you will get the exact same tree every time. The trick is to call the method several times with different random variations of the dataset. In particular, each time, we take a fraction of the dataset and a fraction of the features. Thus, each time, there is a different tree. At classification time, all the trees vote and a final decision is reached. There are many different parameters that determine all the minor details, but only one is relevant, namely the number of trees that you use. In general, the more trees you build the more memory will be required, but your classification accuracy will also increase (up to a plateau of optimal performance). The default in scikit-learn is 10 trees. Unless your dataset is very large such that memory usage become problematic, increasing this value is often advantageous:
from sklearn import ensemble rf = ensemble.RandomForestClassifier(n_estimators=100) predict = model_selection.cross_val_predict(rf, features, target) print("RF accuracy: {:.1%}".format(np.mean(predict == target)))
On this dataset, the result is about 86 percent (it may be slightly different when you run it, as they are random forests).
Another big advantage of random forests is that, since they are based on decision trees, ultimately they only perform binary decisions based on feature thresholds. Thus, they are invariant when features are scaled up or down.
- 24小時學會電腦組裝與維護
- 新媒體跨界交互設計
- Raspberry Pi 3 Cookbook for Python Programmers
- FPGA從入門到精通(實戰篇)
- Instant uTorrent
- 電腦軟硬件維修大全(實例精華版)
- 3ds Max Speed Modeling for 3D Artists
- 微軟互聯網信息服務(IIS)最佳實踐 (微軟技術開發者叢書)
- CC2530單片機技術與應用
- 基于Apache Kylin構建大數據分析平臺
- 單片機系統設計與開發教程
- VMware Workstation:No Experience Necessary
- Blender Quick Start Guide
- Arduino項目開發:智能生活
- Building Machine Learning Systems with Python