- Statistics for Machine Learning
- Pratap Dangeti
- 398字
- 2021-07-02 19:06:02
Random forest
The random forest (RF) is a very powerful technique which is used frequently in the data science field for solving various problems across industries, as well as a silver bullet for winning competitions like Kaggle. We will cover various concepts from the basics to in depth in the next chapter; here we are restricted to the bare necessities. Random forest is an ensemble of decision trees, as we know, logistic regression has very high bias and low variance technique; on the other hand, decision trees have high variance and low bias, which makes decision trees unstable. By averaging decision trees, we will minimize the variance component the of model, which makes approximate nearest to an ideal model.
RF focuses on sampling both observations and variables of training data to develop independent decision trees and take majority voting for classification and averaging for regression problems respectively. In contrast, bagging samples only observations at random and selects all columns that have the deficiency of representing significant variables at root for all decision trees. This way makes trees that are dependent on each other, for which accuracy will be penalized.
The following are a few rules of thumb when selecting sub-samples from observations using random forest. Nonetheless, any of the parameters can be tuned to improve results further! Each tree is developed on sampled data drawn from training data and fitted as shown
About 2/3 of observations in training data for each individual tree
Select columns sqrt(p) -> For classification problem if p is total columns in training data
p/3 -> for regression problem if p is number of columns
In the following diagram, two samples were shown with blue and pink colors, where, in the bagging scenario, a few observations and all columns are selected. Whereas, in random forest, a few observations and columns are selected to create uncorrelated individual trees.

In the following diagram, a sample idea shows how RF classifier works. Each tree has grown separately, and the depth of each tree varies as per the selected sample, but in the end, voting is performed to determine the final class.

Due to the ensemble of decision trees, RF suffered interpretability and could not determine the significance of each variable; only variable importance could be provided instead. In the following graph, a sample of variable performance has been provided, consisting of a mean decrease in Gini:

- JavaScript百煉成仙
- VMware View Security Essentials
- Advanced Machine Learning with Python
- Visual C++實(shí)例精通
- 碼上行動:零基礎(chǔ)學(xué)會Python編程(ChatGPT版)
- Java編程指南:基礎(chǔ)知識、類庫應(yīng)用及案例設(shè)計(jì)
- Modern JavaScript Applications
- 網(wǎng)站構(gòu)建技術(shù)
- 編程可以很簡單
- Building Serverless Web Applications
- 平面設(shè)計(jì)經(jīng)典案例教程:CorelDRAW X6
- Hadoop大數(shù)據(jù)分析技術(shù)
- App Inventor 2 Essentials
- 計(jì)算機(jī)程序的構(gòu)造和解釋(JavaScript版)
- Python趣味創(chuàng)意編程