官术网_书友最值得收藏!

Classification trees

Classification trees operate under the same principle as regression trees, except that the splits aren't determined by the RSS but by an error rate. The error rate used isn't what you would expect where the calculation is simply the misclassified observations divided by the total observations. As it turns out, when it comes to tree-splitting, a misclassification rate, by itself, may lead to a situation where you can gain information with a further split but not improve the misclassification rate. Let's look at an example.

Suppose we have a node, let's call it N0, where you have seven observations labeled No and three observations labeled Yes. We can say that the misclassified rate is 30%. With this in mind, let's calculate a common alternative error measure called the Gini index. The formula for a single node Gini index is as follows:

Then, for N0, the Gini is 1 - (.7)2 - (.3)2, which is equal to 0.42, versus the misclassification rate of 30%.

Taking this example further, we'll now create node N1 with three observations from Class 1 and none from Class 2, along with N2, which has four observations from Class 1 and three from Class 2. Now, the overall misclassification rate for this branch of the tree is still 30%, but look at how the overall Gini index has improved:

  • Gini(N1) = 1 - (3/3)2 - (0/3)2 = 0
  • Gini(N2) = 1 - (4/7)2 - (3/7)2 = 0.49
  • New Gini index = (proportion of N1 x Gini(N1)) + (proportion of N2 x Gini(N2)), which is equal to (0.3 x 0) + (0.7 x 0.49) or 0.343

By doing a split on a surrogate error rate, we actually improved our model impurity, reducing it from 0.42 to 0.343, whereas the misclassification rate didn't change. This is the methodology that's used by the rpart() package, which we'll be using in this chapter.

主站蜘蛛池模板: 和政县| 安福县| 韶关市| 习水县| 老河口市| 永城市| 桂东县| 当涂县| 安平县| 陇南市| 天等县| 合江县| 西平县| 肥西县| 平邑县| 潜江市| 新津县| 通海县| 彩票| 潮州市| 乌拉特前旗| 桐梓县| 安陆市| 鞍山市| 嵩明县| 民县| 万全县| 射洪县| 贵南县| 桦南县| 古蔺县| 镇原县| 闵行区| 巴彦淖尔市| 永平县| 宝鸡市| 遂宁市| 微博| 高雄市| 博兴县| 海伦市|