官术网_书友最值得收藏!

Comparing the entropy differences (information gain)

To know which variable to choose for the first split, we calculate the information gain G when going from the original data to the corresponding subset as the difference between the entropy values:

Here, S(f1) is the entropy of the target variable and S(f1,f2) is the entropy of each feature with respect to the target variable. The entropy values were calculated in the previous subsections, so we use them here:

  • If we choose Outlook as the first variable to split the tree, the information gain is as follows:

G(Train outside,Outlook) = S(Train outside) - S(Train outside,Outlook)
                                                 = 0.94-0.693=0.247

  • If we choose Temperature, the information gain is as follows:

G(Train outside,Temperature) = S(Train outside) - S(Train outside,Temperature)
                                                           = 0.94-0.911=0.029

  • If we choose Humidity, the information gain is as follows:

G(Train outside,Humidity) = S(Train outside) - S(Train outside,Humidity)
                                                     = 0.94-0.788=0.152

  • Finally, choosing Windy gives the following information gain:

G(Train outside,Windy) = S(Train outside) - S(Train outside,Windy)
                                                  = 0.94-0.892=0.048

All these calculations are easily performed in a worksheet using Excel formulas.

The variable to choose for the first splitting of the tree is the one showing the largest information gain, that is, Outlook. If we do this, we will notice that one of the resulting subsets after the splitting has zero entropy, so we don't need to split it further.

To continue building the tree following a similar procedure, the steps to take are as follows:

  1. Calculate S(Sunny), S(Sunny,Temperature), S(Sunny,Humidity), and S(Sunny,Windy).
  2. Calculate G(Sunny,Temperature), G(Sunny,Humidity), and G(Sunny,Windy).
  3. The larger value will tell us what feature to use to split Sunny.
  4. Calculate other gains, using S(Rainy), S(Rainy,Temperature), S(Rainy,Humidity), and S(Rainy,Windy).
  5. The larger value will tell us what feature to use to split Rainy.
  6. Continue iterating until there are no features left to use.

As we will see later in this book, trees are never built by hand. It is important to understand how they work and which calculations are involved. Using Excel, it is easy to follow the full process and each step. Following the same principle, we will work through an unsupervised learning example in the next section.

主站蜘蛛池模板: 江达县| 南乐县| 微山县| 洛川县| 平顶山市| 汶上县| 延安市| 襄垣县| 日照市| 香格里拉县| 鹤壁市| 湟源县| 辉南县| 常宁市| 苗栗县| 女性| 分宜县| 济宁市| 龙南县| 辰溪县| 怀集县| 拜泉县| 锡林浩特市| 阿鲁科尔沁旗| 分宜县| 景泰县| 广河县| 龙山县| 洪泽县| 于田县| 武汉市| 黑山县| 津南区| 永州市| 来安县| 沧源| 纳雍县| 岳西县| 且末县| 崇明县| 饶平县|