官术网_书友最值得收藏!

Our first model – decision tree

Our first attempt at trying to classify the Higgs-Boson from background noise will use a decision tree algorithm. We purposely eschew from explaining the intuition behind this algorithm as this has already been well documented with plenty of supporting literature for the reader to consume (http://www.saedsayad.com/decision_tree.htm, http://spark.apache.org/docs/latest/mllib-decision-tree.html). Instead, we will focus on the hyper-parameters and how to interpret the model's efficacy with respect to certain criteria / error measures. Let's start with the basic parameters:

val numClasses = 2 
val categoricalFeaturesInfo = Map[Int, Int]() 
val impurity = "gini" 
val maxDepth = 5 
val maxBins = 10 

Now we are explicitly telling Spark that we wish to build a decision tree classifier that looks to distinguish between two classes. Let's take a closer look at some of the hyper-parameters for our decision tree and see what they mean:

numClasses: How many classes are we trying to classify? In this example, we wish to distinguish between the Higgs-Boson particle and background noise and thus there are four classes:

  • categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should not be treated as numbers (for example, ZIP code is a popular example). There are no categorical features in this dataset that we need to worry about.
  • impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures of impurity with respect to classification: Gini and Entropy and one impurity for regression: variance.
  • maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to more accurate results but run the risk of overfitting.
  • maxBins: Number of bins (think "values") for the tree to consider when making splits. Generally, increasing the number of bins allows the tree to consider more values but also increases computation time.
主站蜘蛛池模板: 辉南县| 本溪| 中方县| 昌黎县| 崇左市| 汉阴县| 达日县| 武山县| 建始县| 龙游县| 玉龙| 噶尔县| 盐池县| 彰化县| 岗巴县| 如东县| 商洛市| 扶绥县| 红桥区| 民乐县| 紫阳县| 疏附县| 萨迦县| 新密市| 巧家县| 临清市| 耿马| 道真| 遵义市| 洛隆县| 阜康市| 巴青县| 新田县| 安仁县| 石城县| 洪江市| 常宁市| 正宁县| 右玉县| 巴中市| 屯昌县|