官术网_书友最值得收藏!

Next model – tree ensembles

Algorithms such as Random Forest (RF) or Gradient Boosted Machine (GBM) (also referred to as Gradient Boosted Trees) are two examples of ensemble tree-based models which are currently available in MLlib; you can think of an ensemble as an uber-model which represents a collection of base models. The best way to think about what an ensemble is doing behind the scenes is to consider a simple analogy:

"Suppose that you are the head coach of a famous soccer club and you have heard rumors of an incredible athlete from Brazil and it may be advantageous to sign this young athlete to your club before the other teams do; but your schedule is incredibly busy and instead, you send 10 of your assistant coaches to assess the player. Each one of your assistant coaches grades the player based on his/her coaching philosophy - maybe one coach wants to measure how fast the player can run 40 yards while another coach thinks height and arm-reach are important. Regardless of how each coach defines "athlete potential" you, as head coach, just want to know if you should sign the player to a contract now or wait. And so it goes that your coaches fly down to Brazil and each coach makes an assessment; upon arrival you go up to each of your coaches and ask "should we draft this player now or wait?" and, based on a simple rule like majority vote, you can make your decision. This an example of what an ensemble is doing behind the scenes with respect to a classification task."

You can think of each coach as a decision tree and, therefore, you will have an ensemble of 10 trees (for 10 coaches). How each coach assesses the player is highly specific and this holds true for our trees as well; for each of the 10 trees created features are selected randomly at each node (hence the random, in RF. Forest because there are many trees!). The reason for introducing this randomness and other base models is to prevent over-fitting the data. While RF and GBM are both tree-based ensembles, the manner in which they go about training is slightly different and deserves mention.

GBMs must be trained one tree at a time in order to minimize a loss function (for example, log-loss, squared error, and so on) and usually take longer to train than an RF which can generate multiple trees in parallel.

However, when training a GBM, it is recommended to make shallow trees which in turn lends itself to faster training.

  • RFs generally do not overfit the data compared to a GBM; that is, we can add more trees to our forest and be less prone to over-fitting than if we added more trees to our GBM.
  • Hyper-parameter tuning for an RF is much simpler than GBM. In his paper, Influence of Hyperparameters on Random Forest Accuracy, Bernard et al. show via experimentation that the number of K random features to select at each node is a key influencer with respect to model accuracy (https://hal.archives-ouvertes.fr/hal-00436358/document) Conversely, a GBM has much more hyper-parameters that must be considered, such as loss function, learning rate, number of iterations, and so on.

As with most which is better questions in data science, choosing between an RF and a GBM is open-ended and very task and dataset dependent.

主站蜘蛛池模板: 奎屯市| 南和县| 临泽县| 雷山县| 漳平市| 台东市| 灵宝市| 舞钢市| 韩城市| 灵宝市| 吐鲁番市| 罗源县| 乌拉特中旗| 彭水| 旬阳县| 扎赉特旗| 浠水县| 江安县| 瑞昌市| 桐乡市| 吉林省| 桦川县| 湄潭县| 台州市| 开江县| 光泽县| 巴林左旗| 南华县| 沛县| 安康市| 扎兰屯市| 门头沟区| 和顺县| 阿拉善右旗| 吉首市| 射阳县| 兴山县| 康乐县| 新乡市| 蕲春县| 图木舒克市|