官术网_书友最值得收藏!

Building the simplest predictive analytics algorithm

Predictive analytics can be very simple. We introduce a very simple example of a predictive model in the context of binary classification based on a simple threshold.

Imagine that a truck transporting small oranges and large grapefruits runs off the road; all the boxes of fruits open up, and all the fruits end up mixed together. Equipped with a simple weighing scale and a way to roll the fruits out of the truck, you want to be able to separate them automatically based on their weights. You have some information on the average weights of small oranges (96g) and large grapefruits (166g).

According to the USDA, the average weight of a medium-sized orange is 131 grams, while a larger orange weighs approximately 184 grams, and a smaller one around 96 grams. 

  • Large grapefruit (approx 4-1/2'' dia) 166g
  • Medium grapefruit (approx 4'' dia) 128g
  • Small grapefruit (approx 3-1/2'' dia) 100g

Your predictive model is the following:

  • You arbitrarily set a threshold of 130g
  • You weigh each fruit
  • If the fruit weighs more than 130g, it's a grapefruit; otherwise it's an orange

There! You have a robust reliable, predictive model that can be applied to all your mixed up fruits to separate them. Note that in this case, you've set the threshold with an educated guess. There was no machine learning involved.

In machine learning, the models learn by themselves. Instead of setting the threshold yourself, you let your program evolve and calculate the weight separation threshold of fruits by itself.

For that, you would set aside a certain number of oranges and grapefruits. This is called the training dataset. It's important that this training dataset has roughly the same number of oranges and grapefruits.

And you let the machine decide the threshold value by itself. A possible algorithm could be along these lines:

  1. Set original weight estimation at w_0 = 100g to initialize and a counter k = 0
  2. For each new fruit in the training dataset, adjust the weight estimation according to the following:
        For each new fruit_weight:
w(k+1) = (k*w(k) + fruit_weight)/ (k+1)
k = k+1

Assuming that your training dataset is representative of all the remaining fruits and that you have enough fruits, the threshold would converge under certain conditions to the best average between all the fruit weights. A value which you use to separate all the other fruits depending on whether they weight more or less than the threshold you estimated. The following plot shows the convergence of this crude algorithm to estimate the average weight of the fruits:

This problem is a typical binary classification model. If we had not two but three types of fruits (lemons, oranges, and grapefruit), we would have a multiclass classification problem.

In this example, we only have one predictor: the weight of the fruit. We could add another predictor such as the diameter. This would result in what is called a multivariate classification problem.

In practice, machine learning uses more complex algorithms such as the SGD, the linear algorithm used by Amazon ML. Other classic prediction algorithms include Support Vector Machines, Bayes classifiers, Random forests and so on. Each algorithm has its strength and set of assumptions on the dataset.

主站蜘蛛池模板: 祁阳县| 井研县| 沙湾县| 富锦市| 宿州市| 临武县| 北流市| 荣成市| 潍坊市| 邵阳县| 无为县| 阿勒泰市| 平罗县| 宁津县| 内丘县| 哈密市| 阳春市| 旌德县| 清镇市| 克什克腾旗| 六盘水市| 平潭县| 改则县| 金寨县| 盐边县| 正镶白旗| 扶沟县| 平果县| 泸州市| 革吉县| 外汇| 耿马| 通化市| 彭泽县| 珲春市| 焦作市| 凤冈县| 潞西市| 英吉沙县| 天津市| 景东|