- Effective Amazon Machine Learning
- Alexis Perrier
- 542字
- 2021-07-03 00:17:48
Building the simplest predictive analytics algorithm
Predictive analytics can be very simple. We introduce a very simple example of a predictive model in the context of binary classification based on a simple threshold.
Imagine that a truck transporting small oranges and large grapefruits runs off the road; all the boxes of fruits open up, and all the fruits end up mixed together. Equipped with a simple weighing scale and a way to roll the fruits out of the truck, you want to be able to separate them automatically based on their weights. You have some information on the average weights of small oranges (96g) and large grapefruits (166g).
According to the USDA, the average weight of a medium-sized orange is 131 grams, while a larger orange weighs approximately 184 grams, and a smaller one around 96 grams.
- Large grapefruit (approx 4-1/2'' dia) 166g
- Medium grapefruit (approx 4'' dia) 128g
- Small grapefruit (approx 3-1/2'' dia) 100g
Your predictive model is the following:
- You arbitrarily set a threshold of 130g
- You weigh each fruit
- If the fruit weighs more than 130g, it's a grapefruit; otherwise it's an orange
There! You have a robust reliable, predictive model that can be applied to all your mixed up fruits to separate them. Note that in this case, you've set the threshold with an educated guess. There was no machine learning involved.
In machine learning, the models learn by themselves. Instead of setting the threshold yourself, you let your program evolve and calculate the weight separation threshold of fruits by itself.
For that, you would set aside a certain number of oranges and grapefruits. This is called the training dataset. It's important that this training dataset has roughly the same number of oranges and grapefruits.
And you let the machine decide the threshold value by itself. A possible algorithm could be along these lines:
- Set original weight estimation at w_0 = 100g to initialize and a counter k = 0
- For each new fruit in the training dataset, adjust the weight estimation according to the following:
For each new fruit_weight:
w(k+1) = (k*w(k) + fruit_weight)/ (k+1)
k = k+1
Assuming that your training dataset is representative of all the remaining fruits and that you have enough fruits, the threshold would converge under certain conditions to the best average between all the fruit weights. A value which you use to separate all the other fruits depending on whether they weight more or less than the threshold you estimated. The following plot shows the convergence of this crude algorithm to estimate the average weight of the fruits:

This problem is a typical binary classification model. If we had not two but three types of fruits (lemons, oranges, and grapefruit), we would have a multiclass classification problem.
In this example, we only have one predictor: the weight of the fruit. We could add another predictor such as the diameter. This would result in what is called a multivariate classification problem.
In practice, machine learning uses more complex algorithms such as the SGD, the linear algorithm used by Amazon ML. Other classic prediction algorithms include Support Vector Machines, Bayes classifiers, Random forests and so on. Each algorithm has its strength and set of assumptions on the dataset.
- Spark快速大數據分析(第2版)
- Game Development with Swift
- SQL查詢:從入門到實踐(第4版)
- 數據架構與商業智能
- INSTANT Cytoscape Complex Network Analysis How-to
- OracleDBA實戰攻略:運維管理、診斷優化、高可用與最佳實踐
- 基于OPAC日志的高校圖書館用戶信息需求與檢索行為研究
- Oracle 12c云數據庫備份與恢復技術
- Splunk智能運維實戰
- Hadoop集群與安全
- 區塊鏈技術應用與實踐案例
- 聯動Oracle:設計思想、架構實現與AWR報告
- 云計算
- 數據指標體系:構建方法與應用實踐
- Delphi High Performance