官术网_书友最值得收藏!

Studying machine learning models in practice

We have already seen a very simple example and used it to explain some basic concepts. In the next chapter, we are going to explore more complex models. We restricted ourselves to a very small dataset, just for clarity and to start our journey towards mastering machine learning with an easy task. There are some general considerations that we need to be aware of when working with machine learning models to solve real problems:

  • The amount of data is usually very large. In fact, a larger dataset helps to get a more accurate model and a more reliable prediction. Extremely large datasets, usually called big data, can present storage and manipulation challenges.
  • Data is never clean and ready to use, so data cleansing is extremely important and takes a lot of time.
  • The number of features required to correctly represent a real-life problem is often large. The feature engineering techniques previously mentioned are impossible to perform by hand, so automatic methods must be devised and applied.
  • It is far more important to assess the predictive power of a combination of input features than the significance of each individual one. Some simple examples of how to select features are given in Chapter 5, Correlations and the Importance of Variables.
  • It is very unlikely that we will get a very good result with the first model that we apply. Testing and evaluating many different machine learning models implies repeating the same steps several times and usually requires automation as well.
  • The dataset should be large enough to use a percentage of the data for training purposes (usually 80%) and the rest for testing. Evaluating the accuracy of a model only on the training data is misleading. A model can be very precise at explaining and predicting the training dataset, but it can fail to generalize and deliver wrong results when presented with new, previously unseen data values.
  • Training and test data should be selected, usually at random, from the same full dataset. Trying to make a prediction based on input that lies far away from the training range is unlikely to give good results.

Supervised machine learning models are usually trained using a fraction of the input data and tested on the remaining part. The model can be then used to predict the outcome when fed with new and unknown feature values, as shown in the following diagram:

A typical supervised machine learning project includes the following steps:

  1. Obtaining the data and merging different data sources (there is more on this in Chapter 3, Importing Data into Excel from Different Data Sources)
  2. Cleansing the data (you can refer to Chapter 4, Data Cleansing and Preliminary Data Analysis)
  3. Preliminary analysis and feature engineering (you can refer to Chapter 5, Correlations and the Importance of Variables)
  4. Trying different models and parameters for each of them, and training them by using a percentage of the full dataset and using the rest for testing
  5. Deploying the model so that it can be used in a continuous analysis flow and not only in small, isolated tests
  6. Predicting values for new input data

This procedure will become clear in the examples shown in the next chapter.

主站蜘蛛池模板: 方山县| 康保县| 西宁市| 宜丰县| 张家界市| 义乌市| 昭通市| 仁化县| 华宁县| 定陶县| 桑日县| 陇南市| 晋州市| 石狮市| 阜新市| 勃利县| 彰化县| 诏安县| 图木舒克市| 蒙山县| 广德县| 寿阳县| 通渭县| 如东县| 微博| 来凤县| 铜陵市| 成都市| 怀远县| 招远市| 昆明市| 泽普县| 贺州市| 正阳县| 来宾市| 漯河市| 苏州市| 喀喇| 平陆县| 都兰县| 张家界市|