官术网_书友最值得收藏!

Applied machine learning workflow

This book's emphasis is on applied machine learning. We want to provide you with the practical skills needed to get learning algorithms to work in different settings. Instead of math and theory in machine learning, we will spend more time on the practical, hands-on skills (and dirty tricks) to get this stuff to work well on an application. We will focus on supervised and unsupervised machine learning and learn the essential steps in data science to build the applied machine learning workflow.

A typical workflow in applied machine learning applications consists of answering a series of questions that can be summarized in the following steps:

  1. Data and problem definition: The first step is to ask interesting questions, such as: What is the problem you are trying solve? Why is it important? Which format of result answers your question? Is this a simple yes/no answer? Do you need to pick one of the available questions?
  2. Data collection: Once you have a problem to tackle, you will need the data. Ask yourself what kind of data will help you answer the question. Can you get the data from the available sources? Will you have to combine multiple sources? Do you have to generate the data? Are there any sampling biases? How much data will be required?
  3. Data preprocessing: The first data preprocessing task is data cleaning. Some of the examples include filling missing values, smoothing noisy data, removing outliers, and resolving consistencies. This is usually followed by integration of multiple data sources and data transformation to a specific range (normalization), to value bins (discretized intervals), and to reduce the number of dimensions.
  4. Data analysis and modelling: Data analysis and modelling includes unsupervised and supervised machine learning, statistical inference, and prediction. A wide variety of machine learning algorithms are available, including k-nearest neighbors, Naive Bayes classifier, decision trees, Support Vector Machines (SVMs), logistic regression, k-means, and so on. The method to be deployed depends on the problem definition, as discussed in the first step, and the type of collected data. The final product of this step is a model inferred from the data.
  5. Evaluation: The last step is devoted to model assessment. The main issue that the models built with machine learning face is how well they model the underlying data; for example, if a model is too specific or it overfits to the data used for training, it is quite possible that it will not perform well on new data. The model can be too generic, meaning that it underfits the training data. For example, when asked how the weather is in California, it always answers sunny, which is indeed correct most of the time. However, such a model is not really useful for making valid predictions. The goal of this step is to correctly evaluate the model and make sure it will work on new data as well. Evaluation methods include separate test and train sets, cross-validation, and leave-one-out cross-validation.

We will take a closer look at each of the steps in the following sections. We will try to understand the type of questions we must answer during the applied machine learning workflow, and look at the accompanying concepts of data analysis and evaluation.

主站蜘蛛池模板: 克拉玛依市| 万山特区| 青龙| 台南县| 安溪县| 张掖市| 昌邑市| 偃师市| 周宁县| 镇原县| 布拖县| 淳化县| 宁国市| 景泰县| 铅山县| 莲花县| 苏州市| 瑞金市| 鄯善县| 勐海县| 丽水市| 垦利县| 扎兰屯市| 东兰县| 乌鲁木齐县| 峨边| 化德县| 阜新| 潞城市| 德兴市| 望谟县| 泽库县| 社旗县| 永嘉县| 南岸区| 嘉黎县| 类乌齐县| 肥东县| 临洮县| 彭泽县| 柘荣县|