官术网_书友最值得收藏!

ML project pipeline

Most of the content available on ML projects, either through books, blogs, or tutorials, explains the mechanics of machine learning in such a way that the dataset available is split into training, validation, and test datasets. Models are built using training datasets, and model improvements through hyperparameter tuning are done iteratively through validation data. Once a model is built and improved upon to a point that is acceptable, it is tested for goodness with unseen test data and the results of testing are reported out. Most of the public content available, ends at this point.

In reality, the ML projects in a business situation go beyond this step. We may observe that if one stops at testing and reporting a built model performance, there is no real use of the model in terms of predicting about data that is coming up in future. We also need to realize that the idea of building a model is to be able to deploy the model in production and have the predictions based on new data so that businesses can take appropriate action.

In a nutshell, the model needs to be saved and reused. This also means that any new data on which predictions need to be made needs to be preprocessed in the same way as training data. This ensures that, the new data has the same number of columns and also the same types of columns as training data. This part of productionalization of the models built in the lab is totally ignored when being taught. This section covers an end-to-end pipeline for the models, right from data preprocessing to building the models in the lab to productionalization of the models.

ML pipelines describe the entire process from raw data acquisition to obtaining post processing of the prediction results on unseen data so as to make it available for some kind of action by business. It is possible that a pipeline may be depicted at a generalized level or described at a very granular level. This current section focuses on describing a generic pipeline that may be applied to any ML project. Figure 1.8 shows the various components of the ML project pipeline otherwise known as the cross-industry standard process for data mining (CRISP-DM).

主站蜘蛛池模板: 玉溪市| 崇信县| 西乡县| 南华县| 涿州市| 博兴县| 乡宁县| 江门市| 西青区| 扶绥县| 四会市| 岑巩县| 大邑县| 广平县| 十堰市| 化州市| 洪雅县| 泰州市| 漳浦县| 霍城县| 三江| 海安县| 崇州市| 高台县| 浦东新区| 承德县| 曲阜市| 呈贡县| 邮箱| 久治县| 镇坪县| 承德县| 福州市| 渝北区| 姚安县| 宕昌县| 灵寿县| 宜黄县| 奈曼旗| 买车| 桃园市|