官术网_书友最值得收藏!

General machine learning rule of thumb

The general machine learning rule of thumb is that the more data there is, the better the predictive model. However, having more features often creates a mess, to the extent that the performance degrades drastically, especially if the dataset is high-dimensional. The entire learning process requires input datasets that can be split into three types (or are already provided as such):

  • A training set is the knowledge base coming from historical or live data that is used to fit the parameters of the ML algorithm. During the training phase, the ML model utilizes the training set to find optimal weights of the network and reach the objective function by minimizing the training error. Here, the back-prop rule or an optimization algorithm is used to train the model, but all the hyperparameters are needed to be set before the learning process starts.
  • A validation set is a set of examples used to tune the parameters of an ML model. It ensures that the model is trained well and generalizes toward avoiding overfitting. Some ML practitioners refer to it as a development set or dev set as well.
  • A test set is used for evaluating the performance of the trained model on unseen data. This step is also referred to as model inferencing. After assessing the final model on the test set (that is, when we're fully satisfied with the model's performance), we do not have to tune the model any further, but the trained model can be deployed in a production-ready environment.

A common practice is splitting the input data (after necessary pre-processing and feature engineering) into 60% for training, 10% for validation, and 20% for testing, but it really depends on use cases. Sometimes, we also need to perform up-sampling or down-sampling on the data based on the availability and quality of the datasets.

This rule of thumb of learning on different types of training sets can differ across machine learning tasks, as we will cover in the next section. However, before that, let's take a quick look at a few common phenomena in machine learning.

主站蜘蛛池模板: 阳新县| 灵宝市| 西和县| 清水河县| 巫溪县| 波密县| 改则县| 平度市| 卓资县| 红河县| 西华县| 吉安县| 正宁县| 海淀区| 弋阳县| 当雄县| 临夏市| 五台县| 白河县| 闸北区| 昭苏县| 丰镇市| 专栏| 黔东| 汝阳县| 铜陵市| 兴义市| 宽城| 林西县| 澜沧| 年辖:市辖区| 含山县| 成安县| 泗水县| 旌德县| 河津市| 资兴市| 靖江市| 灵璧县| 万安县| 泗洪县|