官术网_书友最值得收藏!

Training data development data – test data

This is one of the most important steps of building a model and it can lead to lots of debate regarding whether we really need all three sets (train, dev, and test), and if so, what should be the breakup of those datasets. Let's understand these concepts.

After we have sufficient data to start modelling, the first thing we need to do is partition the data into three segments, that is, Training Set, Development Setand Test Set:

Let's examine the goal of having these three sets:

  1. Training Set: The training set is used to train the model. When we apply any algorithm, we are fitting the parameter in the training set. In the case of a neural network, finding out about the weights takes place.

Let's say in one scenario that we are trying to fit polynomials of various degrees:

    • f(x) = a+ bx → 1st degree polynomial
    • f(x) = a + bx + cx2  2nd degree polynomial
    • f(x) = a + bx + cx+ dx3 → 3rd degree polynomial

After fitting the model, we calculate the training error for all the fitted models:

We cannot assess how good the model is based on the training error. If we do that, it will lead us to a biased model that might not be able to perform well on unseen data. To counter that, we need to head into the development set.

  1. Development set: This is also called the holdout set or validation set. The goal of this set is to tune the parameters that we have got from the training set. It is also part of an assessment of how well the model is performing. Based on its performance, we have to take steps to tune the parameters. For example, controlling the learning rate, minimizing the overfitting, and electing the best model of the lot all take place in the development set. Here, again, the development set error gets calculated and tuning of the model takes place after seeing which model is giving the least error. The model giving the least error at this stage still needs tuning to minimize overfitting. Once we are convinced about the best model, it is chosen and we head toward the test set.
  1. Test set: The test set is primarily used to assess the best selected model. At this stage, the accuracy of the model is calculated, and if the model's accuracy is not too deviated from the training accuracy and development accuracy, we send this model for deployment.
主站蜘蛛池模板: 弋阳县| 沧州市| 泽州县| 南漳县| 治县。| 凉城县| 华亭县| 烟台市| 新巴尔虎右旗| 来宾市| 西吉县| 海伦市| 陇西县| 嘉善县| 东兰县| 吉木萨尔县| 沂源县| 澄迈县| 大姚县| 阿克陶县| 大洼县| 库伦旗| 友谊县| 固安县| 海城市| 鹤山市| 临汾市| 张家口市| 澎湖县| 阿克陶县| SHOW| 南投县| 台湾省| 奎屯市| 奇台县| 绵阳市| 吉水县| 昌图县| 赤壁市| 郯城县| 扎鲁特旗|