官术网_书友最值得收藏!

Knowledge base/dataset

As we mentioned earlier, we need a historical base of data that will be used to teach the learning algorithm about the task that it's supposed to do later. But we also need another dataset for testing its ability to perform the task after the learning process. So to sum up, we need two types of datasets during the learning process:

  1. The first one is the knowledge base where we have the input data and their corresponding labels such as the fish images and their corresponding labels (opah or tuna). This data will be fed to the learning algorithm to learn from it and try to discover the patterns/trends that will help later on for classifying unlabeled images.
  2. The second one is mainly for testing the ability of the model to apply what it learned from the knowledge base to unlabeled images or unseen data, in general, and see if it's working well.

As you can see, we only have the data that we will use as a knowledge base for our learning method. All of the data we have at hand will have the correct output associated with it. So we need to somehow make up this data that does not have any correct output associated with it (the one that we are going to apply the model to).

While performing data science, we'll be doing the following:

  • Training phase: We present our data from our knowledge base and train our learning method/model by feeding the input data along with its correct output to the model.
  • Validation/test phase: In this phase, we are going to measure how well the trained model is doing. We also use different model property techniques in order to measure the performance of our trained model by using (R-square score for regression, classification errors for classifiers, recall and precision for IR models, and so on).

The validation/test phase is usually split into two steps:

  1. In the first step, we use different learning methods/models and choose the best performing one based on our validation data (validation step)
  2. Then we measure and report the accuracy of the selected model based on the test set (test step)

Now let's see how we get this data to which we are going to apply the model and see how well trained it is.

Since we don't have any training samples without the correct output, we can make up one from the original training samples that we will be using. So we can split our data samples into three different sets (as shown in Figure 1.9):

  • Train set: This will be used as a knowledge base for our model. Usually, will be 70% from the original data samples.
  • Validation set: This will be used to choose the best performing model among a set of models. Usually this will be 10% of the original data samples.
  • Test set: This will be used to measure and report the accuracy of the selected model. Usually, it will be as big as the validation set.
Figure 1.9: Splitting data into train, validation, and test sets

In case you have only one learning method that you are using, you can cancel the validation set and re-split your data to be train and test sets only. Usually, data scientists use 75/25 as percentages, or 70/30.

主站蜘蛛池模板: 丰台区| 彩票| 凭祥市| 河津市| 青龙| 乡城县| 沂南县| 桦南县| 芷江| 合作市| 枣阳市| 双辽市| 浏阳市| 崇明县| 曲水县| 盖州市| 荣昌县| 西平县| 邢台县| 禹州市| 柳河县| 紫金县| 丹凤县| 房产| 乐安县| 方正县| 绥棱县| 南昌县| 蓝山县| 中山市| 宣恩县| 望江县| 光泽县| 郸城县| 嫩江县| 平山县| 北票市| 故城县| 青龙| 荔波县| 纳雍县|