官术网_书友最值得收藏!

Supervised learning

Supervised learning is the simplest and most well-known automatic learning task. It is based on a number of pre-defined examples, in which the category to which each of the inputs should belong is already known. Figure 2 shows a typical workflow of supervised learning.

An actor (for example, an ML practitioner, data scientist, data engineer, ML engineer, and so on) performs Extraction Transformation Load (ETL) and the necessary feature engineering (including feature extraction, selection, and so on) to get the appropriate data having features and labels. Then he does the following:

  1. Splits the data into training, development, and test sets
  2. Uses the training set to train an ML model
  1. The validation set is used to validate the training against the overfitting problem and regularization
  2. He then evaluates the model's performance on the test set (that is unseen data)
  3. If the performance is not satisfactory, he can perform additional tuning to get the best model based on hyperparameter optimization
  4. Finally, he deploys the best model in a production-ready environment

Supervised learning in action

In the overall life cycle, there might be many actors involved (for example, a data engineer, data scientist, or ML engineer) to perform each step independently or collaboratively.

The supervised learning context includes classification and regression tasks; classification is used to predict which class a data point is part of (discrete value), while regression is used to predict continuous values. In other words, a classification task is used to predict the label of the class attribute, while a regression task is used to make a numeric prediction of the class attribute.

In the context of supervised learning, unbalanced data refers to classification problems where we have unequal instances for different classes. For example, if we have a classification task for only two classes, balanced data would mean 50% pre-classified examples for each of the classes.

If the input dataset is a little unbalanced (for example, 60% data points for one class and 40% for the other class), the learning process will require for the input dataset to be split randomly into three sets, with 50% for the training set, 20% for the validation set, and the remaining 30% for the testing set.

主站蜘蛛池模板: 东港市| 本溪市| 江安县| 清镇市| 剑川县| 娄烦县| 乐亭县| 玉田县| 惠安县| 沂南县| 咸丰县| 和平县| 阳新县| 卢氏县| 林州市| 扶余县| 汤原县| 嘉定区| 三台县| 德州市| 融水| 延川县| 酒泉市| 青岛市| 卢龙县| 贺州市| 高州市| 新巴尔虎右旗| 长海县| 刚察县| 芜湖县| 西畴县| 乌拉特后旗| 江陵县| 河北省| 宜黄县| 贵州省| 镶黄旗| 大足县| 二手房| 长岛县|