官术网_书友最值得收藏!

Supervised learning algorithms

In a supervised scenario, the task of the model is to find the correct label of a sample, assuming that the presence of a training set is correctly labeled, along with the possibility of comparing the estimated value with the correct one. The term supervised is derived from the idea of an external teaching agent that provides precise and immediate feedback after each prediction. The model can use such feedback as a measure of the error and, consequently perform the corrections needed to reduce it.

More formally, if we assume a data generating process,  the dataset is obtained as:

As discussed in the previous section, all samples must be independent and identically distributed (IID) values uniformly sampled from the data generating process. In particular, all classes must represent the actual distribution (for example, if p(y=0) = 0.4 and p(y=1) = 0.6, the proportion should 40% or 60%). However, in order to avoid biases, when the discrepancies between classes are not very large, a reasonable choice is a perfectly uniform sampling and has the same number of representatives for y=1, 2, ..., M.

A generic classifier can be modeled in two ways:

  • A parametrized function that outputs the predicted class
  • A parametrized probability distribution that outputs the class probability for every input sample

In the first case, we have:

Considering the whole dataset X, it's possible to compute a global cost function L:

As L depends only on the parameter vector (xi and yi are constants), a generic algorithm must find the optimal parameter vector that minimizes the cost function. For example, in a regression problem (where the labels are continuous), the error measure can be the squared error between the actual value and prediction:

Such a cost function can be optimized in different ways (peculiar to specific algorithms), but a very common strategy (above all in deep learning) is to employ the Stochastic Gradient Descent (SGD) algorithm. It consists of an iteration of the following two steps:

  • Computing the gradient ?L (with respect to the parameter vector) with a small batch of samples xi ∈ X
  • Updating the weights and moving the parameters in the opposite direction of the gradient -?L (remember that the gradient is always directed towards a maximum)

Instead, when the classifier is probabilistic, it should be represented as a parametrized conditional probability distribution:

In other words, the classifier will now output the probability of a label y given an input vector. The goal is now to find the optimal parameter set, which will obtain:

In the preceding formula, we have expressed pdata as a conditional distribution. The optimization can be obtained using a probabilistic distance metric, such as the Kullback-Leibler divergence DKL (which is always non-negative DKL 0 and DKL = 0 only when the two distributions are identical):

With a few simple manipulations, we obtain:

Therefore, the resulting cost function corresponds to the difference between the cross-entropy between p and pdata up to a constant value (the entropy of the data generating process). Therefore, the training strategy is now based on representing labels using one-hot encoding (for example, if there are two labels 0 → (0, 1) and 1 → (1, 0), so that the sum of all elements must be always equal to 1) and using an intrinsic probabilistic output (such as in a logistic regression) or a softmax filter, which transforms M values into a probability distribution.

In both cases, it's clear the presence of a hidden teacher provides a consistent measure of the error, allowing the model to correct the parameters accordingly. In particular, the second approach is extremely helpful for our purposes, therefore I recommend studying it further if it's not known well (all the main definitions can also be found in Bonaccorso G., Machine Learning Algorithms, Second Edition, Packt, 2018).

We can now discuss a very basic example of supervised learning, a linear regression model that can be employed to predict the evolution of a simple time series. 

主站蜘蛛池模板: 大丰市| 文成县| 宁海县| 乌苏市| 阿瓦提县| 建昌县| 嫩江县| 苗栗市| 无极县| 宣恩县| 上栗县| 云霄县| 甘德县| 安岳县| 抚州市| 安泽县| 金昌市| 淮南市| 库尔勒市| 乌拉特前旗| 宽甸| 枣庄市| 健康| 义乌市| 确山县| 福泉市| 平乡县| 武隆县| 剑川县| 安溪县| 广东省| 招远市| 友谊县| 乌拉特中旗| 昌图县| 鄂尔多斯市| 邹城市| 贵州省| 康乐县| 慈利县| 忻城县|