官术网_书友最值得收藏!

Semi-supervised learning algorithms

A semi-supervised scenario can be considered as a standard supervised one that exploits some features belonging to unsupervised learning techniques. A very common problem, in fact, arises when it's easy to obtain large unlabeled datasets but the cost of labeling is very high. Hence, it's reasonable to label only a fraction of the samples and to propagate the labels to all unlabeled ones whose distance from a labeled sample is below a predefined threshold. If the dataset has been drawn from a single data generating process and the labeled samples are uniformly distributed, a semi-supervised algorithm can achieve an accuracy comparable with a supervised one. In this book, we are not discussing these algorithms; however, it's helpful to briefly introduce two very important models:

  • Label propagation
  • Semi-supervised Support Vector Machines

The first one is called label propagation and its goal is to propagate the labels of a few samples to a larger population. This goal is achieved by considering a graph where each vertex represents a sample and every edge is weighted using a distance function. Through an iterative procedure, all labeled samples will send a fraction of their label values to all their neighbors and the process is repeated until the labels stop changing. This system has a stable point (that is, a configuration that cannot evolve anymore) and the algorithm can easily reach it with a limited number of iterations.

Label propagation is extremely helpful in all those contexts where some samples can be labeled according to a similarity measure. For example, an online store could have a large base of customers, but only 10% have disclosed their gender. If the feature vectors are rich enough to represent the common behavior of male and female users, it's possible to employ the label propagation algorithm to guess the gender of customers who haven't disclosed it. Of course, it's important to remember that all the assignments are based on the assumption that similar samples have the same label. This can be true in many situations, but it can also be misleading when the complexity of the feature vectors increases.

Another important family of semi-supervised algorithms is based on the extension of standard SVM, (short for Support Vector Machine) to datasets containing unlabeled samples. In this case, we don't want to propagate existing labels, but rather the classification criterion. In other words, we want to train the classifier using the labeled dataset and extend the discriminative rule to the unlabeled samples as well.

Contrary to the standard procedure that can only evaluate unlabeled samples, a semi-supervised SVM uses them to correct the separating hyperplane. The assumption is always based on the similarity: if A has label 1 and the unlabeled sample B has d(A, B) < ε (where ε is a predefined threshold), it's reasonable to assume that the label of B is also 1. In this way, the classifier can achieve high accuracy on the whole dataset even if only a subset has been manually labeled. Similar to label propagation, these kinds of model are reliable only when the structure of the dataset is not extremely complex and, in particular, when the similarity assumption holds (unfortunately there are some cases where it's extremely difficult to find a suitable distance metric, hence many similar samples are indeed dissimilar and vice versa).

主站蜘蛛池模板: 通州市| 酒泉市| 遂川县| 汝南县| 安乡县| 博野县| 凤阳县| 宁城县| 习水县| 小金县| 当阳市| 大同县| 嘉鱼县| 皮山县| 望城县| 广安市| 岢岚县| 济源市| 江陵县| 当阳市| 溧阳市| 孝义市| 甘南县| 阿巴嘎旗| 唐海县| 商南县| 江油市| 芜湖市| 阳高县| 资中县| 南安市| 莆田市| 黎川县| 澳门| 依兰县| 中山市| 石棉县| 大足县| 汝阳县| 获嘉县| 河津市|