官术网_书友最值得收藏!

Label propagation in Scikit-Learn

Scikit-Learn implements a slightly different algorithm proposed by Zhu and Ghahramani (in the aforementioned paper) where the affinity matrix W can be computed using both methods (KNN and RBF), but it is normalized to become a probability transition matrix:

The algorithm operates like a Markov random walk, with the following sequence (assuming that there are Q different labels):

  1. Define a matrix YMi = [P(label=y0), P(label=y1), ..., and P(label=yQ)], where P(label=yi) is the probability of the label yi, and each row is normalized so that all the elements sum up to 1
  2. Define Y(0) = YM
  1. Iterate until convergence of the following steps:

The first update performs a label propagation step. As we're working with probabilities, it's necessary (second step) to renormalize the rows so that their element sums up to 1. The last update resets the original labels for all labeled samples. In this case, it means imposing a P(label=yi) = 1 to the corresponding label, and setting all the others to zero. The proof of convergence is very similar to the one for label propagation algorithms, and can be found in Learning from Labeled and Unlabeled Data with Label Propagation, Zhu X., Ghahramani Z., CMU-CALD-02-107. The most important result is that the solution can be obtained in closed form (without any iteration) through this formula:

The first term is the sum of a generalized geometric series, where Puu is the unlabeled-unlabeled part of the transition matrix P. Pul, instead, is the unlabeled-labeled part of the same matrix.

For our Python example, we need to build the dataset differently, because Scikit-Learn considers a sample unlabeled if y=-1:

from sklearn.datasets import make_classification

nb_samples = 1000
nb_unlabeled = 750

X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0, random_state=100)
Y[nb_samples - nb_unlabeled:nb_samples] = -1

We can now train a LabelPropagation instance with an RBF kernel and gamma=10.0:

from sklearn.semi_supervised import LabelPropagation

lp = LabelPropagation(kernel='rbf', gamma=10.0)
lp.fit(X, Y)

Y_final = lp.predict(X)

The result is shown in the following double plot:

Original dataset (left). Dataset after a Scikit-Learn label propagation (right)

As expected, the propagation converged to a solution that respects both the smoothness and the clustering assumption.

主站蜘蛛池模板: 远安县| 金阳县| 泾源县| 庆城县| 寿光市| 新和县| 长丰县| 汾西县| 双鸭山市| 韶关市| 陇南市| 房产| 鲁山县| 乌拉特中旗| 马尔康县| 霍城县| 金山区| 含山县| 兴国县| 县级市| 浦城县| 葫芦岛市| 中方县| 镇远县| 重庆市| 伊春市| 黄龙县| 武平县| 罗甸县| 兴义市| 临澧县| 公主岭市| 东乡| 陆丰市| 龙游县| 朝阳区| 宜章县| 海淀区| 清水县| 云安县| 潼关县|