官术网_书友最值得收藏!

Example of label spreading

We can test this algorithm using the Scikit-Learn implementation. Let's start by creating a very dense dataset:

from sklearn.datasets import make_classification

nb_samples = 5000
nb_unlabeled = 1000

X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0, random_state=100)
Y[nb_samples - nb_unlabeled:nb_samples] = -1

We can train a LabelSpreading instance with a clamping factor alpha=0.2. We want to preserve 80% of the original labels but, at the same time, we need a smooth solution:

from sklearn.semi_supervised import LabelSpreading

ls = LabelSpreading(kernel='rbf', gamma=10.0, alpha=0.2)
ls.fit(X, Y)

Y_final = ls.predict(X)

The result is shown, as usual, together with the original dataset:

Original dataset (left). Dataset after a complete label spreading (right)

As it's possible to see in the first figure (left), in the central part of the cluster (x [-1, 0]), there's an area of circle dots. Using a hard-clamping, this aisle would remain unchanged, violating both the smoothness and clustering assumptions. Setting α > 0, it's possible to avoid this problem. Of course, the choice of α is strictly correlated with each single problem. If we know that the original labels are absolutely correct, allowing the algorithm to change them can be counterproductive. In this case, for example, it would be better to preprocess the dataset, filtering out all those samples that violate the semi-supervised assumptions. If, instead, we are not sure that all samples are drawn from the same pdata, and it's possible to be in the presence of spurious elements, using a higher α value can smooth the dataset without any other operation.

主站蜘蛛池模板: 清水河县| 康定县| 孙吴县| 玉树县| 景泰县| 桂平市| 张北县| 乌苏市| 莆田市| 漠河县| 嘉峪关市| 焉耆| 九江县| 彰化县| 高雄县| 黄平县| 城固县| 双柏县| 鸡西市| 日照市| 安义县| 中江县| 宁津县| 朝阳区| 孟村| 开阳县| 大理市| 开阳县| 京山县| 黔南| 阳泉市| 中山市| 崇信县| 广安市| 巴东县| 广饶县| 烟台市| 阿勒泰市| 宁乡县| 拉萨市| 阿克苏市|