官术网_书友最值得收藏!

Homogeneity score

The homogeneity score is complementary to the previous one and it's based on the assumption that a cluster must contain only samples having the same true label. It is defined as:

Analogously to the completeness score, when H(Ytrue|Ypred) → H(Ytrue), it means that the assignments have no impact on the conditional entropy, hence the uncertainty is not reduced after the clustering (for example, every cluster contains samples belonging to all classes) and → 0. Conversely, when H(Ytrue|Ypred) → 0, h → 1, because knowledge of the predictions has reduced the uncertainty about the true assignments and the clusters contain almost exclusively samples with the same label. It's important to remember that this score alone is not enough, because it doesn't guarantee that a cluster contains all samples xi ∈ X with the same true label. That's why the homogeneity score is always evaluated together with the completeness score.

For the Breast Cancer Wisconsin dataset and K=2, we obtain the following:

from sklearn.metrics import homogeneity_score

print('Homogeneity: {}'.format(homogeneity_score(kmdff['diagnosis'], kmdff['prediction'])))

The corresponding output is as follows:

Homogeneity: 0.42229071246999117

This value (in particular, for K=2) confirms our initial analysis. At least one cluster (the one with the majority of benign samples) is not completely homogeneous, because it contains samples belonging to both classes. However, as the value is not very close to 0, we can be sure that the assignments are partially correct. Considering both values, h and c, we can deduct that K-means is not performing extremely well (probably because of non-convexity), but, at the same time, it's able to separate correctly all those samples whose nearest cluster distance is above a specific threshold. It goes without saying that, with knowledge of the ground truth, we cannot easily accept K-means and we should look for another algorithm that is able to yield both h and c → 1.

主站蜘蛛池模板: 舒城县| 焉耆| 陆河县| 安仁县| 磴口县| 巧家县| 太谷县| 微山县| 江口县| 柳林县| 阳谷县| 衡阳市| 泾川县| 静乐县| 博野县| 梅河口市| 苍山县| 山东省| 新乡市| 太仓市| 卓资县| 海原县| 澎湖县| 德钦县| 灵山县| 建阳市| 武汉市| 莲花县| 泽库县| 东阿县| 安多县| 通河县| 云梦县| 南和县| 潼南县| 青浦区| 白水县| 五河县| 仙居县| 雷山县| 阿拉善左旗|