官术网_书友最值得收藏!

Completeness score

This measure (together with all the other ones discussed from now on) is based on knowledge of the ground truth. Before introducing the index, it's helpful to define some common values. If we denote with Ytrue the set containing the true assignments and with Ypred, the set of predictions (both containing M values and K clusters), we can estimate the following probabilities:

In the previous formulas, ntrue/pred(k) represents the number of true/predicted samples belonging the cluster k ∈ K. At this point, we can compute the entropies of Ytrue and Ypred:

Considering the definition of entropy, H(?) is maximized by a uniform distribution, which, in its turn, corresponds to the maximum uncertainty of every assignment. For our purposes, it's also necessary to introduce the conditional entropies (representing the uncertainty of a distribution given the knowledge of another one) of Ytrue given Ypred and the other way around:

The function n(i, j) represents, in the first case, the number of samples with true label i assigned to Kj and, in the second case, the number of samples with true label j assigned to Ki.

The completeness score is defined as:

It's straightforward to understand that when H(Ypred|Ytrue) → 0, the knowledge of Ytrue reduces the uncertainty of the predictions and, therefore, c → 1. This is equivalent to saying that all samples with the same true label are assigned to the same cluster. Conversely, when H(Ypred|Ytrue) → H(Ypred), it means the ground truth doesn't provide any information that reduces the uncertainty of the predictions and c → 0.

Of course, a good clustering is characterized by c → 1. In the case of the Breast Cancer Wisconsin dataset, the completeness score, computed using the scikit-learn function completenss_score(), (which works also with textual labels) and K=2 (the only configuration associated with ground truth), is as follows:

import pandas as pd

from sklearn.cluster import KMeans
from sklearn.metrics import completeness_score

km = KMeans(n_clusters=2, max_iter=1000, random_state=1000)
Y_pred = km.fit_predict(cdf)

df_km = pd.DataFrame(Y_pred, columns=['prediction'], index=cdf.index)
kmdff = pd.concat([dff, df_km], axis=1)

print('Completeness: {}'.format(completeness_score(kmdff['diagnosis'], kmdff['prediction'])))

The output of the previous snippet is as follows:

Completeness: 0.5168089972809706

This result confirms that, for K=2, K-means is not perfectly able to separate the clusters, because, as we have seen, there are some malignant samples that are wrongly assigned to the cluster containing the vast majority of benign samples. However, as c is not extremely small, we can be sure that most of the samples for both classes have been assigned to the different clusters. The reader is invited to check this value using other methods (discussed in Chapter 3, Advanced Clustering) and to provide a brief explanation of the different results.

主站蜘蛛池模板: 册亨县| 田阳县| 余江县| 阳城县| 白山市| 赤峰市| 内黄县| 双柏县| 腾冲县| 大兴区| 仲巴县| 理塘县| 资讯 | 米泉市| 如东县| 娄烦县| 泸西县| 广德县| 巧家县| 长春市| 互助| 红河县| 临江市| 渝北区| 磴口县| 曲靖市| 扎赉特旗| 宜丰县| 沙雅县| 富阳市| 罗山县| 广水市| 天等县| 昂仁县| 吉木乃县| 建德市| 东丰县| 梅河口市| 吉安县| 杭锦后旗| 绍兴县|