- Hands-On Unsupervised Learning with Python
- Giuseppe Bonaccorso
- 452字
- 2021-07-02 12:32:06
Completeness score
This measure (together with all the other ones discussed from now on) is based on knowledge of the ground truth. Before introducing the index, it's helpful to define some common values. If we denote with Ytrue the set containing the true assignments and with Ypred, the set of predictions (both containing M values and K clusters), we can estimate the following probabilities:

In the previous formulas, ntrue/pred(k) represents the number of true/predicted samples belonging the cluster k ∈ K. At this point, we can compute the entropies of Ytrue and Ypred:

Considering the definition of entropy, H(?) is maximized by a uniform distribution, which, in its turn, corresponds to the maximum uncertainty of every assignment. For our purposes, it's also necessary to introduce the conditional entropies (representing the uncertainty of a distribution given the knowledge of another one) of Ytrue given Ypred and the other way around:

The function n(i, j) represents, in the first case, the number of samples with true label i assigned to Kj and, in the second case, the number of samples with true label j assigned to Ki.
The completeness score is defined as:

It's straightforward to understand that when H(Ypred|Ytrue) → 0, the knowledge of Ytrue reduces the uncertainty of the predictions and, therefore, c → 1. This is equivalent to saying that all samples with the same true label are assigned to the same cluster. Conversely, when H(Ypred|Ytrue) → H(Ypred), it means the ground truth doesn't provide any information that reduces the uncertainty of the predictions and c → 0.
Of course, a good clustering is characterized by c → 1. In the case of the Breast Cancer Wisconsin dataset, the completeness score, computed using the scikit-learn function completenss_score(), (which works also with textual labels) and K=2 (the only configuration associated with ground truth), is as follows:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import completeness_score
km = KMeans(n_clusters=2, max_iter=1000, random_state=1000)
Y_pred = km.fit_predict(cdf)
df_km = pd.DataFrame(Y_pred, columns=['prediction'], index=cdf.index)
kmdff = pd.concat([dff, df_km], axis=1)
print('Completeness: {}'.format(completeness_score(kmdff['diagnosis'], kmdff['prediction'])))
The output of the previous snippet is as follows:
Completeness: 0.5168089972809706
This result confirms that, for K=2, K-means is not perfectly able to separate the clusters, because, as we have seen, there are some malignant samples that are wrongly assigned to the cluster containing the vast majority of benign samples. However, as c is not extremely small, we can be sure that most of the samples for both classes have been assigned to the different clusters. The reader is invited to check this value using other methods (discussed in Chapter 3, Advanced Clustering) and to provide a brief explanation of the different results.
- Aftershot Pro:Non-destructive photo editing and management
- 電腦組裝與維修從入門到精通(第2版)
- 計算機應用與維護基礎教程
- Unity 5.x Game Development Blueprints
- 分布式微服務架構:原理與實戰
- Building 3D Models with modo 701
- 固態存儲:原理、架構與數據安全
- Source SDK Game Development Essentials
- Managing Data and Media in Microsoft Silverlight 4:A mashup of chapters from Packt's bestselling Silverlight books
- Java Deep Learning Cookbook
- 單片機原理及應用
- 微控制器的應用
- UML精粹:標準對象建模語言簡明指南(第3版)
- Advanced Machine Learning with R
- 筆記本電腦現場維修實錄