官术网_书友最值得收藏!

Quantifying separations – k-means clustering and the silhouette score

The most difficult class separation in this dataset is versicolor and virginica. The violins for each of these classes tell us that the two techniques actually produce different results. Using the setosa distribution as a reference in both plots, the LDA versicolor distribution is tighter (that is, wider and shorter) than the PCA one, causing its interquartile range to be further separated from the interquartile range of the virginica distribution. If this analysis is not rigorous enough for you, we can easily quantify this difference by using a clustering algorithm on the data. Let's use the k-means clustering algorithm to mathematically group the data together, and then use the quantitative metric called silhouette coefficient to score the tightness of the resulting clusters – a higher score means tighter clusters. Since the k-means algorithm is very straightforward and the quality of the grouping is directly related to the quality of the input data, tighter clusters will prove that the input features separate the classes better:

# cluster With k-means and check silhouette score
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# initialize k-means algo object
kmns = KMeans(n_clusters=3, random_state=42)

# fit algo to pca and find silhouette score
out_kms_pca = kmns.fit_predict(out_pca)
silhouette = silhouette_score(out_pca, out_kms_pca)
print("PCA silhouette score = " + str(silhouette))

# fit algo to lda and find silhouette score
out_kms_lda = kmns.fit_predict(out_lda)
silhouette = silhouette_score(out_lda, out_kms_lda)
print("LDA silhouette score = %2f " % silhouette)

The following output shows that the LDA classes are better separated: 

PCA silhouette score = 0.598
LDA silhouette score = 0.656

This makes sense because the LDA function had more information, namely, the classes to be separated. 

主站蜘蛛池模板: 岱山县| 子长县| 延川县| 杭锦旗| 湘乡市| 普陀区| 汕尾市| 庆城县| 汝阳县| 张家港市| 武川县| 西青区| 成安县| 双柏县| 凌源市| 长寿区| 和田县| 嘉兴市| 阿勒泰市| 景宁| 化州市| 抚宁县| 海安县| 唐山市| 宜春市| 大埔县| 建瓯市| 平原县| 龙井市| 沅江市| 巴彦淖尔市| 盈江县| 新建县| 文水县| 阿鲁科尔沁旗| 海阳市| 福泉市| 凉山| 晋州市| 承德县| 东港市|