官术网_书友最值得收藏!

K-means++

Finding the optimal initial configuration is equivalent to minimizing the inertia; however, Arthur and Vassilvitskii (in K-means++: The Advantages of Careful Seeding, Arthur D., Vassilvitskii S., Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007) have proposed an alternative initialization method (called K-means++), which can dramatically improve the convergence speed by choosing initial centroids with a higher probability of being close to the final ones. The complete proof is quite complex and can be found in the aforementioned paper. In this context, we are providing directly the final results and some important consequences.

Let's consider the function D(?) defined as:

D(?) represents the shortest distance between a sample x ∈ X and a centroid already selected. Once the function has been computed, it's possible to determine a probability distribution G(x) as follows:

The first centroid μ1 is sampled from a uniform distribution. At this point, it's possible to compute D(?) for all samples x ∈ X and, therefore, the distribution G(x). It's straightforward that if we sample from G(x), the probability of selecting a value in a dense region is much larger than the probability of uniform sampling or of picking a centroid in a separating region. Hence, we continue by sampling μ2 from G(x). The procedure is repeated until all K centroids have been determined. Of course, as this is a probabilistic approach, we have no guarantee that the final configuration is optimal. However, the employment of K-means++ is O(log K)-competitive. In fact, if Sopt is the theoretical optimal value for S, the authors proved that the following inequality holds:

As S is decreased by a better choice, the previous formula sets an upper bound for the expected value E[S] roughly proportional to log K. For example, for K=10, E[S]  19.88Sopt and E[S]  12.87Sopt for K=3. This result reveals two important elements. The first one is that K-means++ performs better when K is not extremely large and the second, probably also the most important, is that a single K-means++ initialization cannot be enough to obtain the optimal configuration. For this reason, common implementations (for example, scikit-learn) performs a variable number of initializations and select the one whose initial inertia is the smallest.

主站蜘蛛池模板: 石景山区| 沙湾县| 昭苏县| 余江县| 增城市| 前郭尔| 托克托县| 安乡县| 马龙县| 罗甸县| 石泉县| 靖远县| 通辽市| 琼中| 兰坪| 慈利县| 丰县| 沙田区| 瓮安县| 盘锦市| 临西县| 鸡西市| 青岛市| 阳山县| 吴忠市| 四会市| 吴旗县| 中阳县| 阿拉善盟| 崇州市| 鲁甸县| 大姚县| 鄂温| 新昌县| 五大连池市| 武宁县| 固阳县| 水城县| 讷河市| 隆回县| 石棉县|