官术网_书友最值得收藏!

Cluster analysis

Cluster analysis (normally called just clustering) is an example of a task where we want to find out common features among large sets of samples. In this case, we always suppose the existence of a data generating process  and we define the dataset X as:

A clustering algorithm is based on the implicit assumption that samples can be grouped according to their similarities. In particular, given two vectors, a similarity function is defined as the reciprocal or inverse of a metric function. For example, if we are working in a Euclidean space, we have:

In the previous formula, the constant ε has been introduced to avoid division by zero. It's obvious that d(a, c) < d(a, b) ? s(a, c) > s(a, b). Therefore, given a representative of each cluster , we can create the set of assigned vectors considering the rule:

In other words, a cluster contains all those elements whose distance from the representative is minimum compared to all other representatives. This implies that a cluster contains samples whose similarity with the representative is maximal compared to all representatives. Moreover, after the assignment, a sample gains the right to share its feature with the other members of the same cluster.

In fact, one of the most important applications of cluster analysis is trying to increase the homogeneity of samples that are recognized as similar. For example, a recommendation engine could be based on the clustering of the user vectors (containing information about their interests and bought products). Once the groups have been defined, all the elements belonging to the same cluster are considered as similar, hence we are implicitly authorized to share the differences. If user A has bought the product P and rated it positively, we can suggest this item to user B who didn't buy it and the other way around. The process can appear arbitrary, but it turns out to be extremely effective when the number of elements is large and the feature vectors contain many discriminative elements (for example, ratings).

主站蜘蛛池模板: 砚山县| 明星| 上蔡县| 乌兰浩特市| 临漳县| 贵定县| 玛多县| 那曲县| 四会市| 禹城市| 原阳县| 金门县| 额济纳旗| 元江| 克东县| 密山市| 林周县| 多伦县| 余江县| 英山县| 疏附县| 鄂温| 万年县| 武定县| 化隆| 抚顺市| 全南县| 金山区| 天台县| 泰顺县| 日喀则市| 海兴县| 牟定县| 南宫市| 松江区| 腾冲县| 萍乡市| 隆尧县| 鞍山市| 嘉鱼县| 射洪县|