官术网_书友最值得收藏!

Clustering with k-means 

k-Nearest Neighbors (kNN) is a well-known clustering method. It is based on finding similarities in data points, or what we call the feature similarity. Thus, this algorithm is simple, and is widely used to solve many classification problems, like recommendation systems, anomaly detection, credit ratings, and so on?. However, it requires a high amount of memory. While it is a supervised learning model, it should be fed by labeled data, and the outputs are known. We only need to map the function that relates the two parties. A kNN algorithm is non-parametric. Data is represented as feature vectors. You can see it as a mathematical representation:

The classification is done like a vote; to know the class of the data selected, you must first compute the distance between the selected item and the other, training item. But how can we calculate these distances?

Generally, we have two major methods for calculating. We can use the Euclidean distance:

Or, we can use the cosine similarity:

The second step is choosing k the nearest distances (k can be picked arbitrarily). Finally, we conduct a vote, based on a confidence level. In other words, the data will be assigned to the class with the largest probability.

主站蜘蛛池模板: 哈巴河县| 宜城市| 班玛县| 闸北区| 奉节县| 甘孜| 呼伦贝尔市| 日土县| 台江县| 岐山县| 株洲市| 秦皇岛市| 南川市| 柯坪县| 江都市| 古田县| 赤水市| 安图县| 永安市| 嘉义县| 萍乡市| 九台市| 新丰县| 垦利县| 宁化县| 调兵山市| 泽库县| 苍山县| 仪征市| 南皮县| 重庆市| 乐至县| 东宁县| 洪洞县| 科技| 奇台县| 通州市| 抚宁县| 仪征市| 濉溪县| 武穴市|