官术网_书友最值得收藏!

Reasoning in high-dimensional spaces

Working with feature spaces of high dimensions requires special mental precautions, since our intuition used to deal with three-dimensional space starts to fail. For example, let's look at one peculiar property of n-dimensional spaces, known as an n-ball volume problem. N-ball is just a ball in n-dimensional Euclidean space. If we plot the volume of such n-ball (y axis) as a function of a number of dimensions (x axis), we'll see the following graph:

Figure 3.9: Volume of n-ball in n-dimensional space

Note that at the beginning the volume rises, until it reaches its peak in five-dimensional space, and then starts decreasing. What does it mean for our models? Specifically, for KNN, it means that starting from five features, the more features you have the greater should be the radius of the sphere centered on the point you're trying to classify to cover KNN.

The counter-intuitive phenomena that arise in a high-dimensional space are colloquially known as the curse of dimensionality. This includes a wide range of phenomena that can't be observed in the three-dimensional space we used to deal with. Pedro Domingos, in his A Few Useful Things to Know about Machine Learning, provides some examples:

"In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant shell around it; and most of the volume of a high-dimensional orange is in the skin, not the pulp. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor. And if we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere. This is bad news for machine learning, where shapes of one type are often approximated by shapes of another."

Speaking specifically of KNN, it treats all dimensions as equally important. This creates problems when some of the features are irrelevant, especially in high dimensions, because the noise introduced by these irrelevant features suppresses the signal comprised in the good features. In our example, we bypassed multidimensional problems by taking into account only the magnitude of each three-dimensional vector in our motion signals.

主站蜘蛛池模板: 达尔| 安乡县| 和静县| 枣强县| 敦化市| 苗栗市| 英德市| 河东区| 饶平县| 临潭县| 绥芬河市| 本溪| 顺昌县| 丹东市| 嘉鱼县| 乐东| 中江县| 大姚县| 清原| 格尔木市| 墨竹工卡县| 鄂州市| 商城县| 滨州市| 南昌市| 上杭县| 即墨市| 调兵山市| 长沙县| 铜鼓县| 滁州市| 贵州省| 监利县| 莒南县| 侯马市| 弥渡县| 枞阳县| 天水市| 东源县| 沂水县| 九龙城区|