官术网_书友最值得收藏!

Summarizing large data using principal component analysis

Suppose that you would like to build a predictor for an individual's expected net fiscal worth at age 45. There are a huge number of variables to be considered: IQ, current fiscal worth, marriage status, height, geographical location, health, education, career state, age, and many others you might come up with, such as number of LinkedIn connections or SAT scores.

The trouble with having so many features is several-fold. First, the amount of data, which will incur high storage costs and computational time for your algorithm. Second, with a large feature space, it is critical to have a large amount of data for the model to be accurate. That's to say, it becomes harder to distinguish the signal from the noise. For these reasons, when dealing with high-dimensional data such as this, we often employ dimensionality reduction techniques, such as PCA. More information on the topic can be found at https://en.wikipedia.org/wiki/Principal_component_analysis.

PCA allows us to take our features and return a smaller number of new features, formed from our original ones, with maximal explanatory power. In addition, since the new features are linear combinations of the old features, this allows us to anonymize our data, which is very handy when working with financial information, for example.

主站蜘蛛池模板: 清流县| 忻城县| 西乌| 朝阳市| 武清区| 白城市| 安多县| 阳谷县| 曲沃县| 大厂| 喀喇沁旗| 大英县| 河东区| 北宁市| 天门市| 阳泉市| 许昌县| 迁西县| 汝南县| 景谷| 陆川县| 湟源县| 霍山县| 英山县| 合水县| 广水市| 平度市| 腾冲县| 勐海县| 台东市| 武清区| 陇南市| 安顺市| 鲁甸县| 新余市| 杨浦区| 桐庐县| 辽阳县| 吉隆县| 东乡县| 闽清县|