官术网_书友最值得收藏!

Avoiding overfitting with feature selection and dimensionality reduction

We typically represent data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning, one of the variables is actually not a feature, but the label that we're trying to predict. And in supervised learning, each row is an example that we can use for training or testing.

The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while stock market data has relatively fewer dimensions.

Fitting high-dimensional data is computationally expensive and is prone to overfitting due to the high complexity. Higher dimensions are also impossible to visualize, and therefore we can't use simple diagnostic methods.

Not all of the features are useful and they may only add randomness to our results. It's therefore often important to do good feature selection. Feature selection is the process of picking a subset of significant features for use in better model construction. In practice, not every feature in a dataset carries information useful for discriminating samples; some features are either redundant or irrelevant, and hence can be discarded with little loss. 

In principle, feature selection boils down to multiple binary decisions about whether to include a feature or not. For n features, we get  feature sets, which can be a very large number for a large number of features. For example, for 10 features, we have 1,024 possible feature sets (for instance, if we're deciding what clothes to wear, the features can be temperature, rain, the weather forecast, where we're going, and so on). At a certain point, brute force evaluation becomes infeasible. We'll discuss better methods in Chapter 6, Predicting Online Ads Click-Through with Tree-Based Algorithms. Basically, we have two options: we either start with all of the features and remove features iteratively or we start with a minimum set of features and add features iteratively. We then take the best feature sets for each iteration and compare them.

We'll explore how to perform feature selection mainly in Chapter 7, Predicting Online Ads Click-Through with Logistic Regression.

Another common approach of reducing dimensionality is to transform high-dimensional data in lower-dimensional space. It's called dimensionality reduction or feature projection. This transformation leads to information loss, but we can keep the loss to a minimum.

We'll talk about and implement dimensionality reduction in Chapter 2, Exploring the 20 Newsgroups Dataset with Text Analysis TechniquesChapter 3, Mining the 20 Newsgroups Dataset with Clustering and Topic Modeling Algorithms,  and chapter 10, Machine Learning Best Practices

主站蜘蛛池模板: 荆门市| 瓮安县| 赣榆县| 卢湾区| 东明县| 牡丹江市| 锡林浩特市| 百色市| 开封县| 论坛| 永兴县| 新津县| 新野县| 教育| 鹤壁市| 合作市| 郎溪县| 东乌珠穆沁旗| 湖口县| 静宁县| 利津县| 四子王旗| 阿图什市| 陕西省| 当阳市| 新乐市| 贵阳市| 富宁县| 仙居县| 兰坪| 北京市| 河北区| 康定县| 祁连县| 双辽市| 陇西县| 子洲县| 太康县| 潞西市| 河南省| 渝北区|