- Machine Learning Algorithms
- Giuseppe Bonaccorso
- 273字
- 2021-07-02 18:53:30
Managing missing features
Sometimes a dataset can contain missing features, so there are a few options that can be taken into account:
- Removing the whole line
- Creating sub-model to predict those features
- Using an automatic strategy to input them according to the other known values
The first option is the most drastic one and should be considered only when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the class Imputer, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).
The following snippet shows an example using the three approaches (the default value for a missing feature entry is NaN. However, it's possible to use a different placeholder through the parameter missing_values):
from sklearn.preprocessing import Imputer
>>> data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])
>>> imp = Imputer(strategy='mean')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])
>>> imp = Imputer(strategy='median')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])
>>> imp = Imputer(strategy='most_frequent')
>>> imp.fit_transform(data)
array([[ 1., 3., 2.],
[ 2., 3., 2.],
[-1., 4., 2.]])
- SQL Server 從入門到項目實踐(超值版)
- 從零開始:數(shù)字圖像處理的編程基礎(chǔ)與應(yīng)用
- HTML5移動Web開發(fā)技術(shù)
- Mastering OpenCV Android Application Programming
- UI智能化與前端智能化:工程技術(shù)、實現(xiàn)方法與編程思想
- Python GUI Programming Cookbook
- PHP+MySQL+Dreamweaver動態(tài)網(wǎng)站開發(fā)實例教程
- Visual Basic程序設(shè)計實驗指導(dǎo)(第4版)
- Couchbase Essentials
- Hadoop 2.X HDFS源碼剖析
- Python期貨量化交易實戰(zhàn)
- App Inventor少兒趣味編程動手做
- 寫給大家看的Midjourney設(shè)計書
- Java Hibernate Cookbook
- Java面試一戰(zhàn)到底(基礎(chǔ)卷)