官术网_书友最值得收藏!

  • Machine Learning Algorithms
  • Giuseppe Bonaccorso
  • 273字
  • 2021-07-02 18:53:30

Managing missing features

Sometimes a dataset can contain missing features, so there are a few options that can be taken into account:

  • Removing the whole line
  • Creating sub-model to predict those features
  • Using an automatic strategy to input them according to the other known values

The first option is the most drastic one and should be considered only when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the class Imputer, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).

The following snippet shows an example using the three approaches (the default value for a missing feature entry is NaN. However, it's possible to use a different placeholder through the parameter missing_values):

from sklearn.preprocessing import Imputer

>>> data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])

>>> imp = Imputer(strategy='mean')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='median')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='most_frequent')
>>> imp.fit_transform(data)
array([[ 1., 3., 2.],
[ 2., 3., 2.],
[-1., 4., 2.]])
主站蜘蛛池模板: 新巴尔虎左旗| 海伦市| 茶陵县| 洛隆县| 桐梓县| 田林县| 钟山县| 忻州市| 陈巴尔虎旗| 会同县| 宾阳县| 中方县| 谷城县| 泗阳县| 淮阳县| 偏关县| 开阳县| 七台河市| 南江县| 石家庄市| 宜川县| 焦作市| 泊头市| 涪陵区| 温宿县| 汾西县| 桐庐县| 长岛县| 达日县| 察隅县| 甘南县| 南安市| 麻城市| 黄石市| 金秀| 清水河县| 仲巴县| 石家庄市| 佳木斯市| 额尔古纳市| 柳河县|