官术网_书友最值得收藏!

Missing values

Data aggregation, extraction, and consolidation is often not perfect and sometimes results in missing values. There are several common strategies to deal with missing values in datasets:

  • Removing all the rows with missing values from the dataset. This is simple to apply, but you may end up throwing away a big chunk of information that would have been valuable to your model.
  • Using models that are, by nature, not impacted by missing values such as decision tree-based models: random forests, boosted trees. Unfortunately, the linear regression model, and by extension the SGD algorithm, does not work with missing values (http://facweb.cs.depaul.edu/sjost/csc423/documents/missing_values.pdf).
  • Imputing the missing data with replacement values; for example, replacing missing values with the median, the average, or the harmonic mean of all the existing values, or using clustering or linear regression to predict the missing values. It may be interesting to add the information that these values were missing in the first place to the dataset.

In the end, the right strategy will depend on the type of missing data and of course, the context. While replacing missing blood pressure numbers in a patient medical record by some average may not be acceptable in a healthcare context, replacing missing age values by the average age in the Titanic dataset is definitely adapted to a data science competition.

However, Amazon ML's documentation is not 100% clear on the strategy used to deal with missing values:

If the target attribute is present in the record, but a value for another numeric attribute is missing, then Amazon ML overlooks the missing value. In this case, Amazon ML creates a substitute attribute and sets it to 1 to indicate that this attribute is missing.

In the case of missing values, a new column is created with a Boolean flag to indicate that the value was missing in the first place. But it is not clear whether the whole row or sample is dismissed or overlooked or if just the cell is removed. There is no mention of any type of imputation.

主站蜘蛛池模板: 临邑县| 七台河市| 福贡县| 丰宁| 固阳县| 上林县| 肥乡县| 偃师市| 溧水县| 瓦房店市| 临潭县| 潜江市| 英吉沙县| 鄂温| 大丰市| 开化县| 宾川县| 江孜县| 娄底市| 吉木萨尔县| 吴堡县| 恭城| 玛沁县| 恩施市| 芦山县| 拉孜县| 西宁市| 河西区| 克什克腾旗| 新疆| 腾冲县| 长武县| 额尔古纳市| 厦门市| 德昌县| 桦南县| 临湘市| 姚安县| 九龙城区| 通州市| 山阴县|