官术网_书友最值得收藏!

Cleaning techniques

Typically, the data cleansing process evolves around identifying those data points that are outliers, or those data points that stand out for not following the pattern within the data that the data scientist sees or is interested in.

The data scientists use various methods or techniques for identifying the outliers in the data. One approach is plotting the data points and then visually inspecting the resultant plot for those data points that lie far outside the general distribution. Another technique is programmatically eliminating all points that do not meet the data scientist's mathematical control limits (limits based upon the objective or intention of the statistical project).

Other data cleaning techniques include:

  • Validity checking: Validity checking involves applying rules to the data to determine if it is valid or not. These rules can be global; for example, a data scientist could perform a uniqueness check if specific unique keys are part of the data pool (for example, social security numbers cannot be duplicated), or case level, as when a combination of field values must be a certain value. The validation may be strict (such as removing records with missing values) or fuzzy (such as correcting values that partially match existing, known values).
  • Enhancement: This is a technique where data is made complete by adding related information. The additional information can be calculated by using the existing values within the data file or it can be added from another source. This information could be used for reference, comparison, contrast, or show tendencies.
  • Harmonization: With data harmonization, data values are converted, or mapped to other more desirable values.
  • Standardization: This involves changing a reference dataset to a new standard. For example, use of standard codes.
  • Domain expertise: This involves removing or modifying data values in a data file based upon the data scientist's prior experience or best judgment.

We will go through an example of each of these techniques in the next sections of this chapter.

主站蜘蛛池模板: 安丘市| 南部县| 台中市| 东乌| 甘谷县| 炎陵县| 杂多县| 思南县| 鄂尔多斯市| 平遥县| 镇江市| 大关县| 昌江| 中卫市| 建平县| 太白县| 梓潼县| 永昌县| 通道| 龙海市| 梁山县| 广元市| 屯昌县| 莱州市| 恩施市| 南靖县| 光泽县| 德州市| 惠安县| 石城县| 清水河县| 密云县| 陆丰市| 寿宁县| 铜川市| 闽侯县| 赤峰市| 建水县| 扬中市| 北海市| 松原市|