官术网_书友最值得收藏!

Contextual data issues

A lot of the previously mentioned data issues can be automatically detected and even corrected. The issues may have been originally caused by user entry errors, by corruption in transmission or storage, or by different definitions or understandings of similar entities in different data sources. In data science, there is more to think about.

During data cleaning, a data scientist will attempt to identify patterns within the data, based on a hypothesis or assumption about the context of the data and its intended purpose. In other words, any data that the data scientist determines to be either obviously disconnected with the assumption or objective of the data or obviously inaccurate will then be addressed. This process is reliant upon the data scientist's judgment and his or her ability to determine which points are valid and which are not.

When relying on human judgment, there is always a chance that valid data points, not sufficiently accounted for in the data scientist's hypothesis/assumption, are overlooked or incorrectly addressed. Therefore, it is a common practice to maintain appropriately labeled versions of your cleansed data.
主站蜘蛛池模板: 库车县| 新昌县| 当雄县| 巴林右旗| 台北市| 高淳县| 西藏| 霍州市| 犍为县| 同心县| 故城县| 永吉县| 成武县| 贵德县| 沁水县| 甘南县| 金溪县| 团风县| 盐津县| 金塔县| 江阴市| 本溪市| 佛冈县| 亚东县| 北海市| 浦城县| 海盐县| 太湖县| 图木舒克市| 丰原市| 北京市| 德令哈市| 隆子县| 枣庄市| 石景山区| 仁怀市| 明水县| 广昌县| 华池县| 禹城市| 梁河县|