官术网_书友最值得收藏!

Data understanding

After enduring the all-important pain of the first step, you can now get busy with the data. The tasks in this process consist of the following:

  1. Collecting the data.
  2. Describing the data.
  3. Exploring the data.
  4. Verifying the data quality.

This step is the classic case of Extract, Transform, Load (ETL). There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine whether the variables are sparse and identify the extent to which data may be missing. This may drive the learning method that you use and/or determine whether the imputation of the missing data is necessary and feasible.

Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon incomplete data collection, cases where unintended IT issues led to errors in the data, or planned changes in the business rules. This is critical in time series where often business rules on how the data is classified change over time. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself potential heartache and make one.

主站蜘蛛池模板: 大荔县| 舞钢市| 大冶市| 包头市| 舒兰市| 望都县| 仲巴县| 来凤县| 怀仁县| 金塔县| 南汇区| 五莲县| 阿勒泰市| 罗山县| 闻喜县| 衡南县| 涪陵区| 容城县| 康平县| 龙游县| 老河口市| 鲁甸县| 黄骅市| 瑞金市| 黎川县| 永丰县| 海阳市| 山阴县| 乌审旗| 阿勒泰市| 泸水县| 三原县| 基隆市| 来宾市| 邹平县| 福州市| 达尔| 汝城县| 安庆市| 甘肃省| 温州市|