- Mastering Machine Learning with R(Second Edition)
- Cory Lesmeister
- 230字
- 2021-07-09 18:23:53
Data understanding
After enduring the all-important pain of the first step, you can now get busy with the data. The tasks in this process consist of the following:
- Collecting the data.
- Describing the data.
- Exploring the data.
- Verifying the data quality.
This step is the classic case of Extract, Transform, Load (ETL). There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine whether the variables are sparse and identify the extent to which data may be missing. This may drive the learning method that you use and/or determine whether the imputation of the missing data is necessary and feasible.
Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon incomplete data collection, cases where unintended IT issues led to errors in the data, or planned changes in the business rules. This is critical in time series where often business rules on how the data is classified change over time. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself potential heartache and make one.
- 公有云容器化指南:騰訊云TKE實戰與應用
- Building Computer Vision Projects with OpenCV 4 and C++
- 大規模數據分析和建模:基于Spark與R
- Google Visualization API Essentials
- Creating Dynamic UIs with Android Fragments(Second Edition)
- 大數據架構和算法實現之路:電商系統的技術實戰
- 基于OPAC日志的高校圖書館用戶信息需求與檢索行為研究
- 數據庫設計與應用(SQL Server 2014)(第二版)
- 金融商業算法建模:基于Python和SAS
- Apache Kylin權威指南
- 智慧的云計算
- Visual Studio 2013 and .NET 4.5 Expert Cookbook
- Oracle高性能SQL引擎剖析:SQL優化與調優機制詳解
- 大數據技術原理與應用:概念、存儲、處理、分析與應用
- 數據庫基礎與應用