- Python Machine Learning By Example
- Yuxi (Hayden) Liu
- 681字
- 2021-07-02 12:41:34
Preprocessing, exploration, and feature engineering
Data mining, a buzzword in the 1990s, is the predecessor of data science (the science of data). One of the methodologies popular in the data mining community is called Cross-Industry Standard Process for Data Mining (CRISP-DM). CRISP-DM was created in 1996 and is still used today. I'm not endorsing CRISP-DM, however, I do like its general framework.
The CRISP DM consists of the following phases, which aren't mutually exclusive and can occur in parallel:
- Business understanding: This phase is often taken care of by specialized domain experts. Usually, we have a business person formulate a business problem, such as selling more units of a certain product.
- Data understanding: This is also a phase that may require input from domain experts, however, often a technical specialist needs to get involved more than in the business understanding phase. The domain expert may be proficient with spreadsheet programs, but have trouble with complicated data. In this book, it's usually termed as phase exploration.
- Data preparation: This is also a phase where a domain expert with only Microsoft Excel knowledge may not be able to help you. This is the phase where we create our training and test datasets. In this book, it's usually termed as phase preprocessing.
- Modeling: This is the phase most people associate with machine learning. In this phase, we formulate a model and fit our data.
- Evaluation: In this phase, we evaluate how well the model fits the data to check whether we were able to solve our business problem.
- Deployment: This phase usually involves setting up the system in a production environment (it's considered good practice to have a separate production system). Typically, this is done by a specialized team.
When we learn, we require high-quality learning material. We can't learn from gibberish, so we automatically ignore anything that doesn't make sense. A machine learning system isn't able to recognize gibberish, so we need to help it by cleaning the input data. It's often claimed that cleaning the data forms a large part of machine learning. Sometimes cleaning is already done for us, but you shouldn't count on it.
To decide how to clean the data, we need to be familiar with the data. There are some projects that try to automatically explore the data and do something intelligent, such as produce a report. For now, unfortunately, we don't have a solid solution, so you need to do some manual work.
We can do two things, which aren't mutually exclusive: first, scan the data and second, visualize the data. This also depends on the type of data we're dealing with—whether we have a grid of numbers, images, audio, text, or something else. In the end, a grid of numbers is the most convenient form, and we'll always work toward having numerical features. Let's pretend that we have a table of numbers in the rest of this section.
We want to know whether features have missing values, how the values are distributed, and what type of features we have. Values can approximately follow a normal distribution, a binomial distribution, a Poisson distribution, or another distribution altogether. Features can be binary: either yes or no, positive or negative, and so on. They can also be categorical: pertaining to a category, for instance, continents (Africa, Asia, Europe, Latin America, North America, and so on). Categorical variables can also be ordered, for instance, high, medium, and low. Features can also be quantitative, for example, temperature in degrees or price in dollars.
Feature engineering is the process of creating or improving features. It's more of a dark art than a science. Features are often created based on common sense, domain knowledge, or prior experience. There are certain common techniques for feature creation, however, there's no guarantee that creating new features will improve your results. We're sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to derive features automatically. We'll briefly look at several techniques such as polynomial features, power transformations, and binning, as appetizers in this chapter.
- 高效能辦公必修課:Word圖文處理
- 課課通計算機原理
- 智能傳感器技術與應用
- Apache Hive Essentials
- 工業機器人現場編程(FANUC)
- 西門子S7-200 SMART PLC實例指導學與用
- 四向穿梭式自動化密集倉儲系統的設計與控制
- 完全掌握AutoCAD 2008中文版:機械篇
- RedHat Linux用戶基礎
- 分析力!專業Excel的制作與分析實用法則
- 邊緣智能:關鍵技術與落地實踐
- 中國戰略性新興產業研究與發展·增材制造
- LMMS:A Complete Guide to Dance Music Production Beginner's Guide
- Linux內核精析
- Apache源代碼全景分析(第1卷):體系結構與核心模塊