- Feature Engineering Made Easy
- Sinan Ozdemir Divya Susarla
- 523字
- 2021-06-25 22:45:52
Feature construction – can we build it?
While in previous chapters we focused heavily on removing features that were not helping us with our machine learning pipelines, this chapter will look at techniques in creating brand new features and placing them correctly within our dataset. These new features will ideally hold new information and generate new patterns that ML pipelines will be able to exploit and use to increase performance.
These created features can come from many places. Oftentimes, we will create new features out of existing features given to us. We can create new features by applying transformations to existing features and placing the resulting vectors alongside their previous counterparts. We will also look at adding new features from separate party systems. As an example, if we are working with data attempting to cluster people based on shopping behaviors, then we might benefit from adding in census data that is separate from the corporation and their purchasing data. However, this will present a few problems:
- If the census is aware of 1,700 Jon does and the corporation only knows 13, how do we know which of the 1,700 people match up to the 13? This is called entity matching
- The census data would be quite large and entity matching would take a very long time
These problems and more make for a fairly difficult procedure but oftentimes create a very dense and data-rich environment.
In this chapter, we will take some time to talk about the manual creation of features through highly unstructured data. Two big examples are text and images. These pieces of data by themselves are incomprehensible to machine learning and artificial intelligence pipelines, so it is up to us to manually create features that represent the images/pieces of text. As a simple example, imagine that we are making the basics of a self-driving car and to start, we want to make a model that can take in an image of what the car is seeing in front of it and decide whether or not it should stop. The raw image is not good enough because a machine learning algorithm would have no idea what to do with it. We have to manually construct features out of it. Given this raw image, we can split it up in a few ways:
- We could consider the color intensity of each pixel and consider each pixel an attribute:
- For example, if the camera of the car produces images of 2,048 x 1,536 pixels, we would have 3,145,728 columns
- We could consider each row of pixels as an attribute and the average color of each row being the value:
- In this case, there would only be 1,536 rows
- We could project this image into space where features represent objects within the image. This is the hardest of the three and would look something like this:

Where each feature is an object that may or may not be within the image and the value represents the number of times that object appears in the image. If a model were given this information, it would be a fairly good idea to stop!
- MySQL高可用解決方案:從主從復制到InnoDB Cluster架構
- 我們都是數據控:用大數據改變商業、生活和思維方式
- App+軟件+游戲+網站界面設計教程
- 云計算服務保障體系
- UDK iOS Game Development Beginner's Guide
- Hadoop大數據實戰權威指南(第2版)
- 數字媒體交互設計(初級):Web產品交互設計方法與案例
- Hands-On Mathematics for Deep Learning
- IPython Interactive Computing and Visualization Cookbook(Second Edition)
- 跨領域信息交換方法與技術(第二版)
- 實用數據結構
- MySQL DBA修煉之道
- 數據庫查詢優化器的藝術:原理解析與SQL性能優化
- 工業大數據分析實踐
- 高效使用Redis:一書學透數據存儲與高可用集群