- Go Machine Learning Projects
- Xuanyi Chew
- 320字
- 2021-06-10 18:46:34
Encoding categorical data
The trick to encode categorical data is to expand categorical data into multiple columns, each having a 1 or 0 representing whether it's true or false. This of course comes with some caveats and subtle issues that must be navigated with care. For the rest of this subsection, I shall use a real categorical variable to explain further.
Consider the LandSlope variable. There are three possible values for LandSlope:
- Gtl
- Mod
- Sev
This is one possible encoding scheme (this is commonly known as one-hot encoding):

This would be a terrible encoding scheme. To understand why, we must first understand linear regression by means of ordinary least squares. Without going into too much detail, the meat of OLS-based linear regression is the following formula (which I am so in love with that I have had multiple T-shirts with the formula printed on):
Here,is an(m x n) matrix and
is an (m x 1) vector. The multiplications, therefore, are not straightforward multiplications—they are matrix multiplications. When one-hot encoding is used for linear regression, the resulting input matrix
will typically be singular—in other words, the determinant of the matrix is 0. The problem with singular matrices is that they cannot be inverted.
So, instead, we have this encoding scheme:

Here, we see an application of the Go proverb make the zero value useful for being applied in a data science context. Indeed, clever encoding of categorical variables will yield slightly better results when dealing with previously unseen data.
The topic is far too wide to broach here, but if you have categorical data that can be partially ordered, then when exposed to unseen data, simply encode the unseen data to the closest ordered variable value, and the results will be slightly better than encoding to the zero value or using random encoding. We will cover more of this in the later parts of this chapter.
- Big Data Analytics with Hadoop 3
- Word 2000、Excel 2000、PowerPoint 2000上機(jī)指導(dǎo)與練習(xí)
- 人工智能超越人類
- Canvas LMS Course Design
- 大數(shù)據(jù)技術(shù)入門(第2版)
- PostgreSQL Administration Essentials
- INSTANT Autodesk Revit 2013 Customization with .NET How-to
- 可編程序控制器應(yīng)用實(shí)訓(xùn)(三菱機(jī)型)
- Splunk Operational Intelligence Cookbook
- Storm應(yīng)用實(shí)踐:實(shí)時(shí)事務(wù)處理之策略
- 菜鳥起飛系統(tǒng)安裝與重裝
- EJB JPA數(shù)據(jù)庫(kù)持久層開(kāi)發(fā)實(shí)踐詳解
- Hands-On Agile Software Development with JIRA
- AVR單片機(jī)C語(yǔ)言程序設(shè)計(jì)實(shí)例精粹
- Arduino創(chuàng)意機(jī)器人入門:基于ArduBlock(第2版)