- Mastering Machine Learning with R
- Cory Lesmeister
- 496字
- 2021-07-02 13:46:19
Zero and near-zero variance features
Before moving on to dataset treatment, it's an easy task to eliminate features that have either one unique value (zero variance) or a high ratio of the most common value to the next most common value such that there're few unique values (near-zero variance). To do this, we'll lean on the caret package and the nearZeroVar() function. We get started by creating a dataframe and using the function's defaults except for saveMetrics = TRUE. We need to make that specification to return the dataframe:
feature_variance <- caret::nearZeroVar(gettysburg, saveMetrics = TRUE)
The output is quite interesting, so let's peek at the first six rows of what we produced:
head(feature_variance)
The output of the preceding code is as follows:
freqRatio percentUnique zeroVar nzv
type 3.186047 0.5110733 FALSE FALSE
state 1.094118 5.1107325 FALSE FALSE
regiment_or_battery 1.105263 46.8483816 FALSE FALSE
brigade 1.111111 21.1243612 FALSE FALSE
division 1.423077 6.4735945 FALSE FALSE
corps 1.080000 2.3850085 FALSE FALSE
The two key columns are zeroVar and nzv. They act as an indicator of whether or not that feature is zero variance or near-zero variance; TRUE indicates yes and FALSE not so surprisingly indicates no. The other columns must be defined:
- freqRatio: This is the ratio of the percentage frequency for the most common value over the second most common value.
- percentUnique: This is the number of unique values divided by the total number of samples multiplied by 100.
Let me explain that with the data we're using. For the type feature, the most common value is Infantry, which is roughly three times more common than Artillery. For percentUnique, the lower the percentage, the lower the number of unique values. You can explore this dataframe and adjust the function to determine your relevant cut points. For this example, we'll see whether we have any zero variance features by running this code:
which(feature_variance$zeroVar == 'TRUE')
The output of the preceding code is as follows:
[1] 17
Alas, we see that row 17 (feature 17) has zero variance. Let's see what that could be:
row.names(feature_variance[17, ])
The output of the preceding code is as follows:
[1] "4.5inch_rifles"
This is quite strange to me. What it means is that I failed to record the number of the artillery piece in the one Confederate unit that brought them to the battle. An egregious error on my part discovered using an elegant function from the caret package. Oh well, let's create a new tibble with this filtered out for demonstration purposes:
gettysburg_fltrd <- gettysburg[, feature_variance$zeroVar == 'FALSE']
This code eliminates the zero variance feature. If we wanted also to eliminate near-zero variance as well, just run the code and substitute feature_variance$zerVar with feature_variance$nzv.
We're now ready to perform the real magic of this process and treat our data.
- ABB工業(yè)機(jī)器人編程全集
- PIC單片機(jī)C語言非常入門與視頻演練
- JMAG電機(jī)電磁仿真分析與實(shí)例解析
- 可編程控制器技術(shù)應(yīng)用(西門子S7系列)
- 四向穿梭式自動(dòng)化密集倉儲系統(tǒng)的設(shè)計(jì)與控制
- 數(shù)據(jù)通信與計(jì)算機(jī)網(wǎng)絡(luò)
- 自動(dòng)控制理論(非自動(dòng)化專業(yè))
- Kubernetes for Developers
- Ruby on Rails敏捷開發(fā)最佳實(shí)踐
- Hands-On SAS for Data Analysis
- 實(shí)戰(zhàn)Windows Azure
- 自適應(yīng)學(xué)習(xí):人工智能時(shí)代的教育革命
- MySQL Management and Administration with Navicat
- 巧學(xué)活用Linux
- NetSuite ERP for Administrators