官术网_书友最值得收藏!

Zero and near-zero variance features

Before moving on to dataset treatment, it's an easy task to eliminate features that have either one unique value (zero variance) or a high ratio of the most common value to the next most common value such that there're few unique values (near-zero variance). To do this, we'll lean on the caret package and the nearZeroVar() function. We get started by creating a dataframe and using the function's defaults except for saveMetrics = TRUE. We need to make that specification to return the dataframe:

feature_variance <- caret::nearZeroVar(gettysburg, saveMetrics = TRUE)
To understand the default settings of the  nearZeroVar()  function and determine how to customize it to your needs, just use the R help function by typing  ?nearZeroVar  in the Console.

The output is quite interesting, so let's peek at the first six rows of what we produced:

head(feature_variance)

The output of the preceding code is as follows:

                       freqRatio     percentUnique    zeroVar     nzv
type 3.186047 0.5110733 FALSE FALSE
state 1.094118 5.1107325 FALSE FALSE
regiment_or_battery 1.105263 46.8483816 FALSE FALSE
brigade 1.111111 21.1243612 FALSE FALSE
division 1.423077 6.4735945 FALSE FALSE
corps 1.080000 2.3850085 FALSE FALSE

The two key columns are zeroVar and nzv. They act as an indicator of whether or not that feature is zero variance or near-zero variance; TRUE indicates yes and FALSE not so surprisingly indicates no. The other columns must be defined:

  • freqRatio: This is the ratio of the percentage frequency for the most common value over the second most common value.
  • percentUnique: This is the number of unique values divided by the total number of samples multiplied by 100.

Let me explain that with the data we're using. For the type feature, the most common value is Infantry, which is roughly three times more common than Artillery. For percentUnique, the lower the percentage, the lower the number of unique values. You can explore this dataframe and adjust the function to determine your relevant cut points. For this example, we'll see whether we have any zero variance features by running this code:

which(feature_variance$zeroVar == 'TRUE')

The output of the preceding code is as follows: 

[1] 17

Alas, we see that row 17 (feature 17) has zero variance. Let's see what that could be:

row.names(feature_variance[17, ])

The output of the preceding code is as follows:

[1] "4.5inch_rifles"

This is quite strange to me. What it means is that I failed to record the number of the artillery piece in the one Confederate unit that brought them to the battle. An egregious error on my part discovered using an elegant function from the caret package. Oh well, let's create a new tibble with this filtered out for demonstration purposes:

gettysburg_fltrd <- gettysburg[, feature_variance$zeroVar == 'FALSE']

This code eliminates the zero variance feature. If we wanted also to eliminate near-zero variance as well, just run the code and substitute feature_variance$zerVar with feature_variance$nzv.

We're now ready to perform the real magic of this process and treat our data.

主站蜘蛛池模板: 昆明市| 绥江县| 县级市| 白水县| 曲阜市| 仙游县| 明光市| 依兰县| 慈利县| 夏津县| 南乐县| 杂多县| 柘城县| 黔西县| 陕西省| 太和县| 义乌市| 武邑县| 扎兰屯市| 江山市| 紫云| 洛浦县| 南投县| 安吉县| 大渡口区| 北川| 齐齐哈尔市| 锡林浩特市| 凭祥市| 突泉县| 德阳市| 万宁市| 克什克腾旗| 台州市| 陆良县| 藁城市| 汝南县| 舟山市| 奉节县| 台中市| 临泽县|