官术网_书友最值得收藏!

Weight of evidence and information value

I stumbled into this method several years ago during consulting work. The team I was on was really into big datasets and constrained to using SAS statistical software. It was also a critical requirement that the customer teams could easily interpret the models. 

Given the possibility of hundreds, even thousands, of possible features, I was privileged enough to learn the use of WOE and IV by a former rocket scientist. That's right: a person who actually worked on manned space flight. I became an eager pupil. Now, this method isn't a panacea. First of all, it's univariate, so features that are thrown out can become significant in a multivariate model and vice versa. I can say that it provides a nice complement to other methods, and you should keep it in your modeling toolbox. I believe it had its origins in the world of credit scoring, so if you work in the financial industry, you may already be familiar with it.

First, let's look at the formula for WOE:

The WOE serves as a component in the IV. For numeric features, you would bin your data then calculate WOE separately for each bin. For categorical ones, or when one-hot encoded, bin for each level and calculate the WOE separately. Let's take an example and demonstrate in R.

Our data consists of one input feature coded as 0 or 1, so we'll have just two bins. For each bin, we calculate our WOE. In bin 1, or where values are equal to 0, there are four observations as events and 96 as non-events. Conversely, in bin 2, or where values are equal to 1, we have 12 observations as events and 88 as non-events. Let's see how to calculate the WOE for each bin:

> bin1events <- 4

> bin1nonEvents <- 96

> bin2events <- 12

> bin2nonEvents <- 88

> totalEvents <- bin1events + bin2events

> totalNonEvents <- bin1nonEvents + bin2nonEvents
# Now calculate the percentage per bin
> bin1percentE <- bin1events / totalEvents

> bin1percentNE <- bin1nonEvents / totalNonEvents

> bin2percentE <- bin2events / totalEvents

> bin2percentNE <- bin2nonEvents / totalNonEvents
# It's now possible to produce WOE
> bin1WOE <- log(bin1percentE / bin1percentNE)

> bin2WOE <- log(bin2percentE / bin2percentNE)

With completing this, you end up with the WOE for bin1 and bin2 of roughly -0.74 and 0.45 respectively. We now use that to calculate the IV per bin, then sum that up to arrive at an overall IV for the feature. The formula is as follows:

Taking our current example; this is our feature IV:

> bin1IV <- (bin1percentE - bin1percentNE) * bin1WOE

> bin2IV <- (bin2percentE - bin2percentNE) * bin2WOE

> bin1IV + bin2IV
[1] 0.3221803

The IV for the feature is 0.322. Now, what does that mean? The short answer is that it depends. There's a heuristic provided to help decide what IV threshold makes sense for inclusion in model development:

  • < 0.02 not predictive
  • 0.02 to 0.1 weak
  • 0.1 to 0.3 medium
  • 0.3 to 0.5 strong
  • > 0.5 suspicious

Our following example will provide us with interesting decisions to make regarding where to draw the line.

主站蜘蛛池模板: 迁安市| 错那县| 株洲市| 平遥县| 昌平区| 黄山市| 大邑县| 龙里县| 濮阳市| 乌兰县| 凤山市| 鸡西市| 余庆县| 蓝山县| 大埔区| 南陵县| 贞丰县| 荔浦县| 山丹县| 闽侯县| 阿勒泰市| 方山县| 仲巴县| 大田县| 蓬溪县| 抚顺县| 会理县| 喀喇沁旗| 黑水县| 西和县| 鹿泉市| 嘉善县| 云和县| 永泰县| 北安市| 寿阳县| 永福县| 齐河县| 永定县| 赤城县| 呼图壁县|