官术网_书友最值得收藏!

Model training and evaluation

As mentioned previously, we'll be predicting customer satisfaction. The data is based on a former online competition. I've taken the training portion of the data and cleaned it up for our use. 

A full description of the contest and the data is available at the following link:  https://www.kaggle.com/c/santander-customer-satisfaction/data.

This is an excellent dataset for a classification problem for many reasons. Like so much customer data, it's very messy— especially before I removed a bunch of useless features (there was something like four dozen zero variance features). As discussed in the prior two chapters, I addressed missing values, linear dependencies, and highly correlated pairs. I also found the feature names lengthy and useless, so I coded them V1 through V142. The resulting data deals with what's usually a difficult thing to measure: satisfaction. Because of proprietary methods, no description or definition of satisfaction is given.

Having worked previously in the world of banking, I can assure you that it's a somewhat challenging proposition and fraught with measurement error. As such, there's quite a bit of noise relative to the signal and you can expect model performance to be rather poor. Also, the outcome of interest, customer dissatisfaction, is relatively rare when compared to customers not dissatisfied. The classic problem is that you end up with quite a few false positives when trying to classify the minority labels.

As always, you can find the data on GitHub: https://github.com/datameister66/MMLR3rd/blob/master/santander_prepd.RData.

So, let's start by first loading the data and training a logistic regression algorithm.

主站蜘蛛池模板: 宁远县| 西和县| 伊宁市| 都兰县| 留坝县| 新化县| 锡林浩特市| 大安市| 胶州市| 澳门| 绥芬河市| 若尔盖县| 荥经县| 喀什市| 汉中市| 常宁市| 信丰县| 秭归县| 辛集市| 泸西县| 长岭县| 宜丰县| 蓬安县| 云南省| 平乡县| 南木林县| 东明县| 乌苏市| 简阳市| 错那县| 龙岩市| 丹凤县| 海淀区| 安塞县| 通州区| 浪卡子县| 侯马市| 镇沅| 嘉善县| 磴口县| 锡林郭勒盟|