官术网_书友最值得收藏!

Failure to engineer features

Just throwing data at the problem is not enough; no matter how much of it exists. This may seem obvious, but I have personally experienced, and I know of others who have run into this problem, where business leaders assumed that providing vast amounts of raw data combined with the supposed magic of machine learning would solve all the problems. This is one of the reasons the first chapter is focused on a process that properly frames the business problem and leader's expectations.

Unless you have data from a designed experiment or it has been already preprocessed, raw, observational data will probably never be in a form that you can begin modeling. In any project, very little time is actually spent on building models. The most time-consuming activities will be on the engineering features: gathering, integrating, cleaning, and understanding the data. In the practical exercises in this book, I would estimate that 90 percent of my time was spent on coding these activities versus modeling. This, in an environment where most of the datasets are small and easily accessed. In my current role, 99 percent of the time in SAS is spent using PROC SQL and only 1 percent with things such as PROC GENMOD, PROC LOGISTIC, or Enterprise Miner.

When it comes to feature engineering, I fall in the camp of those that say there is no substitute for domain expertise. There seems to be another camp that believes machine learning algorithms can indeed automate most of the feature selection/engineering tasks and several start-ups are out to prove this very thing. (I have had discussions with a couple of individuals that purport their methodology does exactly that but they were closely guarded secrets.) Let's say that you have several hundred candidate features (independent variables). A way to perform automated feature selection is to compute the univariate information value. However, a feature that appears totally irrelevant in isolation can become important in combination with another feature. So, to get around this, you create numerous combinations of the features. This has potential problems of its own as you may have a dramatically increased computational time and cost and/or overfit your model. Speaking of overfitting, let's pursue it as the next caveat.

主站蜘蛛池模板: 通化县| 乐昌市| 老河口市| 寻乌县| 城市| 焦作市| 鲜城| 石台县| 蓬溪县| 沂水县| 青河县| 商河县| 布尔津县| 竹山县| 彭泽县| 云安县| 巩留县| 开鲁县| 汤阴县| 全州县| 鄄城县| 鹤壁市| 肥西县| 高要市| 云龙县| 赤水市| 汉阴县| 浪卡子县| 本溪市| 大关县| 贡觉县| 治县。| 高密市| 淮安市| 德清县| 汉沽区| 横峰县| 赣州市| 曲周县| 简阳市| 桂东县|