官术网_书友最值得收藏!

Janitorial work

A large part of doing data science work is focused on cleanup. In productionized systems, this data would typically be fetched directly from the database, already relatively clean (high -quality production data science work requires a database of clean data). However, we're not in production mode yet. We're still in the model-building phase. It would be helpful to imagine writing a program solely for cleaning data.

Let's look at our requirements: starting with our data, each column is a variable—most of them are independent variables, except for the last column, which is the dependent variable. Some variables are categorical, and some are continuous. Our task is to write a function that will convert the data, currently [][]string to [][]float64.

To do that, we would require all the data to be converted into float64. For the continuous variables, it's an easy task: simply parse the string into a float. There are oddities that need to be handled, which I hope you had spotted by the time you opened the file in a spreadsheet. But the main pain is in converting categorical data to float64.

Fortunately for us, people much smarter than have figured this out decades ago. There exists an encoding scheme that allows categorical data to play nicely with linear regression algorithms.

主站蜘蛛池模板: 富川| 壶关县| 清镇市| 巫溪县| 永城市| 梓潼县| 丽水市| 安福县| 桐梓县| 哈密市| 清原| 苗栗市| 扎囊县| 中牟县| 寿宁县| 巴马| 伊春市| 沧州市| 吴堡县| 屯昌县| 神池县| 五寨县| 菏泽市| 阿图什市| 榆中县| 昌图县| 武山县| 慈利县| 赣榆县| 凤翔县| 漳州市| 当阳市| 张家界市| 德令哈市| 分宜县| 黎川县| 珠海市| 奉新县| 塔城市| 佛冈县| 宜春市|