官术网_书友最值得收藏!

Preparing and Understanding Data

"We've got to use every piece of data and piece of information, and hopefully that will help us be accurate with our player evaluation. For us, that's our lifeblood."
– Billy Beane , General Manager Oakland Athletics, subject of the book Moneyball

Research consistently shows that machine learning and data science practitioners spend most of their time manipulating data and preparing it for analysis. Indeed, many find it the most tedious and least enjoyable part of their work. Numerous companies are offering solutions to the problem but, in my opinion, results at this point are varied. Therefore, in this first chapter, I shall endeavor to provide a way of tackling the problem that will ease the burden of getting your data ready for machine learning. The methodology introduced in this chapter will serve as the foundation for data preparation and for understanding many of the subsequent chapters. I propose that once you become comfortable with this tried and true process, it may very well become your favorite part of machine learning—as it is for me.

The following are the topics that we'll cover in this chapter:

  • Overview 
  • Reading the data
  • Handling duplicate observations
  • Descriptive statistics
  • Exploring categorical variables
  • Handling missing values
  • Zero and near-zero variance features
  • Treating the data
  • Correlation and linearity

主站蜘蛛池模板: 太仆寺旗| 吉水县| 永春县| 容城县| 巨野县| 关岭| 崇礼县| 闽清县| 上饶市| 晋江市| 博湖县| 炎陵县| 柳江县| 会同县| 安岳县| 确山县| 绍兴县| 怀安县| 桂东县| 澄城县| 九寨沟县| 京山县| 凭祥市| 紫云| 城步| 灌南县| 政和县| 大理市| 泽普县| 芦山县| 都匀市| 舟曲县| 娄烦县| 教育| 湖北省| 海安县| 吉隆县| 托克逊县| 静安区| 蒙自县| 柳林县|