官术网_书友最值得收藏!

Preparing and Understanding Data

"We've got to use every piece of data and piece of information, and hopefully that will help us be accurate with our player evaluation. For us, that's our lifeblood."
– Billy Beane , General Manager Oakland Athletics, subject of the book Moneyball

Research consistently shows that machine learning and data science practitioners spend most of their time manipulating data and preparing it for analysis. Indeed, many find it the most tedious and least enjoyable part of their work. Numerous companies are offering solutions to the problem but, in my opinion, results at this point are varied. Therefore, in this first chapter, I shall endeavor to provide a way of tackling the problem that will ease the burden of getting your data ready for machine learning. The methodology introduced in this chapter will serve as the foundation for data preparation and for understanding many of the subsequent chapters. I propose that once you become comfortable with this tried and true process, it may very well become your favorite part of machine learning—as it is for me.

The following are the topics that we'll cover in this chapter:

  • Overview 
  • Reading the data
  • Handling duplicate observations
  • Descriptive statistics
  • Exploring categorical variables
  • Handling missing values
  • Zero and near-zero variance features
  • Treating the data
  • Correlation and linearity

主站蜘蛛池模板: 临朐县| 河源市| 深水埗区| 芜湖市| 盘山县| 江阴市| 汾阳市| 彩票| 胶州市| 台南县| 毕节市| 吉安县| 潞西市| 宣恩县| 太仆寺旗| 尉犁县| 崇礼县| 邓州市| 三亚市| 凤翔县| 额敏县| 通州区| 和林格尔县| 平安县| 玛纳斯县| 美姑县| 博罗县| 曲阜市| 金堂县| 康定县| 焉耆| 莱州市| 依安县| 西青区| 永德县| 淮北市| 辽中县| 观塘区| 安乡县| 静海县| 辰溪县|