官术网_书友最值得收藏!

Summary

The main learning outcomes of this chapter are summarized as follows:

  • Various methods and variations in importing a dataset using pandas: read_csv and its variations, reading a dataset using open method in Python, reading a file in chunks using the open method, reading directly from a URL, specifying the column names from a list, changing the delimiter of a dataset, and so on.
  • Basic exploratory analysis of data: observing a thumbnail of data, shape, column names, column types, and summary statistics for numerical variables
  • Handling missing values: The reason for incorporation of missing values, why it is important to treat them properly, how to treat them properly by deletion and imputation, and various methods of imputing data.
  • Creating dummy variables: creating dummy variables for categorical variables to be used in the predictive models.
  • Basic plotting: scatter plotting, histograms and boxplots; their meaning and relevance; and how they are plotted.

This chapter is a head start into our journey to explore our data and wrangle it to make it modelling-worthy. The next chapter will go deeper in this pursuit whereby we will learn to aggregate values for categorical variables, sub-set the dataset, merge two datasets, generate random numbers, and sample a dataset.

Cleaning, as we have seen in the last chapter takes about 80% of the modelling time, so it's of critical importance and the methods we are learning will come in handy in the pursuit of that goal.

主站蜘蛛池模板: 哈尔滨市| 响水县| 渝中区| 寿阳县| 襄垣县| 景宁| 定结县| 新野县| 图木舒克市| 锡林郭勒盟| 吉林市| 盐源县| 鹤岗市| 象州县| 孝义市| 定陶县| 临漳县| 嘉禾县| 伊宁县| 清水河县| 宜宾县| 吴旗县| 育儿| 康马县| 大姚县| 施秉县| 花莲县| 大城县| 手游| 京山县| 谢通门县| 乌鲁木齐市| 民乐县| 章丘市| 新建县| 庐江县| 靖江市| 游戏| 五台县| 镇沅| 上林县|