官术网_书友最值得收藏!

Chapter 3. Data Exploration

When we first receive a dataset, most of the times we only know what it is related to—an overview that is not enough to start applying algorithms or create models on it. Data exploration is of paramount importance in data science. It is the necessary process prior to creating a model because it gives a highlight of the dataset and definitely makes clear the path to achieving our objectives. Data exploration familiarizes the data scientist with the data and helps to know what general hypothesis we can infer from the dataset. So, we can say it is a process of extracting some information from the dataset, not knowing beforehand what to look for.

In this chapter, we will study:

  • Sampling, population, and weight vectors
  • Inferring column types
  • Summary of a dataset
  • Scalar statistics
  • Measures of variation
  • Data exploration using visualizations

Data exploration involves descriptive statistics. Descriptive statistics is a field of data analysis that finds out patterns by meaningfully summarizing data. This may not lead to the exact results or the model that we intend to build, but it definitely helps to understand the data. Suppose there are 10 million people in New Delhi and if we calculate the mean of the heights of 1,000 people taken at random living there, it wouldn't be the average height of the people of New Delhi, but it would definitely give an idea.

Julia can effectively be used for data exploration. Julia provides a package called StatsBase.jl, which contains the necessary functions for statistics. We would presume throughout the chapter that you have added the package:

julia> Pkg.update() julia> Pkg.add("StatsBase") 
主站蜘蛛池模板: 广安市| 三门县| 巴青县| 玉环县| 手游| 会泽县| 盐源县| 曲阳县| 长治县| 鄂温| 新巴尔虎左旗| 石首市| 视频| 南昌县| 什邡市| 前郭尔| 武乡县| 蒲城县| 浙江省| 集贤县| 广德县| 昆山市| 富源县| 闽清县| 孟村| 安新县| 行唐县| 文成县| 保靖县| 驻马店市| 攀枝花市| 长海县| 敖汉旗| 锦州市| 固原市| 旬阳县| 南阳市| 新昌县| 华池县| 淮阳县| 保山市|