官术网_书友最值得收藏!

Understanding the Science Behind EDA

In layman's terms, we can define EDA as the science of understanding data. A more formal definition is the process of analyzing and exploring datasets to summarize its characteristics, properties, and latent relationships using statistical, visual, analytical, or a combination of techniques.

To cement our understanding, let's break down the definition further. The dataset is a combination of numeric and categorical features. To study the data, we might need to explore features individually, and to study relationships, we might need to explore features together. Depending on the number of features and the type of features, we may cross paths with different types of EDA.

To simplify, we can broadly classify the process of EDA as follows:

  • Univariate analysis: Studying a single feature
  • Bivariate analysis: Studying the relationship between two features
  • Multivariate analysis: Studying the relationship between more than two features

For now, we will restrict the scope of the chapter to univariate and bivariate analysis. A few forms of multivariate analysis, such as regression, will be covered in the upcoming chapters.

To accomplish each of the previously mentioned analyses, we can use visualization techniques such as boxplots, scatter plots, and bar charts; statistical techniques such as hypothesis testing; or simple analytical techniques such as averages, frequency counts, and so on.

Breaking this further down, we have another dimension to cater to, that is, the types of features—numeric or categorical. In each of the type of analysis mentioned—univariate and bivariate—based on the type of the feature, we might have a different visual technique to accomplish the study. So, for univariate analysis of a numeric variable, we could use a histogram or a boxplot, whereas we might use a frequency bar chart for a categorical variable. We will get into the details of the overall exercise of EDA using a lazy programming approach, that is, we will explore the context and details of the analysis as and when it occurs in the book.

With the basic background context set for the exercise, let's get ready for a specific EDA exercise.

主站蜘蛛池模板: 黄山市| 香格里拉县| 宝丰县| 沙坪坝区| 饶平县| 缙云县| 平罗县| 鲁山县| 图木舒克市| 紫阳县| 灵山县| 澜沧| 崇左市| 威宁| 江永县| 大洼县| 电白县| 瑞丽市| 辽阳市| 万州区| 怀集县| 分宜县| 马山县| 武陟县| 周口市| 兰州市| 图片| 嫩江县| 沈阳市| 绍兴县| 彭阳县| 筠连县| 钟祥市| 永定县| 平顺县| 宜兰市| 家居| 康马县| 宁津县| 绥化市| 钟山县|