官术网_书友最值得收藏!

Introducing Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other, more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify the possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that using a more sophisticated model-based analysis is not justified.

Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a "feel", for the Dataset as a result of such exploration. More specifically, the graphical techniques allow us to efficiently select and validate appropriate models, test our assumptions, identify relationships, select estimators, detect outliers, and so on.

EDA involves a lot of trial and error, and several iterations. The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement. For example, you should plot simple histograms and scatter plots to quickly start developing an intuition for your data.

主站蜘蛛池模板: 峨眉山市| 咸丰县| 龙南县| 恩平市| 乾安县| 元谋县| 闵行区| 大理市| 临泉县| 湄潭县| 利辛县| 常德市| 涿州市| 吉隆县| 蛟河市| 漯河市| 犍为县| 正定县| 瑞昌市| 揭东县| 松江区| 锡林浩特市| 纳雍县| 晋宁县| 威信县| 沙洋县| 云阳县| 汽车| 肃南| 互助| 永兴县| 壶关县| 德令哈市| 泽库县| 廉江市| 大方县| 文成县| 绥滨县| 博乐市| 井研县| 公主岭市|