官术网_书友最值得收藏!

Applying descriptive statistics

Having preprocessed the dataset, let's do some sanity checking using descriptive statistics techniques. 

We can implement this as shown here: 

dfs.info()

The output of the preceding code is as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37554 entries, 1 to 78442
Data columns (total 6 columns):
subject 37367 non-null object
from 37554 non-null object
date 37554 non-null datetime64[ns, UTC]
to 36882 non-null object
label 36962 non-null object
thread 37554 non-null object
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 2.0+ MB

We will learn more about descriptive statistics in Chapter 5Descriptive Statistics. Note that there are 37,554 emails, with each email containing six columns—subject, from, date, to, label, and thread. Let's check the first few entries of the email dataset:

dfs.head(10)

The output of the preceding code is as follows:

Note that our dataframe so far contains six different columns. Take a look at the from field: it contains both the name and the email. For our analysis, we only need an email address. We can use a regular expression to refactor the column. 

主站蜘蛛池模板: 镇赉县| 安义县| 屏东市| 乌审旗| 绥中县| 库伦旗| 青海省| 漯河市| 金门县| 宁陵县| 达拉特旗| 萨嘎县| 棋牌| 双流县| 武陟县| 车险| 襄汾县| 理塘县| 北京市| 新蔡县| 馆陶县| 青阳县| 永州市| 平凉市| 深泽县| 嘉峪关市| 琼结县| 济南市| 蕉岭县| 镇雄县| 朝阳区| 陇川县| 师宗县| 宣威市| 尤溪县| 葵青区| 博湖县| 页游| 大英县| 忻城县| 伊川县|