官术网_书友最值得收藏!

Data refactoring

We noticed that the from field contains more information than we need. We just need to extract an email address from that field. Let's do some refactoring:

First of all, import the regular expression package:

import re

2.ext, let's create a function that takes an entire string from any column and extracts an email address:

def extract_email_ID(string):
email = re.findall(r'<(.+?)>', string)
if not email:
email = list(filter(lambda y: '@' in y, string.split()))
return email[0] if email else np.nan

The preceding function is pretty straightforward, right? We have used a regular expression to find an email address. If there is no email address, we populate the field with NaN. Well, if you are not sure about regular expressions, don't worry. Just read the Appendix

3.ext, let's apply the function to the from column:

dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))

We used the lambda function to apply the function to each and every value in the column.

4.ext, we are going to refactor the label field. The logic is simple. If an email is from your email address, then it is the sent email. Otherwise, it is a received email, that is, an inbox email:

myemail = 'itsmeskm99@gmail.com'
dfs['label'] = dfs['from'].apply(lambda x: 'sent' if x==myemail else 'inbox')

The preceding code is self-explanatory.

主站蜘蛛池模板: 隆化县| 布拖县| 佛教| 扶风县| 威海市| 湘乡市| 新竹市| 靖宇县| 黄山市| 龙门县| 阿克苏市| 嘉荫县| 措勤县| 福安市| 博爱县| 包头市| 忻州市| 江安县| 余干县| 英吉沙县| 揭东县| 无为县| 景东| 宜章县| 苏州市| 汝城县| 蕲春县| 禄劝| 翁牛特旗| 张家口市| 小金县| 永川市| 高雄县| 南汇区| 舞阳县| 连平县| 鹤山市| 怀化市| 澄迈县| 雅安市| 鸡东县|