書名： Hands-On Exploratory Data Analysis with Python
作者名： Suresh Kumar Mukhiya Usman Ahmed
本章字數： 215字
更新時間： 2021-06-24 16:44:56

Data refactoring

We noticed that the from field contains more information than we need. We just need to extract an email address from that field. Let's do some refactoring:

First of all, import the regular expression package:

import re

2.ext, let's create a function that takes an entire string from any column and extracts an email address:

def extract_email_ID(string):
  email = re.findall(r'<(.+?)>', string)
  if not email:
    email = list(filter(lambda y: '@' in y, string.split()))
  return email[0] if email else np.nan

The preceding function is pretty straightforward, right? We have used a regular expression to find an email address. If there is no email address, we populate the field with NaN. Well, if you are not sure about regular expressions, don't worry. Just read the Appendix.

3.ext, let's apply the function to the from column:

dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))

We used the lambda function to apply the function to each and every value in the column.

4.ext, we are going to refactor the label field. The logic is simple. If an email is from your email address, then it is the sent email. Otherwise, it is a received email, that is, an inbox email:

myemail = 'itsmeskm99@gmail.com'
dfs['label'] = dfs['from'].apply(lambda x: 'sent' if x==myemail else 'inbox')

The preceding code is self-explanatory.

官术网_书友最值得收藏!

Hands-On Exploratory Data Analysis with Python

Data refactoring