書名： Natural Language Processing with Python Quick Start Guide
作者名： Nirant Kasliwal
本章字數： 379字
更新時間： 2021-06-10 18:36:39

Introducing Regexes

Regular expressions can be a little challenging at first, but they are very powerful. They are generic abstractions, and work across multiple languages beyond Python:

import re
re.split('\W+', 'Words, words, words.')
> ['Words', 'words', 'words', '']

The regular expression \W+ means a word character (A-Z etc.) repeated one or more times:

words_alphanumeric = re.split('\W+', text)
print(len(words_alphanumeric), len(words))

The output of the preceding code is (109111, 107431).

Let’s preview the words we extracted:

print(words_alphanumeric[90:200])

The following is the output we got from the preceding code:

   ['BOHEMIA', 'I', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', 'All', 'emotions', 'and', 'that', 'one', 'particularly', 'were', 'abhorrent', 'to', 'his', 'cold', 'precise', 'but', 'admirably', 'balanced', 'mind', 'He', 'was', 'I', 'take', 'it', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', 'They', 'were', 'admirable']

We notice how Adler no longer has the punctuation mark alongside it. This is what we wanted. Mission accomplished?

What was the trade-off we made here? To understand that, let's look at another example:

words_break = re.split('\W+', "Isn't he coming home for dinner with the red-headed girl?")
print(words_break)

The following is the output we got from the preceding code:

 ['Isn', 't', 'he', 'coming', 'home', 'for', 'dinner', 'with', 'the', 'red', 'headed', 'girl', '']

We have split Isn't to Isn and t. This isn't good if you're working with, say, email or Twitter data, because you would have a lot more of these contractions and abbreviations. As a minor annoyance, we have an extra empty token, '', at the end. Similarly, because we neglected punctuation, red-headed is broken into two words: red and headed. We have no straightforward way to restore this connection if we are only given the tokenized version.

We can write custom rules in our tokenization strategy to cover most of these edge cases. Or, we can use something that has already been written for us.

官术网_书友最值得收藏!

Natural Language Processing with Python Quick Start Guide

Introducing Regexes