官术网_书友最值得收藏!

Introducing Regexes

Regular expressions can be a little challenging at first, but they are very powerful. They are generic abstractions, and work across multiple languages beyond Python:

import re
re.split('\W+', 'Words, words, words.')
> ['Words', 'words', 'words', '']

The regular expression \W+ means a word character (A-Z etc.) repeated one or more times:

words_alphanumeric = re.split('\W+', text)
print(len(words_alphanumeric), len(words))

The output of the preceding code is (109111, 107431).

Let’s preview the words we extracted:

print(words_alphanumeric[90:200])

The following is the output we got from the preceding code:

   ['BOHEMIA', 'I', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', 'All', 'emotions', 'and', 'that', 'one', 'particularly', 'were', 'abhorrent', 'to', 'his', 'cold', 'precise', 'but', 'admirably', 'balanced', 'mind', 'He', 'was', 'I', 'take', 'it', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', 'They', 'were', 'admirable']

We notice how Adler no longer has the punctuation mark alongside it. This is what we wanted. Mission accomplished? 

What was the trade-off we made here? To understand that, let's look at another example:

words_break = re.split('\W+', "Isn't he coming home for dinner with the red-headed girl?")
print(words_break)

The following is the output we got from the preceding code:

 ['Isn', 't', 'he', 'coming', 'home', 'for', 'dinner', 'with', 'the', 'red', 'headed', 'girl', '']

We have split Isn't to Isn and t. This isn't good if you're working with, say, email or Twitter data, because you would have a lot more of these contractions and abbreviations. As a minor annoyance, we have an extra empty token, '', at the end. Similarly, because we neglected punctuation, red-headed is broken into two words: red and headed. We have no straightforward way to restore this connection if we are only given the tokenized version.

We can write custom rules in our tokenization strategy to cover most of these edge cases. Or, we can use something that has already been written for us.

主站蜘蛛池模板: 中超| 都江堰市| 沧州市| 桦川县| 江安县| 阿尔山市| 白水县| 高平市| 辉县市| 彭泽县| 治县。| 和田县| 佳木斯市| 信丰县| 云浮市| 清原| 宜兰县| 宝坻区| 崇阳县| 彭州市| 巴塘县| 民勤县| 呈贡县| 安岳县| 山阴县| 大渡口区| 济源市| 丰顺县| 沙坪坝区| 饶河县| 镶黄旗| 平罗县| 临桂县| 青岛市| 米泉市| 阳山县| 安新县| 黄梅县| 延庆县| 酉阳| 鄂州市|