- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 379字
- 2021-06-10 18:36:39
Introducing Regexes
Regular expressions can be a little challenging at first, but they are very powerful. They are generic abstractions, and work across multiple languages beyond Python:
import re
re.split('\W+', 'Words, words, words.')
> ['Words', 'words', 'words', '']
The regular expression \W+ means a word character (A-Z etc.) repeated one or more times:
words_alphanumeric = re.split('\W+', text)
print(len(words_alphanumeric), len(words))
The output of the preceding code is (109111, 107431).
Let’s preview the words we extracted:
print(words_alphanumeric[90:200])
The following is the output we got from the preceding code:
['BOHEMIA', 'I', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', 'All', 'emotions', 'and', 'that', 'one', 'particularly', 'were', 'abhorrent', 'to', 'his', 'cold', 'precise', 'but', 'admirably', 'balanced', 'mind', 'He', 'was', 'I', 'take', 'it', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', 'They', 'were', 'admirable']
We notice how Adler no longer has the punctuation mark alongside it. This is what we wanted. Mission accomplished?
What was the trade-off we made here? To understand that, let's look at another example:
words_break = re.split('\W+', "Isn't he coming home for dinner with the red-headed girl?")
print(words_break)
The following is the output we got from the preceding code:
['Isn', 't', 'he', 'coming', 'home', 'for', 'dinner', 'with', 'the', 'red', 'headed', 'girl', '']
We have split Isn't to Isn and t. This isn't good if you're working with, say, email or Twitter data, because you would have a lot more of these contractions and abbreviations. As a minor annoyance, we have an extra empty token, '', at the end. Similarly, because we neglected punctuation, red-headed is broken into two words: red and headed. We have no straightforward way to restore this connection if we are only given the tokenized version.
We can write custom rules in our tokenization strategy to cover most of these edge cases. Or, we can use something that has already been written for us.
- DBA攻堅指南:左手Oracle,右手MySQL
- NLTK基礎教程:用NLTK和Python庫構建機器學習應用
- INSTANT CakePHP Starter
- Practical Game Design
- Learning Firefox OS Application Development
- PLC編程與調試技術(松下系列)
- 深入理解Android:Wi-Fi、NFC和GPS卷
- RealSenseTM互動開發實戰
- HTML+CSS+JavaScript網頁設計從入門到精通 (清華社"視頻大講堂"大系·網絡開發視頻大講堂)
- ExtJS Web應用程序開發指南第2版
- .NET 4.5 Parallel Extensions Cookbook
- Python:Deeper Insights into Machine Learning
- Clojure High Performance Programming(Second Edition)
- Java EE程序設計與開發實踐教程
- WCF全面解析