- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 301字
- 2021-06-10 18:36:39
Intuitive – split by whitespace
The following lines of code simply segment or split the entire text body on space ' ':
words = text.split()
print(len(words))
107431
Let's preview a rather large segment from our list of tokens:
print(words[90:200]) #start with the first chapter, ignoring the index for now
['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex.', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler.', 'All', 'emotions,', 'and', 'that', 'one', 'particularly,', 'were', 'abhorrent', 'to', 'his', 'cold,', 'precise', 'but', 'admirably', 'balanced', 'mind.', 'He', 'was,', 'I', 'take', 'it,', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen,', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position.', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions,', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer.', 'They', 'were', 'admirable', 'things', 'for']
The way punctuation is split here is not desirable. It often appears with the word itself, such as the full stop at end of Adler. and a comma being part of emotions,. Quite often we want words to be separated from punctuation, because words convey a lot more meaning than punctuation in most datasets.
Let's look at a shorter example:
'red-headed woman on the street'.split()
The following is the output from the preceding code:
['red-headed', 'woman', 'on', 'the', 'street']
Note how the words red-headed were not split. This is something we may or may not want to keep. We will come back to this, so keep this in mind.
One way to tackle this punctuation challenge is to simply extract words and discard everything else. This means that we will discard all non-ASCII characters and punctuation.
- 從零開始:數字圖像處理的編程基礎與應用
- Microsoft Application Virtualization Cookbook
- Getting Started with CreateJS
- DevOps Automation Cookbook
- Learn React with TypeScript 3
- Learning Hunk
- Julia高性能科學計算(第2版)
- 從零開始學C#
- Node學習指南(第2版)
- Hands-On JavaScript for Python Developers
- 深入理解Java虛擬機:JVM高級特性與最佳實踐
- Using Yocto Project with BeagleBone Black
- Unity與C++網絡游戲開發實戰:基于VR、AI與分布式架構
- 計算機教學研究與實踐:2017學術年會論文集
- Python3網絡爬蟲寶典