官术网_书友最值得收藏!

Intuitive – split by whitespace

The following lines of code simply segment or split the entire text body on space ' ':

words = text.split()
print(len(words))

107431

Let's preview a rather large segment from our list of tokens:

print(words[90:200])  #start with the first chapter, ignoring the index for now
['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex.', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler.', 'All', 'emotions,', 'and', 'that', 'one', 'particularly,', 'were', 'abhorrent', 'to', 'his', 'cold,', 'precise', 'but', 'admirably', 'balanced', 'mind.', 'He', 'was,', 'I', 'take', 'it,', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen,', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position.', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions,', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer.', 'They', 'were', 'admirable', 'things', 'for']

The way punctuation is split here is not desirable. It often appears with the word itself, such as the full stop at end of Adler. and a comma being part of emotions,. Quite often we want words to be separated from punctuation, because words convey a lot more meaning than punctuation in most datasets.

Let's look at a shorter example:

'red-headed woman on the street'.split()

The following is the output from the preceding code:

['red-headed', 'woman', 'on', 'the', 'street']

Note how the words red-headed were not split. This is something we may or may not want to keep. We will come back to this, so keep this in mind.

One way to tackle this punctuation challenge is to simply extract words and discard everything else. This means that we will discard all non-ASCII characters and punctuation.

主站蜘蛛池模板: 宁城县| 武功县| 溧阳市| 宣武区| 镇安县| 南安市| 肇东市| 美姑县| 玛纳斯县| 万宁市| 象山县| 汝城县| 亚东县| 开化县| 百色市| 望谟县| 敦煌市| 滦平县| 邵阳市| 和平县| 新龙县| 工布江达县| 永康市| 建德市| 尉犁县| 大洼县| 兴安盟| 新宾| 丹东市| 沅陵县| 舒城县| 仙游县| 昌黎县| 临猗县| 仲巴县| 泰宁县| 高台县| 福建省| 甘德县| 酉阳| 阳春市|