官术网_书友最值得收藏!

Tokenization

Given a text sequence, tokenization is the task of breaking it into fragments, which can be words, characters, or sentences. Sometimes, certain characters are usually removed, such as punctuation marks, digits, and emoticons. These fragments are the so-called tokens used for further processing. Moreover, tokens composed of one word are also called unigrams in computational linguistics; bigrams are composed of two consecutive words; trigrams of three consecutive words; and n-grams of n consecutive words. Here is an example of tokenization:

We can implement word-based tokenization using the word_tokenize function in NLTK. We will use the input text '''I am reading a book., and in the next line, It is Python Machine Learning By Example,, then 2nd edition.''', as an example as shown in the following commands:

>>> from nltk.tokenize import word_tokenize
>>> sent = '''I am reading a book.
... It is Python Machine Learning By Example,
... 2nd edition.'''
>>> print(word_tokenize(sent))
['I', 'am', 'reading', 'a', 'book', '.', 'It', 'is', 'Python', 'Machine', 'Learning', 'By', 'Example', ',', '2nd', 'edition', '.']

Word tokens are obtained.

The word_tokenize function keeps punctuation marks and digits, and only discards whitespaces and newlines.

You might think word tokenization is simply splitting a sentence by space and punctuation. Here's an interesting example showing that tokenization is more complex than you think:

>>> sent2 = 'I have been to U.K. and U.S.A.'
>>> print(word_tokenize(sent2))
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A', '.']

The tokenizer accurately recognizes the words 'U.K.' and 'U.S.A' as tokens instead of 'U' and '.' followed by 'K', for example.

SpaCy also has an outstanding tokenization feature. It uses an accurately trained model that is constantly updated. To install it, we can run the following command:

python -m spacy download en_core_web_sm

Then, we'll load the en_core_web_sm model and parse the sentence using this model:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> tokens2 = nlp(sent2)
>>> print([token.text for token in tokens2])
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A.']

We can also segment text based on sentence. For example, on the same input text, using the sent_tokenize function from NLTK, we have the following commands:

>>> from nltk.tokenize import sent_tokenize
>>> print(sent_tokenize(sent))
['I am reading a book.', '...', 'It's Python Machine Learning By Example,\n... 2nd edition.']

Two sentence-based tokens are returned, as there are two sentences in the input text regardless of a newline following a comma.

主站蜘蛛池模板: 武乡县| 永济市| 北流市| 西乌| 梁平县| 娄烦县| 建水县| 随州市| 忻州市| 大悟县| 五寨县| 剑河县| 大庆市| 施秉县| 岢岚县| 宜昌市| 沭阳县| 宿迁市| 福海县| 德钦县| 江孜县| 鲁山县| 承德县| 南木林县| 商南县| 广东省| 西丰县| 南郑县| 哈巴河县| 冷水江市| 津南区| 长岭县| 九龙坡区| 柞水县| 马龙县| 吉安县| 平阴县| 东乌珠穆沁旗| 沅江市| 上杭县| 嫩江县|