- Python Machine Learning By Example
- Yuxi (Hayden) Liu
- 397字
- 2021-07-02 12:41:39
Tokenization
Given a text sequence, tokenization is the task of breaking it into fragments, which can be words, characters, or sentences. Sometimes, certain characters are usually removed, such as punctuation marks, digits, and emoticons. These fragments are the so-called tokens used for further processing. Moreover, tokens composed of one word are also called unigrams in computational linguistics; bigrams are composed of two consecutive words; trigrams of three consecutive words; and n-grams of n consecutive words. Here is an example of tokenization:

We can implement word-based tokenization using the word_tokenize function in NLTK. We will use the input text '''I am reading a book., and in the next line, It is Python Machine Learning By Example,, then 2nd edition.''', as an example as shown in the following commands:
>>> from nltk.tokenize import word_tokenize
>>> sent = '''I am reading a book.
... It is Python Machine Learning By Example,
... 2nd edition.'''
>>> print(word_tokenize(sent))
['I', 'am', 'reading', 'a', 'book', '.', 'It', 'is', 'Python', 'Machine', 'Learning', 'By', 'Example', ',', '2nd', 'edition', '.']
Word tokens are obtained.
You might think word tokenization is simply splitting a sentence by space and punctuation. Here's an interesting example showing that tokenization is more complex than you think:
>>> sent2 = 'I have been to U.K. and U.S.A.'
>>> print(word_tokenize(sent2))
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A', '.']
The tokenizer accurately recognizes the words 'U.K.' and 'U.S.A' as tokens instead of 'U' and '.' followed by 'K', for example.
SpaCy also has an outstanding tokenization feature. It uses an accurately trained model that is constantly updated. To install it, we can run the following command:
python -m spacy download en_core_web_sm
Then, we'll load the en_core_web_sm model and parse the sentence using this model:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> tokens2 = nlp(sent2)
>>> print([token.text for token in tokens2])
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A.']
We can also segment text based on sentence. For example, on the same input text, using the sent_tokenize function from NLTK, we have the following commands:
>>> from nltk.tokenize import sent_tokenize
>>> print(sent_tokenize(sent))
['I am reading a book.', '...', 'It's Python Machine Learning By Example,\n... 2nd edition.']
Two sentence-based tokens are returned, as there are two sentences in the input text regardless of a newline following a comma.
- Mastering Spark for Data Science
- 高性能混合信號ARM:ADuC7xxx原理與應(yīng)用開發(fā)
- 工業(yè)機器人技術(shù)及應(yīng)用
- Managing Mission:Critical Domains and DNS
- Visual FoxPro 6.0數(shù)據(jù)庫與程序設(shè)計
- 一本書玩轉(zhuǎn)數(shù)據(jù)分析(雙色圖解版)
- Mobile DevOps
- 大數(shù)據(jù)技術(shù)與應(yīng)用
- Embedded Programming with Modern C++ Cookbook
- 網(wǎng)中之我:何明升網(wǎng)絡(luò)社會論稿
- Dreamweaver CS6中文版多功能教材
- Excel 2007終極技巧金典
- 中國戰(zhàn)略性新興產(chǎn)業(yè)研究與發(fā)展·數(shù)控系統(tǒng)
- Data Analysis with R(Second Edition)
- 自適應(yīng)學(xué)習(xí):人工智能時代的教育革命