- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 358字
- 2021-06-10 18:36:39
spaCy for tokenization
spaCy loads the English model using the preceding .load syntax. This tells spaCy what rules, logic, weights, and other information to use:
%%time
import spacy
# python -m spacy download en
# uncomment above line to download the model
nlp = spacy.load('en')
While we use only 'en' or English examples in this book, spaCy supports these features for more languages. I have used their multi-language tokenizer for Hindi as well, and have been satisfied with the same:
doc = nlp(text)
This creates a spaCy object, doc. The object stores pre-computed linguistic features, including tokens. Some NLP libraries, especially in the Java and C ecosystem, compute linguistic features such as tokens, lemmas, and parts of speech when that specific function is called. Instead, spaCy computes them all at initialization when the text is passed to it.
We can retrieve them by calling the object iterator. In the following code, we call the iterator and list it:
print(list(doc)[150:200])
The following is the output from the preceding code:
[whole, of, her, sex, ., It, was, not, that, he, felt,
, any, emotion, akin, to, love, for, Irene, Adler, ., All, emotions, ,, and, that,
, one, particularly, ,, were, abhorrent, to, his, cold, ,, precise, but,
, admirably, balanced, mind, ., He, was, ,, I, take, it, ,]
Conveniently, spaCy tokenizes all punctuation and words. They are returned as individual tokens. Let's try the example that we didn't like earlier:
words = nlp("Isn't he coming home for dinner with the red-headed girl?")
print([token for token in words])
> [Is, n't, he, coming, home, for, dinner, with, the, red, -, headed, girl, ?]
Here are the observations:
- spaCy got the Isn't split correct: Is and n't.
- red-headed was broken into three tokens: red, -, and headed. Since the punctuation information isn't lost, we can restore the original red-headed token if we want to.
- OpenShift開發(fā)指南(原書第2版)
- Android 7編程入門經(jīng)典:使用Android Studio 2(第4版)
- VSTO開發(fā)入門教程
- JavaScript前端開發(fā)與實(shí)例教程(微課視頻版)
- Koa開發(fā):入門、進(jìn)階與實(shí)戰(zhàn)
- Mastering AndEngine Game Development
- OpenResty完全開發(fā)指南:構(gòu)建百萬(wàn)級(jí)別并發(fā)的Web應(yīng)用
- Python語(yǔ)言科研繪圖與學(xué)術(shù)圖表繪制從入門到精通
- Learning Bootstrap 4(Second Edition)
- Learning C++ by Creating Games with UE4
- Oracle Database XE 11gR2 Jump Start Guide
- Python面試通關(guān)寶典
- DevOps 精要:業(yè)務(wù)視角
- Apache Kafka 1.0 Cookbook
- Cocos2D Game Development Essentials