書名： Natural Language Processing with Python Quick Start Guide
作者名： Nirant Kasliwal
本章字?jǐn)?shù)： 358字
更新時(shí)間： 2021-06-10 18:36:39

spaCy for tokenization

spaCy loads the English model using the preceding .load syntax. This tells spaCy what rules, logic, weights, and other information to use:

 %%time
 import spacy
 # python -m spacy download en
 # uncomment above line to download the model
 nlp = spacy.load('en')

While we use only 'en' or English examples in this book, spaCy supports these features for more languages. I have used their multi-language tokenizer for Hindi as well, and have been satisfied with the same:

The %%time syntax measures the CPU and Wall time at your runtime execution for the cell in a Jupyter not ebook.

doc = nlp(text)

This creates a spaCy object, doc. The object stores pre-computed linguistic features, including tokens. Some NLP libraries, especially in the Java and C ecosystem, compute linguistic features such as tokens, lemmas, and parts of speech when that specific function is called. Instead, spaCy computes them all at initialization when the text is passed to it.

spaCy pre-computes most linguistic features – all you have to do is retrieve them from the object.

We can retrieve them by calling the object iterator. In the following code, we call the iterator and list it:

print(list(doc)[150:200])

The following is the output from the preceding code:

[whole, of, her, sex, ., It, was, not, that, he, felt,
   , any, emotion, akin, to, love, for, Irene, Adler, ., All, emotions, ,, and, that,
   , one, particularly, ,, were, abhorrent, to, his, cold, ,, precise, but,
   , admirably, balanced, mind, ., He, was, ,, I, take, it, ,]

Conveniently, spaCy tokenizes all punctuation and words. They are returned as individual tokens. Let's try the example that we didn't like earlier:

words = nlp("Isn't he coming home for dinner with the red-headed girl?")
print([token for token in words])
> [Is, n't, he, coming, home, for, dinner, with, the, red, -, headed, girl, ?]

Here are the observations:

spaCy got the Isn't split correct: Is and n't.
red-headed was broken into three tokens: red, -, and headed. Since the punctuation information isn't lost, we can restore the original red-headed token if we want to.

官术网_书友最值得收藏!

Natural Language Processing with Python Quick Start Guide

spaCy for tokenization