官术网_书友最值得收藏!

spaCy for tokenization

spaCy loads the English model using the preceding .load syntax. This tells spaCy what rules, logic, weights, and other information to use:

 %%time
import spacy
# python -m spacy download en
# uncomment above line to download the model
nlp = spacy.load('en')

While we use only 'en' or English examples in this book, spaCy supports these features for more languages. I have used their multi-language tokenizer for Hindi as well, and have been satisfied with the same:

The %%time syntax measures the CPU and Wall time at your runtime execution for the cell in a Jupyter not  ebook.
doc = nlp(text)

This creates a spaCy object, doc. The object stores pre-computed linguistic features, including tokens. Some NLP libraries, especially in the Java and C ecosystem, compute linguistic features such as tokens, lemmas, and parts of speech when that specific function is called. Instead, spaCy computes them all at initialization when the text is passed to it.

spaCy pre-computes most linguistic features all you have to do is retrieve them from the object.

We can retrieve them by calling the object iterator. In the following code, we call the iterator and list it:

print(list(doc)[150:200])

The following is the output from the preceding code:

[whole, of, her, sex, ., It, was, not, that, he, felt,
, any, emotion, akin, to, love, for, Irene, Adler, ., All, emotions, ,, and, that,
, one, particularly, ,, were, abhorrent, to, his, cold, ,, precise, but,
, admirably, balanced, mind, ., He, was, ,, I, take, it, ,]

Conveniently, spaCy tokenizes all punctuation and words. They are returned as individual tokens. Let's try the example that we didn't like earlier:

words = nlp("Isn't he coming home for dinner with the red-headed girl?")
print([token for token in words])
> [Is, n't, he, coming, home, for, dinner, with, the, red, -, headed, girl, ?]

Here are the observations:

  • spaCy got the Isn't split correct: Is and n't.
  • red-headed was broken into three tokens: red, -, and headed. Since the punctuation information isn't lost, we can restore the original red-headed token if we want to.
主站蜘蛛池模板: 保康县| 德令哈市| 安宁市| 蒲江县| 皋兰县| 长乐市| 海盐县| 奉节县| 澄江县| 德令哈市| 炉霍县| 龙山县| 固阳县| 鄂托克前旗| 津市市| 桦川县| 金昌市| 南川市| 周口市| 鄂温| 大港区| 洱源县| 江门市| 普兰县| 光泽县| 益阳市| 武功县| 崇文区| 上林县| 分宜县| 错那县| 东山县| 阳江市| 绍兴市| 光泽县| 乌拉特前旗| 稻城县| 青冈县| 辉县市| 湟中县| 黄石市|