書名： Natural Language Processing with Python Quick Start Guide
作者名： Nirant Kasliwal
本章字?jǐn)?shù)： 101字
更新時間： 2021-06-10 18:36:39

How does the spaCy tokenizer work?

The simplest explanation is from the spaCy docs (spacy-101) itself.

First, the raw text is split on whitespace characters, similar to text.split (' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

Does the substring match a tokenizer exception rule? For example, don't does not contain whitespace, but should be split into two tokens, do and n't, while U.K. should always remain one token.
Can a prefix, suffix, or infix be split off? For example, punctuation such as commas, periods, hyphens, or quotes:

官术网_书友最值得收藏!

Natural Language Processing with Python Quick Start Guide

How does the spaCy tokenizer work?