- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 101字
- 2021-06-10 18:36:39
How does the spaCy tokenizer work?
The simplest explanation is from the spaCy docs (spacy-101) itself.
First, the raw text is split on whitespace characters, similar to text.split (' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
- Does the substring match a tokenizer exception rule? For example, don't does not contain whitespace, but should be split into two tokens, do and n't, while U.K. should always remain one token.
- Can a prefix, suffix, or infix be split off? For example, punctuation such as commas, periods, hyphens, or quotes:

推薦閱讀
- UI圖標創意設計
- AWS Serverless架構:使用AWS從傳統部署方式向Serverless架構遷移
- JavaScript+jQuery網頁特效設計任務驅動教程(第2版)
- C++面向對象程序設計(微課版)
- Essential Angular
- Getting Started with Python Data Analysis
- Spring+Spring MVC+MyBatis整合開發實戰
- 小程序從0到1:微信全棧工程師一本通
- 多媒體技術及應用
- 邊玩邊學Scratch3.0少兒趣味編程
- 計算機應用基礎(第二版)
- Android高級開發實戰:UI、NDK與安全
- SQL Server實例教程(2008版)
- 3D Printing Designs:Octopus Pencil Holder
- C++服務器開發精髓