官术网_书友最值得收藏!

How does the spaCy tokenizer work?

The simplest explanation is from the spaCy docs (spacy-101) itself.

First, the raw text is split on whitespace characters, similar to text.split (' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  • Does the substring match a tokenizer exception rule? For example, don't does not contain whitespace, but should be split into two tokens, do and n't, while U.K. should always remain one token.
  • Can a prefix, suffix, or infix be split off? For example, punctuation such as commas, periods, hyphens, or quotes:
主站蜘蛛池模板: 宁明县| 随州市| 商丘市| 出国| 常山县| 西峡县| 金川县| 长武县| 翼城县| 上虞市| 监利县| 广平县| 永定县| 资源县| 彝良县| 龙游县| 淳安县| 大石桥市| 蕲春县| 安泽县| 迭部县| 淮北市| 大英县| 教育| 正阳县| 安化县| 常熟市| 宜宾县| 翁牛特旗| 边坝县| 镇赉县| 邛崃市| 西充县| 永州市| 天峻县| 右玉县| 信丰县| 璧山县| 涞源县| 股票| 高阳县|