官术网_书友最值得收藏!

How does the spaCy tokenizer work?

The simplest explanation is from the spaCy docs (spacy-101) itself.

First, the raw text is split on whitespace characters, similar to text.split (' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  • Does the substring match a tokenizer exception rule? For example, don't does not contain whitespace, but should be split into two tokens, do and n't, while U.K. should always remain one token.
  • Can a prefix, suffix, or infix be split off? For example, punctuation such as commas, periods, hyphens, or quotes:
主站蜘蛛池模板: 佛学| 庆城县| 茌平县| 保山市| 宕昌县| 区。| 安康市| 湖口县| 磴口县| 会宁县| 茶陵县| 通渭县| 福贡县| 宁安市| 息烽县| 绥化市| 滨海县| 鄯善县| 武川县| 龙江县| 双鸭山市| 马尔康县| 罗城| 嘉峪关市| 海城市| 蒙城县| 抚远县| 那坡县| 白山市| 临泽县| 绥芬河市| 金沙县| 汉沽区| 石渠县| 南丹县| 略阳县| 兰考县| 甘肃省| 鄂托克旗| 日照市| 昌江|