官术网_书友最值得收藏!

Bread and butter – most common tasks

There are several well-known text cleaning ideas. They have all made their way into the most popular tools today such as NLTK, Stanford CoreNLP, and spaCy. I like spaCy for two main reasons:

  • It's an industry-grade NLP, unlike NLTK, which is mainly meant for teaching.
  • It has good speed-to-performance trade-off. spaCy is written in Cython, which gives it C-like performance with Python code.

spaCy is actively maintained and developed, and incorporates the best methods available for most challenges.

By the end of this section, you will be able to do the following:

  • Understand tokenization and do it manually yourself using spaCy
  • Understand why stop word removal and case standardization works, with spaCy examples
  • Differentiate between stemming and lemmatization, with spaCy lemmatization examples
主站蜘蛛池模板: 涟水县| 南平市| 大余县| 湖州市| 岑巩县| 祁连县| 黑山县| 新野县| 海晏县| 班戈县| 金乡县| 德钦县| 海林市| 抚远县| 原平市| 安吉县| 松溪县| 宿松县| 青冈县| 麻江县| 汕尾市| 阿克陶县| 大厂| 藁城市| 封丘县| 宜宾县| 收藏| 宁陕县| 洪雅县| 施甸县| 林甸县| 和顺县| 安龙县| 东源县| 三台县| 普安县| 南漳县| 潍坊市| 博乐市| 巴楚县| 丹棱县|