官术网_书友最值得收藏!

Bread and butter – most common tasks

There are several well-known text cleaning ideas. They have all made their way into the most popular tools today such as NLTK, Stanford CoreNLP, and spaCy. I like spaCy for two main reasons:

  • It's an industry-grade NLP, unlike NLTK, which is mainly meant for teaching.
  • It has good speed-to-performance trade-off. spaCy is written in Cython, which gives it C-like performance with Python code.

spaCy is actively maintained and developed, and incorporates the best methods available for most challenges.

By the end of this section, you will be able to do the following:

  • Understand tokenization and do it manually yourself using spaCy
  • Understand why stop word removal and case standardization works, with spaCy examples
  • Differentiate between stemming and lemmatization, with spaCy lemmatization examples
主站蜘蛛池模板: 凤阳县| 馆陶县| 城口县| 鱼台县| 天柱县| 鄂州市| 西昌市| 湘潭市| 资源县| 上栗县| 科技| 改则县| 罗江县| 黄石市| 乐昌市| 金寨县| 深圳市| 南康市| 锦州市| 尉犁县| 鄂州市| 龙陵县| 许昌市| 句容市| 黔西| 宁都县| 邢台县| 雷波县| 新巴尔虎右旗| 仁寿县| 达州市| 贵州省| 临夏市| 洱源县| 阿拉尔市| 绥芬河市| 南投县| 定兴县| 中超| 临沧市| 德令哈市|