官术网_书友最值得收藏!

Bread and butter – most common tasks

There are several well-known text cleaning ideas. They have all made their way into the most popular tools today such as NLTK, Stanford CoreNLP, and spaCy. I like spaCy for two main reasons:

  • It's an industry-grade NLP, unlike NLTK, which is mainly meant for teaching.
  • It has good speed-to-performance trade-off. spaCy is written in Cython, which gives it C-like performance with Python code.

spaCy is actively maintained and developed, and incorporates the best methods available for most challenges.

By the end of this section, you will be able to do the following:

  • Understand tokenization and do it manually yourself using spaCy
  • Understand why stop word removal and case standardization works, with spaCy examples
  • Differentiate between stemming and lemmatization, with spaCy lemmatization examples
主站蜘蛛池模板: 兴仁县| 乐昌市| 祁门县| 塘沽区| 新巴尔虎左旗| 星座| 怀远县| 肇东市| 甘谷县| 福建省| 乌鲁木齐县| 西平县| 团风县| 湟源县| 上虞市| 河北省| 乐都县| 英山县| 宜良县| 磴口县| 孟津县| 信丰县| 天门市| 尼玛县| 饶河县| 呼伦贝尔市| 周至县| 荥经县| 武义县| 志丹县| 彩票| 灵川县| 大理市| 神池县| 阆中市| 景东| 苗栗县| 胶南市| 辉南县| 甘孜| 汾西县|