- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 117字
- 2021-06-10 18:36:38
Tokenization
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.
Here is an example of tokenization:
It is, in fact, sometimes useful to distinguish between tokens and words. But here, for ease of understanding, we will use them interchangeably.
We will convert the raw text into a list of words. This should preserve the original ordering of the text.
There are several ways to do this, so let's try a few of them out. We will program two methods from scratch to build our intuition, and then check how spaCy handles tokenization.
推薦閱讀
- SOA實踐
- Production Ready OpenStack:Recipes for Successful Environments
- Learning Neo4j 3.x(Second Edition)
- 數據結構習題解析與實驗指導
- Go語言精進之路:從新手到高手的編程思想、方法和技巧(1)
- Yii Project Blueprints
- Flowable流程引擎實戰
- 微課學人工智能Python編程
- UX Design for Mobile
- Web前端開發最佳實踐
- 零基礎學SQL(升級版)
- Arduino Electronics Blueprints
- Java EE輕量級解決方案:S2SH
- Cinder:Begin Creative Coding
- Spring MVC Blueprints