- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 184字
- 2021-06-10 18:36:38
Exploring the loaded data
How many unique characters can we see?
For reference, ASCII has 127 characters in it, so we expect this to have, at most, 127 characters:
unique_chars = list(set(text))
unique_chars.sort()
print(unique_chars)
print(f'There are {len(unique_chars)} unique characters, including both ASCII and Unicode character')
The preceding code returns the following output:
['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'a', 'è', 'é']
There are 85 unique characters, including both ASCII and Unicode character
For our machine learning models, we often need the words to occur as individual tokens or single words. Let's explain what this means in the next section.
推薦閱讀
- Spring Cloud Alibaba微服務(wù)架構(gòu)設(shè)計(jì)與開發(fā)實(shí)戰(zhàn)
- Visual Basic 6.0程序設(shè)計(jì)計(jì)算機(jī)組裝與維修
- Leap Motion Development Essentials
- 0 bug:C/C++商用工程之道
- 持續(xù)集成與持續(xù)交付實(shí)戰(zhàn):用Jenkins、Travis CI和CircleCI構(gòu)建和發(fā)布大規(guī)模高質(zhì)量軟件
- Kivy Cookbook
- 分布式架構(gòu)原理與實(shí)踐
- 深入理解Kafka:核心設(shè)計(jì)與實(shí)踐原理
- 從零開始學(xué)UI設(shè)計(jì)·基礎(chǔ)篇
- C語言程序設(shè)計(jì)實(shí)驗(yàn)指導(dǎo)
- Visual Basic 開發(fā)從入門到精通
- 熱處理常見缺陷分析與解決方案
- Spring MVC Cookbook
- Python量子計(jì)算實(shí)踐:基于Qiskit和IBM Quantum Experience平臺(tái)
- Java性能權(quán)威指南