- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 184字
- 2021-06-10 18:36:38
Exploring the loaded data
How many unique characters can we see?
For reference, ASCII has 127 characters in it, so we expect this to have, at most, 127 characters:
unique_chars = list(set(text))
unique_chars.sort()
print(unique_chars)
print(f'There are {len(unique_chars)} unique characters, including both ASCII and Unicode character')
The preceding code returns the following output:
['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'a', 'è', 'é']
There are 85 unique characters, including both ASCII and Unicode character
For our machine learning models, we often need the words to occur as individual tokens or single words. Let's explain what this means in the next section.
推薦閱讀
- Learn ECMAScript(Second Edition)
- 摩登創(chuàng)客:與智能手機(jī)和平板電腦共舞
- Visual C++數(shù)字圖像模式識(shí)別技術(shù)詳解
- C語言程序設(shè)計(jì)
- Learning Python Design Patterns(Second Edition)
- 網(wǎng)店設(shè)計(jì)看這本就夠了
- TypeScript實(shí)戰(zhàn)指南
- SQL經(jīng)典實(shí)例(第2版)
- Test-Driven Development with Django
- Clojure for Machine Learning
- MySQL程序員面試筆試寶典
- C編程技巧:117個(gè)問題解決方案示例
- Java并發(fā)實(shí)現(xiàn)原理:JDK源碼剖析
- Mastering Object:Oriented Python(Second Edition)
- 軟件開發(fā)中的決策:權(quán)衡與取舍