官术网_书友最值得收藏!

Exploring the loaded data

How many unique characters can we see?

For reference, ASCII has 127 characters in it, so we expect this to have, at most, 127 characters:

unique_chars = list(set(text))
unique_chars.sort()
print(unique_chars)
print(f'There are {len(unique_chars)} unique characters, including both ASCII and Unicode character')

The preceding code returns the following output:

   ['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'a', 'è', 'é']
There are 85 unique characters, including both ASCII and Unicode character

For our machine learning models, we often need the words to occur as individual tokens or single words. Let's explain what this means in the next section.

主站蜘蛛池模板: 张掖市| 南乐县| 高台县| 龙口市| 越西县| 从江县| 乐清市| 高安市| 五大连池市| 商丘市| 伊宁市| 南昌市| 河曲县| 峨山| 岳池县| 长治市| 绥滨县| 景宁| 澜沧| 科尔| 南京市| 新营市| 金华市| 临高县| 永春县| 余江县| 黑龙江省| 庄浪县| 贺州市| 云浮市| 神农架林区| 昌平区| 榕江县| 武威市| 佛坪县| 新沂市| 望江县| 平昌县| 龙井市| 陵川县| 峨山|