官术网_书友最值得收藏!

Exploring the loaded data

How many unique characters can we see?

For reference, ASCII has 127 characters in it, so we expect this to have, at most, 127 characters:

unique_chars = list(set(text))
unique_chars.sort()
print(unique_chars)
print(f'There are {len(unique_chars)} unique characters, including both ASCII and Unicode character')

The preceding code returns the following output:

   ['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'a', 'è', 'é']
There are 85 unique characters, including both ASCII and Unicode character

For our machine learning models, we often need the words to occur as individual tokens or single words. Let's explain what this means in the next section.

主站蜘蛛池模板: 合江县| 张家口市| 新建县| 额敏县| 抚顺县| 长治市| 承德市| 镇坪县| 深水埗区| 阿拉善右旗| 桑日县| 简阳市| 高唐县| 满城县| 威宁| 祁东县| 宽城| 蒙自县| 榆树市| 武汉市| 延长县| 隆回县| 乌拉特中旗| 富裕县| 武平县| 交口县| 新安县| 宁波市| 察隅县| 九寨沟县| 即墨市| 长丰县| 德钦县| 治多县| 南通市| 肇庆市| 新龙县| 娄底市| 垦利县| 遂溪县| 崇阳县|