書名： Natural Language Processing with Python Quick Start Guide
作者名： Nirant Kasliwal
本章字?jǐn)?shù)： 184字
更新時(shí)間： 2021-06-10 18:36:38

Exploring the loaded data

How many unique characters can we see?

For reference, ASCII has 127 characters in it, so we expect this to have, at most, 127 characters:

unique_chars = list(set(text))
unique_chars.sort()
print(unique_chars)
print(f'There are {len(unique_chars)} unique characters, including both ASCII and Unicode character')

The preceding code returns the following output:

   ['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'a', 'è', 'é']
   There are 85 unique characters, including both ASCII and Unicode character

For our machine learning models, we often need the words to occur as individual tokens or single words. Let's explain what this means in the next section.

官术网_书友最值得收藏!

Natural Language Processing with Python Quick Start Guide

Exploring the loaded data