官术网_书友最值得收藏!

Tokenization

When dealing with natural language sentences, the first activity is typically to tokenize the sentence. Given a sentence that reads such as The child was learning a new word and was using it excessively. "Shan't!", she cried. We need to split the sentence into the components that make up the sentence. We call each component a token, hence the name of the process is tokenization. Here's one possible tokenization method, in which we do a simple strings.Split(a, " ").

Here's a simple program:

func main() {
a := "The child was learning a new word and was using it excessively. \"shan't!\", she cried"
dict := make(map[string]struct{})
words := strings.Split(a, " ")
for _, word := range words{
fmt.Println(word)
dict[word] = struct{}{} // add the word to the set of words already seen before.
}
}

This is the output we will get:

The
child
was
learning
a
new
word
and
was
using
it
excessively.
"shan't!",
she
cried

Now think about this in the context of adding words to a dictionary to learn. Let's say we want to use the same set of English words to form a new sentence: she shan't be learning excessively. (Forgive the poor implications in the sentence). We add it to our program, and see if it shows up in the dictionary:

func main() {
a := "The child was learning a new word and was using it excessively. \"shan't!\", she cried"
dict := make(map[string]struct{})
words := strings.Split(a, " ")
for _, word := range words{
dict[word] = struct{}{} // add the word to the set of words already seen before.
}

b := "she shan't be learning excessively."
words = strings.Split(b, " ")
for _, word := range words {
_, ok := dict[word]
fmt.Printf("Word: %v - %v\n", word, ok)
}
}

This leads to the following result:

Word: she - true
Word: shan't - false
Word: be - false
Word: learning - true
Word: excessively. - true

A superior tokenization algorithm would yield a result as follows:

The
child
was
learning
a
new
word
and
was
using
it
excessively
.
"
sha
n't
!
"
,
she
cried

A particular thing to note is that the symbols and punctuation are now tokens. Another particular thing to note is shan't is now split into two tokens: sha and n't. The word shan't is a contraction of shall and not; therefore, it is tokenized into two words. This is a tokenization strategy that is unique to English. Another unique point of English is that words are separated by a boundary marker—the humble space. In languages where there are no word boundary markers, such as Chinese or Japanese, the process of tokenization becomes significantly more complicated. Add to that languages such as Vietnamese, where there are markers for boundaries of syllables, but not words, and you have a very complicated tokenizer at hand.

The details of a good tokenization algorithm are fairly complicated, and tokenization is worthy of a book to itself, so we shan't cover it here.

The best part about the LingSpam corpus is that the tokenization has already been done. Some notes such as compound words and contractions are not tokenized into different tokens such as the example of shan't. They are treated as a single word. For the purposes of a spam classifier, this is fine. However, when working with different types of NLP projects, the reader might want to consider better tokenization strategies.

Here is a final note about tokenization strategies: English is not a particularly regular language. Despite this, regular expressions are useful for small datasets. For  this project, you may get away with the following regular expression: 
const re = `([A-Z])(\.[A-Z])+\.?|\w+(-\w+)*|\$?\d+(\.\d+)?%?|\.\.\.|[][.,;"'?():-_` + "`]"
主站蜘蛛池模板: 东乡族自治县| 宾川县| 崇信县| 枞阳县| 绩溪县| 吐鲁番市| 棋牌| 资阳市| 南皮县| 城固县| 和静县| 隆昌县| 通渭县| 南宫市| 芒康县| 都江堰市| 稻城县| 柳河县| 舟山市| 晋江市| 神木县| 松阳县| 邹城市| 拜泉县| 赤城县| 花莲县| 塘沽区| 宁蒗| 嘉鱼县| 城固县| 右玉县| 安国市| 枣强县| 华安县| 衡山县| 延吉市| 榆社县| 信丰县| 上蔡县| 南宁市| 新泰市|