官术网_书友最值得收藏!

Normalizing and lemmatizing

In the previous section, I wrote that all the words in the second example, she shan't be excessively learned, are already in the dictionary from the first sentence. The observant reader might note the word be isn't actually in the dictionary. From a linguistics point of view, that isn't necessarily false. The word be is the root word of is, of which was is the past tense. Here, there is a notion that instead of just adding the words directly, we should add the root word. This is called lemmatization. Continuing from the previous example, the following are the lemmatized words from the first sentence:

the
child
be
learn
a
new
word
and
be
use
it
excessively
shall
not
she
cry

Again, here I would like to point out some inconsistencies that will be immediately obvious to the observant reader. Specifically, the word excessively has the root word of excess. So why was excessively listed? Again, the task of lemmatization isn't exactly a straightforward lookup of the root word in a dictionary. Often, in complex NLP related tasks, the words have to be lemmatized according to the context they are in. That's beyond the scope of this chapter because, as before, it's a fairly involved topic that could span an entire chapter of a book on NLP preprocessing.

So, let's go back to the topic of adding a word to a dictionary. Another useful thing to do is to normalize the words. In English, this typically means lowercasing the text, replacing unicode combination characters and the like. In the Go ecosystem, there is an extended standard library package that does just this: golang.org/x/text/unicode/norm. In particular, if we are going to work on real datasets, I personally prefer a NFC normalization schema. A good resource on string normalization is on the Go blog post as well: https://blog.golang.org/normalization. The content is not specific to Go, and is a good guide to string normalization in general.

The LingSpam corpus comes with variants that are normalized (by lowercasing and NFC) and lemmatized. They can be found in the lemm and lemm_stop variants of the corpus.

主站蜘蛛池模板: 自贡市| 玛多县| 胶南市| 仪征市| 惠来县| 日喀则市| 颍上县| 赤水市| 芮城县| 青铜峡市| 昌黎县| 郓城县| 陇川县| 四川省| 渭南市| 通榆县| 色达县| 桐乡市| 云阳县| 鄂伦春自治旗| 大悟县| 黄大仙区| 扎囊县| 专栏| 方山县| 平南县| 民丰县| 温宿县| 工布江达县| 邹平县| 沙洋县| 洪洞县| 民权县| 昌黎县| 阿巴嘎旗| 连江县| 德兴市| 德钦县| 胶南市| 遂川县| 南丹县|