官术网_书友最值得收藏!

PoS tagging

We can apply an off-the-shelf tagger from NLTK or combine multiple taggers to customize the tagging process. It is easy to directly use the built-in tagging function, pos_tag, as in: pos_tag(input_tokens), for instance. But behind the scene, it is actually a prediction from a pre-built supervised learning model. The model is trained based on a large corpus composed of words that are correctly tagged.

Reusing an earlier example, we can perform PoS tagging as follows:

>>> import nltk
>>> tokens = word_tokenize(sent)
>>> print(nltk.pos_tag(tokens))
[('I', 'PRP'), ('am', 'VBP'), ('reading', 'VBG'), ('a', 'DT'), ('book', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('Python', 'NNP'), ('Machine', 'NNP'), ('Learning', 'NNP'), ('By', 'IN'), ('Example', 'NNP'), (',', ','), ('2nd', 'CD'), ('edition', 'NN'), ('.', '.')]

The PoS tag following each token is returned. We can check the meaning of a tag using the help function. Looking up PRP and VBP, for example, gives us the following output:

>>> nltk.help.upenn_tagset('PRP')
PRP: pronoun, personal
hers herself him himself hisself it itself me myself one oneself ours ourselves ownself self she thee theirs them themselves they thou thy us
>>> nltk.help.upenn_tagset('VBP')
VBP: verb, present tense, not 3rd person singular
predominate wrap resort sue twist spill cure lengthen brush terminate appear tend stray glisten obtain comprise detest tease attract emphasize mold postpone sever return wag ...

In spaCy, getting a PoS tag is also easy. The token object parsed from an input sentence has an attribute called pos_, which is the tag we are looking for:

>>> print([(token.text, token.pos_) for token in tokens2])
[('I', 'PRON'), ('have', 'VERB'), ('been', 'VERB'), ('to', 'ADP'), ('U.K.', 'PROPN'), ('and', 'CCONJ'), ('U.S.A.', 'PROPN')]
主站蜘蛛池模板: 阿拉善盟| 江永县| 珲春市| 会理县| 蒲城县| 集贤县| 闽清县| 太原市| 博爱县| 盐亭县| 定日县| 达拉特旗| 丽水市| 星子县| 虞城县| 双城市| 正镶白旗| 云林县| 象州县| 定兴县| 焦作市| 青龙| 虎林市| 剑阁县| 夏津县| 吉林省| 北海市| 伊金霍洛旗| 门源| 淳安县| 仁化县| 湘乡市| 遂宁市| 黔西县| 桐梓县| 昌宁县| 始兴县| 武夷山市| 南岸区| 辰溪县| 开原市|