官术网_书友最值得收藏!

Stopwords

By reading this, I would assume the reader is familiar with English. And you may have noticed that some words are used more often than others. Words such as the, there, from, and so on. The task of classifying whether an email is spam or ham is inherently statistical in nature. When certain words are used often in a document (such as an email), it conveys more weight about what that document is about. For example, I received an email today about cats (I am a patron of the Cat Protection Society). The word cat or cats occurred eleven times out of the 120 or so words. It would not be difficult to assume that the email is about cats.

However, the word the showed up 19 times. If we were to classify the topic of the email by a count of words, the email would be classified under the topic the. Connective words such as these are useful in understanding the specific context of the sentences, but for a Na?ve statistical analysis, they often add nothing more than noise. So, we have to remove them.

Stopwords are often specific to projects, and I'm not a particular fan of removing them outright. However, the LingSpam corpus has two variants: stop and lemm_stop, which has the stopwords list applied, and the stopwords removed.

主站蜘蛛池模板: 绍兴市| 明光市| 晋江市| 高淳县| 常熟市| 精河县| 永州市| 农安县| 奇台县| 文登市| 贵德县| 肥西县| 蒲城县| 佳木斯市| 固原市| 光山县| 五家渠市| 全椒县| 商丘市| 双峰县| 瑞丽市| 宝坻区| 凌源市| 佳木斯市| 灌云县| 娄烦县| 定兴县| 曲麻莱县| 怀安县| 许昌县| 武冈市| 宁陵县| 额尔古纳市| 磐安县| 德化县| 嘉义县| 漯河市| 柯坪县| 都江堰市| 繁昌县| 湄潭县|