官术网_书友最值得收藏!

The project 

What we want to do is simple: given an email, is it kosher (which we call ham), or is it a spam email? We will be using the LingSpam database. The emails from that database are a little dated—spammers update their techniques and words all the time. However, I chose the LingSpam corpus for a good reason: it is already nicely preprocessed. The original scope of this chapter was to introduce the preprocessing of emails; however, the topic of preprocessing options for natural language is itself a topic for an entire book, so we will use a dataset that has already been preprocessed. This allows us to focus more on the mechanics of a very elegant algorithm.

Fear not, though, as I will actually walk through the brief basics of preprocessing. Be warned, however, that the level of complexity jumps up in a very steep curve, so be prepared to be sucked into a black hole of many hours on preprocessing natural language. At the end of this chapter, I will also recommend some libraries that will be useful for preprocessing.

主站蜘蛛池模板: 上蔡县| 梁河县| 阳城县| 颍上县| 通城县| 东乌珠穆沁旗| 贡嘎县| 鄂州市| 东安县| 玉溪市| 宣武区| 柘城县| 东安县| 沾化县| 沙坪坝区| 偏关县| 武汉市| 红河县| 碌曲县| 邮箱| 舟曲县| 崇阳县| 彩票| 齐河县| 双鸭山市| 德令哈市| 泗洪县| 新沂市| 中西区| 长垣县| 宣化县| 江城| 安顺市| 张北县| 新沂市| 松潘县| 鱼台县| 景东| 平山县| 广平县| 临沂市|