官术网_书友最值得收藏!

Classification - Spam Email Detection

What makes you you? I have dark hair, pale skin, and Asiatic features. I wear glasses. My facial structure is vaguely round, with extra subcutaneous fat in my cheeks compared to my peers. What I have done is describe the features of my face. Each of these features described can be thought of as a point within a probability continuum. What is the probability of having dark hair? Among my friends, dark hair is a very common feature, and so are glasses (a remarkable statistic is out of the 300 people or so I polled on my Facebook page, 281 of them require prescription glasses). The epicanthic folds of my eyes are probably less common, as is the extra subcutaneous fat in my cheeks.

Why am I bringing up my facial features in a chapter about spam classification? It's because the principles are the same. If I show you a photo of a human face, what is the probability that the photo is of me? We can say that the probability that the photo is a photo of my face is a combination of the probability of having dark hair, the probability of having pale skin, the probability of having an epicanthic fold, and so on, and so forth. From a Naive point of view, we can think of each of the features independently contributing to the probability that the photo is me—the fact that I have an epicanthic fold in my eyes is independent from the fact that my skin is of a yellow pallor. But, of course, with recent advancements in genetics, this has been shown to be patently untrue. These features are, in real life, correlated with one another. We will explore this in a future chapter.

Despite a real-life dependence of probability, we can still assume the Naive position and think of these probabilities as independent contributions to the probability that the photo is one of my face.

In this chapter, we will build a email spam classification system using a Naive Bayes algorithm, which can be used beyond email spam classification. Along the way, we will explore the very basics of natural language processing, and how probability is inherently tied to the very language we use. A probabilistic understanding of language will be built up from the ground with the introduction of the term frequency-inverse document frequency (TF-IDF), which will then be translated into Bayesian probabilities, which is used to classify the emails.

主站蜘蛛池模板: 龙岩市| 龙南县| 德格县| 临泽县| 漠河县| 郴州市| 浪卡子县| 平乡县| 固镇县| 吉安县| 武城县| 阿巴嘎旗| 崇明县| 轮台县| 济宁市| 竹山县| 进贤县| 庐江县| 锡林郭勒盟| 侯马市| 襄垣县| 丽水市| 清水县| 余姚市| 永宁县| 镇雄县| 孟州市| 桦甸市| 凤山市| 彰武县| 宁晋县| 大埔区| 定日县| 甘孜县| 托克逊县| 定兴县| 普定县| 石狮市| 南通市| 康乐县| 大田县|