官术网_书友最值得收藏!

Extracting N-grams

In standard quantitative analysis of text, N-grams are sequences of N tokens (for example, words or characters). For instance, given the text The quick brown fox jumped over the lazy dog, if our tokens are words, then the 1-grams are the, quick, brown, fox, jumped, over, the, lazy, and dog. The 2-grams are the quick, quick brown, brown fox, and so on. The 3-grams are the quick brown, quick brown fox, brown fox jumped, and so on. Just like the local statistics of the text allowed us to build a Markov chain to perform statistical predictions and text generation from a corpus, N-grams allow us to model the local statistical properties of our corpus. Our ultimate goal is to utilize the counts of N-grams to help us predict whether a sample is malicious or benign. In this recipe, we demonstrate how to extract N-gram counts from a sample.

主站蜘蛛池模板: 金坛市| 延边| 承德市| 马关县| 大荔县| 海口市| 信丰县| 林州市| 西峡县| 郸城县| 鸡东县| 健康| 乐昌市| 渑池县| 卢湾区| 太保市| 越西县| 平和县| 龙门县| 牙克石市| 临潭县| 荣昌县| 高青县| 合水县| 桂阳县| 博客| 万载县| 缙云县| 江川县| 宣武区| 当涂县| 剑阁县| 乌恰县| 沿河| 佳木斯市| 黔西县| 平泉县| 汉沽区| 黄浦区| 玉田县| 沿河|