官术网_书友最值得收藏!

How it works...

In the literature and industry, it has been determined that the most frequent N-grams are also the most informative ones for a malware classification algorithm. For this reason, in this recipe, we will write functions to extract them for a file. We start by importing some helpful libraries for our extraction of N-grams (step 1). In particular, we import the collections library and the ngrams library from nltk. The collections library allows us to convert a list of N-grams to a frequency count of the N-grams, while the ngrams library allows us to take an ordered list of bytes and obtain a list of N-grams. We specify the file we would like to analyze and write a function that will read all of the bytes of a given file (steps 2 and 3). We define a few more convenience functions before we begin the extraction. In particular, we write a function to take a file's sequence of bytes and output a list of its N-grams (step 4), and a function to take a file and output the counts of its N-grams (step 5). We are now ready to pass in a file and extracts its N-grams. We do so to extract the counts of 4-grams of our file (step 6) and then display the 10 most common of them, along with their counts (step 7). We see that some of the N-gram sequences, such as (0,0,0,0) and (255,255,255,255) may not be very informative. For this reason, we will utilize feature selection methods to cut out the less informative N-grams in our next recipe.

主站蜘蛛池模板: 镇平县| 宜兰市| 含山县| 盈江县| 西乌珠穆沁旗| 陵川县| 汤阴县| 庆安县| 格尔木市| 班戈县| 德阳市| 长武县| 博客| 禄丰县| 济阳县| 磐石市| 陆良县| 洛阳市| 吉林省| 青海省| 瑞昌市| 卓尼县| 鄄城县| 临江市| 盘山县| 鄱阳县| 凤阳县| 永泰县| 盐亭县| 惠水县| 温泉县| 绥化市| 浦江县| 宾阳县| 高台县| 新乐市| 石首市| 特克斯县| 临邑县| 奉新县| 莱芜市|