官术网_书友最值得收藏!

How to do it...

In the following steps, we will enumerate all the 4-grams of a sample file and select the 50 most frequent ones:

  1. We begin by importing the collections library to facilitate counting and the ngrams library from nltk to ease extraction of N-grams:
import collections
from nltk import ngrams
  1. We specify which file we would like to analyze:
file_to_analyze = "python-3.7.2-amd64.exe"
  1. We define a convenience function to read in a file's bytes:
def read_file(file_path):
"""Reads in the binary sequence of a binary file."""
with open(file_path, "rb") as binary_file:
data = binary_file.read()
return data
  1. We write a convenience function to take a byte sequence and obtain N-grams:
def byte_sequence_to_Ngrams(byte_sequence, N):
"""Creates a list of N-grams from a byte sequence."""
Ngrams = ngrams(byte_sequence, N)
return list(Ngrams)

  1. We write a function to take a file and obtain its count of N-grams:
def binary_file_to_Ngram_counts(file, N):
"""Takes a binary file and outputs the N-grams counts of its binary sequence."""
filebyte_sequence = read_file(file)
file_Ngrams = byte_sequence_to_Ngrams(filebyte_sequence, N)
return collections.Counter(file_Ngrams)
  1. We specify that our desired value is N=4 and obtain the counts of all 4-grams in the file:
extracted_Ngrams = binary_file_to_Ngram_counts(file_to_analyze, 4)
  1. We list the 10 most common 4-grams of our file:
print(extracted_Ngrams.most_common(10))

The result is as follows:

[((0, 0, 0, 0), 24201), ((139, 240, 133, 246), 1920), ((32, 116, 111, 32), 1791), ((255, 255, 255, 255), 1663), ((108, 101, 100, 32), 1522), ((100, 32, 116, 111), 1519), ((97, 105, 108, 101), 1513), ((105, 108, 101, 100), 1513), ((70, 97, 105, 108), 1505), ((101, 100, 32, 116), 1503)]
主站蜘蛛池模板: 大英县| 黄浦区| 沧源| 长岭县| 大石桥市| 正蓝旗| 淮阳县| 乌拉特后旗| 安宁市| 安龙县| 翼城县| 刚察县| 彰化县| 磐安县| 鄂托克前旗| 花垣县| 威远县| 五大连池市| 二连浩特市| 衡阳市| 巧家县| 沁水县| 西青区| 确山县| 北安市| 泽州县| 浏阳市| 沭阳县| 高唐县| 青神县| 噶尔县| 淄博市| 望谟县| 桂阳县| 新建县| 台东县| 勃利县| 札达县| 西青区| 中阳县| 洞口县|