官术网_书友最值得收藏!

Phishing detection with logistic regression

In this section, we are going to build a phishing detector from scratch with a logistic regression algorithm. Logistic regression is a well-known statistical technique used to make binomial predictions (two classes).

Like in every machine learning project, we will need data to feed our machine learning model. For our model, we are going to use the UCI Machine Learning Repository (Phishing Websites Data Set). You can check it out at https://archive.ics.uci.edu/ml/datasets/Phishing+Websites:

The dataset is provided as an arff file:

The following is a snapshot from the dataset:

For better manipulation, we have organized the dataset into a csv file:

As you probably noticed from the attributes, each line of the dataset is represented in the following format – {30 Attributes (having_IP_Address URL_Length, abnormal_URL and so on)} + {1 Attribute (Result)}:

For our model, we are going to import two machine learning libraries, NumPy and scikit-learn, which we already installed in Chapter 1Introduction to Machine Learning in Pentesting.

Let's open the Python environment and load the required libraries:

>>> import numpy as np
>>> from sklearn import *
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import accuracy_score

Next, load the data:

training_data = np.genfromtxt('dataset.csv', delimiter=',', dtype=np.int32)

Identify the inputs (all of the attributes, except for the last one) and the outputs (the last attribute):

>>> inputs = training_data[:,:-1]
>>> outputs = training_data[:, -1]

In the previous chapter, we discussed how we need to pide the dataset into training data and testing data:

training_inputs = inputs[:2000]
training_outputs = outputs[:2000]
testing_inputs = inputs[2000:]
testing_outputs = outputs[2000:]

Create the scikit-learn logistic regression classifier:

classifier = LogisticRegression()

Train the classifier:

classifier.fit(training_inputs, training_outputs)

Make predictions:

predictions = classifier.predict(testing_inputs)

Let's print out the accuracy of our phishing detector model:

accuracy = 100.0 * accuracy_score(testing_outputs, predictions)

print ("The accuracy of your Logistic Regression on testing data is: " + str(accuracy))

The accuracy of our model is approximately 85%. This is a good accuracy, since our model detected 85 phishing URLs out of 100. But let's try to make an even better model with decision trees, using the same data.

主站蜘蛛池模板: 酒泉市| 华容县| 东宁县| 济南市| 永州市| 民勤县| 万山特区| 阿克苏市| 房山区| 汶川县| 上虞市| 南召县| 武平县| 乡宁县| 犍为县| 柳河县| 平安县| 耒阳市| 永嘉县| 竹北市| 正阳县| 于田县| 霍邱县| 桃园县| 若羌县| 海林市| 临高县| 青龙| 桓仁| 定兴县| 灌南县| 沂源县| 西安市| 察哈| 浪卡子县| 涞水县| 定结县| 陇南市| 清涧县| 澜沧| 东乡族自治县|