官术网_书友最值得收藏!

Loading and preparing the dataset

The dataset we are going to use for this example is the famous Iris database of plant classification. In this dataset, we have 150 plant samples and four measurements of each: sepal length, sepal width, petal length, and petal width (all in centimeters). This classic dataset (first used in 1936!) is one of the classic datasets for data mining. There are three classes: Iris Setosa, Iris Versicolour, and Iris Virginica. The aim is to determine which type of plant a sample is, by examining its measurements.

The scikit-learn library contains this dataset built-in, making the loading of the dataset straightforward:

from sklearn.datasets import load_iris 
dataset = load_iris()
X = dataset.data
y = dataset.target

You can also print(dataset.DESCR) to see an outline of the dataset, including some details about the features.

The features in this dataset are continuous values, meaning they can take any range of values. Measurements are a good example of this type of feature, where a measurement can take the value of 1, 1.2, or 1.25 and so on. Another aspect of continuous features is that feature values that are close to each other indicate similarity. A plant with a sepal length of 1.2 cm is like a plant with a Sepal width of 1.25 cm.

In contrast are categorical features. These features, while often represented as numbers, cannot be compared in the same way. In the Iris dataset, the class values are an example of a categorical feature. The class 0 represents Iris Setosa; class 1 represents Iris Versicolour, and class 2 represents Iris Virginica. The numbering doesn't mean that Iris Setosa is more similar to Iris Versicolour than it is to Iris Virginica-despite the class value being more similar. The numbers here represent categories. All we can say is whether categories are the same or different.

There are other types of features too, which we will cover in later chapters. These include pixel intensity, word frequency and n-gram analysis.

While the features in this dataset are continuous, the algorithm we will use in this example requires categorical features. Turning a continuous feature into a categorical feature is a process called discretization.

A simple discretization algorithm is to choose some threshold, and any values below this threshold are given a value 0. Meanwhile, any above this are given the value 1. For our threshold, we will compute the mean (average) value for that feature. To start with, we compute the mean for each feature:

attribute_means = X.mean(axis=0)

The result from this code will be an array of length 4, which is the number of features we have. The first value is the mean of the values for the first feature and so on. Next, we use this to transform our dataset from one with continuous features to one with discrete categorical features:

assert attribute_means.shape == (n_features,)
X_d = np.array(X >= attribute_means, dtype='int')

We will use this new X_d dataset (for X discretized) for our training and testing, rather than the original dataset (X).

主站蜘蛛池模板: 舞钢市| 长寿区| 东海县| 岳池县| 琼结县| 诸暨市| 青冈县| 行唐县| 东平县| 富平县| 仪征市| 个旧市| 通州市| 垫江县| 江安县| 长治市| 安陆市| 鄯善县| 柘城县| 天镇县| 盘锦市| 沭阳县| 泸州市| 江城| 宁阳县| 大洼县| 临城县| 姜堰市| 永济市| 堆龙德庆县| 岳阳市| 莱州市| 浮梁县| 南靖县| 当涂县| 勃利县| 明水县| 鲁甸县| 大名县| 凤山县| 山阳县|