官术网_书友最值得收藏!

Loading the dataset

The dataset Ionosphere, which high-frequency antennas. The aim of the antennas is to determine whether there is a structure in the ionosphere and a region in the upper atmosphere. We consider readings with a structure to be good, while those that do not have structure are deemed bad. The aim of this application is to build a data mining classifier that can determine whether an image is good or bad.

(Image Credit: https://www.flickr.com/photos/geckzilla/16149273389/)

You can download this dataset  for different data mining applications. Go to http://archive.ics.uci.edu/ml/datasets/Ionosphere  and click on Data Folder. Download the ionosphere.data and ionosphere.names files to a folder on your computer. For this example, I'll assume that you have put the dataset in a directory called Data in your home folder. You can place the data in another folder, just be sure to update your data folder (here, and in all other chapters).

The location of your home folder depends on your operating system. For Windows, it is usually at C:Documents and Settingsusername. For Mac or Linux machines, it is usually at /home/username. You can get your home folder by running this python code inside a Jupyter Notebook:

import os print(os.path.expanduser("~"))

For each row in the dataset, there are 35 values. The first 34 are measurements taken from the 17 antennas (two values for each antenna). The last is either 'g' or 'b'; that stands for good and bad, respectively.

Start the Jupyter Notebook server and create a new notebook called Ionosphere Nearest Neighbors. To start with, we load up the NumPy and csv libraries that we will need for our code, and set the data's filename that we will need for our code.

import numpy as np 
import csv
data_filename = "data/ionosphere.data"

We then create the X and y NumPy arrays to store the dataset in. The sizes of these arrays are known from the dataset. Don't worry if you don't know the size of future datasets - we will use other methods to load the dataset in future chapters and you won't need to know this size beforehand:

X = np.zeros((351, 34), dtype='float') 
y = np.zeros((351,), dtype='bool')

The dataset is in a Comma-Separated Values (CSV) format, which is a commonly used format for datasets. We are going to use the csv module to load this file. Import it and set up a csv reader object, then loop through the file, setting the appropriate row in X and class value in y for every line in our dataset:

with open(data_filename, 'r') as input_file: 
reader = csv.reader(input_file)
for i, row in enumerate(reader):
# Get the data, converting each item to a float
data = [float(datum) for datum in row[:-1]]
# Set the appropriate row in our dataset
X[i] = data
# 1 if the class is 'g', 0 otherwise
y[i] = row[-1] == 'g'

We now have a dataset of samples and features in X as well as the corresponding classes in y, as we did in the classification example in Chapter 1, Getting Started with Data Mining.

To begin with, try applying the OneR algorithm from Chapter 1, Getting Started with Data Mining to this dataset. It won't work very well, as the information in this dataset is spread out within the correlations of certain features. OneR is only interested in the values of a single feature and cannot pick up information in more complex datasets very well. Other algorithms, including Nearest Neighbor, merge information from multiple features, making them applicable in more scenarios. The downside is that they are often more computationally expensive to compute.

主站蜘蛛池模板: 云浮市| 嘉义县| 建德市| 揭东县| 竹溪县| 孝昌县| 留坝县| 彭泽县| 乐清市| 南和县| 简阳市| 大同县| 临沂市| 湾仔区| 广东省| 神池县| 田阳县| 佛山市| 武功县| 高密市| 西宁市| 青浦区| 木兰县| 宜城市| 祁东县| 吐鲁番市| 乐昌市| 尼木县| 德兴市| 湘阴县| 阳山县| 如东县| 隆回县| 孝昌县| 昌图县| 清水河县| 兴安县| 青浦区| 抚宁县| 桦甸市| 云霄县|