官术网_书友最值得收藏!

Preprocessing the data

Now let's open up the Jupyter Notebook and get started on our first program, using the methods that we discussed in the previous section:

  1. The first thing we need to do is load the various libraries that we need. We will also load the iris dataset from the scikit-learn library, using the following code:

  1. After importing all the required libraries and the dataset, we will go ahead and create an object called iris_obj, which loads the iris dataset into an object. Then, we will go ahead and use the data method to preview the dataset; and this results in the following output:

Notice that it's a NumPy array. This contains a lot of the data that we want, and each of these columns corresponds to a feature.

  1. We will now see what those feature names are in the following output:

As you can see here, the first column shows the sepal length, the next column shows the sepal width, the third column shows the petal length, and the final column shows the petal width.

  1. Now, there is a fifth column that is not displayed hereit's referred to as the target column. This is stored in a separate array; we will now look at this column as follows:

This displays the target column in an array.

  1. Now, if you want to see the labels of the array header, we can use the following code:

As you can see, the target column consists of data with three different labels. The flowers come from either the setosa, the versicolor, or the virginica species.

  1. Our next step is to take this dataset and turn it into a pandas DataFrame, using the following code:

This results in the following output:

As you can see, we have successfully loaded the data into a DataFrame.

  1. We can see that the species column still shows the various species using numeric values. So, we will replace the final column, which indicates the various species, with strings that indicate the values, rather than numbers, using the following code block:

The following screenshot shows the result:

As you can see, the species column now has the actual species namesthis makes it much easier to work with the data.

Now, for this dataset, the fact that each flower comes from a different species suggests that we may want to actually group the data when we're doing statistical summariestherefore, we can try grouping by species.

  1. So, we will now group the dataset values using the species column as the anchor, and then print out the details of each group to make sure that everything is working. We will use the following lines of code to do so:

This results in the following output:

Now that the data has been loaded and set up, we will use it to perform some basic statistical operations in the next section.

主站蜘蛛池模板: 二手房| 元谋县| 郁南县| 张掖市| 华宁县| 高密市| 博野县| 黄梅县| 公主岭市| 崇礼县| 云霄县| 石屏县| 乾安县| 平果县| 辽中县| 龙川县| 农安县| 景德镇市| 景东| 辽宁省| 洞头县| 西平县| 库尔勒市| 泰安市| 安新县| 元朗区| 安国市| 台中县| 临漳县| 定日县| 区。| 张家川| 古丈县| 南安市| 胶州市| 辽阳县| 海南省| 南投县| 县级市| 醴陵市| 金湖县|