官术网_书友最值得收藏!

The diamond dataset

Let's make actual predictions about diamond prices by using different ensemble learning models. We will use a diamonds dataset(which can be found here: https://www.kaggle.com/shivam2503/diamonds). This dataset has the prices, among other features, of almost 54,000 diamonds. The following are the features that we have in this dataset:

  • Feature information: A dataframe with 53,940 rows and 10 variables
  • Price: Price in US dollars

The following are the nine predictive features:

  • carat: This feature represents weight of the diamond (0.2-5.01)
  • cut: This feature represents quality of the cut (Fair, Good, Very Good, Premium, and Ideal)
  • color: This feature represents diamond color, from J (worst) to D (best)
  • clarity: This feature represents a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x: This feature represents length of diamond in mm (0-10.74)
  • y: This feature represents width of diamond in mm (0-58.9)
  • z: This feature represents depth of diamond in mm (0-31.8)
  • depthThis feature represents z/mean(x, y) = 2 * z/(x + y) (43-79)
  • table: This feature represents width of the top of the diamond relative to the widest point (43-95)

The x, y, and z variables denote the size of the diamonds.

The libraries that we will use are numpy, matplotlib, and pandas. For importing these libraries, the following lines of code can be used:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

The following screenshot shows the lines of code that we use to call the raw dataset:

The preceding dataset has some numerical features and some categorical features. Here, 53,940 is the exact number of samples that we have in this dataset. Now, for encoding the information in these categorical features, we use the one-hot encoding technique to transform these categorical features into dummy features. The reason behind this is because scikit-learn only works with numbers.

The following screenshot shows the lines of code used for the transformation of the categorical features to numbers:

Here, we can see how we can do this with the get_dummies function from pandas. The final dataset looks similar to the one in the following screenshot:

Here, for each of the categories in the categorical variable, we have dummy features. The value here is 1 when the category is present and 0 when the category is not present in the particular diamond.

Now, for rescaling the data, we will use the RobustScaler method to transform all the features to a similar scale. 

The following screenshot shows the lines of code used for importing the train_test_split function and the RobustScaler method:

Here, we extract the features in the X matrix, mention the target, and then use the train_test_split function from scikit-learn to partition the data into two sets.

主站蜘蛛池模板: 启东市| 屏山县| 高碑店市| 隆安县| 苗栗市| 顺昌县| 祥云县| 普兰县| 怀柔区| 松原市| 通辽市| 邵武市| 迁安市| 宣汉县| 海宁市| 白城市| 泸州市| 沙洋县| 维西| 洛扎县| 会昌县| 施秉县| 长汀县| 海宁市| 建始县| 鹤庆县| 清水县| 松桃| 灵璧县| 马龙县| 青海省| 武穴市| 宝兴县| 桂平市| 泸州市| 鹤山市| 安乡县| 龙门县| 随州市| 鱼台县| 朝阳区|