官术网_书友最值得收藏!

Normalization or standardization

This technique aims to give the dataset the properties of a normal distribution, that is, a mean of 0 and a standard deviation of 1.

The way to obtain these properties is by calculating the so-called z scores, based on the dataset samples, with the following formula:

Let's visualize and practice this new concept with the help of scikit-learn, reading a file from the MPG dataset, which contains city-cycle fuel consumption in miles per gallon, based on the following features: mpg, cylinders, displacementhorsepower, weight, acceleration, model year, origin, and car name.

from sklearn import preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv("data/mpg.csv")
plt.figure(figsize=(10,8))
print df.columns
partialcolumns = df[['acceleration', 'mpg']]
std_scale = preprocessing.StandardScaler().fit(partialcolumns)
df_std = std_scale.transform(partialcolumns)
plt.scatter(partialcolumns['acceleration'], partialcolumns['mpg'], color="grey", marker='^')
plt.scatter(df_std[:,0], df_std[:,1])
The following picture allows us to compare the non normalized and normalized data representations:
Depiction of the original dataset, and its normalized counterpart.
It's very important to have an account of the denormalizing of the resulting data at the time of evaluation so that you do not lose the representative of the data, especially if the model is applied to regression, when the regressed data won't be useful if not scaled.
主站蜘蛛池模板: 大厂| 罗平县| 安乡县| 汝阳县| 海城市| 湟源县| 沂南县| 甘洛县| 中西区| 桂林市| 隆化县| 建昌县| 孟津县| 泰安市| 剑川县| 岗巴县| 平顺县| 舞阳县| 肇庆市| 青神县| 嵩明县| 库尔勒市| 西林县| 定陶县| 龙江县| 文水县| 手游| 伊宁市| 庆城县| 同心县| 丹巴县| 石门县| 铜陵市| 洪泽县| 安平县| 泸州市| 镇平县| 巫溪县| 休宁县| 彰化市| 抚州市|