官术网_书友最值得收藏!

Loading the dataset

While maybe not the most fun part of a machine learning problem, loading the data is an important step.  I'm going to cover my data loading methodology here so that you can get a feel for how I handle loading a dataset.

from sklearn.preprocessing import StandardScaler
import pandas as pd

TRAIN_DATA = "./data/train/train_data.csv"
VAL_DATA = "./data/val/val_data.csv"
TEST_DATA = "./data/test/test_data.csv"

def load_data():
"""Loads train, val, and test datasets from disk"""
train = pd.read_csv(TRAIN_DATA)
val = pd.read_csv(VAL_DATA)
test = pd.read_csv(TEST_DATA)

# we will use sklearn's StandardScaler to scale our data to 0 mean, unit variance.
scaler = StandardScaler()
train = scaler.fit_transform(train)
val = scaler.transform(val)
test = scaler.transform(test)
# we will use a dict to keep all this data tidy.
data = dict()

data["train_y"] = train[:, 10]
data["train_X"] = train[:, 0:9]
data["val_y"] = val[:, 10]
data["val_X"] = val[:, 0:9]
data["test_y"] = test[:, 10]
data["test_X"] = test[:, 0:9]
# it's a good idea to keep the scaler (or at least the mean/variance) so we can unscale predictions
data["scaler"] = scaler
return data

When I'm reading data from csv, excel, or even a DBMS, my first step is usually loading it into a pandas dataframe.  

 It's important to normalize our data so that each feature is on a comparable scale, and that all those scales fall within the bounds of our activation functions. Here, I used Scikit-Learn's StandardScaler to accomplish this task. 

This gives us an overall dataset with shape (4898, 10). Our target variable, alcohol, is given as a percentage between 8% and 14.2%.

I've randomly sampled and divided the data into train, val, and test datasets prior to loading the data, so we don't have to worry about that here.

Lastly,  the load_data() function returns a dictionary that keeps everything tidy and in one place.  If you see me reference data["X_train"] later, just know that I'm referencing the training dataset, that I've stored in a dictionary of data.

. The code and data for this project are both available on the book's GitHub site (https://github.com/mbernico/deep_learning_quick_reference). 

主站蜘蛛池模板: 武穴市| 车致| 汶川县| 汾阳市| 武乡县| 奎屯市| 东乌珠穆沁旗| 团风县| 武平县| 新沂市| 隆化县| 儋州市| 东乡县| 永济市| 贵州省| 濉溪县| 宣城市| 三门峡市| 江北区| 安福县| 台东市| 荥经县| 长治县| 绿春县| 连城县| 长寿区| 湖口县| 灵寿县| 怀远县| 晋宁县| 盱眙县| 丹阳市| 宁都县| 莒南县| 寿阳县| 柏乡县| 东台市| 博野县| 宣武区| 额尔古纳市| 怀化市|