官术网_书友最值得收藏!

Managing categorical data

In many classification problems, the target dataset is made up of categorical labels which cannot immediately be processed by any algorithm. An encoding is needed and scikit-learn offers at least two valid options. Let's consider a very small dataset made of 10 categorical samples with two features each:

import numpy as np

>>> X = np.random.uniform(0.0, 1.0, size=(10, 2))
>>> Y = np.random.choice(('Male','Female'), size=(10))
>>> X[0]
array([ 0.8236887 , 0.11975305])
>>> Y[0]
'Male'

The first option is to use the LabelEncoder class, which adopts a dictionary-oriented approach, associating to each category label a progressive integer number, that is an index of an instance array called classes_:

from sklearn.preprocessing import LabelEncoder

>>> le = LabelEncoder()
>>> yt = le.fit_transform(Y)
>>> print(yt)
[0 0 0 1 0 1 1 0 0 1]

>>> le.classes_array(['Female', 'Male'], dtype='|S6')

The inverse transformation can be obtained in this simple way:

>>> output = [1, 0, 1, 1, 0, 0]
>>> decoded_output = [le.classes_[i] for i in output]
['Male', 'Female', 'Male', 'Male', 'Female', 'Female']

This approach is simple and works well in many cases, but it has a drawback: all labels are turned into sequential numbers. A classifier which works with real values will then consider similar numbers according to their distance, without any concern for semantics. For this reason, it's often preferable to use so-called one-hot encoding, which binarizes the data. For labels, it can be achieved using the LabelBinarizer class:

from sklearn.preprocessing import LabelBinarizer

>>> lb = LabelBinarizer()
>>> Yb = lb.fit_transform(Y)
array([[1],
[0],
[1],
[1],
[1],
[1],
[0],
[1],
[1],
[1]])

>>> lb.inverse_transform(Yb)
array(['Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male',
'Male', 'Male'], dtype='|S6')

In this case, each categorical label is first turned into a positive integer and then transformed into a vector where only one feature is 1 while all the others are 0. It means, for example, that using a softmax distribution with a peak corresponding to the main class can be easily turned into a discrete vector where the only non-null element corresponds to the right class. For example:

import numpy as np

>>> Y = lb.fit_transform(Y)
array([[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0]])

>>> Yp = model.predict(X[0])
array([[0.002, 0.991, 0.001, 0.005, 0.001]])

>>> Ypr = np.round(Yp)
array([[ 0., 1., 0., 0., 0.]])

>>> lb.inverse_transform(Ypr)
array(['Female'], dtype='|S6')

Another approach to categorical features can be adopted when they're structured like a list of dictionaries (not necessarily dense, they can have values only for a few features). For example:

data = [
{ 'feature_1': 10.0, 'feature_2': 15.0 },
{ 'feature_1': -5.0, 'feature_3': 22.0 },
{ 'feature_3': -2.0, 'feature_4': 10.0 }
]

In this case, scikit-learn offers the classes DictVectorizer and FeatureHasher; they both produce sparse matrices of real numbers that can be fed into any machine learning model. The latter has a limited memory consumption and adopts MurmurHash 3 (read https://en.wikipedia.org/wiki/MurmurHash, for further information). The code for these two methods is shown as follows:

from sklearn.feature_extraction import DictVectorizer, FeatureHasher

>>> dv = DictVectorizer()
>>> Y_dict = dv.fit_transform(data)

>>> Y_dict.todense()
matrix([[ 10., 15., 0., 0.],
[ -5., 0., 22., 0.],
[ 0., 0., -2., 10.]])

>>> dv.vocabulary_
{'feature_1': 0, 'feature_2': 1, 'feature_3': 2, 'feature_4': 3}

>>> fh = FeatureHasher()
>>> Y_hashed = fh.fit_transform(data)

>>> Y_hashed.todense()
matrix([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])

In both cases, I suggest you read the original scikit-learn documentation to know all possible options and parameters. 

When working with categorical features (normally converted into positive integers through LabelEncoder), it's also possible to filter the dataset in order to apply one-hot encoding using the OneHotEncoder class. In the following example, the first feature is a binary index which indicates 'Male' or 'Female':

from sklearn.preprocessing import OneHotEncoder

>>> data = [
[0, 10],
[1, 11],
[1, 8],
[0, 12],
[0, 15]
]

>>> oh = OneHotEncoder(categorical_features=[0])
>>> Y_oh = oh.fit_transform(data1)

>>> Y_oh.todense()
matrix([[ 1., 0., 10.],
[ 0., 1., 11.],
[ 0., 1., 8.],
[ 1., 0., 12.],
[ 1., 0., 15.]])

Considering that these approaches increase the number of values (also exponentially with binary versions), all the classes adopt sparse matrices based on SciPy implementation. See https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html  for further information.

主站蜘蛛池模板: 西乌珠穆沁旗| 新建县| 尖扎县| 陆川县| 寻甸| 孙吴县| 合水县| 且末县| 金寨县| 景德镇市| 巩义市| 长治县| 水城县| 车致| 青州市| 通海县| 句容市| 昌图县| 报价| 慈利县| 四子王旗| 兴仁县| 镇原县| 禄丰县| 日喀则市| 晋州市| 三江| 光山县| 龙游县| 永春县| 阿城市| 广西| 武胜县| 临安市| 土默特左旗| 达拉特旗| 金昌市| 霍州市| 天津市| 阜新市| 弋阳县|