官术网_书友最值得收藏!

Transforming data – PCA and LDA with scikit-learn

Often, a transformation can make data more digestible. In particular, data scientists use transformations to rotate the data about the axis of the most overall or most important variations with the aim of representing similar information with a smaller number of dimensions. We can use the iris dataset as an example to take four features and represent similar information in two dimensions. Let's start with principal component analysis (PCA) to orient the data onto the axes of the highest variation. The iris set only has four dimensions, but this technique can be used on data with tens or hundreds of features: 

# reduce dimensions with PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
out_pca = pca.fit_transform(df[['sepal length in cm',
'sepal width in cm',
'petal length in cm',
'petal width in cm']])

Now, let's create a pandas DataFrame with the output data and use the .head() sanity check to see what we have:

df_pca = pd.DataFrame(data = out_pca, columns = ['pca1', 'pca2'])
print(df_pca.head())

You will see the following output after executing the preceding code:

This looks good, but we are missing the target or label column (species). Let's add the column by concatenating with the original DataFrame. This gives us a PCA DataFrame (df_pca) that is ready for downstream work and predictions. Then, let's plot it and see what our transformed data looks like plotted on just two dimensions:

df_pca = pd.concat([df_pca, df[['species']]], axis = 1)
print(df_pca.head())
sns.lmplot(x="pca1", y="pca2", hue="species", data=df_pca, fit_reg=False)

You will see the following output after executing the preceding code:

The following plot is obtained after the execution of same code snippet:

We now have our higher-dimensional data represented in two easily-digestible and plottable dimensions. However, can we do better? The goal of PCA is to orient the data in the direction of the greatest variation. However, it ignores some important information from our dataset – for instance, the labels are not used; perhaps we can extract even better transformation vectors if we include the labels. The most popular labeled dimension-reduction technique is called linear discriminant analysis (LDA). The following math will group by class labels, and then find the direction of most separation between the classes:

Ignoring labels in the transformation step can be desirable for some problem statements (especially those with unreliable class labels) to avoid pulling the reduced component vectors in an unhelpful direction. For this reason, I recommend that you always start with PCA before deciding whether you need to do any further work or not. Indeed, unless your dataset is large, the computation time for PCA is short, so there's no harm in starting here. 
# reduce dimensions with LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)

# format dataframe
out_lda = lda.fit_transform(X=df.iloc[:,:4], y=df['species'])
df_lda = pd.DataFrame(data = out_lda, columns = ['lda1', 'lda2'])
df_lda = pd.concat([df_lda, df[['species']]], axis = 1)

# sanity check
print(df_lda.head())

# plot
sns.lmplot(x="lda1", y="lda2", hue="species", data=df_lda, fit_reg=False)

You will see the following output after executing the preceding code:

The following plot is obtained after the execution of same code snippet:

The scatter plots may tempt you into thinking that the PCA and LDA techniques performed the same transformation on the data. Let's look a little closer at the first component of each using the powerful violin plot routine. First, we will begin with PCA, as follows:

sns.violinplot(x='species',y='pca1', data=df_pca).set_title("Violin plot: Feature = PCA_1")

You will see the following output after executing the preceding code:

Now, let's plot the first LDA component, as follows:

sns.violinplot(x='species',y='lda1', data=df_lda).set_title("Violin plot: Feature = LDA_1")

You will see the following output after executing the preceding code:

主站蜘蛛池模板: 延吉市| 宜宾县| 临安市| 沭阳县| 榆中县| 叙永县| 东乡族自治县| 虹口区| 伊吾县| 闵行区| 石泉县| 光山县| 仁寿县| 尉氏县| 石狮市| 九寨沟县| 庆阳市| 石泉县| 彰化县| 尼勒克县| 恩施市| 灌南县| 济宁市| 犍为县| 汪清县| 贵州省| 巫山县| 安宁市| 平果县| 麻阳| 砚山县| 抚州市| 阿克陶县| 乐亭县| 昌图县| 蛟河市| 五指山市| 蒲城县| 双江| 靖西县| 辽阳市|