官术网_书友最值得收藏!

Example of a generative Gaussian mixture

We can now implement this model in Python using a simple bidimensional dataset, created using the make_blobs() function provided by Scikit-Learn:

from sklearn.datasets import make_blobs

nb_samples = 1000
nb_unlabeled = 750

X, Y = make_blobs(n_samples=nb_samples, n_features=2, centers=2, cluster_std=2.5, random_state=100)

unlabeled_idx = np.random.choice(np.arange(0, nb_samples, 1), replace=False, size=nb_unlabeled)
Y[unlabeled_idx] = -1

We have created 1,000 samples belonging to 2 classes. 750 points have then been randomly selected to become our unlabeled dataset (the corresponding class has been set to -1). We can now initialize two Gaussian distributions by defining their mean, covariance, and weight. One possibility is to use random values:

import numpy as np

# First Gaussian
m1 = np.random.uniform(-7.5, 10.0, size=2)
c1 = np.random.uniform(5.0, 15.0, size=(2, 2))
c1 = np.dot(c1, c1.T)
q1 = 0.5

# Second Gaussian
m2 = np.random.uniform(-7.5, 10.0, size=2)
c2 = np.random.uniform(5.0, 15.0, size=(2, 2))
c2 = np.dot(c2, c2.T)
q2 = 0.5

However, as the covariance matrices must be positive semi definite, it's useful to alter the random values (by multiplying each matrix by the corresponding transpose) or to set hard-coded initial parameters. In this case, we could pick the following example:

import numpy as np

# First Gaussian
m1 = np.array([-3.0, -4.5])
c1 = np.array([[25.0, 5.0],
[5.0, 35.0]])
q1 = 0.5

# Second Gaussian
m2 = np.array([5.0, 10.0])
c2 = np.array([[25.0, -10.0],
[-10.0, 25.0]])
q2 = 0.5

The resulting plot is shown in the following graph, where the small diamonds represent the unlabeled points and the bigger dots, the samples belonging to the known classes:

Initial configuration of the Gaussian mixture

The two Gaussians are represented by the concentric ellipses. We can now execute the training procedure. For simplicity, we repeat the update for a fixed number of iterations. The reader can easily modify the code in order to introduce a threshold:

from scipy.stats import multivariate_normal

nb_iterations = 5

for i in range(nb_iterations):
Pij = np.zeros((nb_samples, 2))

for i in range(nb_samples):
if Y[i] == -1:
p1 = multivariate_normal.pdf(X[i], m1, c1, allow_singular=True) * q1
p2 = multivariate_normal.pdf(X[i], m2, c2, allow_singular=True) * q2
Pij[i] = [p1, p2] / (p1 + p2)

else:
Pij[i, :] = [1.0, 0.0] if Y[i] == 0 else [0.0, 1.0]

n = np.sum(Pij, axis=0)
m = np.sum(np.dot(Pij.T, X), axis=0)

m1 = np.dot(Pij[:, 0], X) / n[0]
m2 = np.dot(Pij[:, 1], X) / n[1]

q1 = n[0] / float(nb_samples)
q2 = n[1] / float(nb_samples)

c1 = np.zeros((2, 2))
c2 = np.zeros((2, 2))

for t in range(nb_samples):
c1 += Pij[t, 0] * np.outer(X[t] - m1, X[t] - m1)
c2 += Pij[t, 1] * np.outer(X[t] - m2, X[t] - m2)

c1 /= n[0]
c2 /= n[1]

The first thing at the beginning of each cycle is to initialize the Pij matrix that will be used to store the p(yi|xj,θ,w) values. Then, for each sample, we can compute p(yi|xj,θ,w) considering whether it's labeled or not. The Gaussian probability is computed using the SciPy function multivariate_normal.pdf(). When the whole Pij matrix has been populated, we can update the parameters (means and covariance matrix) of both Gaussians and the relative weights. The algorithm is very fast; after five iterations, we get the stable state represented in the following graph:

The two Gaussians have perfectly mapped the space by setting their parameters so as to cover the high-density regions. We can check for some unlabeled points, as follows:

print(np.round(X[Y==-1][0:10], 3))

[[ 1.67 7.204] [ -1.347 -5.672] [ -2.395 10.952] [ -0.261 6.526] [ 1.053 8.961] [ -0.579 -7.431] [ 0.956 9.739] [ -5.889 5.227] [ -2.761 8.615] [ -1.777 4.717]]

It's easy to locate them in the previous plot. The corresponding classes can be obtained through the last Pij matrix:

print(np.round(Pij[Y==-1][0:10], 3))

[[ 0.002 0.998] [ 1. 0. ] [ 0. 1. ] [ 0.003 0.997] [ 0. 1. ] [ 1. 0. ] [ 0. 1. ] [ 0.007 0.993] [ 0. 1. ] [ 0.02 0.98 ]]

This immediately verifies that they have been correctly labeled and assigned to the right cluster. This algorithm is very fast and produces excellent results in terms of density estimation. In Chapter 5EM Algorithm and Applications, we are going to discuss a general version of this algorithm, explaining the complete training procedure based on the EM algorithm.

In all the examples that involve random numbers, the seed is set equal to 1,000 (np.random.seed(1000)). Other values or subsequent experiments without resetting it can yield slightly different results.

主站蜘蛛池模板: 长丰县| 江西省| 安顺市| 上饶市| 法库县| 阜南县| 射阳县| 台前县| 邹平县| 柏乡县| 临西县| 蒙自县| 绵阳市| 苏尼特右旗| 莱阳市| 福安市| 滨州市| 武安市| 平利县| 北碚区| 鄂尔多斯市| 武宁县| 绥芬河市| 镇宁| 巨野县| 鄯善县| 乐昌市| 乌拉特后旗| 兴业县| 锡林郭勒盟| 惠来县| 焉耆| 泰兴市| 苍南县| 和林格尔县| 平江县| 温泉县| 东海县| 大余县| 长泰县| 漳浦县|