官术网_书友最值得收藏!

Multiple layer model

A multi-layer perceptron (MLP) is a feedforward net with multiple layers. A second linear layer, named hidden layer, is added to the previous example:

Having two linear layers following each other is equivalent to having a single linear layer.

With a non-linear function or non-linearity or transfer function between the linearities, the model does not simplify into a linear one any more, and represents more possible functions in order to capture more complex patterns in the data:

Activation functions helps saturating (ON-OFF) and reproduces the biological neuron activations.

The Rectified Linear Unit (ReLU) graph is given as follows:

(x + T.abs_(x)) / 2.0

The Leaky Rectifier Linear Unit (Leaky ReLU) graph is given as follows:

( (1 + leak) * x + (1 – leak) * T.abs_(x) ) / 2.0

Here, leak is a parameter that defines the slope in the negative values. In leaky rectifiers, this parameter is fixed.

The activation named PReLU considers the leak parameter to be learned.

More generally speaking, a piecewise linear activation can be learned by adding a linear layer followed by a maxout activation of n_pool units:

T.max([x[:, n::n_pool] for n in range(n_pool)], axis=0)

This will output n_pool values or units for the underlying learned linearities:

Sigmoid (T.nnet.sigmoid)

HardSigmoid function is given as:

T.clip(X + 0.5, 0., 1.)

HardTanh function is given as:

T.clip(X, -1., 1.)

Tanh function is given as:

T.tanh(x)

This two-layer network model written in Python will be as follows:

batch_size = 600
n_in = 28 * 28
n_hidden = 500
n_out = 10

def shared_zeros(shape, dtype=theano.config.floatX, name='', n=None):
    shape = shape if n is None else (n,) + shape
    return theano.shared(np.zeros(shape, dtype=dtype), name=name)

def shared_glorot_uniform(shape, dtype=theano.config.floatX, name='', n=None):
    if isinstance(shape, int):
        high = np.sqrt(6. / shape)
    else:
        high = np.sqrt(6. / (np.sum(shape[:2]) * np.prod(shape[2:])))
    shape = shape if n is None else (n,) + shape
    return theano.shared(np.asarray(
        np.random.uniform(
            low=-high,
            high=high,
            size=shape),
        dtype=dtype), name=name)

W1 = shared_glorot_uniform( (n_in, n_hidden), name='W1' )
b1 = shared_zeros( (n_hidden,), name='b1' )

hidden_output = T.tanh(T.dot(x, W1) + b1)

W2 = shared_zeros( (n_hidden, n_out), name='W2' )
b2 = shared_zeros( (n_out,), name='b2' )

model = T.nnet.softmax(T.dot(hidden_output, W2) + b2)
params = [W1,b1,W2,b2]

In deep nets, if weights are initialized to zero with the shared_zeros method, the signal will not flow through the network correctly from end to end. If weights are initialized with values that are too big, after a few steps, most activation functions saturate. So, we need to ensure that the values can be passed to the next layer during propagation, as well as for the gradients to the previous layer during back-propagation.

We also need to break the symmetry between neurons. If the weights of all neurons are zero (or if they are all equal), they will all evolve exactly in the same way, and the model will not learn much.

The researcher Xavier Glorot studied an algorithm to initialize weights in an optimal way. It consists in drawing the weights from a Gaussian or uniform distribution of zero mean and the following variance:

Here are the variables from the preceding formula:

  • nin is the number of inputs the layer receives during feedforward propagation
  • nout is the number of gradients the layer receives during back-propagation

In the case of a linear model, the shape parameter is a tuple, and v is simply numpy.sum( shape[:2] ) (in this case, numpy.prod(shape[2:]) is 1).

The variance of a uniform distribution on [-a, a] is given by a**2 / 3, then the bound a can be computed as follows:

The cost can be defined the same way as before, but the gradient descent needs to be adapted to deal with the list of parameters, [W1,b1,W2,b2]:

g_params = T.grad(cost=cost, wrt=params)

The training loop requires an updated training function:

learning_rate = 0.01
updates = [
        (param, param - learning_rate * gparam)
        for param, gparam in zip(params, g_params)
    ]

train_model = theano.function(
    inputs=[index],
    outputs=cost,
    updates=updates,
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],
        y: train_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

In this case, learning rate is global to the net, with all weights being updated at the same rate. The learning rate is set to 0.01 instead of 0.13. We'll speak about hyperparameter tuning in the training section.

The training loop remains unchanged. The full code is given in the 2-multi.py file.

Execution time on the GPU is 5 minutes and 55 seconds, while on the CPU it is 51 minutes and 36 seconds.

After 1,000 iterations, the error has dropped to 2%, which is a lot better than the previous 5% error rate, but part of it might be due to overfitting. We'll compare the different models later.

主站蜘蛛池模板: 龙岩市| 新蔡县| 贡山| 宿松县| 仙游县| 和田市| 绩溪县| 昌乐县| 凤阳县| 荃湾区| 百色市| 启东市| 景谷| 南川市| 和龙市| 伊宁县| 繁昌县| 航空| 莱西市| 宾川县| 墨玉县| 宜君县| 东辽县| 新昌县| 永丰县| 渭南市| 耒阳市| 弋阳县| 东台市| 望奎县| 湾仔区| 建阳市| 博客| 将乐县| 霍林郭勒市| 永安市| 镶黄旗| 东海县| 安宁市| 东兰县| 涞水县|