Models¶

Parameters¶

As a slightly more complex example, suppose we want to define the following network:

\[\begin{split}h &= \tanh (V i + a) \\ o &= \tanh (W h + b) \\ e &= \|o - c\|^2\end{split}\]

where \(i\) is the input vector and \(c\) is the correct output vector. The parameters of the model are matrices \(V\) and \(W\) and vectors \(a\) and \(b\).

Parameters are created like constants, using parameter(value), where value is the initial value of the parameter:

nh = 3
V = parameter(numpy.random.uniform(-1., 1., (nh, 2)))
a = parameter(numpy.zeros((nh,)))
W = parameter(numpy.random.uniform(-1., 1., (1, nh)))
b = parameter(numpy.zeros((1,)))

The inputs and correct outputs are going to be “constants” whose value we will change from example to example:

i = constant(numpy.empty((2,)))
c = constant(numpy.empty((1,)))

Finally, define the network. This is nearly a straight copy of the equations above:

h = tanh(dot(V, i) + a)
o = tanh(dot(W, h) + b)
e = distance2(o, c)

Training¶

To train the network, first create a trainer object (here SGD; see below for other trainers). Then, feed it expressions using its receive method, which updates the parameters to try to minimize each expression. It also returns the value of the expressions.

import random
trainer = SGD(learning_rate=0.1)
data = [[-1., -1., -1.],
        [-1.,  1.,  1.],
        [ 1., -1.,  1.],
        [ 1.,  1., -1.]] * 10
for epoch in xrange(10):
    random.shuffle(data)
    loss = 0.
    for x, y, z in data:
        i.value[...] = [x, y]
        c.value[...] = [z]
        loss += trainer.receive(e)
    print(loss/len(data))

08034928912
98879616038
00183385115
951137577661
840384066165
314003950596
0539702267511
0295536827621
0192921979733
0140214011032

Loading and Saving¶

To save the model, call save_model(file) where file is a file-like object. To load a model, you must build your expressions in exactly the same way that you did up to the point that you saved the model, then call load_model(file).

Reference¶

penne.parameter(value, model=[])[source]¶

A parameter that is to be trained.

Parameters:	value (Numpy array) – The initial value of the new parameter.

class penne.optimize.StochasticGradientDescent(learning_rate=0.1, clip_gradients=None)[source]¶

Stochastic gradient descent.

Parameters:	learning_rate – learning rate clip_gradients – maximum l2 norm of gradients, or None

penne.optimize.SGD¶: alias of StochasticGradientDescent

class penne.optimize.AdaGrad(learning_rate=0.1, epsilon=1e-08)[source]¶

AdaGrad (diagonal version).

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12:2121-2159, 2011.

Parameters:	learning_rate – Learning rate. epsilon – Small constant to prevent division by zero.

penne.optimize.Adagrad¶: alias of AdaGrad

class penne.optimize.AdaDelta(decay=0.95, epsilon=1e-06)[source]¶

AdaDelta.

Matthew D. Zeiler. ADADELTA: An adaptive learning rate method. arXiv:1212.5701, 2012.

Parameters:	decay – Decay rate of RMS average of updates and gradients. epsilon – Small constant to prevent division by zero.

penne.optimize.Adadelta¶: alias of AdaDelta

class penne.optimize.Momentum(learning_rate=0.01, decay=0.9)[source]¶

Stochastic gradient descent with momentum.

Parameters:	learning_rate – Learning rate. decay – Decay rate of sum of gradients (also known as the momentum coefficient).

class penne.optimize.NesterovMomentum(learning_rate=0.01, decay=0.9)[source]¶

Momentum-like version of Nesterov accelerated gradient.

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proc. ICML, 2013.

Parameters:	learning_rate – Learning rate. decay – Decay rate of sum of gradients (also known as the momentum coefficient).

class penne.optimize.RMSprop(learning_rate=0.01, decay=0.9, epsilon=1e-08)[source]¶

RMSprop.

Hinton. Overview of mini-batch gradient descent. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Parameters:	learning_rate – Learning rate. decay – Decay rate of RMS average of gradients. epsilon – Small constant to prevent division by zero.

class penne.optimize.Adam(learning_rate=0.001, decay1=0.9, decay2=0.999, epsilon=1e-08)[source]¶

Adam.

Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR 2015. http://arxiv.org/pdf/1412.6980.pdf

Parameters:	learning_rate – Learning rate. decay1 – Decay rate of average of gradients. decay2 – Decay rate of RMS average of gradients. epsilon – Small constant to prevent division by zero.

penne.load_model(infile, model=[])[source]¶

Loads parameters from a file.

Assumes that there are exactly the same number of parameters as when the model was saved, created in the same order and with the same shapes.

Parameters:	infile – File or filename to read from.

penne.save_model(outfile, model=[])[source]¶

Saves parameters to a file.

Parameters:	outfile – File or filename to write to.