Models¶
Parameters¶
As a slightly more complex example, suppose we want to define the following network:
where \(i\) is the input vector and \(c\) is the correct output vector. The parameters of the model are matrices \(V\) and \(W\) and vectors \(a\) and \(b\).
Parameters are created like constants, using parameter(value)
, where
value
is the initial value of the parameter:
nh = 3
V = parameter(numpy.random.uniform(-1., 1., (nh, 2)))
a = parameter(numpy.zeros((nh,)))
W = parameter(numpy.random.uniform(-1., 1., (1, nh)))
b = parameter(numpy.zeros((1,)))
The inputs and correct outputs are going to be “constants” whose value we will change from example to example:
i = constant(numpy.empty((2,)))
c = constant(numpy.empty((1,)))
Finally, define the network. This is nearly a straight copy of the equations above:
h = tanh(dot(V, i) + a)
o = tanh(dot(W, h) + b)
e = distance2(o, c)
Training¶
To train the network, first create a trainer object (here SGD
; see
below for other trainers). Then, feed it expressions using its
receive
method, which updates the parameters to try to minimize
each expression. It also returns the value of the expressions.
import random
trainer = SGD(learning_rate=0.1)
data = [[-1., -1., -1.],
[-1., 1., 1.],
[ 1., -1., 1.],
[ 1., 1., -1.]] * 10
for epoch in xrange(10):
random.shuffle(data)
loss = 0.
for x, y, z in data:
i.value[...] = [x, y]
c.value[...] = [z]
loss += trainer.receive(e)
print(loss/len(data))
1.08034928912
0.98879616038
1.00183385115
0.951137577661
0.840384066165
0.314003950596
0.0539702267511
0.0295536827621
0.0192921979733
0.0140214011032
Loading and Saving¶
To save the model, call save_model(file)
where file
is a
file-like object. To load a model, you must build your expressions in
exactly the same way that you did up to the point that you saved the
model, then call load_model(file)
.
Reference¶
-
penne.
parameter
(value, model=[])[source]¶ A parameter that is to be trained.
Parameters: value (Numpy array) – The initial value of the new parameter.
-
class
penne.optimize.
StochasticGradientDescent
(learning_rate=0.1, clip_gradients=None)[source]¶ Stochastic gradient descent.
Parameters: - learning_rate – learning rate
- clip_gradients – maximum l2 norm of gradients, or None
-
penne.optimize.
SGD
¶ alias of
StochasticGradientDescent
-
class
penne.optimize.
AdaGrad
(learning_rate=0.1, epsilon=1e-08)[source]¶ AdaGrad (diagonal version).
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12:2121-2159, 2011.
Parameters: - learning_rate – Learning rate.
- epsilon – Small constant to prevent division by zero.
-
penne.optimize.
Adagrad
¶ alias of
AdaGrad
-
class
penne.optimize.
AdaDelta
(decay=0.95, epsilon=1e-06)[source]¶ AdaDelta.
Matthew D. Zeiler. ADADELTA: An adaptive learning rate method. arXiv:1212.5701, 2012.
Parameters: - decay – Decay rate of RMS average of updates and gradients.
- epsilon – Small constant to prevent division by zero.
-
penne.optimize.
Adadelta
¶ alias of
AdaDelta
-
class
penne.optimize.
Momentum
(learning_rate=0.01, decay=0.9)[source]¶ Stochastic gradient descent with momentum.
Parameters: - learning_rate – Learning rate.
- decay – Decay rate of sum of gradients (also known as the momentum coefficient).
-
class
penne.optimize.
NesterovMomentum
(learning_rate=0.01, decay=0.9)[source]¶ Momentum-like version of Nesterov accelerated gradient.
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proc. ICML, 2013.
Parameters: - learning_rate – Learning rate.
- decay – Decay rate of sum of gradients (also known as the momentum coefficient).
-
class
penne.optimize.
RMSprop
(learning_rate=0.01, decay=0.9, epsilon=1e-08)[source]¶ RMSprop.
Hinton. Overview of mini-batch gradient descent. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Parameters: - learning_rate – Learning rate.
- decay – Decay rate of RMS average of gradients.
- epsilon – Small constant to prevent division by zero.
-
class
penne.optimize.
Adam
(learning_rate=0.001, decay1=0.9, decay2=0.999, epsilon=1e-08)[source]¶ Adam.
Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR 2015. http://arxiv.org/pdf/1412.6980.pdf
Parameters: - learning_rate – Learning rate.
- decay1 – Decay rate of average of gradients.
- decay2 – Decay rate of RMS average of gradients.
- epsilon – Small constant to prevent division by zero.