E cient BackProp
Yann LeCun1, Leon Bottou1, Genevieve B. Orr2, and Klaus-Robert M uller3
1 Image Processing Research Department AT& T Labs - Research, 100 Schulz Drive,
Red Bank, NJ 07701-7033, USA
2 Willamette University, 900 State Street, Salem, OR 97301, USA
3 GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany
fyann,leonbg@research.att.com, gorr@willamette.edu, klaus@ rst.gmd.de
originally published in
Orr, G. and M uller, K. \Neural Networks: tricks of the trade",
Springer, 1998.
Abstract. The convergence of back-propagation learning is analyzed
so as to explain common phenomenon observed by practitioners. Many
undesirable behaviors of backprop can be avoided with tricks that are
rarely exposed in serious technical publications. This paper gives some
of those tricks, and o ers explanations of why they work.
Many authors have suggested that second-order optimization methods
are advantageous for neural net training. It is shown that most \classical"
second-order methods are impractical for large neural networks. A few
methods are proposed that do not have these limitations.
1 Introduction
Backpropagation is a very popular neural network learning algorithm because
it is conceptually simple, computationally e cient, and because it often works.
However, getting it to work well, and sometimes to work at all, can seem more of
an art than a science. Designing and training a network using backprop requires
making many seemingly arbitrary choices such as the number and types of nodes,
layers, learning rates, training and test sets, and so forth. These choices can be
critical, yet there is no foolproof recipe for deciding them because they are largely
problem and data dependent. However, there are heuristics and some underlying
theory that can help guide a practitioner to make better choices.
In the rst section below we introduce standard backpropagation and discuss
a number of simple heuristics or tricks for improving its performance. We next
discuss issues of convergence. We then describe a few \classical" second-order
non-linear optimization techniques and show that their application to neural
network training is very limited, despite many claims to the contrary in the
literature. Finally, we present a few second-order methods that do accelerate
learning in certain cases.
2 Learning and Generalization
There are several approaches to automatic machine learning, but much of the
successful approaches can be categorized as gradient-based learning methods. The
learning machine, as represented in Figure 1, computes a function M(Zp; W)
where Zp is the p-th input pattern, and W represents the collection of adjustable
parameters in the system. A cost function Ep = C(Dp; M(Zp; W)), measures
the discrepancy between Dp, the \correct" or desired output for pattern Zp, and
the output produced by the system. The average cost function Etrain(W) is the
average of the errors Ep over a set of input/output pairs called the training set
f(Z1; D1); ::::(ZP ; DP )g. In the simplest setting, the learning problem consists in
nding the value of W that minimizes Etrain(W). In practice, the performance
of the system on a training set is of little interest. The more relevant measure
is the error rate of the system in the eld, where it would be used in practice.
This performance is estimated by measuring the accuracy on a set of samples
disjoint from the training set, called the test set. The most commonly used cost
function is the Mean Squared Error:
Ep =
1
2
(Dp
M(Zp; W))2;
Etrain =
1
P
X
p=1
Ep
COST FUNCTION
LEARNING
MACHINE
Parameters
Output
E0, E1,....Ep
Error
Desired
Output
D0, D1,...Dp
Input
M(Z,W)
Z0, Z1,... Zp
W
Fig. 1. Gradient-based learning machine.
This chapter is focused on strategies for improving the process of minimizing
the cost function. However, these strategies must be used in conjunction with
methods for maximizing the network's ability to generalize, that is, to predict
the correct targets for patterns the learning system has not previously seen for
more detail).
To understand generalization, let us consider how backpropagation works.
We start with a set of samples each of which is an input/output pair of the
function to be learned. Since the measurement process is often noisy, there may
be errors in the samples. We can imagine that if we collected multiple sets
of samples then each set would look a little di erent because of the noise and
because of the di erent points sampled. Each of these data sets would also result
in networks with minima that are slightly di erent from each other and from the
true function. In this chapter, we concentrate on improving the process of nding
the minimum for the particular set of examples that we are given. Generalization
techniques try to correct for the errors introduced into the network as a result
of our choice of dataset. Both are important.
Several theoretical e orts have analyzed the process of learning by mini-
mizing the error on a training set (a process sometimes called Empirical Risk
Minimization) 40, 41].
Some of those theoretical analyses are based on decomposing the generaliza-
tion error into two terms: bias and variance (see e.g. 12]). The bias is a measure
of how much the network output, averaged over all possible data sets di ers from
the desired function. The variance is a measure of how much the network output
varies between datasets. Early in training, the bias is large because the network
output is far from the desired function. The variance is very small because the
data has had little in uence yet. Late in training, the bias is small because the
network has learned the underlying function. However, if trained too long, the
network will also have learned the noise speci c to that dataset. This is referred
to as overtraining. In such a case, the variance will be large because the noise
varies between datasets. It can be shown that the minimum total error will occur
when the sum of bias and variance are minimal.
There are a number of techniques (e.g. early stopping, regularization) for
maximizing the generalization ability of a network when using backprop.
The idea of this chapter, therefore, is to present minimization strategies
(given a cost function) and the tricks associated with increasing the speed and
quality of the minimization. It is however clear that the choice of the model
(model selection), the architecture and the cost function is crucial for obtaining
a network that generalizes well. So keep in mind that if the wrong model class is
used and no proper model selection is done, then even a superb minimization will
clearly not help very much. In fact, the existence of overtraining has led several
authors to suggest that inaccurate minimization algorithms can be better than
good ones.
3 Standard Backpropagation
Although the tricks and analyses in this paper are primarily presented in the
context of \classical" multi-layer feed-forward neural networks, many of them
also apply to most other gradient-based learning methods.
The simplest form of multilayer learning machine trained with gradient-based
learning is simply a stack of modules, each of which implements a function Xn =
Fn(Wn; Xn 1), where Xn is a vector representing the output of the module, Wn
is the vector of tunable parameters in the module (a subset of W), and Xn 1 is
the module's input vector (as well as the previous module's output vector). The
input X0 to the rst module is the input pattern Zp. If the partial derivative of
Ep with respect to Xn is known, then the partial derivatives of Ep with respect
to Wn and Xn 1 can be computed using the backward recurrence
@Ep
@Wn
=
@F
@W
(Wn; Xn 1)
@Ep
@Xn
@Ep
@Xn 1
=
@F
@X
(Wn; Xn 1)
@Ep
@Xn
(1)
where @F
@W
(Wn; Xn 1) is the Jacobian of F with respect to W evaluated at the
point (Wn; Xn 1), and @F
@X
(Wn; Xn 1) is the Jacobian of F with respect to X.
The Jacobian of a vector function is a matrix containing the partial derivatives
of all the outputs with respect to all the inputs. When the above equations
are applied to the modules in reverse order, from layer N to layer 1, all the
partial derivatives of the cost function with respect to all the parameters can be
computed. The way of computing gradients is known as back-propagation.
Traditional multi-layer neural networks are a special case of the above system
where the modules are alternated layers of matrix multiplications (the weights)
and component-wise sigmoid functions (the units):
Yn = WnXn 1
(2)
Xn = F(Yn)
(3)
where Wn is a matrix whose number of columns is the dimension of Xn 1, and
number of rows is the dimension of Xn. F is a vector function that applies a
sigmoid function to each component of its input. Yn is the vector of weighted
sums, or total inputs, to layer n.
Applying the chain rule to the equation above, the classical backpropagation
equations are obtained:
@Ep
@yi
n
= f0(yi
n)
@Ep
@xi
n
(4)
@Ep
@w
ij
n
= x
j
n 1
@Ep
@yi
n
(5)
@Ep
@xk
n 1
=
X
i
wik
n
@Ep
@yi
n
:
(6)
The above equations can also be written in matrix form:
@Ep
@Yn
= F0(Yn)
@Ep
@Xn
(7)
@Ep
@Wn
= Xn 1
@Ep
@Yn
(8)
@Ep
@Xn 1
= WT
n
@Ep
@Yn
:
(9)
The simplest learning (minimization) procedure in such a setting is the gra-
dient descent algorithm where W is iteratively adjusted as follows:
W(t) = W(t 1)
@E
@W
:
(10)
In the simplest case, is a scalar constant. More sophisticated procedures use
variable . In other methods
takes the form of a diagonal matrix, or is an
estimate of the inverse Hessian matrix of the cost function (second derivative
matrix) such as in the Newton and Quasi-Newton methods described later in
the chapter. A proper choice of is important and will be discussed at length
later.
4 A Few Practical Tricks
Backpropagation can be very slow particularly for multilayered networks where
the cost surface is typically non-quadratic, non-convex, and high dimensional
with many local minima and/or at regions. There is no formula to guarantee
that (1) the network will converge to a good solution, (2) convergence is swift, or
(3) convergence even occurs at all. However, in this section we discuss a number
of tricks that can greatly improve the chances of nding a good solution while
also decreasing the convergence time often by orders of magnitude. More detailed
theoretical justi cations will be given in later sections.
4.1 Stochastic versus Batch learning.
At each iteration, equation (10) requires a complete pass through the entire
dataset in order to compute the average or true gradient. This is referred to as
batch learning since an entire \batch" of data must be considered before weights
are updated. Alternatively, one can use stochastic (online) learning where a
single example fZt; Dtg is chosen (e.g. randomly) from the training set at each
iteration t. An estimate of the true gradient is then computed based on the error
Et of that example, and then the weights are updated:
W(t + 1) = W(t)
@Et
@W
:
(11)
Because this estimate of the gradient is noisy, the weights may not move precisely
down the gradient at each iteration. As we shall see, this \noise" at each iteration
can be advantageous. Stochastic learning is generally the preferred method for
basic backpropagation for the following three reasons:
Advantages of Stochastic Learning
1. Stochastic learning is usually much faster than batch learning.
2. Stochastic learning also often results in better solutions.
3. Stochastic learning can be used for tracking changes.
Stochastic learning is most often much faster than batch learning particularly
on large redundant datasets. The reason for this is simple to show. Consider the
simple case where a training set of size 1000 is inadvertently composed of 10
identical copies of a set with 100 samples. Averaging the gradient over all 1000
patterns gives the exact same result as computing the gradient based on just
the rst 100. Thus, batch gradient descent is wasteful because it recomputes
the same quantity 10 times before one parameter update. On the other hand,
stochastic gradient will see a full epoch as 10 iterations through a 100-long
training set. In practice, examples rarely appear more than once in a dataset,
but there are usually clusters of patterns that are very similar. For example in
phoneme classi cation, all of the patterns for the phoneme / / will (hopefully)
contain much of the same information. It is this redundancy that can make batch
learning much slower than on-line.
Stochastic learning also often results in better solutions because of the noise
in the updates. Nonlinear networks usually have multiple local minima of di er-
ing depths. The goal of training is to locate one of these minima. Batch learning
will discover the minimum of whatever basin the weights are initially placed. In
stochastic learning, the noise present in the updates can result in the weights
jumping into the basin of another, possibly deeper, local minimum. This has
been demonstrated in certain simpli ed cases 15,30].
Stochastic learning is also useful when the function being modeled is chang-
ing over time, a quite common scenario in industrial applications where the
data distribution changes gradually over time (e.g. due to wear and tear of the
machines). If the learning machine does not detect and follow the change it is
impossible to learn the data properly and large generalization errors will result.
With batch learning, changes go undetected and we obtain rather bad results
since we are likely to average over several rules, whereas on-line learning { if
operated properly (see below in section 4.7) { will track the changes and yield
good approximation results.
Despite the advantages of stochastic learning, there are still reasons why one
might consider using batch learning:
Advantages of Batch Learning
1. Conditions of convergence are well understood.
2. Many acceleration techniques (e.g. conjugate gradient) only op-
erate in batch learning.
3. Theoretical analysis of the weight dynamics and convergence
rates are simpler.
These advantages stem from the same noise that make stochastic learning
advantageous. This noise, which is so critical for nding better local minima also
prevents full convergence to the minimum. Instead of converging to the exact
minimum, the convergence stalls out due to the weight uctuations. The size of
the uctuations depend on the degree of noise of the stochastic updates. The
variance of the uctuations around the local minimum is proportional to the
learning rate
28,27,6]. So in order to reduce the uctuations we can either
decrease (anneal) the learning rate or have an adaptive batch size. In theory 13,
30, 36, 35] it is shown that the optimal annealing schedule of the learning rate is
of the form
c
t
;
(12)
where t is the number of patterns presented and c is a constant. In practice, this
may be too fast
Another method to remove noise is to use \mini-batches", that is, start with
a small batch size and increase the size as training proceeds. M ller discusses
one method for doing this 25] and Orr 31] discusses this for linear problems.
However, deciding the rate at which to increase the batch size and which inputs
to include in the small batches is as di cult as determining the proper learning
rate. E ectively the size of the learning rate in stochastic learning corresponds
to the respective size of the mini batch.
Note also that the problem of removing the noise in the data may be less
critical than one thinks because of generalization. Overtraining may occur long
before the noise regime is even reached.
Another advantage of batch training is that one is able to use second order
methods to speed the learning process. Second order methods speed learning
by estimating not just the gradient but also the curvature of the cost surface.
Given the curvature, one can estimate the approximate location of the actual
minimum.
Despite the advantages of batch updates, stochastic learning is still often the
preferred method particularly when dealing with very large data sets because it
is simply much faster.
4.2 Shu ing the Examples
Networks learn the fastest from the most unexpected sample. Therefore, it is
advisable to choose a sample at each iteration that is the most unfamiliar to
the system. Note, this applies only to stochastic learning since the order of
input presentation is irrelevant for batch1. Of course, there is no simple way
to know which inputs are information rich, however, a very simple trick that
crudely implements this idea is to simply choose successive examples that are
from di erent classes since training examples belonging to the same class will
most likely contain similar information.
Another heuristic for judging how much new information a training example
contains is to examine the error between the network output and the target value
when this input is presented. A large error indicates that this input has not been
learned by the network and so contains a lot of new information. Therefore, it
makes sense to present this input more frequently. Of course, by \large" we mean
relative to all of the other training examples. As the network trains, these relative
errors will change and so should the frequency of presentation for a particular
input pattern. A method that modi es the probability of appearance of each
pattern is called an emphasizing scheme.
Choose Examples with Maximum Information Content
1. Shu e the training set so that successive training examples
never (rarely) belong to the same class.
2. Present input examples that produce a large error more fre-
quently than examples that produce a small error.
1 The order in which gradients are summed in batch may be a ected by roundo error
if there is a signi cant range of gradient values.
However, one must be careful when perturbing the normal frequencies of
input examples because this changes the relative importance that the network
places on di erent examples. This may or may not be desirable. For example, this
technique applied to data containing outliers can be disastrous because outliers
can produce large errors yet should not be presented frequently. On the other
hand, this technique can be particularly bene cial for boosting the performance
for infrequently occurring inputs, e.g. /z/ in phoneme recognition
4.3 Normalizing the Inputs
Convergence is usually faster if the average of each input variable over the train-
ing set is close to zero. To see this, consider the extreme case where all the inputs
are positive. Weights to a particular node in the rst weight layer are updated by
an amount proportional to x where is the (scalar) error at that node and x is
the input vector (see equations (5) and (10)). When all of the components of an
input vector are positive, all of the updates of weights that feed into a node will
be the same sign (i.e. sign()). As a result, these weights can only all decrease
or all increase together for a given input pattern. Thus, if a weight vector must
change direction it can only do so by zigzagging which is ine cient and thus
very slow.
In the above example, the inputs were all positive. However, in general, any
shift of the average input away from zero will bias the updates in a particular
direction and thus slow down learning. Therefore, it is good to shift the inputs
so that the average over the training set is close to zero. This heuristic should
be applied at all layers which means that we want the average of the outputs
of a node to be close to zero because these outputs are the inputs to the next
layer 19]. This problem can be addressed by coordinating how the inputs are
transformed with the choice of sigmoidal activation function. Here we discuss
the input transformation. The discussion of the sigmoid follows.
Convergence is faster not only if the inputs are shifted as described above
but also if they are scaled so that all have about the same covariance, Ci, where
Ci =
1
P
P
X
p=1
(z
p
i )2:
(13)
Here, P is the number of training examples, Ci is the covariance of the ith input
variable and z
p
i is the ith component of the pth training example. Scaling speeds
learning because it helps to balance out the rate at which the weights connected
to the input nodes learn. The value of the covariance should be matched with
that of the sigmoid used. For the sigmoid given below, a covariance of 1 is a
good choice.
The exception to scaling all covariances to the same value occurs when it is
known that some inputs are of less signi cance than others. In such a case, it
can be bene cial to scale the less signi cant inputs down so that they are \less
visible" to the learning process.
Transforming the Inputs
1. The average of each input variable over the training set should be close
to zero.
2. Scale input variables so that their covariances are about the same.
3. Input variables should be uncorrelated if possible.
The above two tricks of shifting and scaling the inputs are quite simple to
implement. Another trick that is quite e ective but more di cult to implement
is to decorrelate the inputs. Consider the simple network in Figure 2. If inputs
are uncorrelated then it is possible to solve for the value of w1 that minimizes
the error without any concern for w2, and vice versa. In other words, the two
variables are independent (the system of equations is diagonal). With correlated
inputs, one must solve for both simultaneously which is a much harder problem.
Principal component analysis (also known as the Karhunen-Loeve expansion)
can be used to remove linear correlations in inputs 10].
Inputs that are linearly dependent (the extreme case of correlation) may
also produce degeneracies which may slow learning. Consider the case where one
input is always twice the other input (z2 = 2z1). The network output is constant
along lines W2 = v (1=2)W1, where v is a constant. Thus, the gradient is zero
along these directions (see Figure 2). Moving along these lines has absolutely no
e ect on learning. We are trying to solve in 2-D what is e ectively only a 1-D
problem. Ideally we want to remove one of the inputs which will decrease the
size of the network.
Figure 3 shows the entire process of transforming inputs. The steps are (1)
shift inputs so the mean is zero, (2) decorrelate inputs, and (3) equalize covari-
ances.
z
1
z
2
y
ω
1
ω
2
ω
2
ω
1
Lines of
constant E
Fig. 2. Linearly dependent inputs.
Mean
Cancellation
KL-
Expansion
Covariance
Equalization
Fig. 3. Transformation of inputs.
4.4 The Sigmoid
Nonlinear activation functions are what give neural networks their nonlinear
capabilities. One of the most common forms of activation function is the sigmoid
which is a monotonically increasing function that asymptotes at some nite value
as 1 is approached. The most common examples are the standard logistic
function f(x)=1=(1 + e x) and hyperbolic tangent f(x) = tanh(x) shown in
Figure 4. Sigmoids that are symmetric about the origin (e.g. see Figure 4b) are
preferred for the same reason that inputs should be normalized, namely, because
they are more likely to produce outputs (which are inputs to the next layer)
that are on average close to zero. This is in contrast, say, to the logistic function
whose outputs are always positive and so must have a mean that is positive.
Sigmoids
1. Symmetric sigmoids such as hyperbolic tangent often converge faster
than the standard logistic function.
2. A recommended sigmoid 19] is: f(x)=1:7159 tanh 2
3
x . Since the
tanh function is sometimes computationally expensive, an approxima-
tion of it by a ratio of polynomials can be used instead.
3. Sometimes it is helpful to add a small linear term, e.g. f(x) = tanh(x)+
ax so as to avoid at spots.
-6
-4
-2
2
4
6
0.2
0.4
0.6
0.8
1
-3
-2
-1
1
2
3
-1.5
-1
-0.5
0.5
1
1.5
(a)
(b)
Fig. 4. (a) Not recommended: the standard logistic function, f(x)=1=(1 + e x). (b)
Hyperbolic tangent, f(x)=1:7159 tanh 2
3x .
The constants in the recommended sigmoid given above have been chosen so
that, when used with transformed inputs (see previous discussion), the variance
of the outputs will also be close to 1 because the e ective gain of the sigmoid
is roughly 1 over its useful range. In particular, this sigmoid has the properties
(a) f( 1) = 1, (b) the second derivative is a maximum at x = 1, and (c) the
e ective gain is close to 1.
One of the potential problems with using symmetric sigmoids is that the
error surface can be very at near the origin. For this reason it is good to avoid
initializing with very small weights. Because of the saturation of the sigmoids,
the error surface is also at far from the origin. Adding a small linear term to
the sigmoid can sometimes help avoid the at regions.
4.5 Choosing Target Values
In classi cation problems, target values are typically binary (e.g. f-1,+1g). Com-
mon wisdom might seem to suggest that the target values be set at the value of
the sigmoid's asymptotes. However, this has several drawbacks.
First, instabilities can result. The training process will try to drive the output
as close as possible to the target values, which can only be achieved asymptoti-
cally. As a result, the weights (output and even hidden) are driven to larger and
larger values where the sigmoid derivative is close to zero. The very large weights
increase the gradients, however, these gradients are then multiplied by an expo-
nentially small sigmoid derivative (except when a twisting term2 is added to the
sigmoid) producing a weight update close to zero. As a result, the weights may
become stuck.
Second, when the outputs saturate, the network gives no indication of con-
dence level. When an input pattern falls near a decision boundary the output
class is uncertain. Ideally this should be re ected in the network by an out-
put value that is in between the two possible target values, i.e. not near either
asymptote. However, large weights tend to force all outputs to the tails of the sig-
moid regardless of the uncertainty. Thus, the network may predict a wrong class
without giving any indication of its low con dence in the result. Large weights
that saturate the nodes make it impossible to di erentiate between typical and
nontypical examples.
A solution to these problems is to set the target values to be within the
range of the sigmoid, rather than at the asymptotic values. Care must be taken,
however, to insure that the node is not restricted to only the linear part of the
sigmoid. Setting the target values to the point of the maximum second derivative
on the sigmoid is the best way to take advantage of the nonlinearity without
saturating the sigmoid. This is another reason the sigmoid in Figure 4b is a
good choice. It has maximum second derivative at 1 which correspond to the
binary target values typical in classi cation problems.
Targets
Choose target values at the point of the maximum second derivative on the
sigmoid so as to avoid saturating the output units.
2 A twisting term is a small linear term added to the node output, e.g.
f(x) = tanh(x) + ax.
4.6 Initializing the weights
The starting values of the weights can have a signi cant e ect on the training
process. Weights should be chosen randomly but in such a way that the sig-
moid is primarily activated in its linear region. If weights are all very large then
the sigmoid will saturate resulting in small gradients that make learning slow.
If weights are very small then gradients will also be very small. Intermediate
weights that range over the sigmoid's linear region have the advantage that (1)
the gradients are large enough that learning can proceed and (2) the network
will learn the linear part of the mapping before the more di cult nonlinear part.
Achieving this requires coordination between the training set normalization,
the choice of sigmoid, and the choice of weight initialization. We start by requir-
ing that the distribution of the outputs of each node have a standard deviation
() of approximately 1. This is achieved at the input layer by normalizing the
training set as described earlier. To obtain a standard deviation close to 1 at
the output of the rst hidden layer we just need to use the above recommended
sigmoid together with the requirement that the input to the sigmoid also have a
standard deviation y = 1. Assuming the inputs, yi, to a unit are uncorrelated
with variance 1, the standard deviation of the units weighted sum will be
yi =
0
@X
j
w2
ij
1
A
1=2
:
(14)
Thus, to insure that the yi are approximately 1 the weights should be randomly
drawn from a distribution with mean zero and a standard deviation given by
w = m 1=2
(15)
where m is the number of inputs to the unit.
Initializing Weights
Assuming that:
1. the training set has been normalized, and
2. the sigmoid from Figure 4b has been used
then weights should be randomly drawn from a distribution (e.g. uniform)
with mean zero and standard deviation
w = m 1=2
(16)
where m is the fan-in (the number of connections feeding into the node).
4.7 Choosing Learning rates
There is at least one well-principled method (described in section 9.2) for esti-
mating the ideal learning rate . Many other schemes (most of them rather em-
pirical) have been proposed in the literature to automatically adjust the learning
rate. Most of those schemes decrease the learning rate when the weight vector
\oscillates", and increase it when the weight vector follows a relatively steady
direction. The main problem with these methods is that they are not appropriate
for stochastic gradient or on-line learning because the weight vector uctuates
all the time.
Beyond choosing a single global learning rate, it is clear that picking a dif-
ferent learning rate i for each weight can improve the convergence. A well-
principled way of doing this, based on computing second derivatives, is described
in section 9.1. The main philosophy is to make sure that all the weights in the
network converge roughly at the same speed.
Depending upon the curvature of the error surface, some weights may require
a small learning rate in order to avoid divergence, while others may require a large
learning rate in order to converge at a reasonable speed. Because of this, learning
rates in the lower layers should generally be larger than in the higher layers (see
Figure 21). This corrects for the fact that in most neural net architectures, the
second derivative of the cost function with respect to weights in the lower layers
is generally smaller than that of the higher layers. The rationale for the above
heuristics will be discussed in more detail in later sections along with suggestions
for how to choose the actual value of the learning rate for the di erent weights
(see section 9.1).
If shared weights are used such as in time-delay neural networks (TDNN)
42] or convolutional networks 20], the learning rate should be proportional to
the square root of the number of connections sharing that weight, because we
know that the gradients are a sum of more-or-less independent terms.
Equalize the Learning Speeds
{ give each weight its own learning rate
{ learning rates should be proportional to the square root of the
number of inputs to the unit
{ weights in lower layers should typically be larger than in the
higher layers
Other tricks for improving the convergence include:
Momentum Momentum
w(t + 1) =
@Et+1
@w
+
w(t);
can increase speed when the cost surface is highly nonspherical because it damps
the size of the steps along directions of high curvature thus yielding a larger
e ective learning rate along the directions of low curvature 43] ( denotes the
strength of the momentum term). It has been claimed that momentum generally
helps more in batch mode than in stochastic mode, but no systematic study of
this are known to the authors.
Adaptive learning rates Many authors, including Sompolinsky et al. 37],
Darken & Moody 9], Sutton 38], Murata et al. 28] have proposed rules for
automatically adapting the learning rates (see also 16]). These rules control the
speed of convergence by increasing or decreasing the learning rate based on the
error.
We assume the following facts for a learning rate adaptation algorithm: (1)
the smallest eigenvalue of the Hessian (see Eq.(27)) is su ciently smaller than the
second smallest eigenvalue and (2) therefore after a large number of iterations,
the parameter vector w(t) will approach the minimum from the direction of
the minimum eigenvector of the Hessian (see Eq.(27), Figure 5). Under these
conditions the evolution of the estimated parameter can be thought of as a one-
dimensional process and the minimum eigenvector v can be approximated (for
a large number of iterations: see Figure 5) by
v = h
@E
@w
i=kh
@E
@w
ik;
where k k denotes the L2 norm. Hence we can adopt a projection
= hvT @E
@w
i = kh
@E
@w
ik
to the approximated minimum Eigenvector v as a one dimensional measure of
the distance to the minimum. This distance can be used to control the learning
rate (for details see 28])
w(t + 1) = w(t + 1)
t
@Et
@w
;
(17)
r(t + 1) = (1
)r(t) +
@Et
@w
; (0 < < 1)
(18)
(t + 1) = (t) + (t) (kr(t + 1)k (t));
(19)
where controls the leak size of the average, ; are constants and r is used as
auxiliary variable to calculate the leaky average of the gradient @E
@w
.
Note that this set of rules is easy to compute and straightforward to im-
plement. We simply have to keep track of an additional vector in Eq.(18): the
averaged gradient r. The norm of this vector then controls the size of the learning
rate (see Eq.(19)). The algorithm follows the simple intuition: far away from the
minimum (large distance ) it proceeds in big steps and close to the minimum
it anneals the learning rate (for theoretical details see 28]).
4.8 Radial Basis Functions vs Sigmoid Units
Although most systems use nodes based on dot products and sigmoids, many
other types of units (or layers) can be used. A common alternative is the radial
basis function (RBF) network (see 7, 26, 5, 32]) In RBF networks, the dot prod-
uct of the weight and input vector is replaced with a Euclidean distance between
Fig. 5. Convergence of the ow. During the nal stage of learning the average ow is
approximately one dimensional towards the minimum w and it is a good approxima-
tion of the minimum eigenvalue direction of the Hessian.
the input and weight and the sigmoid is replaced by an exponential. The output
activity is computed, e.g. for one output, as
g(x) =
N
X
i=1
wi exp
1
22
i
kx
ik2 ;
where i (i) is the mean (standard deviation) of the i-th Gaussian. These units
can replace or coexist with the standard units and they are usually trained by
combination of gradient descent (for output units) and unsupervised clustering
for determining the means and widths of the RBF units.
Unlike sigmoidal units which can cover the entire space, a single RBF unit
covers only a small local region of the input space. This can be an advantage
because learning can be faster. RBF units may also form a better set of basis
functions to model the input space than sigmoid units, although this is highly
problem dependent. On the negative side, the locality property of RBFs may be
a disadvantage particularly in high dimensional spaces because may units are
needed to cover the spaces. RBFs are more appropriate in (low dimensional)
upper layers and sigmoids in (high dimensional) lower layers.
5 Convergence of Gradient Descent
5.1 A Little Theory
In this section we examine some of the theory behind the tricks presented earlier.
We begin in one dimension where the update equation for gradient descent can
be written as
W(t + 1) = W(t)
dE(W)
dW
:
(20)
We would like to know how the value of a ects convergence and the learning
speed. Figure 6 illustrates the learning behavior for several di erent sizes of
when the weight W starts out in the vicinity of a local minimum. In one dimen-
sion, it is easy to de ne the optimal learning rate, opt, as being the learning
rate that will move the weight to the minimum, Wmin, in precisely one step
(see Figure 6(i)b). If is smaller than opt then the stepsize will be smaller and
convergence will take multiple timesteps. If is between opt and 2opt then the
weight will oscillate around Wmin but will eventually converge (Figure 6(i)c). If
is more than twice the size of opt (Figure 6(i)d) then the stepsize is so large
that the weight ends up farther from Wmin than before. Divergence results.
E (ω)
ω
η > ηopt
ωmin
E (ω)
ω
η < ηopt
ωmin
a)
c)
E (ω)
ω
η = ηopt
ωmin
E (ω)
ω
η > 2 ηopt
ωmin
b)
d)
E (ω)
ω
η = ηopt
ω
∆ω
dE/dω
ωc
ωmin
ωc
ωmin
d E(ωc )
dω
(i)
(ii)
Fig. 6. Gradient descent for di erent learning rates.
What is the optimal value of the learning rate opt? Let us rst consider
the case in 1-dimension. Assuming that E can be approximated by a quadratic
function, opt can be derived by rst expanding E in a Taylor series about the
current weight, Wc:
E(W) = E(Wc)+(W Wc)
dE(Wc)
dW
+
1
2
(W Wc)2 d2E(Wc)
dW2
+ :::;
(21)
where we use the shorthand
dE(Wc)
dW
dE
dW W=Wc
. If E is quadratic the second
order derivative is constant and the higher order terms vanish. Di erentiating
both sides with respect to W then gives
dE(W)
dW
=
dE(Wc)
dW
+ (W Wc)
d2E(Wc)
dW2
:
(22)
Setting W = Wmin and noting that dE(Wmin)=dW = 0, we are left after rear-
ranging with
Wmin = Wc
d2E(Wc)
dW2
1
dE(Wc)
dW
:
(23)
Comparing this with the update equation (20), we nd that we can reach a
minimum in one step if
opt =
d2E(Wc)
dW2
1
:
(24)
Perhaps an easier way to obtain this same result is illustrated in Figure 6(ii).
The bottom graph plots the gradient of E as a function of W. Since E is
quadratic, the gradient is simply a straight line with value zero at the mini-
mum and
@E(Wc)
@W
at the current weight Wc. @2E=@2W is simply the slope of this
line and is computed using the standard slope formula
@2E=@2W =
@E(Wc)=@W 0
Wc
Wmin
:
(25)
Solving for Wmin then gives equation (23).
While the learning rate that gives fastest convergence is opt, the largest
learning rate that can be used without causing divergence is (also see Fig-
ure 6(i)d)
max = 2opt:
(26)
If E is not exactly quadratic then the higher order terms in equation (21) are
not precisely zero and (23) is only an approximation. In such a case, it may
take multiple iterations to locate the minimum even when using opt, however,
convergence can still be quite fast.
In multiple dimensions, determining opt is a bit more di cult because the
right side of (24) is a matrix H 1 where H is called the Hessian whose compo-
nents are given by
Hij
@2E
@Wi@Wj
(27)
with 1 i; j N, and N equal to the total number of weights.
H is a measure of the curvature of E. In two dimensions, the lines of constant
E for a quadratic cost are oval in shape as shown in Figure 7. The eigenvectors of
H point in the directions of the major and minor axes. The eigenvalues measure
the steepness of E along the corresponding eigendirection.
Example. In the least mean square (LMS) algorithm, we have a single layer
linear network with error function
E(W) =
1
2P
P
X
p=1
jdp
X
i
wix
p
i j2
(28)
Eigenvectors
of H
ω
min,1
ω
min,2
ν
1
ν
2
ω
1
ω
2
P
E
(a)
(b)
Fig. 7. Lines of constant E.
where P is the number of training vectors. The Hessian in this case turns out
the be the same as the covariance matrix of the inputs,
H =
1
P
X
p
xpxpT :
(29)
Thus, each eigenvalue of H is also a measure of the covariance or spread of the
inputs along the corresponding eigendirection as shown in Figure 8.
x
1
x
2
Fig. 8. For the LMS algorithm, the eigenvectors and eigenvalues of H measure the
spread of the inputs in input space.
Using a scalar learning rate is problematic in multiple dimensions. We want
to be large so that convergence is fast along the shallow directions of E (small
eigenvalues of H), however, if is too large the weights will diverge along the
steep directions (large eigenvalues of H). To see this more speci cally, let us
again expand E, but this time about a minimum
E(W) E(Wmin) +
1
2
(W Wmin)T H(Wmin)(W Wmin):
(30)
Di erentiating (30) and using the result in the update equation (20) gives
W(t + 1) = W(t)
@E(t)
@W
(31)
= W(t)
H(Wmin)(W(t) Wmin):
(32)
Subtracting Wmin from both sides gives
(W(t + 1) Wmin)=(I
H(Wmin))(W(t) Wmin):
(33)
If the prefactor (I
H(Wmin)) is a matrix transformation that always shrinks
a vector (i.e. its eigenvalues all have magnitude less than 1) then the update
equation will converge.
How does this help with choosing the learning rates? Ideally we want dif-
ferent learning rates along the di erent eigendirections. This is simple if the
eigendirections are lined up with the coordinate axes of the weights. In such a
case, the weights are uncoupled and we can assign each weight its own learning
rate based on the corresponding eigenvalue. However, if the weights are coupled
then we must rst rotate H such that H is diagonal, i.e. the coordinate axes line
up with the eigendirections (see Figure 7b). This is the purpose of diagonalizing
the Hessian discussed earlier.
Let
be the rotation matrix such that
= HT
(34)
where is diagonal and T
= I. The cost function then can be written as
E(W) E(Wmin) +
1
2
(W Wmin)T T
H(Wmin)
T
(W Wmin)]:
(35)
Making a change of coordinates to
= (W
Wmin) simpli es the above
equation to
E() E(0) +
1
2
T
(36)
and the transformed update equation becomes
(t +1)=(I
)(t):
(37)
Note that I
is diagonal with diagonal components 1
i. This equation
will converge if j1
ij < 1, i.e. < 2
i
for all i. If constrained to have a single
scalar learning rate for all weights then we must require
<
2
max
(38)
in order to avoid divergence, where max is the largest eigenvalue of H. For
fastest convergence we have
opt =
1
max
:
(39)
If min is a lot smaller than max then convergence will be very slow along the
min direction. In fact, convergence time is proportional to the condition number
max=min so that it is desirable to have as small an eigenvalue spread as
possible.
However, since we have rotated H to be aligned with the coordinate axes,
(37) consists actually of N independent 1-dimensional equations. Therefore, we
can choose a learning rate for each weight independent of the others. We see
that the optimal rate for the ith weight i is opt;i = 1
i
.
5.2 Examples
Linear Network Figure 10 displays a set of 100 examples drawn from two
Gaussian distributed classes centered at (-0.4,-0.8) and (0.4,0.8). The eigenvalues
of the covariance matrix are 0.84 and 0.036. We train a single layer linear network
with 2 inputs, 1 output, 2 weights, and 1 bias (see Figure (9)) using the LMS
algorithm in batch mode. Figure 11 displays the weight trajectory and error
during learning when using a learning rates of = 1:5 and 2.5. Note that the
learning rate (see Eq. 38) max = 2=max = 2=:84 = 2:38 will cause divergences
as is evident for = 2:5.
ω
1
ω
0
ω
2
χ
0
y
χ
1
Fig. 9. Simple linear network.
−1.4−1.2 −1 −0.8−0.6−0.4−0.2 0
0.2 0.4 0.6 0.8
1
1.2 1.4
−1.4
−1.2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Fig. 10. Two classes drawn from gaussian distri-
butions centered at (-0.4,-0.8) and (0.4,0.8).
Figure 12 shows the same example using stochastic instead of batch mode
learning. Here, a learning rate of = 0:2 is used. One can see that the trajectory
is much noisier than in batch mode since only an estimate of the gradient is
used at each iteration. The cost is plotted as a function of epoch. An epoch
here is simply de ned as 100 input presentations which, for stochastic learning,
Weight space
−1 −0.8−0.6−0.4−0.2 0
0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Log MSE (dB)
0
1
2
3
4
5
6
7
8
9
10
−20
−15
−10
−5
0
epochs
Weight space
−1 −0.8−0.6−0.4−0.2 0
0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Log MSE (dB)
0
1
2
3
4
5
6
7
8
9
10
−20
−15
−10
−5
0
epochs
(a)
(b)
Fig. 11. Weight trajectory and error curve during learning for (a) = 1:5 and (b)
= 2:5.
corresponds to 100 weight updates. In batch, an epoch corresponds to one weight
update.
Weight space
−1 −0.8−0.6−0.4−0.2 0
0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Log MSE (dB)
0
1
2
3
4
5
6
7
8
9
10
−20
−15
−10
−5
0
epochs
batch
Fig. 12. Weight trajectory and error
curve during stochastic learning for =
0:2.
Log MSE (dB)
0
1
2
3
4
5
6
7
8
9
10
−20
−15
−10
−5
0
Weight space
−2
−1.6 −1.2 −0.8 −0.4
0
0.4
0.8
1.2
1.6
2
2.4
−1.4
−1.2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
epochs
Fig. 13. Weight trajectories and errors
for 1-1-1 network trained using stochas-
tic learning.
Multilayer Network Figure 14 shows the architecture for a very simple mul-
tilayer network. It has 1 input, 1 hidden, and 1 output node. There are 2 weights
and 2 biases. The activation function is f(x)=1:71 tanh((2=3)x). The training
set contains 10 examples from each of 2 classes. Both classes are Gaussian dis-
tributed with standard deviation 0.4. Class 1 has a mean of -1 and class 2 has a
mean of +1. Target values are -1 for class 1 and +1 for class 2. Figure 13 shows
the stochastic trajectory for the example.
ω
1
ω
0
ω
2
ω
3
χ
y
Fig. 14. The minimal multilayer network.
5.3 Input Transformations and Error Surface Transformations
Revisited
We can use the results of the previous section to justify several of the tricks
discussed earlier.
Subtract the means from the input variables
The reason for the above trick is that a nonzero mean in the input variables
creates a very large eigenvalue. This means the condition number will be large,
i.e. the cost surface will be steep in some directions and shallow in others so that
convergence will be very slow. The solution is to simply preprocess the inputs
by subtracting their means.
For a single linear neuron, the eigenvectors of the Hessian (with means sub-
tracted) point along the principal axes of the cloud of training vectors (recall
Figure 8). Inputs that have a large variation in spread along di erent directions
of the input space will have a large condition number and slow learning. And so
we recommend:
Normalize the variances of the input variables.
If the input variables are correlated, this will not make the error surface
spherical, but it will possibly reduce its eccentricity.
Correlated input variables usually cause the eigenvectors of H to be rotated
away from the coordinate axes (Figure 7a versus 7b) thus weight updates are not
decoupled. Decoupled weights make the \one learning rate per weight" method
optimal, thus, we have the following trick:
Decorrelate the input variables.
Now suppose that the input variables of a neuron have been decorrelated,
the Hessian for this neuron is then diagonal and its eigenvalues point along the
coordinate axes. In such a case the gradient is not the best descent direction
as can be seen in Fig 7b. At the point P, an arrow shows that gradient does
not point towards the minimum. However, if we instead assign each weight its
own learning rate (equal the inverse of the corresponding eigenvalue) then the
descent direction will be in the direction of the other arrow that points directly
towards the minimum:
Use a separate learning rate for each weight.
6 Classical second order optimization methods
In the following we will brie y introduce the Newton, conjugate gradient, Gauss-
Newton, Levenberg Marquardt and the Quasi-Newton (BFGS) method (see also
11, 34, 3, 5]).
6.1 Newton Algorithm
To get an understanding of the Newton method let us recapitulate the results
from section 5.1. Assuming a quadratic loss function E (see Eq.(21)) as depicted
in Figure 6(ii), we can compute the weight update along the lines of Eq.(21)-(23)
w =
@2E
@w2
1
@E
@w
= H(w) 1 @E
@w
;
(40)
where must to be chosen in the range 0 < < 1 since E is in practice not
perfectly quadratic. In this equation information about the Hessian H is taken
into account. If the error function was quadratic one step would be su cient to
converge.
Usually the energy surface around the minimum is rather ellipsoid, or in
the extreme like a taco shell, depending on the conditioning of the Hessian. A
whitening transform, well known from signal processing literature 29] can change
this ellipsoid shape to a spherical shape through u =
1=2w (see Figure 15 and
Eq.(34)). So the inverse Hessian in Eq.(40) basically spheres out the error surface
locally. The following two approaches can be shown to be equivalent: (a) use the
Newton algorithm in an untransformed weight space and (b) do usual gradient
descent in a whitened coordinate system (see Figure 15) 19].
Summarizing, the Newton algorithm converges in one step if the error func-
tion is quadratic and (unlike gradient descent) it is invariant with respect to
linear transformations of the input vectors. This means that the convergence
time is not a ected by shifts, scaling and rotation of input vectors. However
one of the main drawbacks is that an N
N Hessian matrix must be stored
and inverted, which takes O(N3) per iterations and is therefore impractical for
more than a few variables. Since the error function is in general non-quadratic,
there is no guarantee of convergence. If the Hessian is not positive de nite (if
it has some zero or even negative Eigenvalues where the error surface is at or
some directions are curved downward), then the Newton algorithm will diverge,
so the Hessian must be positive de nite. Of course the Hessian matrix of multi-
layer networks is in general not positive de nite everywhere. For these reasons
the Newton algorithm in its original form is not usable for general neural net-
work learning. However it gives good insights for developing more sophisticated
algorithms, as discussed in the following.
ω
-½
Λ Θ′
-½
ΘΛ
U
Network
input
output
ω
Network
ω
input
output
Θ
-½
Λ
U
Newton Algorithm here ......
....is like Gradient Descent
there
Fig. 15. Sketch of the whitening properties of the Newton algorithm.
6.2 Conjugate Gradient
There are several important properties in conjugate gradient optimization: (1)
it is a O(N) method, (2) it doesn't use the Hessian explicitly, (3) it attempts
to nd descent directions that try to minimally spoil the result achieved in the
previous iterations, (4) it uses a line search, and most importantly, (5) it works
only for batch learning.
The third property is shown in Figure 16. Assume we pick a descent direction,
e.g. the gradient, then we minimize along a line in this direction (line search).
Subsequently we should try to nd a direction along which the gradient does
not change its direction, but merely its length (conjugate direction), because
moving along this direction will not spoil the result of the previous iteration.
The evolution of the descent directions k at iteration k is given as
k = rE(wk) + k k 1;
(41)
where the choice of k can be done either according to Fletcher and Reeves 34]
first descent
direction
gradients
conjugate
direction
Fig. 16. Sketch of conjugate gradient directions in a 2D error surface.
k =
rE(wk)T rE(wk)
rE(wk 1)T rE(wk 1)
(42)
or Polak and Ribiere
k =
(rE(wk) rE(wk 1))T rE(wk)
rE(wk 1)T rE(wk 1)
:
(43)
Two directions k and k 1 are de ned as conjugate if
T
k Hk 1 = 0;
i.e. conjugate directions are orthogonal directions in the space of an identity
Hessian matrix (see Figure 17). Very important for convergence in both choices is
ρ
κ−1
ρ
κ
κ
ω
Fig. 17. Sketch of conjugate gradient directions in a 2D error surface.
a good line search procedure. For a perfectly quadratic function with N variables
a convergence within N steps can be proved. For non-quadratic functions Polak
and Ribiere's choice seems more robust. Conjugate gradient (41) can also be
viewed as a smart choice for choosing the momentum term known in neural
network training. It has been applied with large success in multi-layer network
training on problems that are moderate sized with rather low redundancy in the
data. Typical applications range from function approximation, robotic control
39], time-series prediction and other real valued problems where high accuracy
is wanted. Clearly on large and redundant (classi cation) problems stochastic
backpropagation is faster. Although attempts have been made to de ne mini-
batches 25], the main disadvantage of conjugate gradient methods remains that
it is a batch method (partly due to the precision requirements in line search
procedure).
6.3 Quasi-Newton (BFGS)
The Quasi-Newton (BFGS) method (1) iteratively computes an estimate of the
inverse Hessian, (2) is an O(N2) algorithm, (3) requires line search and (4) it
works only for batch learning.
The positive de nite estimate of the inverse Hessian is done directly without
requiring matrix inversion and by only using gradient information. Algorithmi-
cally this can be described as follows: (1) rst a positive de nite matrix M is
chosen, e.g. M = I, (2) then the search direction is set to
(t) = M(t)rE(w(t));
(3) a line search is performed along , which gives the update for the parameters
at time t
w(t) = w(t 1)
(t)(t):
Finally (4) the estimate of the inverse Hessian is updated. Compared to the
Newton algorithm the Quasi-Newton approach only needs gradient information.
The most successful Quasi-Newton algorithm is the Broyden-Fletcher-Goldfarb-
Shanno (BFGS) method. The update rule for the estimate of the inverse Hessian
is
M(t) = M(t 1) 1 +
T M
T
T
T
T M + M T
T
;
(44)
where some abbreviations have been used for the following N 1 vectors
= rE(w(t)) rE(w(t 1))
= w(t) w(t 1):
(45)
Although, as mentioned above, the complexity is only O(N2), we are still re-
quired to store a N
N matrix, so the algorithm is only practical for small
networks with non-redundant training sets. Recently some variants exist that
aim to reduce storage requirements (see e.g. 3]).
6.4 Gauss-Newton and Levenberg Marquardt
Gauss-Newton and Levenberg Marquardt algorithm (1) use the square Jacobi
approximation, (2) are mainly designed for batch learning, (3) have a complexity
of O(N3) and (4) most important, they work only for mean squared error loss
functions. The Gauss-Newton algorithm is like the Newton algorithm, however
the Hessian is approximated by the square of the Jacobian (see also section 7.2
for a further discussion)
w =
X
p
@f(w; xp)
@w
T
@f(w; xp)
@w
! 1
rE(w):
(46)
The Levenberg Marquardt method is like the Gauss-Newton above, but it has a
regularization parameter that prevents it from blowing up, if some eigenvalues
are small
w =
X
p
@f(w; xp)
@w
T
@f(w; xp)
@w
+ I
! 1
rE(w);
(47)
where I denotes the unity matrix. The Gauss Newton method is valid for
quadratic cost functions however a similar procedure also works with Kullback-
Leibler cost and is called Natural Gradient (see e.g. 1, 44, 2]).
7 Tricks to compute the Hessian information in
multilayer networks
We will now discuss several techniques aimed at computing full or partial Hessian
information by (a) nite di erence method, (b) square Jacobian approximation
(for Gauss-Newton and Levenberg-Marquardt algorithm), (c) computation of
the diagonal of the Hessian and (d) by obtaining a product of the Hessian and
a vector without computing the Hessian. Other semi-analytical techniques that
allow the computation of the full Hessian are omitted because they are rather
complicated and also require many forward/backward propagation steps 5, 8].
7.1 Finite Di erence
We can write the k-th line of the Hessian
H(k) =
@(rE(w))
@wk
rE(w + k) rE(w)
;
where k = (0;0;0;:::;1;:::;0) is a vector of zeros and only one 1 at the k-th
position. This can be implemented with a simple recipe: (1) compute the total
gradient by multiple forward and backward propagation steps. (2) Add to the
k-th parameter and compute again the gradient, and nally (3) subtract both
results and divide by . Due to numerical errors in this computation scheme the
resulting Hessian might not be perfectly symmetric. In this case it should be
symmetrized as described below.
7.2 Square Jacobian approximation for the Gauss-Newton and
Levenberg-Marquardt algorithms
Assuming a mean squared cost function
E(w) =
1
2
X
p
(dp
f(w; xp))T (dp
f(w; xp))
(48)
then the gradient is
@E(w)
@w
=
X
p
(dp
f(w; xp))T @f(w; xp)
@w
(49)
and the Hessian follows as
H(w) =
X
p
@f(w; xp)
@w
T
@f(w; xp)
@w
+
X
p
(dp
f(w; xp))T @2f(w; xp)
@w@w
:
(50)
A simplifying approximation of the Hessian is the square of the Jacobian which
is a positive semi-de nite matrix of dimension: N O
H(w)
X
p
@f(w; xp)
@w
T
@f(w; xp)
@w
;
(51)
where the second term from Eq.(50) was dropped. This is equivalent to assuming
that the network is a linear function of the parameters w. Again this is readily
implemented for the k-th column of the Jacobian: for all training patterns, (1)
we forward propagate, then (2) set the activity of the output units to 0 and only
the k-th output to 1, (3) a backpropagation step is taken and the gradient is
accumulated.
7.3 Backpropagating second derivatives
Let us consider a multi-layer system with some functional blocks with Ni inputs,
No outputs and N parameters of the form O = F(W; X). Now assume we knew
@2E=@O2, which is a No
No matrix. Then it is straight forward to compute
this matrix
@2E
@W2
=
@O
@W
T @2E
@O2
@O
@W
+
@E
@O
@2O
@W2
:
(52)
We can drop the second term in Eq.(52) and the resulting estimate of the Hessian
is positive semi-de nite. A further reduction is achieved, if we ignore all but the
diagonal terms of @2E
@O2 :
@2E
@w2
i
=
X
k
@2E
@o2
k
@ok
@wi
2
:
(53)
A similar derivation can be done to obtain the Ni times Ni matrix @2E=@x2.
7.4 Backpropagating the diagonal Hessian in neural nets
Backpropagation procedures for computing the diagonal Hessian are well known
18,4,19]. It is assumed that each layer in the network has the functional form
oi = f(yi) = f(
P
j wijxj) (see Figure 18 for the sigmoidal network). Using the
Gauss-Newton approximation (dropping the term that contain f00(y)) we obtain:
@2E
@y2
k
=
@2E
@o2
k
(f0(yk))
2
;
(54)
@2E
@w2
ki
=
@2E
@y2
k
x2
i
(55)
and
@2E
@x2
i
X
k
@2E
@y2
k
w2
ki:
(56)
With f being a Gaussian nonlinearity as shown in Figure 18 for the RBF net-
works we obtain
@2E
@w2
ki
=
@2E
@y2
k
(xi
wki)2
(57)
and
@2E
@x2
i
=
X
k
@2E
@y2
k
(xi
wki)2:
(58)
The cost of computing the diagonal second derivatives by running these equa-
tions from the last layer to the rst one is essentially the same as the regular
backpropation pass used for the gradient, except that the square of the weights
are used in the weighted sums. This technique is applied in the \optimal brain
damage" pruning procedure (see 21]).
||ω−
x ||2
y
x
f( )
ω
x
x
ω
y
z
1/2
Fig. 18. Backpropagating the diagonal Hessian: sigmoids (left) and RBFs (right).
7.5 Computing the product of the Hessian and a vector
In many methods that make use of the Hessian, the Hessian is used exclusively in
products with a vector. Interestingly, there is a way of computing such products
without going through the trouble of computing the Hessian itself. The nite
di erence method can ful ll this task for an arbitrary vector
H
1 @E
@w
(w +
)
@E
@w
(w) ;
(59)
using only two gradient computations (at point w and w +
respectively),
which can be readily computed with backprop ( is a small constant).
This method can be applied to compute the principal eigenvector and eigen-
value of H by the power method. By iterating and setting
(t + 1) =
H(t)
k (t)k
;
(60)
the vector (t) will converge to the largest eigenvector of H and k (t)k to the
corresponding eigenvalue 23,14,10]. See also 33] for an even more accurate
method that (1) does not use nite di erences and (2) has similar complexity.
8 Analysis of the Hessian in multi-layer networks
It is interesting to understand how some of the tricks shown previously in uence
on the Hessian, i.e. how does the Hessian change with architecture and details of
the implementation. Typically, the eigenvalue distribution of the Hessian looks
like the one sketched in Figure 20: a few small eigenvalues, many medium ones
and few very large ones. We will now argue that the large eigenvalues will cause
the trouble in the training process because 23,22]
{ non-zero mean inputs or neuron states 22]
{ wide variations of the second derivatives from layer to layer
{ correlation between state variables.
To exemplify this, we show the eigenvalue distribution of a network trained
on OCR data in Figure 20. Clearly, there is a wide spread of eigenvalues (see
Figure 19) and we observe that the ratio between e.g. the rst and the eleventh
eigenvalue is about 8. The long tail of the eigenvalue distribution (see Figure 20)
is rather painful because the ratio between the largest and smallest eigenvalue
gives the conditioning of the learning problem. A large ratio corresponds to a
big di erence in the axis of the ellipsoidal shaped error function: the larger the
ratio, the more we nd a taco-shell shaped minima, which are extremely steep
towards the small axis and very at along the long axis.
Another general characteristic of the Hessian in multi-layer networks is the
spread between layers. In Figure 21 we roughly sketch how the shape of the
Hessian varies from being rather at in the rst layer to being quite steep in
0
100 200 300 400 500 600 700 800
−6
−5.5
−5
−4.5
−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5
0
Eigenvalue order
10
Log Eigenvalue
the ratio between the 1st and
the 11th eigenvalues is 8
Fig. 19. Eigenvalue spectrum in a 4 layer shared weights network (256 128 64 10)
trained on 320 handwritten digits.
the last layer. This a ects the learning speed and can provide an ingredient
to explain the slow learning in lower layers and the fast (sometime oscillating)
learning in the last layer. A trick to compensate this di erent scale of learning is
to use the inverse diagonal Hessian to control the learning rate (see also section
6.
9 Applying Second Order Methods to Multilayer
Networks
Before we concentrate in this section on how to tailor second order techniques
for training large networks, let us rst repeat some rather pessimistic facts about
applying classical second order methods. Techniques using full Hessian informa-
tion (Gauss -Newton, Levenberg-Marquardt and BFGS) can only apply to very
small networks trained in batch mode, however those small networks are not
the ones that need speeding up the most. Most second order methods (conju-
gate gradient, BFGS, ...) require a line-search and can therefore not be used
in the stochastic mode. Many of the tricks discussed previously apply only to
batch learning. From our experience we know that a carefully tuned stochastic
gradient descent is hard to beat on large classi cation problems. For smaller
problems that require accurate real-valued outputs like in function approxima-
tion or control problems, we see that conjugate gradient (with Polak-Ribiere
0
2
4
6
8
10
12
14
16
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Eigenvalue magnitude
Number of Eigenvalues
Big killers
Fig. 20. Eigenvalue spectrum in a 4 layer shared weights network (256 128 64 10)
trained on 320 handwritten digits.
Fig. 21. Multilayered architecture: the second derivative is often smaller in lower layers.
Eq.(43)) o ers the best combination of speed, reliability and simplicity. Several
attempts using \mini batches" in applying conjugate gradient to large and re-
dundant problems have been made recently 17,25,31]. A variant of conjugate
gradient optimization (called scaled CG) seems interesting: here the line search
procedure is replaced by a 1D Levenberg Marquardt type algorithm 24].
9.1 A stochastic diagonal Levenberg Marquardt method
To obtain a stochastic version of the Levenberg Marquardt algorithm the idea
is to compute the diagonal Hessian through a running estimate of the second
derivative with respect to each parameter. The instantaneous second derivative
can be obtained via backpropagation as shown in the formulas of section 7. As
soon as we have those running estimates we can use them to compute individual
learning rates for each parameter
ki =
h @2E
@w2
ki
i+
;
(61)
where denotes the global learning rate, and h @2E
@w2
ki
i is a running estimate of the
diagonal second derivative with respect to wki. is a parameter to prevent ki
from blowing up in case the second derivative is small, i.e. when the optimization
moves in at parts of the error function. The running estimate is computed as
h
@2E
@w2
ki
inew = (1
)h
@2E
@w2
ki
iold +
@2Ep
@w2
ki
;
(62)
where
is a small constant that determines the amount of memory that is
being used. The second derivatives can be computed prior to training over e.g.
a subset of the training set. Since they change only very slowly they only need
to be reestimated every few epochs. Note that the additional cost over regular
backpropagation is negligible and convergence is { as a rule of thumb { about
three times faster than a carefully tuned stochastic gradient algorithm.
In Figure 22 and 23 we see the convergence of the stochastic diagonal Leven-
berg Marquardt method (61) for a toy example with two di erent sets of learn-
ing rates. Obviously the experiment shown Figure 22 contains fewer uctuations
than in Figure 23 due to smaller learning rates.
9.2 Computing the principal Eigenvalue/vector of the Hessian
In the following we give three tricks for computing the principal eigenvalue/Vec-
tor of the Hessian without having to compute the Hessian itself. Remember
that in section 4.7 we also introduced a method to approximate the smallest
eigenvector of the Hessian (without having to compute the Hessian) through
averaging (see also 28]).
Weight space
−1
−0.8 −0.6 −0.4 −0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Log MSE (dB)
0
1
2
3
4
5
6
7
8
9
10
−20
−15
−10
−5
0
Hessian
largest
eigenvalue:
λ = 0.84
max
η = 2.38
max
epochs
Learning
rates:
η0 = 0.12
η1 = 0.03
η2 = 0.02
Maximum
admissible
Learning
rate (batch):
Fig. 22. Stochastic diagonal Levenberg-Marquardt algorithm. Data set from 2 Gaus-
sians with 100 examples. The network has one linear unit, 2 inputs and 1 output, i.e.
three parameters (2 weights, 1 bias).
Weight space
−1
−0.8 −0.6 −0.4 −0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Log MSE (dB)
0
1
2
3
4
5
6
7
8
9
10
−20
−15
−10
−5
0
Hessian
largest
eigenvalue:
λ = 0.84
max
η = 2.38
max
epochs
Learning
rates:
Maximum
admissible
Learning
rate (batch):
η0 = 0.76
η1 = 0.18
η2 = 0.12
Fig. 23. Stochastic diagonal Levenberg-Marquardt algorithm. Data set from 2 Gaus-
sians with 100 examples. The network has one linear unit, 2 inputs and 1 output, i.e.
three parameters (2 weights, 1 bias).
Power Method We repeat the result of our discussion in section 7.5: starting
from a random initial vector , the iteration
new = H
old
k oldk
;
will eventually converge to the principal eigenvector (or a vector in the principal
eigenspace) and k oldk will converge to the corresponding eigenvalue 14,10].
Taylor Expansion Another method makes use of the fact that small pertur-
bations of the gradient also lead to the principal eigenvector of H
new =
1 @E
@w
(w +
old
k oldk
)
@E
@w
(w) ;
(63)
where is a small constant. One iteration of this procedure requires two forward
and two backward propagation steps for each pattern in the training set.
Online Computation of
The following rule makes use of the running av-
erage to obtain the largest eigenvalue of the average Hessian very fast
new = (1
) +
1 @Ep
@w
(w +
old
k oldk
)
@E
@w
(w) :
(64)
To summarize, the eigenvalue/vector computations:
1. a random vector is chosen for initialization of ,
2. an input pattern is presented with desired output, a forward and backward
propagation, step is performed and the gradients G(w) are stored,
3.
old
k oldk
is added to the current weight vector w,
4. a forward and backward propagation step is performed with the perturbed
weight vector and the gradients G(w0) are stored,
5. the di erence 1=(G(w0) G(w)) is computed and the running average of
the eigenvector is updated,
6. we loop from (2)-(6) until a reasonably stable result is obtained for ,
7. the optimal learning rate is then given as
opt =
1
k k
:
In Figure 24 we see the evolution of the eigenvalue as a function of the number of
pattern presentations for a neural network in a handwritten character recognition
task. In practice we adapt the leak size of the running average in order to get
fewer uctuations (as also indicated on the gure). In the gure we see that
after fewer than 100 pattern presentations the correct order of magnitude for
the eigenvalue, i.e the learning rate is reached. From the experiments we also
observe that the uctuations of the average Hessian over training are small.
0
50
100
150
200
250
300
350
400
0
10
20
30
40
50
60
70
80
Number of pattern presentations
eigenvalue
γ=0.1 γ=0.03
γ=0.01
γ=0.003
Fig. 24. Evolution of the eigenvalue as a function of the number of pattern presen-
tations for a shared weight network with 5 layers, 64638 connections and 1278 free
parameters. The training set consists of 1000 handwritten digits.
In Figure 25 and 26 we start with the same initial conditions, and perform
a xed number of epochs with learning rates computed by multiplying the pre-
dicted learning rate by a prede ned constant. Choosing constant 1 (i.e. using the
predicted optimal rate) always gives residual errors which are very close to the
error achieved by the best choice of the constant. In other words, the \predicted
optimal rate" is optimal enough.
0 0.250.50.75 1 1.251.51.75 2 2.252.52.75 3 3.253.53.75 4
0
0.5
1
1.5
2
2.5
LEARNING RATE
PREDICTED OPTIMAL LEARNING RATE
MEAN SQUARED ERROR
1 epoch
2 epochs
3 epochs
4 epochs
5 epochs
Fig. 25. Mean squared error as a function of the ratio between learning rate and
predicted optimal learning rate for a fully connected network (784 30 10). The
training set consists of 300 handwritten digits.
0
0.25 0.5 0.75
1
1.25 1.5 1.75
2
2.25 2.5 2.75
3
0
0.5
1
1.5
2
2.5
LEARNING RATE
PREDICTED OPTIMAL LEARNING RATE
MEAN SQUARED ERROR
1 epoch
2 epochs
3 epochs
4 epochs
5 epochs
Fig. 26. Mean squared error as a function of the ratio between learning rate and
predicted optimal learning rate for a shared weight network with 5 layers (1024
1568
392
400
100
10), 64638 (local) connections and 1278 free parameters
(shared weights). The training set consists of 1000 handwritten digits.
10 Discussion and Conclusion
According to the recommendations mentioned above, a practitioner facing a
multi-layer neural net training problem would go through the following steps:
{ shu e the examples
{ center the input variables by subtracting the mean
{ normalize the input variable to a standard deviation of 1
{ if possible, decorrelate the input variables.
{ pick a network with the sigmoid function shown in gure 4
{ set the target values within the range of the sigmoid, typically +1 and -1.
{ initialize the weights to random values as prescribed by 16.
The preferred method for training the network should be picked as follows:
{ if the training set is large (more than a few hundred samples) and redundant,
and if the task is classi cation, use stochastic gradient with careful tuning,
or use the stochastic diagonal Levenberg Marquardt method.
{ if the training set is not too large, or if the task is regression, use conjugate
gradient.
Classical second-order methods are impractical in almost all useful cases.
The non-linear dynamics of stochastic gradient descent in multi-layer neural
networks, particularly as it pertains to generalization, is still far from being well
understood. More theoretical work and systematic experimental work is needed.
Acknowledgement Y.L. & L.B. & K.-R. M. gratefully acknowledge mutual
exchange grants from DAAD and NSF.
References
1. S. Amari. Neural learning in structured parameter spaces | natural riemannian
gradient. In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors,
Advances in Neural Information Processing Systems, volume 9, page 127. The MIT
Press, 1997.
2. S. Amari. Natural gradient works e ciently in learning. Neural Computation,
10(2):251{276, 1998.
3. R. Battiti. First- and second-order methods for learning: Between steepest descent
and newton's method. Neural Computation, 4:141{166, 1992.
4. S. Becker and Y. LeCun. Improving the convergence of backbropagation learning
with second oder metho ds. In David Touretzky, Geo rey Hinton, and Terrence
Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School,
pages 29{37. Lawrence Erlbaum Associates, 1989.
5. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford,
1995.
6. L. Bottou. Online algorithms and stochastic approximations. In David Saad, editor,
Online Learning in Neural Networks (1997 Workshop at the Newton Institute),
Cambridge, 1998. The Newton Institute Series, Cambridge University Press.
7. D. S. Broomhead and D. Lowe. Multivariable function interpolation and adaptive
networks. Complex Systems, 2:321{355, 1988.
8. W. L. Buntine and A. S. Weigend. Computing second order derivatives in Feed-
Forward networks: A review. IEEE Transactions on Neural Networks, 1993. To
appear.
9. C. Darken and J. E. Moody. Note on learning rate schedules for stochastic opti-
mization. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances
in Neural Information Processing Systems, volume 3, pages 832{838. Morgan Kauf-
mann, San Mateo, CA, 1991.
10. K. I. Diamantaras and S. Y. Kung. Principal Component Neural Networks. Wiley,
New York, 1996.
11. R. Fletcher. Practical Methods of Optimization, chapter 8.7 : Polynomial time
algorithms, pages 183{188. John Wiley & Sons, New York, second edition, 1987.
12. S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance
dilemma. Neural Computation, 4(1):1{58, 1992.
13. L. Goldstein. Mean square optimality in the continuous time Robbins Monro pro-
cedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern
California, LA, 1987.
14. G. H. Golub and C. F. Van Loan. Matrix Computations, 2nd ed. Johns Hopkins
University Press, Baltimore, 1989.
15. T.M. Heskes and B. Kappen. On-line learning processes in arti cial neural net-
works. In J. G. Tayler, editor, Mathematical Approaches to Neural Networks, vol-
ume 51, pages 199{233. Elsevier, Amsterdam, 1993.
16. Robert A. Jacobs. Increased rates of convergence through learning rate adaptation.
Neural Networks, 1:295{307, 1988.
17. A. H. Kramer and A. Sangiovanni-Vincentelli. E cient parallel learning algorithms
for neural networks. In D. S. Touretzky, editor, Advances in Neural Information
Processing Systems. Proceedings of the 1988 Conference, pages 40{48, San Mateo,
CA, 1989. Morgan Kaufmann.
18. Y. LeCun. Modeles connexionnistes de l'apprentissage (connectionist learning mod-
els). PhD thesis, Universit e P. et M. Curie (Paris VI), 1987.
19. Y. LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter,
F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Amsterdam,
1989. Elsevier. Proceedings of the International Conference Connectionism in Per-
spective, University of Z urich, 10. { 13. October 1988.
20. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel. Handwritten digit recognition with a backpropagation network. In
D. S. Touretsky, editor, Advances in Neural Information Processing Systems, vol.
2, San Mateo, CA, 1990. Morgan Kaufman.
21. Y. LeCun, J.S. Denker, and S.A. Solla. Optimal brain damage. In D. S. Touretsky,
editor, Advances in Neural Information Processing Systems, vol. 2, pages 598{605,
1990.
22. Y. LeCun, I. Kanter, and S. A. Solla. Second order properties of error surfaces. In
Advances in Neural Information Processing Systems, vol. 3, San Mateo, CA, 1991.
Morgan Kaufmann.
23. Y. LeCun, P. Y. Simard, and B. Pearlmutter. Automatic learning rate maximiza-
tion by on-line estimation of the hessian's eigenvectors. In Giles, Hanson, and
Cowan, editors, Advances in Neural Information Processing Systems, vol. 5, San
Mateo, CA, 1993. Morgan Kaufmann.
24. M. M ller. A scaled conjugate gradient algorithm for fast supervised learning.
Neural Networks, 6:525{533, 1993.
25. M. M ller. Supervised learning on large redundant training sets. International
Journal of Neural Systems, 4(1):15{25, 1993.
26. J. E. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing
units. Neural Computation, 1:281{294, 1989.
27. N. Murata. (in Japanese). PhD thesis, University of Tokyo, 1992.
28. N. Murata, K.-R. M uller, A. Ziehe, and S. Amari. Adaptive on-line learning in
changing environments. In Michael C. Mozer, Michael I. Jordan, and Thomas
Petsche, editors, Advances in Neural Information Processing Systems, volume 9,
page 599. The MIT Press, 1997.
29. A.V. Oppenheim and R.W. Schafer. Digital Signal Processing. Prentice Hall,
Englewood Cli s, 1975.
30. G. B. Orr. Dynamics and Algorithms for Stochastic learning. PhD thesis, Oregon
Graduate Institute, 1995.
31. G.B. Orr. Removing noise in on-line search using adaptive batch sizes. In
Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors, Advances in
Neural Information Processing Systems, volume 9, page 232. The MIT Press, 1997.
32. M.J.L. Orr. Regularization in the selection of radial basis function centers. Neural
Computation, 7(3):606{623, 1995.
33. B.A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation,
6:147{160, 1994.
34. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical
Recipies in C: The art of Scienti c Programming. Cambridge University Press,
Cambridge, England, 1988.
35. D. Saad, editor. Online Learning in Neural Networks (1997 Workshop at the New-
ton Institute). The Newton Institute Series, Cambridge University Press, Cam-
bridge, 1998.
36. D. Saad and S. A. Solla. Exact solution for on-line learning in multilayer neural
networks. Physical Review Letters, 74:4337{4340, 1995.
37. H. Sompolinsky, N. Barkai, and H.S. Seung. On-line learning of dichotomies: al-
gorithms and learning curves. In J-H. Oh, C. Kwon, and S. Cho, editors, Neural
Networks: The Statistical Mechanics Perspective, pages 105{130. Singapore: World
Scienti c, 1995.
38. R.S. Sutton. Adapting bias by gradient descent: An incremental version of delta-
bar-delta. In William Swartout, editor, Proceedings of the 10th National Conference
on Arti cial Intelligence, pages 171{176, San Jose, CA, July 1992. MIT Press.
39. P. van der Smagt. Minimisation methods for training feed-forward networks. Neural
Networks, 7(1):1{11, 1994.
40. V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York,
1995.
41. V. Vapnik. Statistical Learning Theory Wiley, New York, 1998.
42. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recog-
nition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,
and Signal Processing, ASSP-37:328{339, 1989.
43. W. Wiegerinck, A. Komoda, and T. Heskes. Stochastic dynamics of learning with
momentum in neural networks. Journal of Physics A, 27:4425{4437, 1994.
44. H.H. Yang and S. Amari. The e ciency and the robustness of natural gradient
descent learning rule. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla,
editors, Advances in Neural Information Processing Systems, volume 10. The MIT
Press, 1998.