Institute for Advanced Management Systems Research
Department of Information Technologies
Åbo Akademi University
The delta learning rule - Tutorial
Robert Fullér
Directory
• Table of Contents
• Begin Article
c 2010 rfuller@abo.fi
November 4, 2010
Table of Contents
1. Error correction learning
2. Linear activation function
3. The delta learning rule
4. The delta learning rule with semilinear activation function
5. The generalized delta learning rule
6. Effectivity of neural networks
3
1. Error correction learning
The error correction learning procedure is simple enough in
conception. The procedure is as follows: During training an
input is put into the network and flows through the network
generating a set of values on the output units.
Then, the actual output is compared with the desired target, and
a match is computed. If the output and target match, no change
is made to the net.
However, if the output differs from the target a change must be
made to some of the connections.
Let’s first recall the definition of derivative of single-variable
functions.
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 4
Definition 1.1. The derivative of f at (an interior point of its
domain) x, denoted by f 0 (x), and defined by
f (x) − f (xn )
f 0 (x) = lim
xn →x x − xn
Let us consider a differentiable function
f : R → R.
The derivative of f at (an interior point of its domain) x is de-
noted by f 0 (x).
If f 0 (x) > 0 then we say that f is increasing at x, if f 0 (x) < 0
then we say that f is decreasing at x, if f 0 (x) = 0 then f can
have a local maximum, minimum or inflextion point at x.
A differentiable function is always increasing in the direction
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 5
of its derivative, and decreasing in the opposite direction.
It means that if we want to find one of the local minima of a
function f starting from a point x0 then we should search for a
second candidate:
• in the right-hand side of x0 if f 0 (x0 ) < 0, when f is de-
creasing at x0 ;
• in the left-hand side of x0 if f 0 (x0 ) > 0, when f increasing
at x0 .
The equation for the line crossing the point (x0 , f (x0 )) is given
by
y − f (x0 )
= f 0 (x0 )
x − x0
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 6
that is
y = f (x0 ) + (x − x0 )f 0 (x0 )
The next approximation, denoted by x1 , is a solution to the
equation
f (x0 ) + (x − x0 )f 0 (x0 ) = 0
which is,
f (x0 )
x1 = x0 − 0 0
f (x )
This idea can be applied successively, that is
f (xn )
xn+1 = xn − 0 n .
f (x )
The above procedure is a typical descent method. In a descent
method the next iteration wn+1 should satisfy the following
property
f (wn+1 ) < f (wn )
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 7
f ( x)
Downhill direction
f(x 0)
x1 x0
The downhill direction is negative at x0.
Figure 1: The downhill direction is negative at x0 .
The above procedure is a typical descent method.
i.e. theInvalue
a of f at
descent wn+1 the
method is smaller than its
next iteration previous
wn+1 shouldvalue at
n
w . satisfy the following property
f (wn+1procedure,
In error correction learning ) < f (wn) each iteration of a de-
scent method
i.e. the calculates thewdownhill
value of f at n+1 direction
is smaller (opposite
than its previ- of the
n
direction
ousofvalue
the derivative)
n
at w . at w which means that for a suffi-
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 8
ciently small η > 0 the inequality
f (wn − ηf 0 (wn )) < f (wn )
should hold, and we let wn+1 be the vector
wn+1 = wn − ηf 0 (wn )
Let f : Rn → R be a real-valued function. In a descent method,
whatever is the next iteration, wn+1 , it should satisfy the prop-
erty
f (wn+1 ) < f (wn )
i.e. the value of f at wn+1 is smaller than its value at previous
approximation wn .
Each iteration of a descent method calculates a downhill direc-
tion (opposite of the direction of the derivative) at wn which
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 9
means that for a sufficiently small η > 0 the inequality
f (wn − ηf 0 (wn )) < f (wn )
should hold, and we let wn+1 be the vector
wn+1 = wn − ηf 0 (wn ).
Let f : Rn → R be a real-valued function and let e ∈ Rn with
kek = 1 be a given direction. The derivative of f with respect
e at w is defined as
f (w + te) − f (w)
∂e f (w) = lim .
t→+0 t
If
i-th
z}|{
e = (0, . . . 1 . . . , 0)T ,
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 10
i.e. e is the i-th basic direction then instead of ∂e f (w) we write
∂i f (w), which is defined by
∂i f (w) =
f (w1 , . . . , wi + t, . . . wn ) − f (w1 , . . . , wi , . . . , wn )
lim .
t→+0 t
The gradient of f at w, denoted by f 0 (w) is defined by
f 0 (w) = (∂1 f (w), . . . , ∂n f (w))T
Example 1.1. Let f (w1 , w2 ) = w12 + w22 then the gradient of f
is given by
f 0 (w) = 2w = (2w1 , 2w2 )T .
The gradient vector always points to the uphill direction of
f . The downhill (steepest descent) direction of f at w is the
Toc JJ II J I Back J Doc Doc I
f (w1, . . . , wi + t, . . . wn) − f (w1, . . . , wi, . . . , wn)
= lim .
Section 1: Error correction
t→+0 learning t 11
f(w)
f(w + te)
w
e te
w + te
The derivative of f with respect to the direction e..
Figure 2: The derivative of f with respect to the direction e.
The gradient of f at w, denoted by f #(w) is defined
opposite by
of the uphill direction, i.e. the downhill direction is
−f 0 (w), which is
(−∂1 f (w), . . . , −∂n f (w))T .
8
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 12
2. Linear activation function
Definition 2.1. A linear activation function is a mapping from
f : R → R such that f (t) = t for all t ∈ R.
Suppose we are given a single-layer network with n input units
and m linear output units, i.e. the output of the i-th neuron can
be written as
oi = neti =< wi , x >= wi1 x1 + · · · + win xn ,
for i = 1, . . . , m. Assume we have the following training set
{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ), y k = (y1k , . . . , ym
k
), k = 1, . . . , K.
The basic idea of the delta learning rule is to define a measure
of the overall performance of the system and then to find a way
Toc JJ II J I Back J Doc Doc I
where x = (x1 , . . . , xn), y = (y1 , . . . , ym), k =
1, . . . , K.
Section 2: Linear activation function 13
!1 !m
w11 wmn
x1 x2 xn
Single-layer feedforward network with m output units
Figure 3: Single-layer feedforward network with m output units.
10
to optimize that performance.
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 14
In our network, we can define the performance of the system as
X
K
1X k
K
E= Ek = ky − ok k2
k=1
2 k=1
That is
1 XX k
K m
E= (yi − oki )2
2 k=1 i=1
1 XX k
K m
= (yi − < wi , xk >)2 ,
2 k=1 i=1
where i indexes the output units; k indexes the input/output
pairs to be learned; yik indicates the target for a particular output
unit on a particular pattern;
oki :=< wi , xk >,
indicates the actual output for that unit on that pattern; and E is
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 15
the total error of the system. The goal, then, is to minimize this
function. It turns out, if the output functions are differentiable,
that this problem has a simple solution: namely, we can assign
a particular unit blame in proportion to the degree to which
changes in that unit’s activity lead to changes in the error.
That is, we change the weights of the system in proportion to
the derivative of the error with respect to the weights.
The rule for changing weights following presentation of in-
put/output pair (x, y) (we omit the index k for the sake of sim-
plicity) is given by the gradient descent method, i.e. we mini-
mize the quadratic error function by using the following itera-
tion process
∂Ek
wij := wij − η
∂wij
where η > 0 is the learning rate.
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 16
Let us compute now the partial derivative of the error function
Ek with respect to wij
∂Ek ∂Ek ∂neti
= = −(yi − oi )xj
∂wij ∂neti ∂wij
where neti = wi1 x1 + · · · + win xn .
That is,
wij := wij + η(yi − oi )xj
for j = 1, . . . , n.
Definition 2.2. The error signal term, denoted by δik and called
delta, produced by the i-th output neuron is defined as
∂Ek
δi = − = (yi − oi )
∂neti
For linear output units δi is nothing else but the difference be-
tween the desired and computed output values of the i-th neu-
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 17
ron.
So the delta learning rule can be written as
wi := wi + η(yi − oi )x
where wi is the weight vector of the i-th output unit, x is the
actual input to the system, yi is the i-th coordinate of the desired
output, oi is the i-th coordinate of the computed output and
η > 0 is the learning rate.
If we have only one output unit then the delta learning rule
collapses into
w := w + η(y − o)x = w + ηδx
where δ denotes the difference between the desired and the
Toc JJ II J I Back J Doc Doc I
If we have only one output unit then the delta learn-
ing
Section 2: rule
Linear collapses
activation function into 18
x1 w1
! = f(net)
f
xn wn
A single neuron net.
Figure 4: A single neuron net.
computed output.
w := w + η(y − o)x = w + ηδx
A keywhere δ denotes
advantage thenetwork
of neural difference between
systems the
is that desired
these simple,
and thelearning
yet powerful computed output. can be defined, allowing the
procedures
systems to adapt to their environments.
A key advantage
The essential ofsuch
character of neural network
networks systems
is that is that
they map simi-
these simple, yet powerful learning
lar input patterns to similar output patterns.procedures can
be defined, allowing the systems to adapt to their
This characteristic
environments. is what allows these networks to make rea-
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 19
sonable generalizations and perform reasonably on patterns that
have never before been presented. The similarity of patterns in
a connectionist system is determined by their overlap.
The overlap in such networks is determined outside the learning
system itself whatever produces the patterns.
The standard delta rule essentially implements gradient descent
in sum-squared error for linear activation functions.
It should be noted that the delta learning rule was introduced
only recently for neural network training by McClelland and
Rumelhart.
This rule parallels the discrete perceptron training rule. It also
can be called the continuous perceptron training rule.
Toc JJ II J I Back J Doc Doc I
20
3. The delta learning rule
The delta learning rule with linear activation functions. Given
are K training pairs arranged in the training set
{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ) and y k = (y1k , . . . , ym
k
), k = 1, . . . , K.
• Step 1 η > 0, Emax > 0 are chosen
• Step 2 Weights wij are initialized at small random values,
k := 1, and the running error E is set to 0
• Step 3 Training starts here. Input xk is presented, x := xk ,
y := y k , and output o = (o1 , . . . , om )T is computed
oi =< wi , x >) = wiT x
for i = 1, . . . , m.
Toc JJ II J I Back J Doc Doc I
Section 3: The delta learning rule 21
• Step 4 Weights are updated
wij := wij + η(yi − oi )xj
• Step 5 Cumulative cycle error is computed by adding the
present error to E
1
E := E + ky − ok2
2
• Step 6 If k < K then k := k + 1 and we continue the
training by going back to Step 3, otherwise we go to Step
7
• Step 7 The training cycle is completed. For E < Emax
terminate the training session. If
E > Emax
then E is set to 0 and we initiate a new training cycle by
going back to Step 3
Toc JJ II J I Back J Doc Doc I
Section 3: The delta learning rule 22
Exercise 1. Construct a single-neuron network, which com-
putes the following function. The training set is
x1 x2 x3 o(x1 , x2 , x3 )
1. 1 1 1 1
2. 1 1 0 1
3. 1 0 0 1
4. 0 1 1 0
5. 0 0 1 0
6. 0 1 0 0
7. 1 0 1 1
Solution 1. For example, w1 = 1 and w2 = w3 = 0.
Toc JJ II J I Back J Doc Doc I
Section 3: The delta learning rule 23
Exercise 2. The error function to be minimized is given by
1
E(w1 , w2 , w3 , w4 ) = (w1 − w2 − w3 )2 +
2
2 2 2
(−w1 + w2 − w3 ) + (−w1 − w2 − 1) + (w4 + 1)
Find analytically the gradient vector
∂1 E(w)
∂2 E(w)
E 0 (w) =
∂3 E(w)
∂4 E(w)
Find analytically the weight vectors w∗ that minimizes the error
function such that E 0 (w∗ ) = 0. Derive the steepest descent
method for the minimization of E.
Toc JJ II J I Back J Doc Doc I
Section 3: The delta learning rule 24
Solution 2. The gradient vector of E is
3w1 − w2 + 1
−w 1 + 3w2 + 1
E 0 (w) =
2w3
w4 + 1
and
w∗ = (−1/2, −1/2, 0, −1)T
is the unique solution to the equation E 0 (w) = 0.
The steepest descent method for the minimization of E reads
w1 w1 3w1 − w2 + 1
w2
:= w2 − η −w1 + 3w2 + 1 .
w3 w3 2w3
w4 w4 w4 + 1
where η > 0 is the learning constant.
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 25
That is,
w1 : = w1 − η(3w1 − w2 + 1),
w2 : = w2 − η(−w1 + 3w2 + 1),
w3 : = w3 − η(2w3 ),
w4 : = w4 − η(w4 + 1).
4. The delta learning rule with semilinear activation function
The standard delta rule essentially implements gradient descent
in sum-squared error for linear activation functions.
We shall describe now the delta learning rule with semilinear
activation function. For simplicity we explain the learning al-
gorithm in the case of a single-output network.
Toc JJ II J I Back J Doc Doc I
If we have only one output unit then the delta learn-
ing
Section 4: Therule collapses
delta learning into
rule with semilinear activation function 26
x1 w1
! = f(net)
f
xn wn
A single neuron net.
Figure 5: A single neuron net.
The output of the neuron
w := is computed
w + η(y − o)x = by
w +unipolar
ηδx sigmoidal
activation function
where δ denotes the difference 1between the desired
o(< w, x >)
and the computed =
output. .
1 + exp(−wT x)
A key
Suppose advantage
we are given theoffollowing
neural network
trainingsystems
set is that
these simple, yet powerful learning procedures can
be defined, allowing the systems to adapt to their
environments.
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 27
No. input values desired output value
1. x1 = (x11 , . . . x1n ) y1
2. x2 = (x21 , . . . x2n ) y2
.. .. ..
. . .
K. xK = (xK K
1 , . . . xn ) yK
The system first uses the input vector, xk , to produce its own
output vector, ok , and then compares this with the desired out-
put, y k .
If we use a linear output unit then whatever is the final weight
vector, the output function of the network is a line (in two-
dimensional case).
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 28
y1
f(x) = y
y4
y6
x1 x2 x3 x4 x5 x6 x7
A pattern set from an almost linear function f .
Figure 6: A pattern set from an almost linear function f .
It means that the delta learning rule with linear out-
It means that the can
put function deltaapproximate
learning rule with
only linear set
a pattern output
de- func-
tion can approximate
rived only alinear
from an almost pattern set derived from an almost
function.
linear function.
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 29
y7
f(x) = y
y5
y4
y1
x1 x2 x3 x4 x5 x6 x7
A pattern set from a nonlinear function f .
Figure 7: A pattern set from a nonlinear function f .
However, if the patterns are derived from a nonlin-
However, if the patterns
ear function f , then are derived for
the chances from a nonlinear
a good approx-function
f , then the chances
imation for Ita isgood
are small. whyapproximation areacti-
we use semilinear small. It is
why we use functions.
vation semilinear activation functions.
Let
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 30
Let
1 1
Ek = (y k − ok )2 = (y k − o(< w, xk >))2 =
2 2
2
1 k 1
× y −
2 1 + exp (−wT xk )
be our measure of the error on input/output pattern k and let
X
K
E= Ek = E1 + E2 + · · · + EK ,
k=1
be our overall measure of the error.
The rule for changing weights following presentation of in-
put/output pair k is given by the gradient descent method, i.e.
we minimize the quadratic error function by using the follow-
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 31
ing iteration process
w := w − ηEk0 (w).
Let us compute now the gradient vector of the error function
Ek at point w:
2
0 d 1 k 1
Ek (w) = × y − =
dw 2 1 + exp (−wT xk )
2
1 d k 1
× y − =
2 dw 1 + exp (−wT xk )
−(y k − ok )ok (1 − ok )xk
where ok = 1/(1 + exp (−wT xk )).
Therefore our learning rule for w is
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 32
w := w + η(y k − ok )ok (1 − ok )xk
which can be written as
w := w + ηδk ok (1 − ok )xk
where
δk = (y k − ok )ok (1 − ok ).
The delta learning rule with unipolar sigmoidal activation func-
tion.
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 33
Given are K training pairs arranged in the training set
{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ) and y k ∈ R, k = 1, . . . , K.
• Step 1 η > 0, Emax > 0 are chosen
• Step 2 Weigts w are initialized at small random values,
k := 1, and the running error E is set to 0
• Step 3 Training starts here. Input xk is presented, x := xk ,
y := y k , and output o is computed
1
o = o(< w, x >) =
1 + exp (−wT x)
• Step 4 Weights are updated
w := w + η(y − o)o(1 − o)x
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 34
• Step 5 Cumulative cycle error is computed by adding the
present error to E
1
E := E + (y − o)2
2
• Step 6 If k < K then k := k + 1 and we continue the
training by going back to Step 3, otherwise we go to Step
7
• Step 7 The training cycle is completed. For E < Emax
terminate the training session. If E > Emax then E is set
to 0 and we initiate a new training cycle by going back to
Step 3
In this case, without hidden units, the error surface is shaped
like a bowl with only one minimum, so gradient descent is guar-
anteed to find the best set of weights. With hidden units, how-
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 35
ever, it is not so obvious how to compute the derivatives, and
the error surface is not concave upwards, so there is the danger
of getting stuck in local minima.
We illustrate the delta learning rule with bipolar sigmoidal ac-
tivation function
2
f (t) = .
(1 + exp −t) − 1
Example 4.1. The delta learning rule with bipolar sigmoidal
activation function. Given are K training pairs arranged in the
training set
{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ) and y k ∈ R, k = 1, . . . , K.
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 36
• Step 1 η > 0, Emax > 0 are chosen
• Step 2 Weigts w are initialized at small random values,
k := 1, and the running error E is set to 0
• Step 3 Training starts here. Input xk is presented, x :=
xk , y := y k , and output o is computed
2
o = o(< w, x >) = −1
1 + exp(−wT x)
• Step 4 Weights are updated
1
w := w + η(y − o)(1 − o2 )x
2
• Step 5 Cumulative cycle error is computed by adding the
present error to E
1
E := E + (y − o)2
2
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 37
• Step 6 If k < K then k := k + 1 and we continue the
training by going back to Step 3, otherwise we go to Step
7
• Step 7 The training cycle is completed. For E < Emax
terminate the training session. If E > Emax then E is set
to 0 and we initiate a new training cycle by going back to
Step 3
Exercise 3. The error function to be minimized is given by
1
E(w1 , w2 ) = [(w2 − w1 )2 + (1 − w1 )2 ]
2
Find analytically the gradient vector
0 ∂1 E(w)
E (w) =
∂2 E(w)
Find analytically the weight vector w∗ that minimizes the error
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 38
function such that
E 0 (w) = 0.
Derive the steepest descent method for the minimization of E.
Solution 3. The gradient vector of E is
0 (w1 − w2 ) + (w1 − 1) 2w1 − w2 − 1
E (w) = =
(w2 − w1 ) w2 − w1
and w∗ (1, 1)T is the unique solution to the equation
2w1 − w2 − 1 0
= .
w2 − w1 0
The steepest descent method for the minimization of E reads
w1 (t + 1) 2w1 (t) − w2 (t) − 1
=η .
w2 (t + 1) w2 (t) − w1 (t)
where η > 0 is the learning constant and t indexes the number
of iterations.
Toc JJ II J I Back J Doc Doc I
39
That is,
w1 (t + 1) = w1 (t) − η(2w1 (t) − w2 (t) − 1),
w2 (t + 1) = w2 (t) − η(w2 (t) − w1 (t)),
or, equivalently,
w1 := w1 − η(2w1 (t) − w2 − 1),
w2 := w2 − η(w2 (t) − w1 ).
5. The generalized delta learning rule
We now focus on generalizing the delta learning rule for feed-
forward layered neural networks. The architecture of the two-
layer network considered below is shown in the figure. It has
strictly speaking, two layers of processing neurons.
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 40
If, however, the layers of nodes are counted, then the network
can also be labeled as a three-layer network. There is no agree-
ment in the literature as to which approach is to be used to
describe network architectures.
In this text we will use the term layer in reference to the actual
number of existing and processing neuron layers. Layers with
neurons whose outputs are not directly accesible are called in-
ternal or hidden layers. Thus the network of the figure is a
two-layer network, which can be called a single hidden-layer
network.
The generalized delta rule is the most often used supervised
learning algorithm of feedforward multi-layer neural networks.
For simplicity we consider only a neural network with one hid-
den layer and one output node.
Toc JJ II J I Back J Doc Doc I
work, which can be called a single hidden-layer net-
work.
Section 5: The generalized delta learning rule 41
!1 m output nodes !m
W11
WmL
L hidden nodes
w11 w1n wLn
w12
x1 x2 n input nodes xn
Figure 8: Layered
Layered neural network
neural with two
network continuous
with perceptron layers.
two continuous
perceptron layers.
The measure of the error on an input/output training pattern
(xk , y k )
The generalized delta rule is the most often used
Tocsupervised
JJ learning
II algorithm
J I of feedforward
Back J Doc multi-
Doc I
For simplicity we consider only a neural network
with
Section 5: The one hidden
generalized layerrule
and
one output node.
delta learning 42
!
W1 WL
w11 w1n wLn
w12
x1 x2 xn
Two-layer neural network with one output node.
Figure 9: Two-layer neural network with one output node.
The measure of the error on an input/output training
is defined by
pattern
1
Ek (W, w) = (y k − Ok )2
2
where Ok is the computed output
k k and the overall measure of
(x , y )
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 43
the error is
X
K
E(W, w) = Ek (W, w).
k=1
If an input vector xk is presented to the network then it gener-
ates the following output
1
Ok =
1 + exp(−W T ok )
where ok is the output vector of the hidden layer
1
okl =
1 + exp(−wlT xk )
and wl denotes the weight vector of the l-th hidden neuron,
l = 1, . . . , L.
The rule for changing weights following presentation of in-
put/output pair k is given by the gradient descent method, i.e.
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 44
we minimize the quadratic error function by using the follow-
ing iteration process
∂Ek (W, w)
W := W − η ,
∂W
∂Ek (W, w)
wl := wl − η ,
∂wl
for l = 1, . . . , L, and η > 0 is the learning rate.
By using the chain rule for derivatives of composed functions
we get
2
∂Ek (W, w) 1 ∂ k 1
= y −
∂W 2 ∂W 1 + exp(−W T ok )
= −(y k − Ok )Ok (1 − Ok )ok
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 45
So the rule for changing weights of the output unit is
W := W + η(y k − Ok )Ok (1 − Ok )ok = W + ηδk ok
that is
Wl := Wl + ηδk okl ,
for l = 1, . . . , L, and we have used the notation
δk = (y k − Ok )Ok (1 − Ok ).
Let us now compute the partial derivative of Ek with respect to
wl
∂Ek (W, w)
= −Ok (1 − Ok )Wl okl (1 − okl )xk
∂wl
i.e. the rule for changing weights of the hidden units is
wl := wl + ηδk Wl okl (1 − okl )xk ,
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 46
for l = 1, . . . , L. That is
wlj := wlj + ηδk Wl okl (1 − okl )xkj ,
for j = 1, . . . , n.
Summary. The generalized delta learning rule (error back-
propagation learning). We are given the training set
{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ) and y k ∈ R, k = 1, . . . , K.
• Step 1 η > 0, Emax > 0 are chosen
• Step 2 Weigts w are initialized at small random values,
k := 1, and the running error E is set to 0
• Step 3 Training starts here. Input xk is presented, x :=
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 47
xk , y := y k , and output O is computed
1
O=
1 + exp(−W T o)
where ol is the output vector of the hidden layer
1
ol =
1 + exp(−wlT x)
• Step 4 Weights of the output unit are updated
W := W + ηδo
where δ = (y − O)O(1 − O).
• Step 5 Weights of the hidden units are updated
wl = wl + ηδWl ol (1 − ol )x, l = 1, . . . , L
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 48
• Step 6 Cumulative cycle error is computed by adding the
present error to E
1
E := E + (y − O)2
2
• Step 7 If k < K then k := k + 1 and we continue the
training by going back to Step 2, otherwise we go to Step
8
• Step 8 The training cycle is completed. For E < Emax
terminate the training session. If E > Emax then E := 0,
k := 1 and we initiate a new training cycle by going back
to Step 3
Exercise 4. Derive the backpropagation learning rule with bipo-
lar sigmoidal activation function
2
f (t) = .
(1 + exp −t) − 1
Toc JJ II J I Back J Doc Doc I
49
6. Effectivity of neural networks
Funahashi (1989) showed that infinitely large neural networks
with a single hidden layer are capable of approximating all con-
tinuous functions. Namely, he proved the following theorem
Theorem 6.1. Let φ(x) be a nonconstant, bounded and mono-
tone increasing continuous function. Let K ⊂ Rn be a compact
set and
f: K →R
be a real-valued continuous function on K. Then for arbitrary
> 0, there exists an integer N and real constants wi , wij such
that
X
N Xn
˜
f (x1 , . . . , xn ) = wi φ( wij xj )
i=1 j=1
Toc JJ II J I Back J Doc Doc I
Section 6: Effectivity of neural networks 50
satisfies
kf − f˜k∞ = sup |f (x) − f˜(x)| ≤ .
x∈K
In other words, any continuous mapping can be approximated
in the sense of uniform topology on K by input-output map-
pings of two-layers networks whose output functions for the
hidden layer are φ(x) and are linear for the output layer.
The Stone-Weierstrass theorem from classical real analysis can
be used to show certain network architectures possess the uni-
versal approximation capability.
By employing the Stone-Weierstrass theorem in the designing
of our networks, we also guarantee that the networks can com-
pute certain polynomial expressions: if we are given networks
exactly computing two functions, f1 and f2 , then a larger net-
Toc JJ II J I Back J Doc Doc I
!f − f˜!∞ = sup |f (x) − f˜(x)| ≤ !.
Section 6: Effectivity of neural networks x∈K 51
! = #wi"(!i)
w1 wm
!1 ="(# w1j xj) !m ="(# wmj xj)
w1n
wmn
w11
x1 x2 xn
Funahashi’s network.
Figure 10: Funahashi’s network.
In other words, any continuous mapping can be ap-
work can exactly compute a polynomial expression of f1 and
f2 . proximated in the sense of uniform topology on K
by input-output mappings of two-layers networks
whoseJJ
Toc outputII
functions
J for the
I hidden
Back layer are Doc
J Doc φ(x)I
Section 6: Effectivity of neural networks 52
Theorem 6.2. (Stone-Weierstrass) Let domain K be a compact
space of n dimensions, and let G be a set of continuous real-
valued functions on K, satisfying the following criteria:
1. The constant function f (x) = 1 is in G.
2. For any two points x1 6= x2 in K, there is an f in G such
that f (x1 ) 6= f (x2 ).
3. If f1 and f2 are two functions in G, then f g and α1 f1 + α2 f2
are in G for any two real numbers α1 and α2 .
Then G is dense in C(K), the set of continuous real-valued func-
tions on K. In other words, for any > 0 and any function g in
C(K), there exists g in G such that
kf − gk∞ = sup |f (x) − g(x)| ≤ .
x∈K
Toc JJ II J I Back J Doc Doc I
Section 6: Effectivity of neural networks 53
The key to satisfying he Stone-Weierstrass theorem is to find
functions that transform multiplication into addition so that
products can be written as summations.
There are at least three generic functions that accomplish this
transfomation: exponential functions, partial fractions, and step
functions.
The following networks satisfy the Stone-Weierstrass theorem.
• Decaying-exponential networks
Exponential functions are basic to the process of trans-
forming multiplication into addition in several kinds of
networks:
exp(x1 ) exp(x2 ) = exp(x1 + x2 ).
Toc JJ II J I Back J Doc Doc I
Section 6: Effectivity of neural networks 54
Let G be the set of all continuous functions that can be
computed by arbitrarily large decaying-exponential net-
works on domain K = [0, 1]n :
G = f (x1 , . . . , xn )
X
N X
n
= wi exp(− wij xj ), wi , wij ∈ R .
i=1 j=1
Then G is dense in C(K)
• Fourier networks
• Exponentiated-function networks
• Modified logistic networks
• Modified sigma-pi and polynomial networks
Toc JJ II J I Back J Doc Doc I
Section 6: Effectivity of neural networks 55
Let G be the set of all continuous functions that can be
computed by arbitrarily large modified sigma-pi or poly-
nomial networks on domain K = [0, 1]n :
X
N Y
n
wij
G = f (x1 , . . . , xn ) = wi xj , wi , wij ∈ R .
i=1 j=1
Then G is dense in C(K).
• Step functions and perceptron networks
• Partial fraction networks
Toc JJ II J I Back J Doc Doc I