0% found this document useful (0 votes)

39 views55 pages

Delta Rule Example

The document provides a tutorial on the delta learning rule, detailing its application in error correction learning within neural networks. It explains the process of adjusting weights based on the difference between actual and desired outputs, utilizing gradient descent methods to minimize error. Key concepts such as linear activation functions and the importance of the delta learning rule in neural network training are also discussed.

Uploaded by

5873.2021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views55 pages

Delta Rule Example

Uploaded by

5873.2021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Institute for Advanced Management Systems Research

Department of Information Technologies

Åbo Akademi University

The delta learning rule - Tutorial

Robert Fullér

1. Error correction learning

2. Linear activation function

3. The delta learning rule

4. The delta learning rule with semilinear activation function

5. The generalized delta learning rule

6. Effectivity of neural networks

1. Error correction learning

The error correction learning procedure is simple enough in

conception. The procedure is as follows: During training an
input is put into the network and flows through the network
generating a set of values on the output units.
Then, the actual output is compared with the desired target, and
a match is computed. If the output and target match, no change
is made to the net.
However, if the output differs from the target a change must be
made to some of the connections.
Let’s first recall the definition of derivative of single-variable
functions.

Toc JJ II J I Back J Doc Doc I

Section 1: Error correction learning 4

Definition 1.1. The derivative of f at (an interior point of its

domain) x, denoted by f 0 (x), and defined by

f (x) − f (xn )
f 0 (x) = lim
xn →x x − xn

Let us consider a differentiable function

f : R → R.
The derivative of f at (an interior point of its domain) x is de-
noted by f 0 (x).
If f 0 (x) > 0 then we say that f is increasing at x, if f 0 (x) < 0
then we say that f is decreasing at x, if f 0 (x) = 0 then f can
have a local maximum, minimum or inflextion point at x.
A differentiable function is always increasing in the direction
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 5

of its derivative, and decreasing in the opposite direction.

It means that if we want to find one of the local minima of a
function f starting from a point x0 then we should search for a
second candidate:

• in the right-hand side of x0 if f 0 (x0 ) < 0, when f is de-

creasing at x0 ;
• in the left-hand side of x0 if f 0 (x0 ) > 0, when f increasing
at x0 .

The equation for the line crossing the point (x0 , f (x0 )) is given
by
y − f (x0 )
= f 0 (x0 )
x − x0
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 6

that is
y = f (x0 ) + (x − x0 )f 0 (x0 )
The next approximation, denoted by x1 , is a solution to the
equation
f (x0 ) + (x − x0 )f 0 (x0 ) = 0
which is,
f (x0 )
x1 = x0 − 0 0
f (x )
This idea can be applied successively, that is
f (xn )
xn+1 = xn − 0 n .
f (x )

The above procedure is a typical descent method. In a descent

method the next iteration wn+1 should satisfy the following
property
f (wn+1 ) < f (wn )
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 7

f ( x)

Downhill direction

f(x 0)

x1 x0

The downhill direction is negative at x0.

Figure 1: The downhill direction is negative at x0 .

The above procedure is a typical descent method.

i.e. theInvalue
a of f at
descent wn+1 the
method is smaller than its
next iteration previous
wn+1 shouldvalue at
n
w . satisfy the following property
f (wn+1procedure,
In error correction learning ) < f (wn) each iteration of a de-
scent method
i.e. the calculates thewdownhill
value of f at n+1 direction
is smaller (opposite
than its previ- of the
n
direction
ousofvalue
the derivative)
n
at w . at w which means that for a suffi-
Toc JJ II J I Back J Doc Doc I
Section 1: Error correction learning 8

ciently small η > 0 the inequality

f (wn − ηf 0 (wn )) < f (wn )
should hold, and we let wn+1 be the vector
wn+1 = wn − ηf 0 (wn )
Let f : Rn → R be a real-valued function. In a descent method,
whatever is the next iteration, wn+1 , it should satisfy the prop-
erty
f (wn+1 ) < f (wn )

i.e. the value of f at wn+1 is smaller than its value at previous

approximation wn .
Each iteration of a descent method calculates a downhill direc-
tion (opposite of the direction of the derivative) at wn which

Toc JJ II J I Back J Doc Doc I

Section 1: Error correction learning 9

means that for a sufficiently small η > 0 the inequality

f (wn − ηf 0 (wn )) < f (wn )
should hold, and we let wn+1 be the vector
wn+1 = wn − ηf 0 (wn ).

Let f : Rn → R be a real-valued function and let e ∈ Rn with

kek = 1 be a given direction. The derivative of f with respect
e at w is defined as
f (w + te) − f (w)
∂e f (w) = lim .
t→+0 t

If
i-th
z}|{
e = (0, . . . 1 . . . , 0)T ,

Toc JJ II J I Back J Doc Doc I

Section 1: Error correction learning 10

i.e. e is the i-th basic direction then instead of ∂e f (w) we write

∂i f (w), which is defined by
∂i f (w) =
f (w1 , . . . , wi + t, . . . wn ) − f (w1 , . . . , wi , . . . , wn )
lim .
t→+0 t

The gradient of f at w, denoted by f 0 (w) is defined by

f 0 (w) = (∂1 f (w), . . . , ∂n f (w))T
Example 1.1. Let f (w1 , w2 ) = w12 + w22 then the gradient of f
is given by
f 0 (w) = 2w = (2w1 , 2w2 )T .

The gradient vector always points to the uphill direction of

f . The downhill (steepest descent) direction of f at w is the

Toc JJ II J I Back J Doc Doc I

f (w1, . . . , wi + t, . . . wn) − f (w1, . . . , wi, . . . , wn)
= lim .
Section 1: Error correction
t→+0 learning t 11

f(w)

f(w + te)
w

e te

w + te

The derivative of f with respect to the direction e..

Figure 2: The derivative of f with respect to the direction e.
The gradient of f at w, denoted by f #(w) is defined
opposite by
of the uphill direction, i.e. the downhill direction is
−f 0 (w), which is
(−∂1 f (w), . . . , −∂n f (w))T .
8

Toc JJ II J I Back J Doc Doc I

Section 2: Linear activation function 12

2. Linear activation function

Definition 2.1. A linear activation function is a mapping from

f : R → R such that f (t) = t for all t ∈ R.

Suppose we are given a single-layer network with n input units

and m linear output units, i.e. the output of the i-th neuron can
be written as
oi = neti =< wi , x >= wi1 x1 + · · · + win xn ,
for i = 1, . . . , m. Assume we have the following training set
{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ), y k = (y1k , . . . , ym
k
), k = 1, . . . , K.
The basic idea of the delta learning rule is to define a measure
of the overall performance of the system and then to find a way
Toc JJ II J I Back J Doc Doc I
where x = (x1 , . . . , xn), y = (y1 , . . . , ym), k =
1, . . . , K.
Section 2: Linear activation function 13

!1 !m

w11 wmn

x1 x2 xn

Single-layer feedforward network with m output units

Figure 3: Single-layer feedforward network with m output units.
10
to optimize that performance.

Toc JJ II J I Back J Doc Doc I

Section 2: Linear activation function 14

In our network, we can define the performance of the system as

X
K
1X k
K
E= Ek = ky − ok k2
k=1
2 k=1
That is
1 XX k
K m
E= (yi − oki )2
2 k=1 i=1

1 XX k
K m
= (yi − < wi , xk >)2 ,
2 k=1 i=1
where i indexes the output units; k indexes the input/output
pairs to be learned; yik indicates the target for a particular output
unit on a particular pattern;
oki :=< wi , xk >,
indicates the actual output for that unit on that pattern; and E is
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 15

the total error of the system. The goal, then, is to minimize this
function. It turns out, if the output functions are differentiable,
that this problem has a simple solution: namely, we can assign
a particular unit blame in proportion to the degree to which
changes in that unit’s activity lead to changes in the error.
That is, we change the weights of the system in proportion to
the derivative of the error with respect to the weights.
The rule for changing weights following presentation of in-
put/output pair (x, y) (we omit the index k for the sake of sim-
plicity) is given by the gradient descent method, i.e. we mini-
mize the quadratic error function by using the following itera-
tion process
∂Ek
wij := wij − η
∂wij
where η > 0 is the learning rate.
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 16

Let us compute now the partial derivative of the error function

Ek with respect to wij
∂Ek ∂Ek ∂neti
= = −(yi − oi )xj
∂wij ∂neti ∂wij
where neti = wi1 x1 + · · · + win xn .
That is,
wij := wij + η(yi − oi )xj
for j = 1, . . . , n.
Definition 2.2. The error signal term, denoted by δik and called
delta, produced by the i-th output neuron is defined as
∂Ek
δi = − = (yi − oi )
∂neti
For linear output units δi is nothing else but the difference be-
tween the desired and computed output values of the i-th neu-
Toc JJ II J I Back J Doc Doc I
Section 2: Linear activation function 17

ron.

So the delta learning rule can be written as

wi := wi + η(yi − oi )x

where wi is the weight vector of the i-th output unit, x is the

actual input to the system, yi is the i-th coordinate of the desired
output, oi is the i-th coordinate of the computed output and
η > 0 is the learning rate.
If we have only one output unit then the delta learning rule
collapses into
w := w + η(y − o)x = w + ηδx
where δ denotes the difference between the desired and the
Toc JJ II J I Back J Doc Doc I
If we have only one output unit then the delta learn-
ing
Section 2: rule
Linear collapses
activation function into 18

x1 w1
! = f(net)
f

xn wn

A single neuron net.

Figure 4: A single neuron net.

computed output.
w := w + η(y − o)x = w + ηδx
A keywhere δ denotes
advantage thenetwork
of neural difference between
systems the
is that desired
these simple,
and thelearning
yet powerful computed output. can be defined, allowing the
procedures
systems to adapt to their environments.
A key advantage
The essential ofsuch
character of neural network
networks systems
is that is that
they map simi-
these simple, yet powerful learning
lar input patterns to similar output patterns.procedures can
be defined, allowing the systems to adapt to their
This characteristic
environments. is what allows these networks to make rea-

Toc JJ II J I Back J Doc Doc I

Section 2: Linear activation function 19

sonable generalizations and perform reasonably on patterns that

have never before been presented. The similarity of patterns in
a connectionist system is determined by their overlap.

The overlap in such networks is determined outside the learning

system itself whatever produces the patterns.

The standard delta rule essentially implements gradient descent

in sum-squared error for linear activation functions.

It should be noted that the delta learning rule was introduced

only recently for neural network training by McClelland and
Rumelhart.
This rule parallels the discrete perceptron training rule. It also
can be called the continuous perceptron training rule.

Toc JJ II J I Back J Doc Doc I

3. The delta learning rule

The delta learning rule with linear activation functions. Given

are K training pairs arranged in the training set
{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ) and y k = (y1k , . . . , ym
k
), k = 1, . . . , K.

• Step 1 η > 0, Emax > 0 are chosen

• Step 2 Weights wij are initialized at small random values,
k := 1, and the running error E is set to 0
• Step 3 Training starts here. Input xk is presented, x := xk ,
y := y k , and output o = (o1 , . . . , om )T is computed
oi =< wi , x >) = wiT x
for i = 1, . . . , m.
Toc JJ II J I Back J Doc Doc I
Section 3: The delta learning rule 21

• Step 4 Weights are updated

wij := wij + η(yi − oi )xj
• Step 5 Cumulative cycle error is computed by adding the
present error to E
1
E := E + ky − ok2
2
• Step 6 If k < K then k := k + 1 and we continue the
training by going back to Step 3, otherwise we go to Step
7
• Step 7 The training cycle is completed. For E < Emax
terminate the training session. If
E > Emax
then E is set to 0 and we initiate a new training cycle by
going back to Step 3
Toc JJ II J I Back J Doc Doc I
Section 3: The delta learning rule 22

Exercise 1. Construct a single-neuron network, which com-

putes the following function. The training set is

x1 x2 x3 o(x1 , x2 , x3 )
1. 1 1 1 1
2. 1 1 0 1
3. 1 0 0 1
4. 0 1 1 0
5. 0 0 1 0
6. 0 1 0 0
7. 1 0 1 1
Solution 1. For example, w1 = 1 and w2 = w3 = 0.

Toc JJ II J I Back J Doc Doc I

Section 3: The delta learning rule 23

Exercise 2. The error function to be minimized is given by

1
E(w1 , w2 , w3 , w4 ) = (w1 − w2 − w3 )2 +
2

2 2 2
(−w1 + w2 − w3 ) + (−w1 − w2 − 1) + (w4 + 1)

Find analytically the gradient vector

 
∂1 E(w)
 ∂2 E(w) 
E 0 (w) = 
 ∂3 E(w)


∂4 E(w)
Find analytically the weight vectors w∗ that minimizes the error
function such that E 0 (w∗ ) = 0. Derive the steepest descent
method for the minimization of E.

Toc JJ II J I Back J Doc Doc I

Section 3: The delta learning rule 24

Solution 2. The gradient vector of E is

 
3w1 − w2 + 1
 −w 1 + 3w2 + 1

E 0 (w) = 



2w3
w4 + 1
and
w∗ = (−1/2, −1/2, 0, −1)T
is the unique solution to the equation E 0 (w) = 0.

The steepest descent method for the minimization of E reads

     
w1 w1 3w1 − w2 + 1
 w2     
  :=  w2  − η  −w1 + 3w2 + 1  .
 w3   w3   2w3 
w4 w4 w4 + 1
where η > 0 is the learning constant.
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 25

That is,
w1 : = w1 − η(3w1 − w2 + 1),
w2 : = w2 − η(−w1 + 3w2 + 1),
w3 : = w3 − η(2w3 ),
w4 : = w4 − η(w4 + 1).

4. The delta learning rule with semilinear activation function

The standard delta rule essentially implements gradient descent

in sum-squared error for linear activation functions.
We shall describe now the delta learning rule with semilinear
activation function. For simplicity we explain the learning al-
gorithm in the case of a single-output network.

Toc JJ II J I Back J Doc Doc I

If we have only one output unit then the delta learn-
ing
Section 4: Therule collapses
delta learning into
rule with semilinear activation function 26

x1 w1
! = f(net)
f

xn wn

A single neuron net.

Figure 5: A single neuron net.

The output of the neuron

w := is computed
w + η(y − o)x = by
w +unipolar
ηδx sigmoidal
activation function
where δ denotes the difference 1between the desired
o(< w, x >)
and the computed =
output. .
1 + exp(−wT x)

A key
Suppose advantage
we are given theoffollowing
neural network
trainingsystems
set is that
these simple, yet powerful learning procedures can
be defined, allowing the systems to adapt to their
environments.
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 27

No. input values desired output value

1. x1 = (x11 , . . . x1n ) y1
2. x2 = (x21 , . . . x2n ) y2
.. .. ..
. . .
K. xK = (xK K
1 , . . . xn ) yK

The system first uses the input vector, xk , to produce its own
output vector, ok , and then compares this with the desired out-
put, y k .

If we use a linear output unit then whatever is the final weight

vector, the output function of the network is a line (in two-
dimensional case).

Toc JJ II J I Back J Doc Doc I

Section 4: The delta learning rule with semilinear activation function 28

f(x) = y

x1 x2 x3 x4 x5 x6 x7

A pattern set from an almost linear function f .

Figure 6: A pattern set from an almost linear function f .

It means that the delta learning rule with linear out-

It means that the can
put function deltaapproximate
learning rule with
only linear set
a pattern output
de- func-
tion can approximate
rived only alinear
from an almost pattern set derived from an almost
function.
linear function.

Toc JJ II J I Back J Doc Doc I

Section 4: The delta learning rule with semilinear activation function 29

f(x) = y
y5

x1 x2 x3 x4 x5 x6 x7

A pattern set from a nonlinear function f .

Figure 7: A pattern set from a nonlinear function f .

However, if the patterns are derived from a nonlin-

However, if the patterns
ear function f , then are derived for
the chances from a nonlinear
a good approx-function
f , then the chances
imation for Ita isgood
are small. whyapproximation areacti-
we use semilinear small. It is
why we use functions.
vation semilinear activation functions.

Let
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 30

Let
1 1
Ek = (y k − ok )2 = (y k − o(< w, xk >))2 =
2 2
2
1 k 1
× y −
2 1 + exp (−wT xk )
be our measure of the error on input/output pattern k and let
X
K
E= Ek = E1 + E2 + · · · + EK ,
k=1
be our overall measure of the error.

The rule for changing weights following presentation of in-

put/output pair k is given by the gradient descent method, i.e.
we minimize the quadratic error function by using the follow-

Toc JJ II J I Back J Doc Doc I

Section 4: The delta learning rule with semilinear activation function 31

ing iteration process

w := w − ηEk0 (w).
Let us compute now the gradient vector of the error function
Ek at point w:
2
0 d 1 k 1
Ek (w) = × y − =
dw 2 1 + exp (−wT xk )
2
1 d k 1
× y − =
2 dw 1 + exp (−wT xk )

−(y k − ok )ok (1 − ok )xk

where ok = 1/(1 + exp (−wT xk )).

Therefore our learning rule for w is
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 32

w := w + η(y k − ok )ok (1 − ok )xk

which can be written as

w := w + ηδk ok (1 − ok )xk

where

δk = (y k − ok )ok (1 − ok ).

The delta learning rule with unipolar sigmoidal activation func-

tion.

Toc JJ II J I Back J Doc Doc I

Section 4: The delta learning rule with semilinear activation function 33

Given are K training pairs arranged in the training set

{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ) and y k ∈ R, k = 1, . . . , K.

• Step 1 η > 0, Emax > 0 are chosen

• Step 2 Weigts w are initialized at small random values,
k := 1, and the running error E is set to 0
• Step 3 Training starts here. Input xk is presented, x := xk ,
y := y k , and output o is computed
1
o = o(< w, x >) =
1 + exp (−wT x)
• Step 4 Weights are updated
w := w + η(y − o)o(1 − o)x
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 34

• Step 5 Cumulative cycle error is computed by adding the

present error to E
1
E := E + (y − o)2
2
• Step 6 If k < K then k := k + 1 and we continue the
training by going back to Step 3, otherwise we go to Step
7
• Step 7 The training cycle is completed. For E < Emax
terminate the training session. If E > Emax then E is set
to 0 and we initiate a new training cycle by going back to
Step 3

In this case, without hidden units, the error surface is shaped

like a bowl with only one minimum, so gradient descent is guar-
anteed to find the best set of weights. With hidden units, how-
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 35

ever, it is not so obvious how to compute the derivatives, and

the error surface is not concave upwards, so there is the danger
of getting stuck in local minima.

We illustrate the delta learning rule with bipolar sigmoidal ac-

tivation function

2
f (t) = .
(1 + exp −t) − 1
Example 4.1. The delta learning rule with bipolar sigmoidal
activation function. Given are K training pairs arranged in the
training set
{(x1 , y 1 ), . . . , (xK , y K )}

where xk = (xk1 , . . . , xkn ) and y k ∈ R, k = 1, . . . , K.

Toc JJ II J I Back J Doc Doc I

Section 4: The delta learning rule with semilinear activation function 36

• Step 1 η > 0, Emax > 0 are chosen

• Step 2 Weigts w are initialized at small random values,
k := 1, and the running error E is set to 0
• Step 3 Training starts here. Input xk is presented, x :=
xk , y := y k , and output o is computed
2
o = o(< w, x >) = −1
1 + exp(−wT x)
• Step 4 Weights are updated
1
w := w + η(y − o)(1 − o2 )x
2
• Step 5 Cumulative cycle error is computed by adding the
present error to E
1
E := E + (y − o)2
2
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 37

• Step 6 If k < K then k := k + 1 and we continue the

training by going back to Step 3, otherwise we go to Step
7
• Step 7 The training cycle is completed. For E < Emax
terminate the training session. If E > Emax then E is set
to 0 and we initiate a new training cycle by going back to
Step 3

Exercise 3. The error function to be minimized is given by

1
E(w1 , w2 ) = [(w2 − w1 )2 + (1 − w1 )2 ]
2
Find analytically the gradient vector

0 ∂1 E(w)
E (w) =
∂2 E(w)
Find analytically the weight vector w∗ that minimizes the error
Toc JJ II J I Back J Doc Doc I
Section 4: The delta learning rule with semilinear activation function 38

function such that

E 0 (w) = 0.
Derive the steepest descent method for the minimization of E.
Solution 3. The gradient vector of E is

0 (w1 − w2 ) + (w1 − 1) 2w1 − w2 − 1
E (w) = =
(w2 − w1 ) w2 − w1
and w∗ (1, 1)T is the unique solution to the equation

2w1 − w2 − 1 0
= .
w2 − w1 0
The steepest descent method for the minimization of E reads

w1 (t + 1) 2w1 (t) − w2 (t) − 1
=η .
w2 (t + 1) w2 (t) − w1 (t)
where η > 0 is the learning constant and t indexes the number
of iterations.
Toc JJ II J I Back J Doc Doc I
39

That is,
w1 (t + 1) = w1 (t) − η(2w1 (t) − w2 (t) − 1),
w2 (t + 1) = w2 (t) − η(w2 (t) − w1 (t)),
or, equivalently,
w1 := w1 − η(2w1 (t) − w2 − 1),
w2 := w2 − η(w2 (t) − w1 ).

5. The generalized delta learning rule

We now focus on generalizing the delta learning rule for feed-

forward layered neural networks. The architecture of the two-
layer network considered below is shown in the figure. It has
strictly speaking, two layers of processing neurons.

Toc JJ II J I Back J Doc Doc I

Section 5: The generalized delta learning rule 40

If, however, the layers of nodes are counted, then the network
can also be labeled as a three-layer network. There is no agree-
ment in the literature as to which approach is to be used to
describe network architectures.
In this text we will use the term layer in reference to the actual
number of existing and processing neuron layers. Layers with
neurons whose outputs are not directly accesible are called in-
ternal or hidden layers. Thus the network of the figure is a
two-layer network, which can be called a single hidden-layer
network.
The generalized delta rule is the most often used supervised
learning algorithm of feedforward multi-layer neural networks.
For simplicity we consider only a neural network with one hid-
den layer and one output node.

Toc JJ II J I Back J Doc Doc I

work, which can be called a single hidden-layer net-
work.
Section 5: The generalized delta learning rule 41

!1 m output nodes !m
W11
WmL
L hidden nodes

w11 w1n wLn

w12

x1 x2 n input nodes xn

Figure 8: Layered
Layered neural network
neural with two
network continuous
with perceptron layers.
two continuous
perceptron layers.
The measure of the error on an input/output training pattern
(xk , y k )
The generalized delta rule is the most often used
Tocsupervised
JJ learning
II algorithm
J I of feedforward
Back J Doc multi-
Doc I
For simplicity we consider only a neural network
with
Section 5: The one hidden
generalized layerrule
and
one output node.
delta learning 42

!
W1 WL

w11 w1n wLn

w12

x1 x2 xn

Two-layer neural network with one output node.

Figure 9: Two-layer neural network with one output node.

The measure of the error on an input/output training

is defined by
pattern
1
Ek (W, w) = (y k − Ok )2
2
where Ok is the computed output
k k and the overall measure of
(x , y )
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 43

the error is
X
K
E(W, w) = Ek (W, w).
k=1
If an input vector xk is presented to the network then it gener-
ates the following output
1
Ok =
1 + exp(−W T ok )
where ok is the output vector of the hidden layer
1
okl =
1 + exp(−wlT xk )
and wl denotes the weight vector of the l-th hidden neuron,
l = 1, . . . , L.
The rule for changing weights following presentation of in-
put/output pair k is given by the gradient descent method, i.e.
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 44

we minimize the quadratic error function by using the follow-

ing iteration process
∂Ek (W, w)
W := W − η ,
∂W
∂Ek (W, w)
wl := wl − η ,
∂wl
for l = 1, . . . , L, and η > 0 is the learning rate.
By using the chain rule for derivatives of composed functions
we get
2
∂Ek (W, w) 1 ∂ k 1
= y −
∂W 2 ∂W 1 + exp(−W T ok )
= −(y k − Ok )Ok (1 − Ok )ok

Toc JJ II J I Back J Doc Doc I

Section 5: The generalized delta learning rule 45

So the rule for changing weights of the output unit is

W := W + η(y k − Ok )Ok (1 − Ok )ok = W + ηδk ok
that is
Wl := Wl + ηδk okl ,
for l = 1, . . . , L, and we have used the notation
δk = (y k − Ok )Ok (1 − Ok ).

Let us now compute the partial derivative of Ek with respect to

wl
∂Ek (W, w)
= −Ok (1 − Ok )Wl okl (1 − okl )xk
∂wl
i.e. the rule for changing weights of the hidden units is
wl := wl + ηδk Wl okl (1 − okl )xk ,

Toc JJ II J I Back J Doc Doc I

Section 5: The generalized delta learning rule 46

for l = 1, . . . , L. That is
wlj := wlj + ηδk Wl okl (1 − okl )xkj ,

for j = 1, . . . , n.
Summary. The generalized delta learning rule (error back-
propagation learning). We are given the training set
{(x1 , y 1 ), . . . , (xK , y K )}
where xk = (xk1 , . . . , xkn ) and y k ∈ R, k = 1, . . . , K.

• Step 1 η > 0, Emax > 0 are chosen

• Step 2 Weigts w are initialized at small random values,
k := 1, and the running error E is set to 0
• Step 3 Training starts here. Input xk is presented, x :=
Toc JJ II J I Back J Doc Doc I
Section 5: The generalized delta learning rule 47

xk , y := y k , and output O is computed

1
O=
1 + exp(−W T o)
where ol is the output vector of the hidden layer
1
ol =
1 + exp(−wlT x)
• Step 4 Weights of the output unit are updated
W := W + ηδo
where δ = (y − O)O(1 − O).
• Step 5 Weights of the hidden units are updated
wl = wl + ηδWl ol (1 − ol )x, l = 1, . . . , L

Toc JJ II J I Back J Doc Doc I

Section 5: The generalized delta learning rule 48

• Step 6 Cumulative cycle error is computed by adding the

present error to E
1
E := E + (y − O)2
2
• Step 7 If k < K then k := k + 1 and we continue the
training by going back to Step 2, otherwise we go to Step
8
• Step 8 The training cycle is completed. For E < Emax
terminate the training session. If E > Emax then E := 0,
k := 1 and we initiate a new training cycle by going back
to Step 3
Exercise 4. Derive the backpropagation learning rule with bipo-
lar sigmoidal activation function
2
f (t) = .
(1 + exp −t) − 1
Toc JJ II J I Back J Doc Doc I
49

6. Effectivity of neural networks

Funahashi (1989) showed that infinitely large neural networks

with a single hidden layer are capable of approximating all con-
tinuous functions. Namely, he proved the following theorem
Theorem 6.1. Let φ(x) be a nonconstant, bounded and mono-
tone increasing continuous function. Let K ⊂ Rn be a compact
set and
f: K →R
be a real-valued continuous function on K. Then for arbitrary
> 0, there exists an integer N and real constants wi , wij such
that
X
N Xn
˜
f (x1 , . . . , xn ) = wi φ( wij xj )
i=1 j=1

Toc JJ II J I Back J Doc Doc I

Section 6: Effectivity of neural networks 50

satisfies
kf − f˜k∞ = sup |f (x) − f˜(x)| ≤ .
x∈K

In other words, any continuous mapping can be approximated

in the sense of uniform topology on K by input-output map-
pings of two-layers networks whose output functions for the
hidden layer are φ(x) and are linear for the output layer.
The Stone-Weierstrass theorem from classical real analysis can
be used to show certain network architectures possess the uni-
versal approximation capability.
By employing the Stone-Weierstrass theorem in the designing
of our networks, we also guarantee that the networks can com-
pute certain polynomial expressions: if we are given networks
exactly computing two functions, f1 and f2 , then a larger net-
Toc JJ II J I Back J Doc Doc I
!f − f˜!∞ = sup |f (x) − f˜(x)| ≤ !.
Section 6: Effectivity of neural networks x∈K 51

! = #wi"(!i)
w1 wm
!1 ="(# w1j xj) !m ="(# wmj xj)

w1n
wmn
w11

x1 x2 xn

Funahashi’s network.
Figure 10: Funahashi’s network.

In other words, any continuous mapping can be ap-

work can exactly compute a polynomial expression of f1 and
f2 . proximated in the sense of uniform topology on K
by input-output mappings of two-layers networks
whoseJJ
Toc outputII
functions
J for the
I hidden
Back layer are Doc
J Doc φ(x)I
Section 6: Effectivity of neural networks 52

Theorem 6.2. (Stone-Weierstrass) Let domain K be a compact

space of n dimensions, and let G be a set of continuous real-
valued functions on K, satisfying the following criteria:

1. The constant function f (x) = 1 is in G.

2. For any two points x1 6= x2 in K, there is an f in G such
that f (x1 ) 6= f (x2 ).
3. If f1 and f2 are two functions in G, then f g and α1 f1 + α2 f2
are in G for any two real numbers α1 and α2 .

Then G is dense in C(K), the set of continuous real-valued func-

tions on K. In other words, for any > 0 and any function g in
C(K), there exists g in G such that
kf − gk∞ = sup |f (x) − g(x)| ≤ .
x∈K

Toc JJ II J I Back J Doc Doc I

Section 6: Effectivity of neural networks 53

The key to satisfying he Stone-Weierstrass theorem is to find

functions that transform multiplication into addition so that
products can be written as summations.

There are at least three generic functions that accomplish this

transfomation: exponential functions, partial fractions, and step
functions.
The following networks satisfy the Stone-Weierstrass theorem.

• Decaying-exponential networks
Exponential functions are basic to the process of trans-
forming multiplication into addition in several kinds of
networks:
exp(x1 ) exp(x2 ) = exp(x1 + x2 ).

Toc JJ II J I Back J Doc Doc I

Section 6: Effectivity of neural networks 54

Let G be the set of all continuous functions that can be

computed by arbitrarily large decaying-exponential net-
works on domain K = [0, 1]n :

G = f (x1 , . . . , xn )

X
N X
n
= wi exp(− wij xj ), wi , wij ∈ R .
i=1 j=1

Then G is dense in C(K)

• Fourier networks
• Exponentiated-function networks
• Modified logistic networks
• Modified sigma-pi and polynomial networks
Toc JJ II J I Back J Doc Doc I
Section 6: Effectivity of neural networks 55

Let G be the set of all continuous functions that can be

computed by arbitrarily large modified sigma-pi or poly-
nomial networks on domain K = [0, 1]n :
X
N Y
n
wij
G = f (x1 , . . . , xn ) = wi xj , wi , wij ∈ R .
i=1 j=1

Then G is dense in C(K).

• Step functions and perceptron networks
• Partial fraction networks

Toc JJ II J I Back J Doc Doc I

Delta Learning Rule Explained
No ratings yet
Delta Learning Rule Explained
18 pages
Handout Delta Rule
No ratings yet
Handout Delta Rule
10 pages
Lect3 UWA PDF
No ratings yet
Lect3 UWA PDF
73 pages
Artificial Neural Networks and Deep Learning
No ratings yet
Artificial Neural Networks and Deep Learning
55 pages
Lab 3
No ratings yet
Lab 3
40 pages
Learning Rules For Multilayer Feedforward Neural Networks
No ratings yet
Learning Rules For Multilayer Feedforward Neural Networks
19 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
شبكات عصبية ٢
No ratings yet
شبكات عصبية ٢
6 pages
DL Unit-I
No ratings yet
DL Unit-I
30 pages
Assignment 4 Solution
No ratings yet
Assignment 4 Solution
3 pages
CS2011 5
No ratings yet
CS2011 5
43 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
6 pages
09: Neural Networks - Learning: Neural Network Cost Function
No ratings yet
09: Neural Networks - Learning: Neural Network Cost Function
9 pages
ADALINE For Pattern Classification: Polytechnic University Department of Computer and Information Science
No ratings yet
ADALINE For Pattern Classification: Polytechnic University Department of Computer and Information Science
27 pages
NN 04
No ratings yet
NN 04
18 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Chapter 2 Adaline
No ratings yet
Chapter 2 Adaline
71 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
Gradient Descent Learning: Minimize Objective Function: Error Landscape
No ratings yet
Gradient Descent Learning: Minimize Objective Function: Error Landscape
14 pages
Eio Supplementary
No ratings yet
Eio Supplementary
6 pages
Model of Neuron in An ANN
No ratings yet
Model of Neuron in An ANN
12 pages
3 DeltaRule PDF
No ratings yet
3 DeltaRule PDF
10 pages
Neural Networks - Learning
No ratings yet
Neural Networks - Learning
26 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Lecture 220927 02
No ratings yet
Lecture 220927 02
29 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Backpropagation LectureNotesPublic
No ratings yet
Backpropagation LectureNotesPublic
13 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Neural Networks: Key Concepts & Functions
No ratings yet
Neural Networks: Key Concepts & Functions
22 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
(IJCST-V6I4P17) :P T V Lakshmi
No ratings yet
(IJCST-V6I4P17) :P T V Lakshmi
4 pages
Backpropagation in Neural Nets
No ratings yet
Backpropagation in Neural Nets
13 pages
04-NN Training GoodF
No ratings yet
04-NN Training GoodF
82 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
No ratings yet
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
29 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
CSC 323-06 Artificial Neural Network
No ratings yet
CSC 323-06 Artificial Neural Network
29 pages
CT1 NNDL Question Bank
No ratings yet
CT1 NNDL Question Bank
8 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
9 Lesson 06
No ratings yet
9 Lesson 06
12 pages
0905 Cs 161183 Vishal
No ratings yet
0905 Cs 161183 Vishal
38 pages
Soft - Computing - 2 With Numericals
No ratings yet
Soft - Computing - 2 With Numericals
64 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
CO5 Neural Network LossFunctions
No ratings yet
CO5 Neural Network LossFunctions
34 pages
2.2 Training. Linear Regression
No ratings yet
2.2 Training. Linear Regression
16 pages
Hebbian Learning and Gradient Descent Learning: Neural Computation: Lecture 5
No ratings yet
Hebbian Learning and Gradient Descent Learning: Neural Computation: Lecture 5
20 pages
SSLC2025 - Exam - Modelqp - 81e - 4
No ratings yet
SSLC2025 - Exam - Modelqp - 81e - 4
11 pages
Structural Features of Cold-Formed Steel Profiles
No ratings yet
Structural Features of Cold-Formed Steel Profiles
6 pages
Unit - 2 Single Phase AC Circuits: ET 10203A-Basic Electrical Engineering
No ratings yet
Unit - 2 Single Phase AC Circuits: ET 10203A-Basic Electrical Engineering
28 pages
Eng Montazni Navod Roto E-Tec Control IMO 302 v2 24050 240509 102809
No ratings yet
Eng Montazni Navod Roto E-Tec Control IMO 302 v2 24050 240509 102809
56 pages
Curl (Mathematics) - Wikipedia
No ratings yet
Curl (Mathematics) - Wikipedia
14 pages
E-Technical Manual - Machine
No ratings yet
E-Technical Manual - Machine
184 pages
Two Complexities
No ratings yet
Two Complexities
7 pages
4 Vector Space RN
No ratings yet
4 Vector Space RN
71 pages
On The Lagrangian Being A Homogeneous Function of The Velocity - Prof. Gaetano Giaquinta (University of Catania)
No ratings yet
On The Lagrangian Being A Homogeneous Function of The Velocity - Prof. Gaetano Giaquinta (University of Catania)
9 pages
Bandwidth How Mathematics, Physics, and Chemistry Constrain Society
0% (1)
Bandwidth How Mathematics, Physics, and Chemistry Constrain Society
446 pages
Material Dielektrik
No ratings yet
Material Dielektrik
12 pages
Kaocast - Metric-New
No ratings yet
Kaocast - Metric-New
1 page
Study of V-I Charactristics of Planer Igbt: Fig.1 IGBT Structure in Silvaco
No ratings yet
Study of V-I Charactristics of Planer Igbt: Fig.1 IGBT Structure in Silvaco
7 pages
AISC 360-22 Checklists
No ratings yet
AISC 360-22 Checklists
3 pages
Electromagnetic Analysis Using Transmission Line Variables Third Edition Maurice Weiner Instant Download Full Chapters
100% (16)
Electromagnetic Analysis Using Transmission Line Variables Third Edition Maurice Weiner Instant Download Full Chapters
200 pages
Activity 1: Properties of Fluids
No ratings yet
Activity 1: Properties of Fluids
9 pages
02 Hollow-Prism 2 XII Physics Investigatory Project
No ratings yet
02 Hollow-Prism 2 XII Physics Investigatory Project
17 pages
Maths
No ratings yet
Maths
8 pages
Factors For Unit Conversions: Quantity Equivalent Values
No ratings yet
Factors For Unit Conversions: Quantity Equivalent Values
2 pages
Theory Notes of Newton's Laws of Motion For NEET & IIT 2017
90% (10)
Theory Notes of Newton's Laws of Motion For NEET & IIT 2017
18 pages
Ship Fire System Specifications
No ratings yet
Ship Fire System Specifications
9 pages
U2 Lec6 Mass Diffusion Fundamentals
No ratings yet
U2 Lec6 Mass Diffusion Fundamentals
11 pages
Superelastic Shape Memory Alloy Reinforced Concrete Elements
No ratings yet
Superelastic Shape Memory Alloy Reinforced Concrete Elements
13 pages
Microwave & Millimeter Wave NDE Techniques
No ratings yet
Microwave & Millimeter Wave NDE Techniques
19 pages
GATE 2015: Agricultural Engg Q&A
No ratings yet
GATE 2015: Agricultural Engg Q&A
8 pages
Em Waves Equation Worksheet
No ratings yet
Em Waves Equation Worksheet
3 pages
Engineering
No ratings yet
Engineering
22 pages
Elementary Calculus
No ratings yet
Elementary Calculus
346 pages
Paul Langevin's 1908 Paper "On The Theory of Brownian Motion" ("Sur La Théorie Du Mouvement Brownien," C. R. Acad. Sci. (Paris) 146, 530-533 (1908) )
No ratings yet
Paul Langevin's 1908 Paper "On The Theory of Brownian Motion" ("Sur La Théorie Du Mouvement Brownien," C. R. Acad. Sci. (Paris) 146, 530-533 (1908) )
4 pages
Assignment Light
No ratings yet
Assignment Light
1 page