Multi Layer Perceptron
Multi Layer Perceptron
Multi Layer Perceptron
Tirtharaj Dash
Perceptron I
Table 1: Exclusive–OR
x1 x2 Class
0 0 0
0 1 1
1 0 1
1 1 0
I Similar table can also be made if we consider the inputs to be
bipolar (+1,-1).
I It is quite easy to verify that the XOR problem is not linearly
separable.
Perceptron IV
0 × w1 + 0 × w2 + w0 ≤ 0 =⇒ w0 ≤ 0 (3)
0 × w1 + 1 × w2 + w0 > 0 =⇒ w0 > −w2 (4)
1 × w1 + 0 × w2 + w0 > 0 =⇒ w0 > −w1 (5)
1 × w1 + 1 × w2 + w0 ≤ 0 =⇒ w0 ≤ −(w1 + w2 ) (6)
where C is the set of all neurons in the output layer. With the
training dataset consisting of N examples, the error energy
MLP VII
I Similarly,
∂ej (n)
= −1 (17)
∂yj (n)
I Differentiating yj (n) = fj (vj (n)) w.r.t. vj (n) gives
∂yj (n)
= fj0 (vj (n)) (18)
∂vj (n)
I Finally,
∂vj (n)
= yi (n) (19)
∂wji (n)
I So, we can write
∂E(n)
= −ej (n)fj0 (vj (n))yi (n) (20)
∂wji (n)
Back Propagation with SGD V
I The correction ∆wji (n) applies to wji (n) is defined by the
delta rule, or
∂E(n)
∆wji (n) = −η (21)
∂wji (n)
where η is the learning rate parameter of the back
propagation algorithm. If we remember the gradient descent,
we were taking a step opposite to the direction of the
gradient. Therefore, the – sign is used in the delta rule that
signify the same in the weight space.
I Now, we can write
I The term at the right ej (n)fj0 (vj (n)) can be written as δj (n)
which is called the local gradient at the jth node.
1X 2
E(n) = ek (n) (26)
2
k∈C
∂ek (n)
I From chain rule for partial derivatives, ∂yj (n) can be replaced
in the above equation as
∂vk (n)
= wkj (n) (32)
∂yj (n)
Back Propagation with SGD XI
∂E(n)
I Coming back to the term ∂yj (n) , we can now write
∂E(n) X
=− ek f 0 (vk (n))wkj (n) (33)
∂yj (n)
k
X
=− δk (n)wkj (n) (34)
k
I The term Jav (w, b) is the average error measure such as MSE
whose evaluation extends over the output neurons of the
network and is carried out for all the training examples on an
epoch-by-epoch basis.
I The second term, Jc (w, b) is a complexity penalty, where the
notion of complexity is measured in terms of the network
(weights w, b) alone; its inclusion imposes on the solution
prior knowledge that we may have on the models being
considered.
Complexity regularization III
I For the present situation, λ could be considered as a
regularization parameter, which represents the relative
importance of the complexity-penalty term with respect to the
performance metric term.
I When λ is zero, the back-propagation learning process is
unconstrained, with the network being completely determined
from the training examples.
I When λ is made infinitely large, on the other hand, the
implication is that the constraint imposed by the complexity
penalty is by itself sufficient to specify the network, which is
another way of saying that the training examples are
unreliable.
I In practical applications of complexity regularization, the
regularization parameter λ is assigned a value somewhere
between these two limiting cases.
Complexity regularization IV
I In simplest form, the term Jc (w, b) could be written as
∂J
Wi = Wi − η (37)
∂Wi
Example of weight decay II
I If η is large you will have a correspondingly large modification
of the weights Wi (at least this much we know)
I In order to effectively limit the number of free parameters in
our MLP model so as to avoid overfitting, it is possible to
regularize the cost function by changing the cost function to
something like this:
λ 2
J(W) = J(W) + W
2
I Applying gradient descent to this new cost function we obtain:
∂J
Wi = Wi − η − ηλWi
∂Wi
I The new term ηλWi coming from the regularization causes
the weight to decay in proportion to its size.
Network committee (towards generalization) I
fi (X ) = f (X ) + i (X ) (38)
Network committee (towards generalization) III
Ei = E (fi (X ) − f (X ))2
(39)
Ei = E [2i (X )] (40)
N
!2 N
!
X X
i ≤N 2i (52)
i=1 i=1
I Therefore,
EC ≤ Eav (53)
I We should note that at a marginal increase in computational
cost we obtain some reduction in error due to the reduced
variance achieved by averaging over different networks.
Network committee (towards generalization) VIII