MACHINE LEARNING
(20BT60501)
COURSE DESCRIPTION:
Concept learning, General to specific ordering, Decision tree
learning, Support vector machine, Artificial neural networks,
Multilayer neural networks, Bayesian learning, Instance based
learning, reinforcement learning.
Subject :MACHINE LEARNING -(20BT60501)
Topic: Unit III – ARTIFICIAL NEURAL NETWORKS
Prepared By:
Dr.J.Avanija
Professor
Dept. of CSE
Sree Vidyanikethan Engineering College
Tirupati.
Unit III – ARTIFICIAL NEURAL NETWORKS
Neural network representations
Appropriate problems for neural network learning
Perceptrons
Multilayer networks and Backpropagation algorithm
Convergence and local minima
Representational power of feedforward networks
Hypothesis space search and inductive bias
Hidden layer representations, Generalization
Overfitting, Stopping criterion
An Example - Face Recognition.
What is Artificial Neural Network?
The term "Artificial Neural Network" is derived from Biological
neural networks that develop the structure of a human brain. Similar
to the human brain that has neurons interconnected to one another,
artificial neural networks also have neurons that are interconnected to
one another in various layers of the networks. These neurons are
known as nodes.
4
What is Artificial Neural Network?
Biological Neural Artificial Neural
Network Network
Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output
5
Artificial Neural Network Representation
Dendrites from Biological Neural Network represent inputs in
Artificial Neural Networks, cell nucleus represents Nodes, synapse
represents Weights, and Axon represents Output.
The outputs to inputs relationship keep changing because of the
neurons in our brain, which are "learning."
6
Architecture of Artificial Neural Network
Artificial Neural Network primarily consists of three layers:
Input Layer
Hidden Layer
Output Layer
The artificial neural
network takes input
and computes the
weighted sum of the
inputs and includes a
bias.
This computation is
represented in the form
of a transfer function.
7
A Simple Neural Network
8
When to consider Neural Network ?
Input is high-dimensional discrete or real-valued (e.g. raw
sensor input)
Output is discrete or real valued.
Output is a vector of values.
Possibly noisy data Form of target function is unknown
Human readability of result is unimportant
9
Appropriate Problems for Neural Network
Learning
Instances are represented by many attribute-value pairs (e.g.,
the pixels of a picture)
The target function output may be discrete-valued, real-valued,
or a vector of several real- or discrete-valued attributes.
The training examples may contain errors.
Long training times are acceptable.
Fast evaluation of the learned target function may be required.
The ability for humans to understand the learned target function is
not important.
10
Artificial Neural Network - Components
N o d e s – interconnected processing elements (units or neurons)
N e u r o n is connected to other by a connection link.
Each connection link is associated with weight which has information
about the input signal.
A N N processing elements are called as neurons or artificial neurons ,
since they have the capability to model networks of original neurons as found in
brain.
Internal state of neuron is called activation or activity level of
neuron, which is the function of the inputs the neurons receives.
N e u r o n can send only one signal at a time.
11
Artificial Neural Network - Components
X 1 and X2 – input neurons
Y - output neuron
We i g h t e d interconnection links- w1 and w2.
N e t input calculation is :
O u t p u t is :
O u tpu t =function
T h e function to be applied over the ne t input is called
activation function.
12
A C T I V A T I O N FUNCTIONS
To make work more efficient and for exact output, some force or activation is given.
Like that, activation function is applied over the net input to calculate the output of an
ANN.
Information processing of processing element has two major parts: input and output.
An integration function (f) is associated with input of processing element.
Several activation functions are there.
1. Identity function:
it is a linear function which is defined as
f(x) =x for all x
The output is same as the input.
2. Binary step function
it is defined as
where θ represents thresh hold value.
It is used in single layer nets to convert the net input to an output that is bianary. ( 0 or 1(
13
contd..
3. Bipolar step function:
• It is defined as
• where θ represents threshold value.
• used in single layer nets to convert the net input to an output that is bipolar (+1 or -1).
4. Sigmoid function
used in Back propagation nets.
Two types:
a) binary sigmoid
function
-logistic
sigmoid
function or
unipolar
sigmoid
function.
-it is f’(x) = λ f(x)[1-f(x)]. The range of sigmoid function is 0 to 1. 33
defined as
contd..
b) Bipolar sigmoid function
where λ- steepness parameter and the sigmoid range is between -1
and +1.
- The derivative of this function can be
[1+f(x)][1-f(x)]
- It is closely related to hyberbolic tangent function, which is written as
15
contd..
The derivative of the hyberbolic tangent function is
h’(x)= [1+h(x))][1-h(x)]
5. Ramp function
The graphical representation of all these function is
given in the upcoming Figure
16
17
I M P O R TA N T T E R M I N O LO G I E S
Weight
T h e weight contain information about the input signal.
I t is used by the net to solve the problem.
I t is represented in terms of matrix & called as connection matrix.
I f weight matrix W contains all the elements of an ANN, then the
set of all W matrices will determine the set of all possible
information processing configuration.
T h e ANN can be realized by finding an appropriate matrix W.
Weight encode long-term memory (LTM) and the activation states
of network encode short-term memory (STM) in a neural
18
network.
Contd..
Bias
Bias has an impact in calculating net input.
Bias is included by adding x0 to the input vector x.
The net output is calculated by
The bias is of two types
Positive bias
Increase the net
input
Negative bias
19
Decrease the net
Contd..
Threshold
I t is a set value based upon which the final output
is calculated.
Calculated net input and threshold is compared to get
the network output.
T h e activation function of threshold is defined as
where θ is the fixed threshold value
20
Perceptron
A Perceptron unit is used to build the Artificial Neural Network
System
A perceptron takes a vector of real-valued inputs , calculates a
linear combination of these inputs then outputs 1, if the result
is greater than some threshold and -1 otherwise.
Given inputs x1 through xn the output o(x1,…xn) computed by
the perceptron is
where wi is a real-valued constant or weight that determines the
contribution of input xi to the perceptron output
21
Perceptron Training Rule
22
Perceptron Training Rule
One way to learn acceptable weight vector is to begin with random
weights then iteratively apply perceptron to each training example,
modifying the perceptron weights when it misclassifies the
training example.
This process is repeated iteratively through the training examples
until the perceptron correctly classifies the training examples.
Weights are modified at each step according to perceptron training
rule which updates the weight wi associated with input xi according
to the rule.
Learning rate
where t- target output
o- actual output
23
Perceptron Training Rule
24
Perceptron Training Example
Designing AND Gate
Consider w1=1.2, w2=0.6, Threshold =1
Learning rate n=0.5
First Input:
A=0, B=0 , Target=0
wi.xi =1.2*0 + 0.6*0=0
Apply activation function
Since the value is not greater than the threshold 1 the output is 0.
Actual output is same as target so no need to perform weight
updation
Second Input:
A=0, B=0 , Target=0
wi.xi =1.2*0 + 0.6*1=0.6
Since the value is not greater than the threshold 1 the output is 0.
Actual output is same as target so no need to perform weight updation25
Perceptron Training Example
Designing AND Gate
Consider w1=1.2, w2=0.6, Threshold =1
Learning rate n=0.5
Third Input:
A=1 B=0 , Target=0
wi.xi =1.2*1+ 0.6*0=1.2
Since the value is greater than the threshold 1 the output is 1.
Actual output is not same as target so perform weight updation
wi=wi+n(t-o)xi
w1=1.2+0.5(0-1)1=0.7
w2=0.6+0.5(0-1)0=0.6
26
Perceptron Training Example
Designing AND Gate
Consider w1=0.7, w2=0.6, Threshold =1
Learning rate n=0.5
First Input:
A=0, B=0 , Target=0
wi.xi =0.7*0 + 0.6*0=0
Since the value is not greater than the threshold 1 the output is 0.
Actual output is same as target so no need to perform weight
updation
Second Input:
A=0, B=0 , Target=0
wi.xi =0.7*0 + 0.6*1=0.6
Since the value is not greater than the threshold 1 the output is 0.
Actual output is same as target so no need to perform weight
27
Perceptron Training Example
Designing AND Gate
Consider w1=0.7, w2=0.6, Threshold =1
Learning rate n=0.5
Third Input:
A=1 B=0 , Target=0
wi.xi =0.7*1+ 0.6*0=0.7
Since the value is not greater than the threshold 1 the output is 0.
Actual output is same as target so no need to perform weight
updation
28
Perceptron Training Example
Designing AND Gate
Consider w1=0.7, w2=0.6, Threshold =1
Learning rate n=0.5
Fourth Input:
A=1 B=1 , Target=1
wi.xi =0.7*1+ 0.6*1=1.3
Since the value is greater than the threshold 1 the output is 1.
Actual output is same as target so no need to perform weight
updation
29
Gradient Descent and Delta Rule
Perceptron rule finds the successful weight vector when the training
examples are linearly separable and fails to converge if the examples are
not linearly separable.
Delta rule can be used in training to classify non-linearly separable
training examples.
The delta rule converges towards a best fit approximation to the target
concept if the training examples are not linearly separable.
Delta rule uses gradient descent to search the hypothesis space of possible
weight vectors and finds the weights that best fits the training examples.
This rule is important since gradient descent provides the basis for
Backpropagation algorithm which can learn network with many
interconnected units.
Gradient descent can also serve as the basis for learning algorithm that
search through hypothesis containing many different types of
contonuously parameterized hytothesis.
30
Gradient Descent and Delta Rule
Visualizing the hypothesis space
31
Gradient Descent and Delta Rule
The delta rule can be best understood by considering the task of training an
unthresholded perceptron, ie. A linear unit for which the output o is given
by:
Thus a linear unit corresponds to the first stage of a perceptron without a
threshold.
In order to derive a weight learning rule for the linear units, begin by
specifying a measure for training error of the hypothesis(weight vector)
relative to the training examples.
Although there are many ways to define this error , one common measure is
Where D is the set of training examples, td is the target output of the
training example d and od is the output of the linear unit for training
example d. 32
Derivation of Gradient Descent Rule
How can we calculate the steepest descent along the error surface?
This direction can be found by computing the derivative of E with respect to
each component of vector
This vector derivative is called the gradient of E with respect to ,
written as
Since the gradient specifies the direction of steepest increase of E, the
training rule for gradient descent is
where
Here n is a positive constant called learning rate which determines the
step size in gradient descent search. The negative sign is presented since the
33
weight vector moves in the direction that decreases E.
Derivation of Gradient Descent Rule
This training rule can also written in its component form
Where,
34
Derivation of Gradient Descent Rule
35
Gradient Descent Algorithm
Gradient descent and delta rule is used to separate non-linearly
separable data.
Weights are updated using the following rule,
Where
36
Gradient Descent Algorithm
37
Gradient Descent Algorithm
Gradient descent is an important general paradigm for learning.
It is a strategy for searching through a large or infinite
hypothesis space that can be applied whenever
The hypothesis space contains continuously parameterized
hypothesis (eg. the weights in a linear unit) and
The error can be differentiated with respect to these
hypothesis parameters
38
Gradient Descent Algorithm
The key practical difficulties in applying gradient descent are:
Converging to a local minima can be slow(ie. It can require many
thousands of gradient descent steps)
If there are multiple local minima in the error surface then there is no
guarantee that the procedure will find the global minimum.
Differences between Delta Rule and Perceptron Learning Rule
Error (t-o) in delta rule is not restricted having values of 0,1 and -1 as in
Perceptron Learning Rule , but may have any value.
Delta Rule can be derived for any differentiable output/activation
function f, whereas Perceptron Learning Rule only works for threshold
output function.
39
Multilayer Networks
A multilayered network learned by Backpropagation algorithm are capable of
expressing rich variety of nonlinear decision surfaces.
A multilayer networks can learn using gradient descent algorithm
A differentiable threshold unit can be used.
Multiple layers of linear units still produce only linear functions
Perceptrons have a discontinuous threshold which is undifferentiable and
therefore unsuitable for gradient descent
40
Multilayer Networks
A unit whose output is a nonlinear is required with differentiable function
of the inputs. One solution is a sigmoid unit.
Like perceptrons it computes a linear combination of its inputs and then
applies a threshold to the result. But the threshold output is a continuous
function of its input which ranges from 0 to 1.It is often referred to as a
squashing function.
Sigmoid unit computes its output as where,
41
Backpropagation Algorithm
The Backpropagation network learns the weights for a multilayer network
given a network with fixed sets of units and interconnections.
It uses gradient descent to minimize the error between the network output
values and target values.
Redefine Error E by summing the errors over all of the network output units.
Target associated wit kth
output unit and training
example d
Output values
associated wit kth
output unit and training
example d
Set of output units in
the network
Learning problem faced by Backpropagation is to search large hypothesis
space defined by possible weight values for a;; units in network.
Gradient descent can be used to find the hypothesis to minimize E.
42
Backpropagation Algorithm
1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
4. Calculate the error in the outputs
5. Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.
43
6. Repeat till desired output is achieved
Backpropagation Algorithm
44
Backpropagation Algorithm
45
Backpropagation Algorithm
Input Layer
X14,w14 Hidden Layers Output Layer
X15,w15 O6
O7
46
Backpropagation Algorithm
47
Backpropagation Algorithm
48
Backpropagation Algorithm Example
Assume that the neuron has sigmoid activation function, perform a
forward pass and a backward pass on the network. Assume that the actual
output y is 0.5 and the learning rate n is 1. Perform another forward pass.
49
Backpropagation Algorithm Example
Forward Pass:
Compute output for y3,y4,y5
Target y = 0.5
50
Backpropagation Algorithm Example
Each weight changed by
Learning Rate
Error Measure
51
Backpropagation Algorithm Example
52
Backpropagation Algorithm Example
53
Backpropagation Algorithm Example
Similarly update all other weights
54
Backpropagation Algorithm Example
Forward Pass:
Compute output for y3,y4,y5
55
Adding Momentum
Backpropagation is a widely used algorithm with many variations.Common
method is to alter weight update rule. Weight updation on nth iteration
depends on (n-1)th iteration.
Weight
Update Rule
Momentum
Term
Momentum helps to speed up converge with small local minima on the
error surface.
56
Derivation of the Backpropagation Rule
To derive the equation for updating weights in Backpropagation algorithm
gradient descent rule is used.
Stochastic gradient descent involves iterating through training examples one
at a time, for each training example d descending the gradient of the
error Ed with respect to the example.
For each training example d every weight wji is updated adding it to
57
Derivation of the Backpropagation Rule
Error on training example is Ed
Here outputs is the set of output units in the network, tk is the target value
of the unit k for training example d and ok is the output unit k given training
example d.
58
Notations used
59
Derivation of the Backpropagation Rule
Notice that weight wji can influence the rest of the network through netj.
Therefore we can use chain rule
60
Derivation of the Backpropagation Rule
To derive a convenient expression for
We consider 2 cases:
Case 1, where unit j is an output unit for the network
Case 2, where unit j is the internal unit of the network
61
Derivation of the Backpropagation Rule
To derive a convenient expression for
We consider 2 cases:
Case 1, where unit j is an output unit for the network
wij can influence the network with netj and netj can influence the network
with oj.We can use chain rule as specified below:
62
Derivation of the Backpropagation Rule
Case 1, Training rule for output unit weights
63
Derivation of the Backpropagation Rule
Case 2, Training rule for hidden unit weights
64
Derivation of the Backpropagation Rule
Case 2, Training rule for hidden unit weights
65