Introduction to deep learning, Deep
feed forward network
Mr. Sivadasan E T
Associate Professor
Vidya Academy of Science and Technology, Thrissur
Deep feed forward network
Deep feedforward networks, also often called
feedforward neural networks, or multilayer perceptrons
(MLPs), are the quintessential deep learning models.
The goal of a feedforward network is to approximate some
function f ∗. For example, for a classifier, y = f ∗(x) maps an
input x to a category y.
Deep feed forward network
A feedforward network defines a mapping
y = f (x; θ)
and learns the value of the parameters θ that result in the
best function approximation.
These models are called feedforward because information flows
through the function being evaluated from x, through the
intermediate computations used to define f , and finally to the
output y.
Deep feed forward network
Feedforward networks are of extreme importance to machine
learning practitioners.
They form the basis of many important commercial
applications.
For example, the convolutional networks used for object
recognition from photos are a specialized kind of feedforward
network.
Deep feed forward network
Feedforward networks are a conceptual stepping stone on
the path to recurrent networks, which power many natural
language applications.
Feedforward neural networks are called networks because
they are typically represented by composing together many
different functions.
Deep feed forward network
For example,
we might have three functions f (1) , f (2), and f (3)
connected in a chain, to form
f (x) = f(3)(f (2)(f (1)(x))).
These chain structures are the most commonly used structures
of neural networks.
Deep feed forward network
These chain structures are the most commonly used
structures of neural networks.
In this case, f (1) is called the first layer of the network,
f (2) is called the second layer, and so on.
Deep feed forward network
The overall length of the chain gives the depth of the model.
It is from this terminology that the name “deep learning”
arises.
The final layer of a feedforward network is called the output
layer.
Deep feed forward network
The behavior of the other layers is not directly specified by the
training data.
Because the training data does not show the desired output for
each of these layers, these layers are called hidden layers.
Deep feed forward network
Finally, these networks are called neural because they are
loosely inspired by neuroscience.
Each hidden layer of the network is typically vector-valued.
The dimensionality of these hidden layers determines the
width of the model.
Initialization
Initialization is particularly important in neural networks
because of the stability issues associated with neural network
training.
Neural networks often exhibit stability problems in the sense
that the activations of each layer either become successively
weaker or successively stronger.
Initialization
The effect is exponentially related to the depth of the network,
and is therefore particularly severe in deep networks.
One way of ameliorating this effect to some extent is to choose
good initialization points in such a way that the gradients are
stable across the different layers.
Initialization
One possible approach to initialize the weights is to generate
random values from a Gaussian distribution with zero mean
and a small standard deviation, such as 10−2.
Typically, this will result in small random values that are both
positive and negative.
Initialization
One problem with this initialization is that it is not sensitive to
the number of inputs to a specific neuron.
For example, if one neuron has only 2 inputs and another has 100
inputs, the output of the former is far more sensitive to the
average weight because of the additive effect of more inputs
(which will show up as a much larger gradient).
Initialization
One problem with this initialization is that it is not sensitive to
the number of inputs to a specific neuron.
Initialization
Example,
1. Neuron A with 2 Inputs:
Suppose this neuron has only two input
connections.
The output of this neuron depends heavily on the values
of its weights because there are fewer inputs contributing to
the output. Any small change in the weights can
significantly affect the neuron's output.
Initialization
2. Neuron B with 100 Inputs:
This neuron has 100 input connections.
The effect of each individual weight on the output diminishes
because the outputs from all 100 inputs combine. Even if a few
weights change, the impact on the neuron's output is less
pronounced due to the averaging effect.
Initialization
In general, it can be shown that the variance of the outputs
linearly scales with the number of inputs, and
Therefore the standard deviation scales with the square root of
the number of inputs.
Initialization
To balance this fact, each weight is initialized to a value drawn
from a Gaussian distribution with standard deviation sqrt(1/r),
where r is the number of inputs to that neuron.
Xavier initialization or Glorot
Is a weight initialization technique designed to help neural
networks converge more efficiently during training.
It was introduced by Xavier Glorot and Yoshua Bengio in their
paper "Understanding the difficulty of training deep feedforward
neural networks."
Xavier initialization or Glorot
The weights are initialized in such a way that:
The variance of the outputs of each layer is the same as the
variance of its inputs.
The gradients during backpropagation have a similar variance
across layers, preventing vanishing or exploding gradients.
This is achieved by carefully scaling the initial weights based on
the number of input and output neurons.
Xavier initialization or Glorot initialization.
Let rin and rout respectively be the fan-in and fan-out for a
particular neuron.
Then to use a Gaussian distribution with standard deviation of
sqrt(2/(rin + rout)).
Symmetry breaking.
An important consideration in using randomized methods is that
symmetry breaking is important.
if all weights are initialized to the same value (such as 0), all
updates will move in lock-step in a layer.
As a result, identical features will be created by the neurons in a
layer.
It is important to have a source of asymmetry among the neurons
to begin with.
Thank You!