[go: up one dir, main page]

0% found this document useful (0 votes)
42 views37 pages

Unit 2.1

deep learning

Uploaded by

jadhavrohan7337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views37 pages

Unit 2.1

deep learning

Uploaded by

jadhavrohan7337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

MIT Art Design and Technology University

MIT School of Computing, Pune


21BTCS031 – Deep Learning & Neural Networks

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks


Dr. Anant Kaulage
Dr. Sunita Parinam
Dr. Mayura Shelke
Dr. Aditya Pai
AY 2024-2025 SEM-I
Deep Neural Networks

Unit II
Introduction
□ Modern deep learning provides a very powerful
framework for supervised learning.
□ By adding more layers and more units within a
layer, a deep network can represent functions
of increasing complexity
□ Deep feedforward networks, also often
called feedforward neural networks, or
multilayer perceptrons (MLPs) are the
quintessential deep learning models.
□ The goal of a feedforward network is to
approximate some function f*.
Introduction
□ y = f*(x) maps an input x to a category y.
□ A feedforward network defines a mapping y = f
(x; θ) and learns the value of the parameters θ
that result in the best function approximation
□ feedforward :- information flows through the
function being evaluated from x, through the
intermediate computations used to define f ,
and finally to the output y.
□ No feedback connections in which outputs of
the model are fed back into itself
Introduction
□ When feedforward neural networks are
extended to include feedback connections,
they are called recurrent neural networks
□ Feedforward neural networks are called
networks because they are typically
represented by composing together many
different functions
■ f(x) = f(3)(f(2)(f(1)(x)))
□ During neural network training, we drive f(x) to
match f∗(x).
y ≈ f∗(x)
Introduction
□ Linear models: logistic regression and linear
regression, are appealing because they may be fit
efficiently and reliably
□ To extend linear models to represent nonlinear
functions of x, we can apply the linear model not to x
itself but to a transformed input φ(x )
■ Φ -nonlinear transformation
□ Choosing the mapping φ
■ use a very generic φ, such as the infinite-dimensional φ
that is implicitly used by kernel machines based on the
RBF kernel
■ option is to manually engineer φ
■ deep learning to learn φ
Learning XOR
□ XOR Function: When exactly one of binary
values is equal to 1, the XOR function returns 1.
□ target function, y = f∗(x)
□ Our model provides a function y = f(x;θ)
□ our learning algorithm will adapt the
parameters θ to make f as similar as possible
to f∗
□ X = {[0, 0]T, [0,1]T,[1, 0]T,[1, 1]T}
□ Consider regression problem and use a mean
squared error loss function
Learning XOR
□ Evaluated on our whole training set, the MSE
loss function is

□ Linear Model

□ minimize J(θ) in closed form with respect to w


and b using the normal equations
□ Simple feedforward network with one hidden
layer containing two hidden units
Learning XOR
Learning XOR
□ h = f(1)(x;W, c)
□ y = f(2)(h;w, b)
□ complete model f(x;W,
c,w, b) = f(2)(f (1)(x)).
□ Nonlinear function called
an activation function.
h= g(WTx + c)
□ f(x;W, c,w, b) = wT
max{0,WTx + c} + b.
Learning XOR
Gradient Based Learning
□ The largest difference between the linear models
and neural networks is that the nonlinearity of a
neural network causes most interesting loss
functions to become non-convex
□ Neural networks are usually trained by using
iterative, gradient-based optimizers
□ Linear equation solvers used to train linear
regression models or the convex optimization
algorithms with global convergence used for logistic
regression
□ Convex optimization converges starting from any
initial parameters
Gradient Based Learning
Gradient Based Learning
□ What makes non-convex optimization
hard?
■ Potentially many local minima
■ Saddle points
■ Very flat regions
■ Widely varying curvature
Gradient Based Learning
□ Matrix completion, principle component
analysis
□ Low-rank models and tensor decomposition
□ Maximum likelihood estimation with hidden
variables
□ The big one: deep neural networks
Gradient Based Learning
□ How to solve non-convex problems
■ Stochastic gradient descent
■ Mini-batching
■ SVRG (stochastic variance reduced gradient)
■ Momentum
□ There are also specialized methods for
solving non-convex problems
■ Alternating minimization methods
■ Branch-and-bound methods
■ These generally aren’t very popular for
machine learning problems
Cost Functions: Conditional
Distribution
□ An important aspect of the design of a deep
neural network is the choice of the cost
function
□ our parametric model defines a distribution p(y
| x;θ ) and we simply use the principle of
maximum likelihood
□ cost function: cross-entropy between the
training data and the model’s predictions
□ Most modern neural networks are trained using
maximum likelihood
■ cost function : negative log-likelihood
Cost Functions
Conditional Statistics
□ Instead of learning a full probability distribution
p(y | x; θ) we often want to learn just one
conditional statistic of y given x.
□ For example, we may have a predictor f(x; θ)
that we wish to predict the mean of y
Output Units
□ The choice of cost function is tightly coupled with
the choice of output unit.
□ Most of the time, we simply use the cross-entropy
between the data distribution and the model
distribution.
□ The choice of how to represent the output then
determines the form of the cross-entropy function
□ The role of the output layer is to provide some
additional transformation from the features to
complete the task that the network must perform.
Linear Units for Gaussian
Output Distributions
□ Given features h, a layer of linear output units
produces a vector yˆ = WTh+b
□ Linear output layers are often used to produce
the mean of a conditional Gaussian distribution
■ p(y | x) = N(y;yˆ, I).
■ Gaussian distribution over y with mean y^ and
covariance I
□ Maximizing the log-likelihood is then equivalent
to minimizing the mean squared error
□ Because linear units do not saturate, they pose
little difficulty for gradient based optimization
algorithms
Linear Units for Gaussian
Output Distributions
□ Functions that saturate (become very flat)
■ Because the gradient becomes very small
■ Happens when activation functions producing
output of hidden/output units saturate
□ Negative log-likelihood helps avoid
saturation problem for many models
■ Many output units involve exp functions that
saturate when its argument is very negative
■ log function in Negative log-likelihood cost
function undoes exp of some units
□ Possible use in VAE
Sigmoid Units for Bernoulli
Output Distributions
□ Many tasks require predicting the value of a
binary variable y .
□ E.g. classification problems with two classes
□ The maximum-likelihood approach is to define
a Bernoulli distribution over y conditioned on x
□ A Bernoulli distribution is defined by just a
single number.
□ The neural net needs to predict only P(y = 1 |
x).
□ For this number to be a valid probability, it
must lie in the interval [0, 1].
Softmax Units for Multinoulli
Output Distributions
□ Any time we wish to represent a probability
distribution over a discrete variable with n
possible values, we may use the softmax function
□ Softmax functions are most often used as the

distribution over 𝑛 different classes.


output of a classifier, to represent the probability

□ In case of a discrete variable with 𝑘 values,


produce a vector 𝒚^with 𝑦^𝑖 = 𝑃(𝑦 = 𝑖|𝑥)
□ First, a linear layer predicts unnormalized log
probabilities: z = WTh + b, where
zi = log P˜(y = i | x)
Hidden Units
□ How to choose the type of hidden unit to use in the
hidden layers of the model?
□ The design of hidden units is an extremely active area of
research and does not yet have many definitive guiding
theoretical principles
□ Rectified linear units are an excellent default choice
□ Positives:
■ Gives large and consistent gradients (does not saturate) when
active
■ Efficient to optimize, converges much faster than sigmoid or
tanh
□ Negatives:
■ Non zero centered output
■ Units "die" i.e. when inactive they will never update
Hidden Units
□ Signmoid
□ Tanh
□ Radial basis function
□ Softplus
□ Hard Tanh
Architecture Design
□ Key design consideration for neural networks
□ How many units it should have and how these
units should be connected to each other
□ neural networks are organized into layers

□ main architectural considerations are to choose


the depth of the network and the width of each
layer
□ a network with even one hidden layer is
sufficient to fit the training set
□ Deeper networks often are able to use far fewer
units per layer and far fewer parameters and
often generalize to the test set
□ often harder to optimize
□ The ideal network architecture for a task must
be found via experimentation guided by
monitoring the validation set error
Universal Approximation
Properties and Depth
□ presume that learning a nonlinear function requires
designing a specialized model
□ feedforward networks with hidden layers provide a
universal approximation framework
□ the universal approximation theorem states that a
feedforward network with a linear output layer and at
least one hidden layer with any “squashing” activation
function can approximate any Borel measurable
function from one finite-dimensional space to another
with any desired non-zero amount of error, provided
that the network is given enough hidden units
□ Any continuous function on a closed and bounded
subset of Rn
□ Mathematically speaking, any neural network
architecture aims at finding any mathematical function
y= f(x) that can map attributes(x) to output(y).
□ The accuracy of this function i.e. mapping differs
depending on the distribution of the dataset and the
architecture of the network employed.
□ The function f(x) can be arbitrarily complex.
□ The Universal Approximation Theorem tells us that
Neural Networks has a kind of universality i.e. no
matter what f(x) is, there is a network that can
approximately approach the result and do the job!
□ This result holds for any number of inputs and outputs.
Universal Approximation
Properties and Depth
□ The universal approximation theorem means that
regardless of what function we are trying to learn, we
know that a large MLP will be able to represent this
function.
□ However, we are not guaranteed that the training
algorithm will be able to learn that function
■ Reasons: optimization algorithm used for training may not
be able to find the value of the parameters
■ training algorithm might choose the wrong function due to
overfitting
□ The universal approximation theorem says that there
exists a network large enough to achieve any degree of
accuracy we desire.
□ How much large?
□ In summary, a feedforward network with
a single layer is sufficient to represent
any function, but the layer may be
infeasibly large and may fail to learn and
generalize correctly.
□ In many circumstances, using deeper
models can reduce the number of units
required to represent the desired
function and can reduce the amount of
generalization error
Other Architectures
□ Neural networks show considerably more diversity
□ Specialized architectures for computer vision called
convolutional networks
□ Feedforward networks may also be generalized to the
recurrent neural networks for sequence processing

Empirical results showing that deeper


networks generalize better

You might also like