Unit 2.1

deep learning

Uploaded by

jadhavrohan7337

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views37 pages

Unit 2.1

deep learning

Uploaded by

jadhavrohan7337

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

MIT Art Design and Technology University

MIT School of Computing, Pune

21BTCS031 – Deep Learning & Neural Networks

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks

Dr. Anant Kaulage
Dr. Sunita Parinam
Dr. Mayura Shelke
Dr. Aditya Pai
AY 2024-2025 SEM-I
Deep Neural Networks

Unit II
Introduction
□ Modern deep learning provides a very powerful
framework for supervised learning.
□ By adding more layers and more units within a
layer, a deep network can represent functions
of increasing complexity
□ Deep feedforward networks, also often
called feedforward neural networks, or
multilayer perceptrons (MLPs) are the
quintessential deep learning models.
□ The goal of a feedforward network is to
approximate some function f*.
Introduction
□ y = f*(x) maps an input x to a category y.
□ A feedforward network defines a mapping y = f
(x; θ) and learns the value of the parameters θ
that result in the best function approximation
□ feedforward :- information flows through the
function being evaluated from x, through the
intermediate computations used to define f ,
and finally to the output y.
□ No feedback connections in which outputs of
the model are fed back into itself
Introduction
□ When feedforward neural networks are
extended to include feedback connections,
they are called recurrent neural networks
□ Feedforward neural networks are called
networks because they are typically
represented by composing together many
different functions
■ f(x) = f(3)(f(2)(f(1)(x)))
□ During neural network training, we drive f(x) to
match f∗(x).
y ≈ f∗(x)
Introduction
□ Linear models: logistic regression and linear
regression, are appealing because they may be fit
efficiently and reliably
□ To extend linear models to represent nonlinear
functions of x, we can apply the linear model not to x
itself but to a transformed input φ(x )
■ Φ -nonlinear transformation
□ Choosing the mapping φ
■ use a very generic φ, such as the infinite-dimensional φ
that is implicitly used by kernel machines based on the
RBF kernel
■ option is to manually engineer φ
■ deep learning to learn φ
Learning XOR
□ XOR Function: When exactly one of binary
values is equal to 1, the XOR function returns 1.
□ target function, y = f∗(x)
□ Our model provides a function y = f(x;θ)
□ our learning algorithm will adapt the
parameters θ to make f as similar as possible
to f∗
□ X = {[0, 0]T, [0,1]T,[1, 0]T,[1, 1]T}
□ Consider regression problem and use a mean
squared error loss function
Learning XOR
□ Evaluated on our whole training set, the MSE
loss function is

□ Linear Model

□ minimize J(θ) in closed form with respect to w

and b using the normal equations
□ Simple feedforward network with one hidden
layer containing two hidden units
Learning XOR
Learning XOR
□ h = f(1)(x;W, c)
□ y = f(2)(h;w, b)
□ complete model f(x;W,
c,w, b) = f(2)(f (1)(x)).
□ Nonlinear function called
an activation function.
h= g(WTx + c)
□ f(x;W, c,w, b) = wT
max{0,WTx + c} + b.
Learning XOR
Gradient Based Learning
□ The largest difference between the linear models
and neural networks is that the nonlinearity of a
neural network causes most interesting loss
functions to become non-convex
□ Neural networks are usually trained by using
iterative, gradient-based optimizers
□ Linear equation solvers used to train linear
regression models or the convex optimization
algorithms with global convergence used for logistic
regression
□ Convex optimization converges starting from any
initial parameters
Gradient Based Learning
Gradient Based Learning
□ What makes non-convex optimization
hard?
■ Potentially many local minima
■ Saddle points
■ Very flat regions
■ Widely varying curvature
Gradient Based Learning
□ Matrix completion, principle component
analysis
□ Low-rank models and tensor decomposition
□ Maximum likelihood estimation with hidden
variables
□ The big one: deep neural networks
Gradient Based Learning
□ How to solve non-convex problems
■ Stochastic gradient descent
■ Mini-batching
■ SVRG (stochastic variance reduced gradient)
■ Momentum
□ There are also specialized methods for
solving non-convex problems
■ Alternating minimization methods
■ Branch-and-bound methods
■ These generally aren’t very popular for
machine learning problems
Cost Functions: Conditional
Distribution
□ An important aspect of the design of a deep
neural network is the choice of the cost
function
□ our parametric model defines a distribution p(y
| x;θ ) and we simply use the principle of
maximum likelihood
□ cost function: cross-entropy between the
training data and the model’s predictions
□ Most modern neural networks are trained using
maximum likelihood
■ cost function : negative log-likelihood
Cost Functions
Conditional Statistics
□ Instead of learning a full probability distribution
p(y | x; θ) we often want to learn just one
conditional statistic of y given x.
□ For example, we may have a predictor f(x; θ)
that we wish to predict the mean of y
Output Units
□ The choice of cost function is tightly coupled with
the choice of output unit.
□ Most of the time, we simply use the cross-entropy
between the data distribution and the model
distribution.
□ The choice of how to represent the output then
determines the form of the cross-entropy function
□ The role of the output layer is to provide some
additional transformation from the features to
complete the task that the network must perform.
Linear Units for Gaussian
Output Distributions
□ Given features h, a layer of linear output units
produces a vector yˆ = WTh+b
□ Linear output layers are often used to produce
the mean of a conditional Gaussian distribution
■ p(y | x) = N(y;yˆ, I).
■ Gaussian distribution over y with mean y^ and
covariance I
□ Maximizing the log-likelihood is then equivalent
to minimizing the mean squared error
□ Because linear units do not saturate, they pose
little difficulty for gradient based optimization
algorithms
Linear Units for Gaussian
Output Distributions
□ Functions that saturate (become very flat)
■ Because the gradient becomes very small
■ Happens when activation functions producing
output of hidden/output units saturate
□ Negative log-likelihood helps avoid
saturation problem for many models
■ Many output units involve exp functions that
saturate when its argument is very negative
■ log function in Negative log-likelihood cost
function undoes exp of some units
□ Possible use in VAE
Sigmoid Units for Bernoulli
Output Distributions
□ Many tasks require predicting the value of a
binary variable y .
□ E.g. classification problems with two classes
□ The maximum-likelihood approach is to define
a Bernoulli distribution over y conditioned on x
□ A Bernoulli distribution is defined by just a
single number.
□ The neural net needs to predict only P(y = 1 |
x).
□ For this number to be a valid probability, it
must lie in the interval [0, 1].
Softmax Units for Multinoulli
Output Distributions
□ Any time we wish to represent a probability
distribution over a discrete variable with n
possible values, we may use the softmax function
□ Softmax functions are most often used as the

distribution over 𝑛 different classes.

output of a classifier, to represent the probability

□ In case of a discrete variable with 𝑘 values,

produce a vector 𝒚^with 𝑦^𝑖 = 𝑃(𝑦 = 𝑖|𝑥)
□ First, a linear layer predicts unnormalized log
probabilities: z = WTh + b, where
zi = log P˜(y = i | x)
Hidden Units
□ How to choose the type of hidden unit to use in the
hidden layers of the model?
□ The design of hidden units is an extremely active area of
research and does not yet have many definitive guiding
theoretical principles
□ Rectified linear units are an excellent default choice
□ Positives:
■ Gives large and consistent gradients (does not saturate) when
active
■ Efficient to optimize, converges much faster than sigmoid or
tanh
□ Negatives:
■ Non zero centered output
■ Units "die" i.e. when inactive they will never update
Hidden Units
□ Signmoid
□ Tanh
□ Radial basis function
□ Softplus
□ Hard Tanh
Architecture Design
□ Key design consideration for neural networks
□ How many units it should have and how these
units should be connected to each other
□ neural networks are organized into layers

□ main architectural considerations are to choose

the depth of the network and the width of each
layer
□ a network with even one hidden layer is
sufficient to fit the training set
□ Deeper networks often are able to use far fewer
units per layer and far fewer parameters and
often generalize to the test set
□ often harder to optimize
□ The ideal network architecture for a task must
be found via experimentation guided by
monitoring the validation set error
Universal Approximation
Properties and Depth
□ presume that learning a nonlinear function requires
designing a specialized model
□ feedforward networks with hidden layers provide a
universal approximation framework
□ the universal approximation theorem states that a
feedforward network with a linear output layer and at
least one hidden layer with any “squashing” activation
function can approximate any Borel measurable
function from one finite-dimensional space to another
with any desired non-zero amount of error, provided
that the network is given enough hidden units
□ Any continuous function on a closed and bounded
subset of Rn
□ Mathematically speaking, any neural network
architecture aims at finding any mathematical function
y= f(x) that can map attributes(x) to output(y).
□ The accuracy of this function i.e. mapping differs
depending on the distribution of the dataset and the
architecture of the network employed.
□ The function f(x) can be arbitrarily complex.
□ The Universal Approximation Theorem tells us that
Neural Networks has a kind of universality i.e. no
matter what f(x) is, there is a network that can
approximately approach the result and do the job!
□ This result holds for any number of inputs and outputs.
Universal Approximation
Properties and Depth
□ The universal approximation theorem means that
regardless of what function we are trying to learn, we
know that a large MLP will be able to represent this
function.
□ However, we are not guaranteed that the training
algorithm will be able to learn that function
■ Reasons: optimization algorithm used for training may not
be able to find the value of the parameters
■ training algorithm might choose the wrong function due to
overfitting
□ The universal approximation theorem says that there
exists a network large enough to achieve any degree of
accuracy we desire.
□ How much large?
□ In summary, a feedforward network with
a single layer is sufficient to represent
any function, but the layer may be
infeasibly large and may fail to learn and
generalize correctly.
□ In many circumstances, using deeper
models can reduce the number of units
required to represent the desired
function and can reduce the amount of
generalization error
Other Architectures
□ Neural networks show considerably more diversity
□ Specialized architectures for computer vision called
convolutional networks
□ Feedforward networks may also be generalized to the
recurrent neural networks for sequence processing

Empirical results showing that deeper

networks generalize better

Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Module 2
No ratings yet
Module 2
44 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Module 2
No ratings yet
Module 2
55 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
DL 2
No ratings yet
DL 2
62 pages
6.1 DeepFFNets M2
No ratings yet
6.1 DeepFFNets M2
48 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
79 pages
DL Unit 1
No ratings yet
DL Unit 1
10 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
FDL Module1
No ratings yet
FDL Module1
102 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Neural Networks: Feedforward Basics
No ratings yet
Neural Networks: Feedforward Basics
24 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Deep Learning Basics Lecture 1 Feedforward
No ratings yet
Deep Learning Basics Lecture 1 Feedforward
31 pages
6.3 HiddenUnits
No ratings yet
6.3 HiddenUnits
26 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Neural Networks
No ratings yet
Neural Networks
108 pages
DL 02 Deep Forward Networks
No ratings yet
DL 02 Deep Forward Networks
47 pages
2 DL Training
No ratings yet
2 DL Training
60 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Tutorial 1,2
No ratings yet
Tutorial 1,2
12 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
04 - Neural Networks PDF
No ratings yet
04 - Neural Networks PDF
46 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Ch03-Deep Learning Network
No ratings yet
Ch03-Deep Learning Network
36 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
17 pages
Ch06 Deep Feedforward Networks
100% (1)
Ch06 Deep Feedforward Networks
90 pages
Neural Networks & Backpropagation
No ratings yet
Neural Networks & Backpropagation
77 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Unit II
No ratings yet
Unit II
56 pages
Contents MLP PDF
No ratings yet
Contents MLP PDF
60 pages
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
No ratings yet
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
57 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
Deep Learning
No ratings yet
Deep Learning
50 pages
EST3 Life Safety Submittal Guide
No ratings yet
EST3 Life Safety Submittal Guide
64 pages
Breakingupisnthardtodo
No ratings yet
Breakingupisnthardtodo
6 pages
01P20CM131-LEAP-1A29CJ, TAPS II Combustor (22.01.2021)
No ratings yet
01P20CM131-LEAP-1A29CJ, TAPS II Combustor (22.01.2021)
1 page
Sindh Police Constable Fee Slip
0% (1)
Sindh Police Constable Fee Slip
1 page
Technology Guide (Updated 12.9.24)
No ratings yet
Technology Guide (Updated 12.9.24)
1 page
Movie Data
No ratings yet
Movie Data
11 pages
FN595NWS
No ratings yet
FN595NWS
53 pages
3navigation and Routing in Flutter
No ratings yet
3navigation and Routing in Flutter
34 pages
Medical Dosimetry Certification Exam Pass Rates Does Degree Level Make A Difference
No ratings yet
Medical Dosimetry Certification Exam Pass Rates Does Degree Level Make A Difference
1 page
A - Brand Standardization (Brand and IM) INDIA
No ratings yet
A - Brand Standardization (Brand and IM) INDIA
12 pages
CC 11
No ratings yet
CC 11
3 pages
Approval
No ratings yet
Approval
4 pages
31 Startup Ideas
No ratings yet
31 Startup Ideas
32 pages
The Great Hack PDF
No ratings yet
The Great Hack PDF
2 pages
Guidelines For Producing A Short Documentary
No ratings yet
Guidelines For Producing A Short Documentary
13 pages
AEQ TH-03 Digital Hybrid Manual
No ratings yet
AEQ TH-03 Digital Hybrid Manual
15 pages
Sample
No ratings yet
Sample
24 pages
Binder 1
No ratings yet
Binder 1
30 pages
Online Bookstore Abstraction
No ratings yet
Online Bookstore Abstraction
9 pages
Ficha Tecnica PC 200lc 7
No ratings yet
Ficha Tecnica PC 200lc 7
12 pages
CSE Instructor Materials Chapter2
No ratings yet
CSE Instructor Materials Chapter2
26 pages
4 Bit Binary Adder (Mini-Calculator)
No ratings yet
4 Bit Binary Adder (Mini-Calculator)
9 pages
Electricity Bill Management System - Information Systems Project
No ratings yet
Electricity Bill Management System - Information Systems Project
9 pages
Unit 5 Evaluating Information Sources
No ratings yet
Unit 5 Evaluating Information Sources
11 pages
HVAC System Configuration Guide
No ratings yet
HVAC System Configuration Guide
43 pages
SAP Sales & Distribution Guide
100% (2)
SAP Sales & Distribution Guide
2 pages
Minor Project Report
No ratings yet
Minor Project Report
17 pages
TM Task 16
No ratings yet
TM Task 16
2 pages
PRIME AMP Guide
No ratings yet
PRIME AMP Guide
6 pages
Banking & CSR Study Proposal
No ratings yet
Banking & CSR Study Proposal
7 pages