0% found this document useful (0 votes)

27 views67 pages

Lecture 2

Uploaded by

anna tran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views67 pages

Lecture 2

Uploaded by

anna tran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Introduction to Deep Learning

Tan Minh Nguyen

Department of Mathematics, NUS
A Cool Thing: Bias-Variance
Tradeoff
Underfitting Underfitting

h𝒟(x) = w0 + w1x
y ȳ(x) = w0 + w1x + w2x 2 y

x (e.g., size of the house) x (e.g., size of the house)

(Just right) Underfitting
Underfitting
Underfitting
Bias toward linear models
h𝒟(x) = w0 + w1x
y ȳ(x) = w0 + w1x + w2x 2 y
→ A strongly biased solution

x (e.g., size of the house) x (e.g., size of the house)

(Just right) Underfitting
Underfitting Underfitti

Now let’s redo linear regression on a diﬀerent dat

Bias toward linear models
y
→ A strongly biased solution

When fitting the model to a

different dataset, the new linear
solution does not differ too
much from the old one x (e.g., size of the house)
→ A low-variance solution
Overfitting Overfitting

h𝒟(x) = w0 + w1x + w2x 2 + … + w6x 5

y ȳ(x) = w0 + w1x + w2x 2 y

x (e.g., size of the house) x (e.g., size of the house)

(Just right) Overfitting
Overfitting
Overfitting
Our hypothesis space contains 2 5
th
all polynomials up to 5 order2 h 𝒟 (x) = w0 + w1 x + w2 x + … + w6 x
y ȳ(x) = w0 + w1x + w2x y
→ No strong bias

x (e.g., size of the house) x (e.g., size of the house)

(Just right) Overfitting
Overfi
Overfitting Redo the higher-order polynom

Our hypothesis space contains y

all polynomials up to 5th order

→ No strong bias

When fitting the model to a

different dataset, the new x (e.g., size of the house)
solution does differ a lot from
the old one

→ A high-variance solution
Bias of learner
Given: dataset 𝐷 with M samples:
Learn: for different datasets 𝐷, you will get different solutions 𝑓(𝑥)
Expected prediction: 𝑓 ̅ = 𝐸! [𝑓(𝑥)]

Bias: the difference between expected prediction and truth

• Measure how well you expect to represent a true solution
• Decrease with more complex model

𝑏𝑖𝑎𝑠 ! = & 𝐸# 𝑓(𝑥) − 𝑦 2 𝑝 𝑥 𝑑𝑥

𝒙
Variance of learner
Given: dataset 𝐷 with M samples:
Learn: for different datasets 𝐷, you will get different solutions 𝑓(𝑥)
Expected prediction: 𝑓 ̅ = 𝐸! [𝑓(𝑥)]

Variance: the difference between what you expect to learn, i.e., 𝑓,̅
and what you learn from a particular dataset
• Measure how sensitive the learner is to a specific dataset
• Decrease with more simpler model

̅
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = & 𝐸# 𝑓 𝑥 − 𝑓(𝑥) 2 𝑝 𝑥 𝑑𝑥
𝒙
Neural Networks
What we’ve seen so far
So far, we introduced two types of hypothesis spaces containing
%

𝑓 𝑥 = + 𝑎" 𝜙" 𝑥
"#$
Main difference:
1. The feature maps 𝜙! are fixed
Examples: linear models, linear basis models, SVM
2. The feature maps 𝜙! are adapted to data
Examples: standard/boosted/bagged decision trees
In this lecture, we introduce another class belonging to 2: neural
networks
Neural Networks
ImageNet
Machine learning models
based on neural networks are
called deep learning models.

Deep learning models have

become increasingly popular
thanks to the increasing GPUs
amount of data and
computational power (e.g.,
powerful graphics processing
units (GPU)).
Examples of Deep Learning Models

https://chat.openai.com/
Examples of Deep Learning Models

https://labs.openai.com/
Shallow Neural Networks
History
Neural networks originated from an attempt to model collective
interaction of neurons

https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7
Neural Networks for Regression
The neural network (NN) hypothesis space is quite like linear
basis models
%

𝑓 𝑥 = + 𝑣" 𝜎 𝑤"& 𝑥 + 𝑏"

"#$
'! (

Trainable variables: 𝑣" ∈ ℝ, 𝑏" ∈ ℝ, 𝑤" ∈ ℝ)

• 𝑤! are the weights of the hidden layer
• 𝑏! are the biases of the hidden layer
• 𝑣! are the weights of the output layer
• 𝜎: ℝ → ℝ is the activation function
Activation Functions
• Sigmoid
1
𝜎 𝑧 =
1 + 𝑒 *+
• Tanh
𝜎 𝑧 = tanh 𝑧
• Rectified Linear Unit (ReLU)
𝜎 𝑧 = max 0, 𝑧
• Leaky-ReLU
𝑧 if 𝑧 ≥ 0
𝜎 𝑧 =@
𝛿𝑧 if 𝑧 < 0
Universal Approximation Theorem
One of the foundational results for neural networks is the
universal approximation theorem.

In words, it says the following:

Any continuous function 𝑓 ∗ on a compact domain can be

approximated by neural networks to arbitrary precision, provided
there are enough neurons (𝑀 large enough).
Neural networks hypothesis space of arbitrarily large width has
zero approximation error!

Distance 𝑓 ∗ , ℋ = 0

𝑓∗

𝓗
𝑓!
“Proof” in a Special Case
Let us consider a 1D continuous function on the unit interval
𝑓 ∗ : 0,1 → ℝ
Step 1: Approximate by a step functions
Step 2: Use two sigmoids to make a step
Step 3: Sum the resulting function
Curse of Dimensionality
Although this idea can be extended to high dimensions, it
introduces an issue.
How many patches of linear size 𝜖 are there in 0,1 ) ?
• 𝑑 = 1: 1/𝜖 pieces
• 𝑑 = 2: 1/𝜖 " pieces
• General 𝑑: 1/𝜖 # pieces
Even if somehow, we only need a constant number of neurons to
approximate each piece, we would still need 𝒪 𝜖 *) neurons!
This is known as the curse of dimensionality that plagues high
dimensional problems
Do neural networks suffer
from the curse of
dimensionality?
Linear and Nonlinear
Approximation
Linear vs Nonlinear Approximation
Recall:
%

𝑓 𝑥 = + 𝑣" 𝜙" 𝑥
"#$

1. Linear approximation: 𝜙! are fixed

2. Nonlinear approximation: 𝜙! are adapted

What is the difference?

The significance of data-dependent
feature maps
Let us consider some motivating examples

Suppose we want to write a vector 𝑢 in 3D in terms of its

coordinate components

𝑢. 𝑢
𝑢- 𝑢 = 𝑢$ 𝑒$ + 𝑢- 𝑒- + 𝑢. 𝑒.
𝑒.
𝑒-

𝑒$ 𝑢$
Suppose we can only use 2 coordinate axes, say 𝑒$ and 𝑒-
What is the best approximation 𝑢P of 𝑢?

Example:
𝑢 = 3, 1, 2 = 3𝑒$ + 1𝑒- + 2𝑒. 𝑢
2
𝑢P = 3𝑒$ + 1𝑒- 1
𝑒. 𝑒- 𝑢P
Error: 𝑢 − 𝑢P = 2
𝑒$ 3

• What if we can pick which two bases to use after seeing 𝑢?

• What if we can pick two bases from a much larger set?
Functions behave just like vectors!
• Each 𝜙! is like a coordinate axis. They play the role of 𝑒! .
Important difference: there are an infinite number of them
• The oracle function 𝑓 ∗ plays the role of 𝑢

Writing
%

𝑓 ∗ 𝑥 = + 𝑣" 𝜙" 𝑥
"#$
is like expanding a vector into its components, but we can’t have
all components since 𝑀 is finite
If we get to choose which components to have in the sum after
seeing some information on 𝑓 ∗ , we can usually do much better!
Linear Approximation
Basis independent of data
Nonlinear Approximation
Basis depends on data
Overcoming the Curse of
Dimensionality
Under some technical assumptions, for any continuous (+ other
conditions) function 𝑓 ∗ : 0,1 ) → ℝ, there exists a width-𝑀
neural network 𝑓% such that
𝑓 ∗ − 𝑓% - ≤ 𝒪 𝑀*$
This result is first proved in [Baron, 1993]

This is a tremendous improvement over linear approximation,

where we usually have
/
∗ - *
𝑓 − 𝑓% ≤𝒪 𝑀 )

The constant 𝛼 measures the smoothness of 𝑓 ∗

Optimizing Neural Networks
Optimization
The universal approximation theorem is an approximation result
• We know there is a good approximator of 𝑓 ∗ in ℋ
• But, we do not yet know how to find it

𝓗
𝑓! ≈ 𝑓 ∗
𝑓0

Optimization 𝑓'
(using 𝒟)
Empirical Risk Minimization for
Neural Networks
We can parameterize the hypothesis space

ℋ = 𝑓: 𝑓 𝑥 = 𝑓2 𝑥 , 𝜃 ∈ Θ

Then, empirical risk minimization is

6
1
min Φ 𝜃 = + 𝐿 𝑓2 𝑥5 , 𝑦5
2∈4 𝑁
5#$ 7" 2

Here,Φ is the total loss and Φ5 is the sample loss

Gradient Descent
Consider minimizing the total loss or objective

Φ 𝜃 over 𝜃 ∈ Θ = ℝ8 (unconstrained)

A necessary first-order optimality condition

∇Φ 𝜃 ∗ = 0
Two choices:
• Solve ∇Φ 𝜃 ∗ = 0
• Use an iterative method
The Effect of Learning Rate
Look at the GD iteration

𝜃9:$ = 𝜃9 − 𝜂∇Φ 𝜃9

• When 𝜂 is too small, updates are slow

• When 𝜂 is too large, the updates may become unstable
Example
One dimension
1 -
Φ 𝜃 = 𝜃
2
∇Φ 𝜃 = 𝜃

GD iterates
𝜃9:$ = 𝜃9 − 𝜂𝜃9 = 1 − 𝜂 𝜃9

Solution
𝜃9 = 1 − 𝜂 9 𝜃0
Convergence of GD
Provided 𝜂 < ‖∇- Φ‖*$ , it can be shown that

‖∇Φ 𝜃9 ‖ → 0

However, this does not mean that 𝜃9 → 𝜃 ∗ !

Most important problem: there may be local minima

Local vs Global Minima
𝜃 ∗ is a local minimum of Φ if
there exist 𝛿 > 0
Φ 𝜃∗ ≤ Φ 𝜃 Local Min
for all 𝜃 such that 𝜃 − 𝜃 ∗ ≤ 𝛿 Global Min

𝜃 ∗ is a global minimum of Φ if
Φ 𝜃 ∗ ≤ Φ 𝜃 for all 𝜃
𝛿

When does GD find a global minimum?

Convex Functions
A class of objective/loss functions for which local minima are
also global is called convex functions

Definition:

A function 𝛷 is convex if
𝛷 𝜆𝜃 + 1 − 𝜆 𝜃 ; ≤ 𝜆𝛷 𝜃 + 1 − 𝜆 𝛷 𝜃 ;
for all 𝜃, 𝜃 ; ∈ ℝ8 and

Geometric meaning?
Examples

Convex:

Non-convex:
Important Property
If Φ is convex, then all local minima are also global!

Proof by picture:
GD on Convex Functions
When Φ is convex, GD finds the global minima. In fact, there is a
rate estimate!
Stochastic Gradient Descent
GD is an optimization algorithm for general differentiable
functions, but in empirical risk minimization we have some
structure
6
1
Φ 𝜃 = + Φ5 𝜃
𝑁
5#$

Challenges to GD?
% &
• ∇Φ 𝜃 = ∑ ∇Φ' 𝜃 , so a gradient evaluation
& '(%
requires a summation of 𝑁 terms
• This is very expensive when 𝑁 is large
Stochastic gradient descent relies on the following idea: at each
step, we take a random sub-sample of the dataset as an
approximation of the full gradient

Gradient Descent (GD)

𝜃9:$ = 𝜃9 − 𝜂∇Φ 𝜃9

Stochastic Gradient Descent (SGD)

1
𝜃9:$ = 𝜃9 − 𝜂 + ∇Φ5 (𝜃9 )
𝐵
5∈<#
where 𝐼= is a random sub-sample of 1,2, … , 𝑁 of size 𝐵

This is efficient if 𝐵 is small and 𝑁 is large!

The Dynamics of SGD
Consider sample objectives
1 1
Φ$ 𝜃 = 𝜃 − 𝜉 - Φ- 𝜃 = 𝜃 + 𝜉 -
2 2
$
Total objective: Φ 𝜃 = - 𝜃 - + 𝜉 -
SGD vs GD dynamics?
Deep Neural Networks
Deep Neural Networks
Deep neural networks are an extension of shallow networks.
Idea: we stack many hidden layers together

𝑥0 = 𝑥
𝑥$ = 𝜎 𝑊0 𝑥0 + 𝑏0
𝑥- = 𝜎(𝑊$ 𝑥$ + 𝑏$ )
𝑥. = 𝜎 𝑊- 𝑥- + 𝑏-
⋮
𝑥& = 𝜎 𝑊&*$ 𝑥&*$ + 𝑏&*$
𝑓 𝑥 = 𝑣 & 𝑥&
Optimizing Deep Neural Networks
Analogous to shallow NNs, deep NNs can also be optimized with
(stochastic) gradient descent.

However, due to the repeated feed-forward structure we need an

efficient algorithm to compute the gradients.

This is known as the back-propagation algorithm

Review: Chain Rule
Consider functions
𝐺: ℝ> → ℝ?
𝐻: ℝ) → ℝ>

Then, the chain rule of calculus gives

∇( 𝐺 𝐻 𝑥 = ∇@ 𝐺 𝐻 𝑥 ∇( 𝐻 𝑥

In component form, we have

𝜕𝐺5 𝜕𝐺5 𝜕𝐻9
=+ ⋅
𝜕𝑥" 𝜕𝐻9 𝜕𝑥"
9
Back-Propagation
Let us consider a network

𝑥0 = 𝑥
𝑥A:$ = 𝑔A 𝑥A , 𝑊A , 𝑡 = 0, … , 𝑇
𝑓 𝑥 = 𝑥&:$ (𝑣 is absorbed into last layer)

Loss function (just consider one sample)

Φ 𝜃 = 𝐿(𝑥&:$ , 𝑦)
We want to compute
∇B$ Φ 𝜃 ≡ 𝑊0 , … , 𝑊& 𝑡 = 0,1, … , 𝑇
1. Generally, 𝑥&:$ has the following dependence
𝑥&:$ ≡ 𝑥&:$ (𝑥, 𝑊0 , … , 𝑊& )

2. But, given 𝑥A:$ , 𝑥&:$ no longer depends on 𝑊C : 𝑠 ≤ 𝑡

𝑥&:$ ≡ 𝑥&:$ 𝑥A:$ , 𝑊A:$ , … , 𝑊&
𝑥A:$ ≡ 𝑥A:$ (𝑥, 𝑊0 , … , 𝑊A )

3. Use chain rule on

𝐿 𝑥&:$ , 𝑦 ≡ 𝐿 𝑥A:$ 𝑥, 𝑊0 , … , 𝑊A , 𝑊A:$ , … , 𝑊&
giving
&
∇D% 𝐿 𝑥&:$ , 𝑦 = ∇D% 𝑥A:$ ∇(%'( 𝐿 𝑥&:$ , 𝑦

∇F! (% ,D% & 8 %'(

(IJKLMJ)
4. So, we have defined
𝑝A = ∇(% 𝐿(𝑥&:$ , 𝑦)
once we know 𝑝A , we are done! How to compute 𝑝A ?

For 𝑡 = 𝑇 + 1, this is easy

𝑝&:$ = ∇(&'( 𝐿 𝑥&:$ , 𝑦

For 𝑡 ≤ 𝑇, we use chain rule again to derive a recursion

&
𝑝A = ∇(% 𝑥A:$ ∇(%'( 𝐿 𝑥&:$ , 𝑦
and so
&
𝑝A = ∇O 𝑔A 𝑥A , 𝑊A 𝑝A:$
Summary
Approximation Properties of Neural Networks
• Nonlinear approximation: adapted to data
• Universal approximation property, overcomes curse of
dimensionality

Optimizing Neural Networks

• (Stochastic) Gradient Descent
• For deep NNs, compute gradients using back-propagation
Homework 1
Homework 1

DSA5105 Lecture5
No ratings yet
DSA5105 Lecture5
52 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
18 pages
DLUNIT2
No ratings yet
DLUNIT2
25 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Neural Networks
No ratings yet
Neural Networks
108 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Neural Network As Universal Approximates
No ratings yet
Neural Network As Universal Approximates
5 pages
Mathematics in AI: Neural Networks
No ratings yet
Mathematics in AI: Neural Networks
10 pages
Deep Learning
No ratings yet
Deep Learning
50 pages
Deep Learning & Neural Networks Guide
No ratings yet
Deep Learning & Neural Networks Guide
87 pages
Neural Networks & Gradient Descent
No ratings yet
Neural Networks & Gradient Descent
77 pages
MtechDL Unit2
No ratings yet
MtechDL Unit2
25 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Macro Finance
No ratings yet
Macro Finance
119 pages
CHC 351 Module 2
No ratings yet
CHC 351 Module 2
73 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
Chapter 08
100% (2)
Chapter 08
202 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
DEEP LEARNING Paper
No ratings yet
DEEP LEARNING Paper
12 pages
Machine Learning-Gkouzionis
No ratings yet
Machine Learning-Gkouzionis
14 pages
Module 2
No ratings yet
Module 2
44 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
DL Unit 1
No ratings yet
DL Unit 1
10 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Week 2 Artificial Neural Networks - Part II
No ratings yet
Week 2 Artificial Neural Networks - Part II
40 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
UNIT-III-3.2-ML-Features of ANN and Case Study ANN
No ratings yet
UNIT-III-3.2-ML-Features of ANN and Case Study ANN
24 pages
Hda RMML
No ratings yet
Hda RMML
131 pages
978-3-030-41068-1 (1) - 133-188
No ratings yet
978-3-030-41068-1 (1) - 133-188
56 pages
Lect 12 - Deep Feed Forward NN - Review
No ratings yet
Lect 12 - Deep Feed Forward NN - Review
93 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
Unit 2
No ratings yet
Unit 2
76 pages
Sigmoid Neural Networks To Predict Handwritten Digits
No ratings yet
Sigmoid Neural Networks To Predict Handwritten Digits
16 pages
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
No ratings yet
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
29 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
On The Implicit Bias in Deep-Learning Algorithms: Gal Vardi TTI-Chicago and Hebrew University
No ratings yet
On The Implicit Bias in Deep-Learning Algorithms: Gal Vardi TTI-Chicago and Hebrew University
17 pages
Week 1 - Artificial Neural Networks - Part I - Justin
No ratings yet
Week 1 - Artificial Neural Networks - Part I - Justin
56 pages
Deep Learning L - Fin
No ratings yet
Deep Learning L - Fin
151 pages
Midterm Study Guide Csci566
No ratings yet
Midterm Study Guide Csci566
20 pages
Understanding Deep Convolutional Networks
No ratings yet
Understanding Deep Convolutional Networks
17 pages
Lecture 6
No ratings yet
Lecture 6
45 pages
Abstract Algebra Worksheet2
No ratings yet
Abstract Algebra Worksheet2
11 pages
OTNote
No ratings yet
OTNote
46 pages
Ws 8
No ratings yet
Ws 8
5 pages
IGCSE Mark Scheme The Internet and Its Uses
No ratings yet
IGCSE Mark Scheme The Internet and Its Uses
3 pages
Basic Theorems On Infinite Series
No ratings yet
Basic Theorems On Infinite Series
5 pages
Chapter 10
No ratings yet
Chapter 10
24 pages
Digital Signage Software
No ratings yet
Digital Signage Software
43 pages
Cross Sections and End Area Volumes Supplement To Inroads Survey
No ratings yet
Cross Sections and End Area Volumes Supplement To Inroads Survey
18 pages
FortiOS 5.0 - Security Profiles
No ratings yet
FortiOS 5.0 - Security Profiles
182 pages
Irzal Dwi Rahadianto CV
No ratings yet
Irzal Dwi Rahadianto CV
2 pages
XboxGamingOverlayTraces FT Server 20210331000909
No ratings yet
XboxGamingOverlayTraces FT Server 20210331000909
1 page
Expert System Practical File (1-6)
No ratings yet
Expert System Practical File (1-6)
16 pages
KCI ActiVAC-User-Manual PDF
No ratings yet
KCI ActiVAC-User-Manual PDF
60 pages
Cmsa Sem 5 CC 11 CBCS
No ratings yet
Cmsa Sem 5 CC 11 CBCS
3 pages
MTL4000 Backplanes
No ratings yet
MTL4000 Backplanes
2 pages
11 One Way Anova
No ratings yet
11 One Way Anova
24 pages
Farnsworth Leer
No ratings yet
Farnsworth Leer
18 pages
IC3645SR4T404N11
No ratings yet
IC3645SR4T404N11
81 pages
Updated Lab File BCS351
No ratings yet
Updated Lab File BCS351
7 pages
Guide To Proposal Preparation 2019
No ratings yet
Guide To Proposal Preparation 2019
24 pages
What SA SSL VPN Configuration Is Required For The VPN Tunneling Client To Obtain An IP Address
No ratings yet
What SA SSL VPN Configuration Is Required For The VPN Tunneling Client To Obtain An IP Address
6 pages
Academic Certification Roadmap Jan2018 PDF
No ratings yet
Academic Certification Roadmap Jan2018 PDF
1 page
Boss I Think Someone Stole Our Customer Data Case Study
No ratings yet
Boss I Think Someone Stole Our Customer Data Case Study
3 pages
Interview Q&A撒打算大时代
No ratings yet
Interview Q&A撒打算大时代
25 pages
Ec201 Electronic Devices - End - Mo23
No ratings yet
Ec201 Electronic Devices - End - Mo23
1 page
Configuring The Cisco 2960X Series of Switches For Livewire®
No ratings yet
Configuring The Cisco 2960X Series of Switches For Livewire®
5 pages
Synopsis of Snake Game
No ratings yet
Synopsis of Snake Game
5 pages
Teoria de Control II
No ratings yet
Teoria de Control II
11 pages
The Oxford Handbook of Modern Scottish History 1st Edition T M Devine Instant Download
100% (1)
The Oxford Handbook of Modern Scottish History 1st Edition T M Devine Instant Download
14 pages
DFF
No ratings yet
DFF
12 pages
6 Wiring Diagram
No ratings yet
6 Wiring Diagram
4 pages
Kunal Patel: - Bi-Directional Cadworx 3D, Stress Analysis Solutions and Modeling Support
No ratings yet
Kunal Patel: - Bi-Directional Cadworx 3D, Stress Analysis Solutions and Modeling Support
6 pages
Topic 1 - Information Security Governance
No ratings yet
Topic 1 - Information Security Governance
33 pages
Manual Utilizare Sielaff Sielissimo
No ratings yet
Manual Utilizare Sielaff Sielissimo
54 pages

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Introduction to Deep Learning

Tan Minh Nguyen

x (e.g., size of the house) x (e.g., size of the house)

x (e.g., size of the house) x (e.g., size of the house)

Now let’s redo linear regression on a diﬀerent dat

When fitting the model to a

h𝒟(x) = w0 + w1x + w2x 2 + … + w6x 5

x (e.g., size of the house) x (e.g., size of the house)

x (e.g., size of the house) x (e.g., size of the house)

Our hypothesis space contains y

When fitting the model to a

Bias: the difference between expected prediction and truth

𝑏𝑖𝑎𝑠 ! = & 𝐸# 𝑓(𝑥) − 𝑦 2 𝑝 𝑥 𝑑𝑥

Deep learning models have

𝑓 𝑥 = + 𝑣" 𝜎 𝑤"& 𝑥 + 𝑏"

Trainable variables: 𝑣" ∈ ℝ, 𝑏" ∈ ℝ, 𝑤" ∈ ℝ)

In words, it says the following:

Any continuous function 𝑓 ∗ on a compact domain can be

1. Linear approximation: 𝜙! are fixed

What is the difference?

Suppose we want to write a vector 𝑢 in 3D in terms of its

• What if we can pick which two bases to use after seeing 𝑢?

This is a tremendous improvement over linear approximation,

The constant 𝛼 measures the smoothness of 𝑓 ∗

Then, empirical risk minimization is

Here,Φ is the total loss and Φ5 is the sample loss

A necessary first-order optimality condition

• When 𝜂 is too small, updates are slow

However, this does not mean that 𝜃9 → 𝜃 ∗ !

Most important problem: there may be local minima

When does GD find a global minimum?

Gradient Descent (GD)

Stochastic Gradient Descent (SGD)

This is efficient if 𝐵 is small and 𝑁 is large!

However, due to the repeated feed-forward structure we need an

This is known as the back-propagation algorithm

Then, the chain rule of calculus gives

In component form, we have

Loss function (just consider one sample)

2. But, given 𝑥A:$ , 𝑥&:$ no longer depends on 𝑊C : 𝑠 ≤ 𝑡

3. Use chain rule on

∇F! (% ,D% & 8 %'(

For 𝑡 = 𝑇 + 1, this is easy

For 𝑡 ≤ 𝑇, we use chain rule again to derive a recursion

Optimizing Neural Networks

You might also like