[go: up one dir, main page]

0% found this document useful (0 votes)
27 views67 pages

Lecture 2

Uploaded by

anna tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views67 pages

Lecture 2

Uploaded by

anna tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Introduction to Deep Learning

Tan Minh Nguyen


Department of Mathematics, NUS
A Cool Thing: Bias-Variance
Tradeoff
Underfitting Underfitting

h𝒟(x) = w0 + w1x
y ȳ(x) = w0 + w1x + w2x 2 y

x (e.g., size of the house) x (e.g., size of the house)


(Just right) Underfitting
Underfitting
Underfitting
Bias toward linear models
h𝒟(x) = w0 + w1x
y ȳ(x) = w0 + w1x + w2x 2 y
→ A strongly biased solution

x (e.g., size of the house) x (e.g., size of the house)


(Just right) Underfitting
Underfitting Underfitti

Now let’s redo linear regression on a different dat


Bias toward linear models
y
→ A strongly biased solution

When fitting the model to a


different dataset, the new linear
solution does not differ too
much from the old one x (e.g., size of the house)
→ A low-variance solution
Overfitting Overfitting

h𝒟(x) = w0 + w1x + w2x 2 + … + w6x 5


y ȳ(x) = w0 + w1x + w2x 2 y

x (e.g., size of the house) x (e.g., size of the house)


(Just right) Overfitting
Overfitting
Overfitting
Our hypothesis space contains 2 5
th
all polynomials up to 5 order2 h 𝒟 (x) = w0 + w1 x + w2 x + … + w6 x
y ȳ(x) = w0 + w1x + w2x y
→ No strong bias

x (e.g., size of the house) x (e.g., size of the house)


(Just right) Overfitting
Overfi
Overfitting Redo the higher-order polynom

Our hypothesis space contains y


all polynomials up to 5th order

→ No strong bias

When fitting the model to a


different dataset, the new x (e.g., size of the house)
solution does differ a lot from
the old one

→ A high-variance solution
Bias of learner
Given: dataset 𝐷 with M samples:
Learn: for different datasets 𝐷, you will get different solutions 𝑓(𝑥)
Expected prediction: 𝑓 ̅ = 𝐸! [𝑓(𝑥)]

Bias: the difference between expected prediction and truth


• Measure how well you expect to represent a true solution
• Decrease with more complex model

𝑏𝑖𝑎𝑠 ! = & 𝐸# 𝑓(𝑥) − 𝑦 2 𝑝 𝑥 𝑑𝑥


𝒙
Variance of learner
Given: dataset 𝐷 with M samples:
Learn: for different datasets 𝐷, you will get different solutions 𝑓(𝑥)
Expected prediction: 𝑓 ̅ = 𝐸! [𝑓(𝑥)]

Variance: the difference between what you expect to learn, i.e., 𝑓,̅
and what you learn from a particular dataset
• Measure how sensitive the learner is to a specific dataset
• Decrease with more simpler model

̅
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = & 𝐸# 𝑓 𝑥 − 𝑓(𝑥) 2 𝑝 𝑥 𝑑𝑥
𝒙
Neural Networks
What we’ve seen so far
So far, we introduced two types of hypothesis spaces containing
%

𝑓 𝑥 = + 𝑎" 𝜙" 𝑥
"#$
Main difference:
1. The feature maps 𝜙! are fixed
Examples: linear models, linear basis models, SVM
2. The feature maps 𝜙! are adapted to data
Examples: standard/boosted/bagged decision trees
In this lecture, we introduce another class belonging to 2: neural
networks
Neural Networks
ImageNet
Machine learning models
based on neural networks are
called deep learning models.

Deep learning models have


become increasingly popular
thanks to the increasing GPUs
amount of data and
computational power (e.g.,
powerful graphics processing
units (GPU)).
Examples of Deep Learning Models

https://chat.openai.com/
Examples of Deep Learning Models

https://labs.openai.com/
Shallow Neural Networks
History
Neural networks originated from an attempt to model collective
interaction of neurons

https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7
Neural Networks for Regression
The neural network (NN) hypothesis space is quite like linear
basis models
%

𝑓 𝑥 = + 𝑣" 𝜎 𝑤"& 𝑥 + 𝑏"


"#$
'! (

Trainable variables: 𝑣" ∈ ℝ, 𝑏" ∈ ℝ, 𝑤" ∈ ℝ)


• 𝑤! are the weights of the hidden layer
• 𝑏! are the biases of the hidden layer
• 𝑣! are the weights of the output layer
• 𝜎: ℝ → ℝ is the activation function
Activation Functions
• Sigmoid
1
𝜎 𝑧 =
1 + 𝑒 *+
• Tanh
𝜎 𝑧 = tanh 𝑧
• Rectified Linear Unit (ReLU)
𝜎 𝑧 = max 0, 𝑧
• Leaky-ReLU
𝑧 if 𝑧 ≥ 0
𝜎 𝑧 =@
𝛿𝑧 if 𝑧 < 0
Universal Approximation Theorem
One of the foundational results for neural networks is the
universal approximation theorem.

In words, it says the following:

Any continuous function 𝑓 ∗ on a compact domain can be


approximated by neural networks to arbitrary precision, provided
there are enough neurons (𝑀 large enough).
Neural networks hypothesis space of arbitrarily large width has
zero approximation error!

Distance 𝑓 ∗ , ℋ = 0

𝑓∗

𝓗
𝑓!
“Proof” in a Special Case
Let us consider a 1D continuous function on the unit interval
𝑓 ∗ : 0,1 → ℝ
Step 1: Approximate by a step functions
Step 2: Use two sigmoids to make a step
Step 3: Sum the resulting function
Curse of Dimensionality
Although this idea can be extended to high dimensions, it
introduces an issue.
How many patches of linear size 𝜖 are there in 0,1 ) ?
• 𝑑 = 1: 1/𝜖 pieces
• 𝑑 = 2: 1/𝜖 " pieces
• General 𝑑: 1/𝜖 # pieces
Even if somehow, we only need a constant number of neurons to
approximate each piece, we would still need 𝒪 𝜖 *) neurons!
This is known as the curse of dimensionality that plagues high
dimensional problems
Do neural networks suffer
from the curse of
dimensionality?
Linear and Nonlinear
Approximation
Linear vs Nonlinear Approximation
Recall:
%

𝑓 𝑥 = + 𝑣" 𝜙" 𝑥
"#$

1. Linear approximation: 𝜙! are fixed


2. Nonlinear approximation: 𝜙! are adapted

What is the difference?


The significance of data-dependent
feature maps
Let us consider some motivating examples

Suppose we want to write a vector 𝑢 in 3D in terms of its


coordinate components

𝑢. 𝑢
𝑢- 𝑢 = 𝑢$ 𝑒$ + 𝑢- 𝑒- + 𝑢. 𝑒.
𝑒.
𝑒-

𝑒$ 𝑢$
Suppose we can only use 2 coordinate axes, say 𝑒$ and 𝑒-
What is the best approximation 𝑢P of 𝑢?

Example:
𝑢 = 3, 1, 2 = 3𝑒$ + 1𝑒- + 2𝑒. 𝑢
2
𝑢P = 3𝑒$ + 1𝑒- 1
𝑒. 𝑒- 𝑢P
Error: 𝑢 − 𝑢P = 2
𝑒$ 3

• What if we can pick which two bases to use after seeing 𝑢?


• What if we can pick two bases from a much larger set?
Functions behave just like vectors!
• Each 𝜙! is like a coordinate axis. They play the role of 𝑒! .
Important difference: there are an infinite number of them
• The oracle function 𝑓 ∗ plays the role of 𝑢

Writing
%

𝑓 ∗ 𝑥 = + 𝑣" 𝜙" 𝑥
"#$
is like expanding a vector into its components, but we can’t have
all components since 𝑀 is finite
If we get to choose which components to have in the sum after
seeing some information on 𝑓 ∗ , we can usually do much better!
Linear Approximation
Basis independent of data
Nonlinear Approximation
Basis depends on data
Overcoming the Curse of
Dimensionality
Under some technical assumptions, for any continuous (+ other
conditions) function 𝑓 ∗ : 0,1 ) → ℝ, there exists a width-𝑀
neural network 𝑓% such that
𝑓 ∗ − 𝑓% - ≤ 𝒪 𝑀*$
This result is first proved in [Baron, 1993]

This is a tremendous improvement over linear approximation,


where we usually have
/
∗ - *
𝑓 − 𝑓% ≤𝒪 𝑀 )

The constant 𝛼 measures the smoothness of 𝑓 ∗


Optimizing Neural Networks
Optimization
The universal approximation theorem is an approximation result
• We know there is a good approximator of 𝑓 ∗ in ℋ
• But, we do not yet know how to find it

𝓗
𝑓! ≈ 𝑓 ∗
𝑓0

Optimization 𝑓'
(using 𝒟)
Empirical Risk Minimization for
Neural Networks
We can parameterize the hypothesis space

ℋ = 𝑓: 𝑓 𝑥 = 𝑓2 𝑥 , 𝜃 ∈ Θ

Then, empirical risk minimization is

6
1
min Φ 𝜃 = + 𝐿 𝑓2 𝑥5 , 𝑦5
2∈4 𝑁
5#$ 7" 2

Here,Φ is the total loss and Φ5 is the sample loss


Gradient Descent
Consider minimizing the total loss or objective

Φ 𝜃 over 𝜃 ∈ Θ = ℝ8 (unconstrained)

A necessary first-order optimality condition

∇Φ 𝜃 ∗ = 0
Two choices:
• Solve ∇Φ 𝜃 ∗ = 0
• Use an iterative method
The Effect of Learning Rate
Look at the GD iteration

𝜃9:$ = 𝜃9 − 𝜂∇Φ 𝜃9

• When 𝜂 is too small, updates are slow


• When 𝜂 is too large, the updates may become unstable
Example
One dimension
1 -
Φ 𝜃 = 𝜃
2
∇Φ 𝜃 = 𝜃

GD iterates
𝜃9:$ = 𝜃9 − 𝜂𝜃9 = 1 − 𝜂 𝜃9

Solution
𝜃9 = 1 − 𝜂 9 𝜃0
Convergence of GD
Provided 𝜂 < ‖∇- Φ‖*$ , it can be shown that

‖∇Φ 𝜃9 ‖ → 0

However, this does not mean that 𝜃9 → 𝜃 ∗ !

Most important problem: there may be local minima


Local vs Global Minima
𝜃 ∗ is a local minimum of Φ if
there exist 𝛿 > 0
Φ 𝜃∗ ≤ Φ 𝜃 Local Min
for all 𝜃 such that 𝜃 − 𝜃 ∗ ≤ 𝛿 Global Min

𝜃 ∗ is a global minimum of Φ if
Φ 𝜃 ∗ ≤ Φ 𝜃 for all 𝜃
𝛿

When does GD find a global minimum?


Convex Functions
A class of objective/loss functions for which local minima are
also global is called convex functions

Definition:

A function 𝛷 is convex if
𝛷 𝜆𝜃 + 1 − 𝜆 𝜃 ; ≤ 𝜆𝛷 𝜃 + 1 − 𝜆 𝛷 𝜃 ;
for all 𝜃, 𝜃 ; ∈ ℝ8 and

Geometric meaning?
Examples

Convex:

Non-convex:
Important Property
If Φ is convex, then all local minima are also global!

Proof by picture:
GD on Convex Functions
When Φ is convex, GD finds the global minima. In fact, there is a
rate estimate!
Stochastic Gradient Descent
GD is an optimization algorithm for general differentiable
functions, but in empirical risk minimization we have some
structure
6
1
Φ 𝜃 = + Φ5 𝜃
𝑁
5#$

Challenges to GD?
% &
• ∇Φ 𝜃 = ∑ ∇Φ' 𝜃 , so a gradient evaluation
& '(%
requires a summation of 𝑁 terms
• This is very expensive when 𝑁 is large
Stochastic gradient descent relies on the following idea: at each
step, we take a random sub-sample of the dataset as an
approximation of the full gradient

Gradient Descent (GD)


𝜃9:$ = 𝜃9 − 𝜂∇Φ 𝜃9

Stochastic Gradient Descent (SGD)


1
𝜃9:$ = 𝜃9 − 𝜂 + ∇Φ5 (𝜃9 )
𝐵
5∈<#
where 𝐼= is a random sub-sample of 1,2, … , 𝑁 of size 𝐵

This is efficient if 𝐵 is small and 𝑁 is large!


The Dynamics of SGD
Consider sample objectives
1 1
Φ$ 𝜃 = 𝜃 − 𝜉 - Φ- 𝜃 = 𝜃 + 𝜉 -
2 2
$
Total objective: Φ 𝜃 = - 𝜃 - + 𝜉 -
SGD vs GD dynamics?
Deep Neural Networks
Deep Neural Networks
Deep neural networks are an extension of shallow networks.
Idea: we stack many hidden layers together

𝑥0 = 𝑥
𝑥$ = 𝜎 𝑊0 𝑥0 + 𝑏0
𝑥- = 𝜎(𝑊$ 𝑥$ + 𝑏$ )
𝑥. = 𝜎 𝑊- 𝑥- + 𝑏-

𝑥& = 𝜎 𝑊&*$ 𝑥&*$ + 𝑏&*$
𝑓 𝑥 = 𝑣 & 𝑥&
Optimizing Deep Neural Networks
Analogous to shallow NNs, deep NNs can also be optimized with
(stochastic) gradient descent.

However, due to the repeated feed-forward structure we need an


efficient algorithm to compute the gradients.

This is known as the back-propagation algorithm


Review: Chain Rule
Consider functions
𝐺: ℝ> → ℝ?
𝐻: ℝ) → ℝ>

Then, the chain rule of calculus gives


∇( 𝐺 𝐻 𝑥 = ∇@ 𝐺 𝐻 𝑥 ∇( 𝐻 𝑥

In component form, we have


𝜕𝐺5 𝜕𝐺5 𝜕𝐻9
=+ ⋅
𝜕𝑥" 𝜕𝐻9 𝜕𝑥"
9
Back-Propagation
Let us consider a network

𝑥0 = 𝑥
𝑥A:$ = 𝑔A 𝑥A , 𝑊A , 𝑡 = 0, … , 𝑇
𝑓 𝑥 = 𝑥&:$ (𝑣 is absorbed into last layer)

Loss function (just consider one sample)


Φ 𝜃 = 𝐿(𝑥&:$ , 𝑦)
We want to compute
∇B$ Φ 𝜃 ≡ 𝑊0 , … , 𝑊& 𝑡 = 0,1, … , 𝑇
1. Generally, 𝑥&:$ has the following dependence
𝑥&:$ ≡ 𝑥&:$ (𝑥, 𝑊0 , … , 𝑊& )

2. But, given 𝑥A:$ , 𝑥&:$ no longer depends on 𝑊C : 𝑠 ≤ 𝑡


𝑥&:$ ≡ 𝑥&:$ 𝑥A:$ , 𝑊A:$ , … , 𝑊&
𝑥A:$ ≡ 𝑥A:$ (𝑥, 𝑊0 , … , 𝑊A )

3. Use chain rule on


𝐿 𝑥&:$ , 𝑦 ≡ 𝐿 𝑥A:$ 𝑥, 𝑊0 , … , 𝑊A , 𝑊A:$ , … , 𝑊&
giving
&
∇D% 𝐿 𝑥&:$ , 𝑦 = ∇D% 𝑥A:$ ∇(%'( 𝐿 𝑥&:$ , 𝑦

∇F! (% ,D% & 8 %'(


(IJKLMJ)
4. So, we have defined
𝑝A = ∇(% 𝐿(𝑥&:$ , 𝑦)
once we know 𝑝A , we are done! How to compute 𝑝A ?

For 𝑡 = 𝑇 + 1, this is easy


𝑝&:$ = ∇(&'( 𝐿 𝑥&:$ , 𝑦

For 𝑡 ≤ 𝑇, we use chain rule again to derive a recursion


&
𝑝A = ∇(% 𝑥A:$ ∇(%'( 𝐿 𝑥&:$ , 𝑦
and so
&
𝑝A = ∇O 𝑔A 𝑥A , 𝑊A 𝑝A:$
Summary
Approximation Properties of Neural Networks
• Nonlinear approximation: adapted to data
• Universal approximation property, overcomes curse of
dimensionality

Optimizing Neural Networks


• (Stochastic) Gradient Descent
• For deep NNs, compute gradients using back-propagation
Homework 1
Homework 1

You might also like