Advanced Machine Learning
Lecture 2: Linear models
Sandjai Bhulai
Vrije Universiteit Amsterdam
s.bhulai@vu.nl
8 September 2023
Linear models
Advanced Machine Learning
Polynomial curve tting
▪ 10 points sampled from sin(2πx) + disturbance
x3 x5 x7 x9
sin x = x − + − + +⋯
3! 5! 7! 9!
3 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting
▪ Polynomial curve
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
▪ Performance is measured by
1 N
{y(xn, w) − tn}2
2∑
E(w) =
n=1
4 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 0
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
5 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 1
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
6 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 3
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
7 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 9
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
8 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Over tting
▪ Root mean square (RMS) error: ERMS = 2E(w*)/N
9 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Over tting
10 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
E ect of dataset size
▪ Polynomial of order 9 and N = 15
11 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
ff
E ect of dataset size
▪ Polynomial of order 9 and N = 100
12 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
ff
Regularization
▪ Penalize large coe cients values
1 N λ
{y(xn, w) − tn} + ∥w∥2
2
2∑
Ẽ(w) =
n=1
2
▪ λ becomes a model parameter
13 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
ffi
Regularization
▪ Regularization with ln λ = − 18
14 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Regularization
▪ Regularization with ln λ = 0
15 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Regularization
▪ ERMS versus ln λ
16 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Regularization
17 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
A deeper analysis
Advanced Machine Learning
What is the issue?
x3 x5 x7 x9
sin x = x − + − + +⋯
3! 5! 7! 9!
19 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Linear basis function models
▪ General model is
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ φj are known are basis functions
▪ Typically, φ0(x) = 1 so that w0 acts as bias
20 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Linear basis function models
▪ General model is
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ Polynomial basis functions:
φj(x) = x j
▪ These are global functions
21 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Linear basis function models
▪ General model is
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ Gaussian basis functions:
(x − μj)2
{ 2s 2 }
φj(x) = exp −
▪ These are local functions
> μj controls location
> s controls scale
22 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Linear basis function models
▪ General model is
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ Sigmoidal basis functions:
( )
x − μj
φj(x) = σ
s
where
1
σ(a) =
1 + exp(−a)
23 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)
{ 2σ 2 }
2 1 1 2
▪ Note that (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
β = 1/σ 2
(x | μ, σ 2) > 0
∞
∫−∞
(x | μ, σ 2) dx = 1
𝒩
𝒩
24 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒩
𝒩
Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)
{ 2σ 2 }
2 1 1 2
▪ Note that (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
∞
∫−∞
[x] = x (x | μ, σ 2) dx = μ
∞
∫−∞
[x 2] = x 2 (x | μ, σ 2) dx = μ 2 + σ 2
var[x] = [x 2] − [x]2 = σ 2
𝒩
𝔼
𝔼
𝒩
𝒩
25 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝔼
𝔼
𝒩
Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)
▪ This is the same as saying
p(t | x, w, β) = (t | y(x, w), β −1)
M−1
wjφj(x) = w⊤φ(x)
▪ Recall: y(x, w) =
∑
j=0
26 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒩
𝒩
Maximum likelihood
▪ This is the same as saying
p(t | x, w, β) = (t | y(x, w), β −1)
▪ Given observed inputs X = {x1, …, xN} and targets
t = [t1, …, tN ]⊤, we obtain the likelihood function:
N
(tn | w⊤φ(xn), β −1)
∏
p(t | X, w, β) =
n=1
𝒩
27 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒩
Maximum likelihood
▪ Taking the logarithm, we get
N
(tn | w⊤φ(xn), β −1)
∑
ln p(t | w, β) = ln
n=1
N N
= ln β − ln(2π) − βED(w)
2 2
1 N
{tn − w⊤φ(xn)}2
2∑
where ED(w) =
n=1
{ 2σ 2 }
2 1 1 2
▪ Recall: (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
𝒩
𝒩
28 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Maximum likelihood
▪ Computing the gradient and setting it to zero yields
N
{tn − w⊤φ(xn)}φ(xn)⊤ = 0
∑
∇w ln p(t | w, β) = β
n=1
The Moore-Penrose
pseudo-inverse
▪ Solve for w, we get
wML = (Φ Φ) ⊤ −1
Φ⊤t
with φ0(x1) φ1(x1) ⋯ φM−1(x1)
φ0(x2) φ1(x2) ⋯ φM−1(x2)
Φ=
⋮ ⋮ ⋱ ⋮
φ0(xN ) φ1(xN ) ⋯ φM−1(xN )
29 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Interpretation
▪ Consider y = ΦwML = [φ1, …, φM]wML
y∈ ⊆
N-dimensional
M-dimensional
▪ is spanned by φ1, …, φM
▪ wML minimizes the distance between t and its orthogonal
projection on , i.e., y
30
𝒮
𝒮
Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒮
𝒯
Regularization
▪ Consider the error function
ED(w) + λEW (w)
data term + regularization term
▪ With the sum-of-squares error function and a quadratic
regularizer, we get
1 N ⊤ 2 λ ⊤
∑
{tn − w φ(xn)} + w w
2 n=1 2
▪ This is minimized by w = (λI + Φ⊤Φ)
−1
Φ⊤t
31 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Regularization
▪ With a more general regularizer, we have
1 N λ M
{tn − w⊤φ(xn)}2 + | wj |q
2∑
n=1
2 ∑
j=1
lasso quadratic
32 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Regularization
▪ Lasso tends to generate sparser solutions than a quadratic
regularizer
33 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023