LinReg
LinReg
LinReg
IIT Indore
https://chandreshiit.github.io
1
Slides credit goes to Yi, Yung
November 17, 2024 1 / 34
Warm-Up
https://www.youtube.com/watch?v=wYPUhge9w5c
• In case of Gaussian noise, θML = θ that minimizes the empirical risk with the
squared loss function
◦ Models as functions = Model as probabilistic models
dL
• We find θ such that dθ =0
dL 1 T
1 T T T
= 2 −2(y − X θ) X = 2 −y X + θ X X = 0
dθ 2σ σ
T
⇐⇒ θML X TX = y TX
T −1
⇐⇒ θML = y T X (X T X ) (X T X is positive definite if rk(X ) = D)
−1
⇐⇒ θML = (X T X ) X Ty
◦ p(Y|X , θ) = N (y | Φθ, σ 2 I )
−1
• MLE: θML = (ΦT Φ) ΦT y
L9(2) November 17, 2024 13 / 34
Polynomial Fit
• N = 10 data, where xn ∼ U[−5, 5] and yn = − sin(xn /5) + cos(xn ) + ϵ,
ϵ ∼ N (0, 0.22 )
• Fit with poloynomial with degree 4 using ML
• MLE: prone to overfitting, where the magnitude of the parameters becomes large.
• Gradient
d log p(θ|X , Y) d log p(Y|X , θ) d log p(θ)
− =− −
dθ dθ dθ
• Gradient
d log p(θ|X , Y) 1 1
− = 2 (θ T ΦT Φ − y T Φ) + 2 θ T
dθ σ b
• MAP vs. ML
σ 2 −1
−1
θMAP = ΦT Φ + 2 I ΦT y , θML = (ΦT Φ) ΦT y
| {z b }
(∗)
σ2
• The term 2 I
b
◦ Ensures that (∗) is symmetric, strictly positive definite
◦ Role of regularizer
• We will use this later for computing the parameter posterior distribution in Bayesian
linear regression.
2
Chapter 9.3.4 For ease of understanding, I’ve slightly changed the organization of these lecture
slides from that of the textbook.
L9(4) November 17, 2024 23 / 34
Parameter Posterior Distribution (1)
p(θ | X , Y) = N (θ | mN , SN ), where
−1 2 T
−1 −1 −2 T
SN = S0 + σ Φ Φ , mN = SN S0 m0 + σ Φ y
(Proof Sketch)
1 −2 T −2 T T −2 T T −1 T −1 T −1
= σ y y − 2σ y Φθ + θ σ Φ Φθ + θ S0 θ − 2m0 S0 θ + m0 S0 m0
2
1 T −2 T −1 −1 T
= θ (σ Φ Φ + S0 )θ − 2(σ −2 ΦT y + S0 m0 ) θ + const
2
• cyan color: quadratic term, orange color: linear term
• p(θ|X , Y) ∝ exp( quadratic in θ ) =⇒ Gaussian distribution
• Assume that p(θ|X , Y) = N (θ|mN , SN ), and find mN and SN .
1 T
− log N (θ|mN , SN ) = (θ − mN ) SN−1 (θ − mN ) + const
2
1 T −1 T −1 T −1
= θ SN θ − 2mN SN θ + mN SN mN + const
2
T
• Thus, SN−1 = σ −2 ΦT Φ + S0−1 and mNT SN−1 = (σ −2 ΦT y + S0−1 m0 )
R
• Likelihood: p(Y|X , θ), Marginal likelihood: p(Y|X ) = p(Y|X , θ)p(θ)dθ
• Recall that the marginal likelihood is important for model selection via Bayes factor:
P(D|M1 )P(M1 )
P(M1 | D) P(D) P(M1 ) P(D | M1 )
(Posterior odds) = = P(D|M2 )P(M2 )
=
P(M2 | D) P(M2 ) P(D | M2 )
P(D) | {z } | {z }
Prior odds Bayes factor
Z Z
p(Y|X ) = p(Y|X , θ)p(θ)dθ = N (y |Φθ, σ 2 I )N (θ|m0 , S0 ) dθ
= N (y | Φm0 , ΦS0 ΦT + σ 2 I )
X Ty
−1
• For f (x) = x T θ + N (0, σ 2 ), θML = (X T X ) X T y = T ∈ R
X X
XX T
X θML = T y
X X
◦ Orthogonal projection of y onto the one-dimensional subspace spanned by X
ΦTy
−1
• For f (x) = ϕT (x)θ + N (0, σ 2 ), θML = (ΦT Φ) ΦT y = T ∈ R
Φ Φ
ΦΦT
ΦθML = T y
Φ Φ
◦ Orthogonal projection of y onto the K -dimensional subspace spanned by columns of Φ
• Linear regression for Gaussian likelihood and conjugate Gaussian priors. Nice
analytical results and closed forms
• Other forms of likelihoods for other applications (e.g., classification)
• GLM (generalized linear model): y = σ ◦ f (σ: activation function)
◦ No longer linear in θ
1
◦ Logistic regression: σ(f ) = ∈ [0, 1] (interpreted as the probability of
1 + exp(−f )
becoming 1)
◦ Building blocks of (deep) feedforward neural nets
◦ y = σ(Ax + b). A: weight matrix, b: bias vector
◦ K -layer deep neural nets: xk+1 = fk (xk ), fk (xk ) = σk (Ak xk + bk )
• Gaussian process
◦ A distribution over parameters → a distribution over functions
◦ Gaussian process: distribution over functions without detouring via parameters
◦ Closely related to BLR and support vector regression, also interpreted as Bayesian neural
network with a single hidden layer and the infinite number of units
• Gaussian likelihood, but non-Gaussian prior
◦ When N ≪ D (small training data)
◦ Prior that enforces sparsity, e.g., Laplace prior
◦ A linear regression with the Laplace prior = linear regression with LASSO (L1
regularization)
1)