LinReg

Lecture: Linear Regression1
CS 2XX: Mathematics for AI and ML
Chandresh Kumar Maurya
IIT Indore
https://chandreshiit.github.io
November 17, 2024
1
Slides credit goes to Yi, Yung
November 17, 2024 1 / 34
Warm-Up
Please watch this tutorial video by Luis Serrano on PCA.
https://www.youtube.com/watch?v=wYPUhge9w5c
November 17, 2024 2 / 34

Roadmap
(1) Problem Formulation

(2) Parameter Estimation: ML
(3) Parameter Estimation: MAP
(4) Bayesian Linear Regression
(5) Maximum Likelihood as Orthogonal Projection
November 17, 2024 3 / 34

Roadmap

L9(1) November 17, 2024 4 / 34

Regression Problem
• For some input values xn , we observe noisy function values yn = f (xn ) + ϵ

• Goal: infer the function f that generalizes well to function values at new inputs
• Applications: time-series analysis, control and robotics, image recognition, etc.
L9(1) November 17, 2024 5 / 34

Formulation
Notation for simplification (this is how the textbook uses)

simplifies
p(y |x) = pY |X (y |x), Y ∼ N (µ, σ 2 ) −−−−−→ N (y | f (x), σ 2 )
• Assume: linear regression, Gaussian noise

• y = f (x) + ϵ, where ϵ ∼ N (0, σ 2 )
• Likelihood: for x ∈ RD and y ∈ R, p(y | x) = N (y | f (x), σ 2 )
• Linear regression with the parameter θ ∈ RD , i.e., f (x) = x T θ
p(y | x) = N (y | x T θ, σ 2 ) ⇐⇒ y = x T θ + ϵ, ϵ ∼ N (0, σ 2 )
Prior with Gaussian nose: p(y | x) = N (y | x T θ, σ 2 )
L9(1) November 17, 2024 6 / 34

Parameter Estimation
• Training set D = {(x1 , y1 ), . . . , (xN , yN )}
• Assuming iid N data samples, the likelihood is factorized into:

N
Y N
Y
p(Y | X , θ) = p(yn | xn , θ) = N (yn | xnT , σ 2 ),
n=1 n=1
where X = {x1 , . . . , xn } and Y = {y1 , . . . , yn }
• Estimation methods: ML and MAP
L9(1) November 17, 2024 7 / 34

Roadmap

L9(2) November 17, 2024 8 / 34

MLE (Maximum Likelihood Estimation) (1)

• θML = arg maxθ p(Y | X , θ) = arg minθ − log p(Y | X , θ)
• For Gaussian noise with X = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T ,

N
Y N
X
− log p(Y | X , θ) = − log p(yn | xn , θ) = − log p(yn | xn , θ)
n=1 n=1
N
1 X 1 2
= (yn − xnT θ)2 + const = ∥y − X θ∥ + const
2σ 2 2σ 2
n=1
Negative-log likelihood for f (x) = x T θ + N (0, σ 2 ):

1
− log p(Y | X , θ) = 2 ∥y − X θ∥2 + const
2σ
L9(2) November 17, 2024 9 / 34

• For Gaussian noise with X = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T ,

1 1
θML = arg min 2 ∥y − X θ∥ , L(θ) = 2 ∥y − X θ∥2
2
θ 2σ 2σ
• In case of Gaussian noise, θML = θ that minimizes the empirical risk with the
squared loss function
◦ Models as functions = Model as probabilistic models
L9(2) November 17, 2024 10 / 34

dL
• We find θ such that dθ =0
dL 1 T
1 T T T

= 2 −2(y − X θ) X = 2 −y X + θ X X = 0
dθ 2σ σ
T
⇐⇒ θML X TX = y TX
T −1
⇐⇒ θML = y T X (X T X ) (X T X is positive definite if rk(X ) = D)
−1
⇐⇒ θML = (X T X ) X Ty
L9(2) November 17, 2024 11 / 34

MLE with Features
• Linear regression: Linear in terms of the parameters
T
◦ ϕ(x) θ is also fine, where ϕ(x) can be non-linear (we will cover this later)
◦ ϕ(x) are the features
• Linear regression with the parameter θ ∈ RK , ϕ(x) : RD 7→ RK :
K
X −1
T T
p(y | x) = N (y | ϕ(x) θ, σ 2 ) ⇐⇒ y = ϕ(x) θ + ϵ = θk ϕk (x) + ϵ
k=0
• Example. Polynomial regression. For x ∈ R and θ ∈ RK , we lift the original 1-D
input into K -D feature space with monomials x k :
   
ϕ0 (x) 1 K −1
ϕ(x) = 
 ..   .. 
= ∈ R K
=⇒ f (x) =
X
θ x k
.   .  k
ϕK −1 (x) x K −1 k=0
L9(2) November 17, 2024 12 / 34

Feature Matrix and MLE
• Now, for the entire training set {x1 , . . . , xN },
 T   
ϕ (x1 ) ϕ0 (x1 ) ··· ϕK −1 (x1 )
Φ :=  ...  =  ... ..
∈R
N×K
, Φij = ϕj (xi ), ϕj : RD 7→ R
   
··· .
ϕT (xN ) ϕ0 (xN ) · · · ϕK −1 (xN )
• Negative log-likelihood: Similarly to the case of y = X θ,
◦ p(Y|X , θ) = N (y | Φθ, σ 2 I )
◦ Negative-log likelihood for f (x) = ϕT (x)θ + N (0, σ 2 ):

1 2
− log p(Y | X , θ) = ∥y − Φθ∥ + const
2σ 2
−1
• MLE: θML = (ΦT Φ) ΦT y
L9(2) November 17, 2024 13 / 34
Polynomial Fit
• N = 10 data, where xn ∼ U[−5, 5] and yn = − sin(xn /5) + cos(xn ) + ϵ,
ϵ ∼ N (0, 0.22 )
• Fit with poloynomial with degree 4 using ML
L9(2) November 17, 2024 14 / 34

Overfitting in Linear Regression
• Higher polynomial degree is better (training error always decreases)

• Test error increases after some polynomial degree
L9(2) November 17, 2024 15 / 34

Roadmap

L9(3) November 17, 2024 16 / 34

MAPE (Maximum A Posteriori Estimation)
• MLE: prone to overfitting, where the magnitude of the parameters becomes large.
• a prior distribution p(θ) helps: what θ is plausible
• MAPE and Bayes’ theorem

p(Y | X , θ)p(θ)
p(θ | X , Y) = =⇒ θMAP ∈ arg min − log p(Y | X , θ) − log p(θ)
p(Y | X ) θ
• Gradient
d log p(θ|X , Y) d log p(Y|X , θ) d log p(θ)
− =− −
dθ dθ dθ
L9(3) November 17, 2024 17 / 34

MAPE for Gausssian Prior (1)
• Example. A (conjugate) Gaussian prior p(θ) ∼ N (0, b 2 I )

◦ For Gaussian likelihood, Gaussian prior =⇒ Gaussian posterior L6(6)
• Negative log-posterior
Negative-log posterior for f (x) = ϕT (x)θ + N (0, σ 2 ) and p(θ) ∼ N (0, b 2 I ):

1 T 1 T
− log p(θ|X , Y) = 2 (y − Φθ) (y − Φθ) + 2 θ θ + const
2σ 2b
• Gradient
d log p(θ|X , Y) 1 1
− = 2 (θ T ΦT Φ − y T Φ) + 2 θ T
dθ σ b
L9(3) November 17, 2024 18 / 34

MAPE for Gausssian Prior (2)
• MAP vs. ML
σ 2 −1
−1

θMAP = ΦT Φ + 2 I ΦT y , θML = (ΦT Φ) ΦT y
| {z b }
(∗)
σ2
• The term 2 I
b
◦ Ensures that (∗) is symmetric, strictly positive definite
◦ Role of regularizer
L9(3) November 17, 2024 19 / 34

Aside: MAPE for General Gausssian Prior (3)
• Example. A (conjugate) Gaussian prior p(θ) ∼ N (m0 , S0 )

• Negative log-posterior
Negative-log posterior for f (x) = ϕT (x)θ + N (0, σ 2 ) and p(θ) ∼ N (m0 , S0 ):

1 1
− log p(θ|X , Y) = 2 (y − Φθ) (y − Φθ) + (θ − m0 )T S0−1 (θ − m0 ) + const
T
2σ 2
• We will use this later for computing the parameter posterior distribution in Bayesian
linear regression.
L9(3) November 17, 2024 20 / 34

Regularization: MAPE vs. Explicit Regularizer
• Explicit regularizer in regularized least squares (RLS)

∥y − Φθ∥2 + λ ∥θ∥2
• MAPE wth Gaussian prior p(θ) ∼ N (0, b 2 I )
◦ Negative log-Gaussian prior
1 T
− log p(θ) = θ θ + const
2b 2
◦ λ = 1/2b 2 is the regularization term
• Not surprising that we have
−1
θRLS = ΦT Φ + λI ΦT y
L9(3) November 17, 2024 21 / 34

Roadmap

L9(4) November 17, 2024 22 / 34

Bayesian Linear Regression
• Earlier, ML and MAP. Now, fully Bayesian L8(4)
• Model
prior p(θ) ∼ N (m0 , S0 )
T 2

likelihood p(y |x, θ) ∼ N y | ϕ (x)θ, σ
joint p(y , θ|x) = p(y | x, θ)p(θ)
• Goal: For an input x∗ , we want to compute the following posterior predictive

distribution2 of y∗ :
Z z likelihood (∗)
}| { z }| {
p(y∗ |x∗ , X , Y) = p(y∗ |x∗ , θ) p(θ|X , Y) dθ
◦ (∗): parameter posterior distribution that needs to be computed
2
Chapter 9.3.4 For ease of understanding, I’ve slightly changed the organization of these lecture
slides from that of the textbook.
L9(4) November 17, 2024 23 / 34
Parameter Posterior Distribution (1)
• Parameter posterior distribution Chapter 9.3.3
p(θ | X , Y) = N (θ | mN , SN ), where
−1 2 T
−1 −1 −2 T

SN = S0 + σ Φ Φ , mN = SN S0 m0 + σ Φ y
(Proof Sketch)
• From the negative-log posterior for general Gaussian prior,

1 T 1 T −1
− log p(θ|X , Y) = (y − Φθ) (y − Φθ) + (θ − m0 ) S0 (θ − m0 ) + const
2σ 2 2
L9(4) November 17, 2024 24 / 34

Parameter Posterior Distribution (2)
1 −2 T −2 T T −2 T T −1 T −1 T −1

= σ y y − 2σ y Φθ + θ σ Φ Φθ + θ S0 θ − 2m0 S0 θ + m0 S0 m0
2
1 T −2 T −1 −1 T

= θ (σ Φ Φ + S0 )θ − 2(σ −2 ΦT y + S0 m0 ) θ + const
2
• cyan color: quadratic term, orange color: linear term
• p(θ|X , Y) ∝ exp( quadratic in θ ) =⇒ Gaussian distribution
• Assume that p(θ|X , Y) = N (θ|mN , SN ), and find mN and SN .
1 T
− log N (θ|mN , SN ) = (θ − mN ) SN−1 (θ − mN ) + const
2
1 T −1 T −1 T −1

= θ SN θ − 2mN SN θ + mN SN mN + const
2
T
• Thus, SN−1 = σ −2 ΦT Φ + S0−1 and mNT SN−1 = (σ −2 ΦT y + S0−1 m0 )
L9(4) November 17, 2024 25 / 34

Posterior Predictions (1)
• Posterior predictive distribution L6(5)

Z
p(y∗ |x∗ , X , Y) = p(y∗ |x∗ , θ)p(θ|X , Y)dθ
Z
T 2
= N y∗ |ϕ (x∗ )θ, σ N θ|mN , SN dθ

= N y∗ |ϕT (x∗ )mN , ϕT (x∗ )SN ϕ(x∗ ) + σ 2
• The mean ϕT (x∗ )mN coincides with the MAP estimate
L9(4) November 17, 2024 26 / 34

Posterior Predictions (2)
• BLR: Bayesian Linear Regression
L9(4) November 17, 2024 27 / 34

Computing Marginal Likelihood
R
• Likelihood: p(Y|X , θ), Marginal likelihood: p(Y|X ) = p(Y|X , θ)p(θ)dθ
• Recall that the marginal likelihood is important for model selection via Bayes factor:
P(D|M1 )P(M1 )
P(M1 | D) P(D) P(M1 ) P(D | M1 )
(Posterior odds) = = P(D|M2 )P(M2 )
=
P(M2 | D) P(M2 ) P(D | M2 )
P(D) | {z } | {z }
Prior odds Bayes factor
Z Z
p(Y|X ) = p(Y|X , θ)p(θ)dθ = N (y |Φθ, σ 2 I )N (θ|m0 , S0 ) dθ
= N (y | Φm0 , ΦS0 ΦT + σ 2 I )
L9(4) November 17, 2024 28 / 34

Roadmap

L9(5) November 17, 2024 29 / 34

ML as Orthogonal Projection
X Ty
−1
• For f (x) = x T θ + N (0, σ 2 ), θML = (X T X ) X T y = T ∈ R
X X
XX T
X θML = T y
X X
◦ Orthogonal projection of y onto the one-dimensional subspace spanned by X
ΦTy
−1
• For f (x) = ϕT (x)θ + N (0, σ 2 ), θML = (ΦT Φ) ΦT y = T ∈ R
Φ Φ
ΦΦT
ΦθML = T y
Φ Φ
◦ Orthogonal projection of y onto the K -dimensional subspace spanned by columns of Φ
L9(5) November 17, 2024 30 / 34

Summary and Other Issues (1)
• Linear regression for Gaussian likelihood and conjugate Gaussian priors. Nice
analytical results and closed forms
• Other forms of likelihoods for other applications (e.g., classification)
• GLM (generalized linear model): y = σ ◦ f (σ: activation function)
◦ No longer linear in θ
1
◦ Logistic regression: σ(f ) = ∈ [0, 1] (interpreted as the probability of
1 + exp(−f )
becoming 1)
◦ Building blocks of (deep) feedforward neural nets
◦ y = σ(Ax + b). A: weight matrix, b: bias vector
◦ K -layer deep neural nets: xk+1 = fk (xk ), fk (xk ) = σk (Ak xk + bk )
L9(5) November 17, 2024 31 / 34

Summary and Other Issues (2)
• Gaussian process
◦ A distribution over parameters → a distribution over functions
◦ Gaussian process: distribution over functions without detouring via parameters
◦ Closely related to BLR and support vector regression, also interpreted as Bayesian neural
network with a single hidden layer and the infinite number of units
• Gaussian likelihood, but non-Gaussian prior
◦ When N ≪ D (small training data)
◦ Prior that enforces sparsity, e.g., Laplace prior
◦ A linear regression with the Laplace prior = linear regression with LASSO (L1
regularization)
L9(5) November 17, 2024 32 / 34

Questions?
L9(5) November 17, 2024 33 / 34

Review Questions
1)
L9(5) November 17, 2024 34 / 34

LinReg

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

LinReg

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LinReg

Uploaded by

Copyright:

Available Formats

Lecture: Linear Regression1

CS 2XX: Mathematics for AI and ML

Chandresh Kumar Maurya

November 17, 2024

Please watch this tutorial video by Luis Serrano on PCA.

November 17, 2024 2 / 34

(1) Problem Formulation

November 17, 2024 3 / 34

(1) Problem Formulation

L9(1) November 17, 2024 4 / 34

• For some input values xn , we observe noisy function values yn = f (xn ) + ϵ

L9(1) November 17, 2024 5 / 34

Notation for simplification (this is how the textbook uses)

• Assume: linear regression, Gaussian noise

Prior with Gaussian nose: p(y | x) = N (y | x T θ, σ 2 )

L9(1) November 17, 2024 6 / 34

• Training set D = {(x1 , y1 ), . . . , (xN , yN )}

• Assuming iid N data samples, the likelihood is factorized into:

• Estimation methods: ML and MAP

L9(1) November 17, 2024 7 / 34

(1) Problem Formulation

L9(2) November 17, 2024 8 / 34

• For Gaussian noise with X = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T ,

Negative-log likelihood for f (x) = x T θ + N (0, σ 2 ):

L9(2) November 17, 2024 9 / 34

• For Gaussian noise with X = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T ,

L9(2) November 17, 2024 10 / 34

L9(2) November 17, 2024 11 / 34

L9(2) November 17, 2024 12 / 34

• Negative log-likelihood: Similarly to the case of y = X θ,

◦ Negative-log likelihood for f (x) = ϕT (x)θ + N (0, σ 2 ):

L9(2) November 17, 2024 14 / 34

• Higher polynomial degree is better (training error always decreases)

L9(2) November 17, 2024 15 / 34

(1) Problem Formulation

L9(3) November 17, 2024 16 / 34

• a prior distribution p(θ) helps: what θ is plausible

• MAPE and Bayes’ theorem

L9(3) November 17, 2024 17 / 34

• Example. A (conjugate) Gaussian prior p(θ) ∼ N (0, b 2 I )

Negative-log posterior for f (x) = ϕT (x)θ + N (0, σ 2 ) and p(θ) ∼ N (0, b 2 I ):

L9(3) November 17, 2024 18 / 34

L9(3) November 17, 2024 19 / 34

• Example. A (conjugate) Gaussian prior p(θ) ∼ N (m0 , S0 )

Negative-log posterior for f (x) = ϕT (x)θ + N (0, σ 2 ) and p(θ) ∼ N (m0 , S0 ):

L9(3) November 17, 2024 20 / 34

• Explicit regularizer in regularized least squares (RLS)

L9(3) November 17, 2024 21 / 34

(1) Problem Formulation

L9(4) November 17, 2024 22 / 34

• Goal: For an input x∗ , we want to compute the following posterior predictive

• Parameter posterior distribution Chapter 9.3.3

• From the negative-log posterior for general Gaussian prior,

L9(4) November 17, 2024 24 / 34

L9(4) November 17, 2024 25 / 34

• Posterior predictive distribution L6(5)

L9(4) November 17, 2024 26 / 34

• BLR: Bayesian Linear Regression