[go: up one dir, main page]

0% found this document useful (0 votes)
0 views34 pages

LinReg

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 34

Lecture: Linear Regression1

CS 2XX: Mathematics for AI and ML

Chandresh Kumar Maurya

IIT Indore
https://chandreshiit.github.io

November 17, 2024

1
Slides credit goes to Yi, Yung
November 17, 2024 1 / 34
Warm-Up

Please watch this tutorial video by Luis Serrano on PCA.

https://www.youtube.com/watch?v=wYPUhge9w5c

November 17, 2024 2 / 34


Roadmap

(1) Problem Formulation


(2) Parameter Estimation: ML
(3) Parameter Estimation: MAP
(4) Bayesian Linear Regression
(5) Maximum Likelihood as Orthogonal Projection

November 17, 2024 3 / 34


Roadmap

(1) Problem Formulation


(2) Parameter Estimation: ML
(3) Parameter Estimation: MAP
(4) Bayesian Linear Regression
(5) Maximum Likelihood as Orthogonal Projection

L9(1) November 17, 2024 4 / 34


Regression Problem

• For some input values xn , we observe noisy function values yn = f (xn ) + ϵ


• Goal: infer the function f that generalizes well to function values at new inputs
• Applications: time-series analysis, control and robotics, image recognition, etc.

L9(1) November 17, 2024 5 / 34


Formulation

Notation for simplification (this is how the textbook uses)


simplifies
p(y |x) = pY |X (y |x), Y ∼ N (µ, σ 2 ) −−−−−→ N (y | f (x), σ 2 )

• Assume: linear regression, Gaussian noise


• y = f (x) + ϵ, where ϵ ∼ N (0, σ 2 )
• Likelihood: for x ∈ RD and y ∈ R, p(y | x) = N (y | f (x), σ 2 )
• Linear regression with the parameter θ ∈ RD , i.e., f (x) = x T θ
p(y | x) = N (y | x T θ, σ 2 ) ⇐⇒ y = x T θ + ϵ, ϵ ∼ N (0, σ 2 )

Prior with Gaussian nose: p(y | x) = N (y | x T θ, σ 2 )

L9(1) November 17, 2024 6 / 34


Parameter Estimation

• Training set D = {(x1 , y1 ), . . . , (xN , yN )}

• Assuming iid N data samples, the likelihood is factorized into:


N
Y N
Y
p(Y | X , θ) = p(yn | xn , θ) = N (yn | xnT , σ 2 ),
n=1 n=1
where X = {x1 , . . . , xn } and Y = {y1 , . . . , yn }

• Estimation methods: ML and MAP

L9(1) November 17, 2024 7 / 34


Roadmap

(1) Problem Formulation


(2) Parameter Estimation: ML
(3) Parameter Estimation: MAP
(4) Bayesian Linear Regression
(5) Maximum Likelihood as Orthogonal Projection

L9(2) November 17, 2024 8 / 34


MLE (Maximum Likelihood Estimation) (1)
 
• θML = arg maxθ p(Y | X , θ) = arg minθ − log p(Y | X , θ)

• For Gaussian noise with X = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T ,


N
Y N
X
− log p(Y | X , θ) = − log p(yn | xn , θ) = − log p(yn | xn , θ)
n=1 n=1
N
1 X 1 2
= (yn − xnT θ)2 + const = ∥y − X θ∥ + const
2σ 2 2σ 2
n=1

Negative-log likelihood for f (x) = x T θ + N (0, σ 2 ):


1
− log p(Y | X , θ) = 2 ∥y − X θ∥2 + const

L9(2) November 17, 2024 9 / 34


MLE (Maximum Likelihood Estimation) (2)

• For Gaussian noise with X = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T ,


1 1
θML = arg min 2 ∥y − X θ∥ , L(θ) = 2 ∥y − X θ∥2
2
θ 2σ 2σ

• In case of Gaussian noise, θML = θ that minimizes the empirical risk with the
squared loss function
◦ Models as functions = Model as probabilistic models

L9(2) November 17, 2024 10 / 34


MLE (Maximum Likelihood Estimation) (3)

dL
• We find θ such that dθ =0
dL 1  T
 1  T T T

= 2 −2(y − X θ) X = 2 −y X + θ X X = 0
dθ 2σ σ
T
⇐⇒ θML X TX = y TX
T −1
⇐⇒ θML = y T X (X T X ) (X T X is positive definite if rk(X ) = D)
−1
⇐⇒ θML = (X T X ) X Ty

L9(2) November 17, 2024 11 / 34


MLE with Features
• Linear regression: Linear in terms of the parameters
T
◦ ϕ(x) θ is also fine, where ϕ(x) can be non-linear (we will cover this later)
◦ ϕ(x) are the features
• Linear regression with the parameter θ ∈ RK , ϕ(x) : RD 7→ RK :
K
X −1
T T
p(y | x) = N (y | ϕ(x) θ, σ 2 ) ⇐⇒ y = ϕ(x) θ + ϵ = θk ϕk (x) + ϵ
k=0
• Example. Polynomial regression. For x ∈ R and θ ∈ RK , we lift the original 1-D
input into K -D feature space with monomials x k :
   
ϕ0 (x) 1 K −1
ϕ(x) = 
 ..   .. 
= ∈ R K
=⇒ f (x) =
X
θ x k
.   .  k
ϕK −1 (x) x K −1 k=0

L9(2) November 17, 2024 12 / 34


Feature Matrix and MLE
• Now, for the entire training set {x1 , . . . , xN },
 T   
ϕ (x1 ) ϕ0 (x1 ) ··· ϕK −1 (x1 )
Φ :=  ...  =  ... ..
∈R
N×K
, Φij = ϕj (xi ), ϕj : RD 7→ R
   
··· .
ϕT (xN ) ϕ0 (xN ) · · · ϕK −1 (xN )

• Negative log-likelihood: Similarly to the case of y = X θ,

◦ p(Y|X , θ) = N (y | Φθ, σ 2 I )

◦ Negative-log likelihood for f (x) = ϕT (x)θ + N (0, σ 2 ):


1 2
− log p(Y | X , θ) = ∥y − Φθ∥ + const
2σ 2

−1
• MLE: θML = (ΦT Φ) ΦT y
L9(2) November 17, 2024 13 / 34
Polynomial Fit
• N = 10 data, where xn ∼ U[−5, 5] and yn = − sin(xn /5) + cos(xn ) + ϵ,
ϵ ∼ N (0, 0.22 )
• Fit with poloynomial with degree 4 using ML

L9(2) November 17, 2024 14 / 34


Overfitting in Linear Regression

• Higher polynomial degree is better (training error always decreases)


• Test error increases after some polynomial degree

L9(2) November 17, 2024 15 / 34


Roadmap

(1) Problem Formulation


(2) Parameter Estimation: ML
(3) Parameter Estimation: MAP
(4) Bayesian Linear Regression
(5) Maximum Likelihood as Orthogonal Projection

L9(3) November 17, 2024 16 / 34


MAPE (Maximum A Posteriori Estimation)

• MLE: prone to overfitting, where the magnitude of the parameters becomes large.

• a prior distribution p(θ) helps: what θ is plausible

• MAPE and Bayes’ theorem


p(Y | X , θ)p(θ)  
p(θ | X , Y) = =⇒ θMAP ∈ arg min − log p(Y | X , θ) − log p(θ)
p(Y | X ) θ

• Gradient
d log p(θ|X , Y) d log p(Y|X , θ) d log p(θ)
− =− −
dθ dθ dθ

L9(3) November 17, 2024 17 / 34


MAPE for Gausssian Prior (1)

• Example. A (conjugate) Gaussian prior p(θ) ∼ N (0, b 2 I )


◦ For Gaussian likelihood, Gaussian prior =⇒ Gaussian posterior L6(6)
• Negative log-posterior

Negative-log posterior for f (x) = ϕT (x)θ + N (0, σ 2 ) and p(θ) ∼ N (0, b 2 I ):


1 T 1 T
− log p(θ|X , Y) = 2 (y − Φθ) (y − Φθ) + 2 θ θ + const
2σ 2b

• Gradient
d log p(θ|X , Y) 1 1
− = 2 (θ T ΦT Φ − y T Φ) + 2 θ T
dθ σ b

L9(3) November 17, 2024 18 / 34


MAPE for Gausssian Prior (2)

• MAP vs. ML
σ 2 −1
−1

θMAP = ΦT Φ + 2 I ΦT y , θML = (ΦT Φ) ΦT y
| {z b }
(∗)

σ2
• The term 2 I
b
◦ Ensures that (∗) is symmetric, strictly positive definite
◦ Role of regularizer

L9(3) November 17, 2024 19 / 34


Aside: MAPE for General Gausssian Prior (3)

• Example. A (conjugate) Gaussian prior p(θ) ∼ N (m0 , S0 )


• Negative log-posterior

Negative-log posterior for f (x) = ϕT (x)θ + N (0, σ 2 ) and p(θ) ∼ N (m0 , S0 ):


1 1
− log p(θ|X , Y) = 2 (y − Φθ) (y − Φθ) + (θ − m0 )T S0−1 (θ − m0 ) + const
T
2σ 2

• We will use this later for computing the parameter posterior distribution in Bayesian
linear regression.

L9(3) November 17, 2024 20 / 34


Regularization: MAPE vs. Explicit Regularizer

• Explicit regularizer in regularized least squares (RLS)


∥y − Φθ∥2 + λ ∥θ∥2
• MAPE wth Gaussian prior p(θ) ∼ N (0, b 2 I )
◦ Negative log-Gaussian prior
1 T
− log p(θ) = θ θ + const
2b 2
◦ λ = 1/2b 2 is the regularization term
• Not surprising that we have
 −1
θRLS = ΦT Φ + λI ΦT y

L9(3) November 17, 2024 21 / 34


Roadmap

(1) Problem Formulation


(2) Parameter Estimation: ML
(3) Parameter Estimation: MAP
(4) Bayesian Linear Regression
(5) Maximum Likelihood as Orthogonal Projection

L9(4) November 17, 2024 22 / 34


Bayesian Linear Regression
• Earlier, ML and MAP. Now, fully Bayesian L8(4)
• Model
prior p(θ) ∼ N (m0 , S0 )
T 2

likelihood p(y |x, θ) ∼ N y | ϕ (x)θ, σ
joint p(y , θ|x) = p(y | x, θ)p(θ)

• Goal: For an input x∗ , we want to compute the following posterior predictive


distribution2 of y∗ :
Z z likelihood (∗)
}| { z }| {
p(y∗ |x∗ , X , Y) = p(y∗ |x∗ , θ) p(θ|X , Y) dθ
◦ (∗): parameter posterior distribution that needs to be computed

2
Chapter 9.3.4 For ease of understanding, I’ve slightly changed the organization of these lecture
slides from that of the textbook.
L9(4) November 17, 2024 23 / 34
Parameter Posterior Distribution (1)

• Parameter posterior distribution Chapter 9.3.3

p(θ | X , Y) = N (θ | mN , SN ), where
−1 2 T
−1 −1 −2 T

SN = S0 + σ Φ Φ , mN = SN S0 m0 + σ Φ y

(Proof Sketch)

• From the negative-log posterior for general Gaussian prior,


1 T 1 T −1
− log p(θ|X , Y) = (y − Φθ) (y − Φθ) + (θ − m0 ) S0 (θ − m0 ) + const
2σ 2 2

L9(4) November 17, 2024 24 / 34


Parameter Posterior Distribution (2)

1  −2 T −2 T T −2 T T −1 T −1 T −1

= σ y y − 2σ y Φθ + θ σ Φ Φθ + θ S0 θ − 2m0 S0 θ + m0 S0 m0
2
1  T −2 T −1 −1 T

= θ (σ Φ Φ + S0 )θ − 2(σ −2 ΦT y + S0 m0 ) θ + const
2
• cyan color: quadratic term, orange color: linear term
• p(θ|X , Y) ∝ exp( quadratic in θ ) =⇒ Gaussian distribution
• Assume that p(θ|X , Y) = N (θ|mN , SN ), and find mN and SN .

1 T
− log N (θ|mN , SN ) = (θ − mN ) SN−1 (θ − mN ) + const
2
1  T −1 T −1 T −1

= θ SN θ − 2mN SN θ + mN SN mN + const
2
T
• Thus, SN−1 = σ −2 ΦT Φ + S0−1 and mNT SN−1 = (σ −2 ΦT y + S0−1 m0 )

L9(4) November 17, 2024 25 / 34


Posterior Predictions (1)

• Posterior predictive distribution L6(5)


Z
p(y∗ |x∗ , X , Y) = p(y∗ |x∗ , θ)p(θ|X , Y)dθ
Z    
T 2
= N y∗ |ϕ (x∗ )θ, σ N θ|mN , SN dθ
 
= N y∗ |ϕT (x∗ )mN , ϕT (x∗ )SN ϕ(x∗ ) + σ 2
• The mean ϕT (x∗ )mN coincides with the MAP estimate

L9(4) November 17, 2024 26 / 34


Posterior Predictions (2)

• BLR: Bayesian Linear Regression

L9(4) November 17, 2024 27 / 34


Computing Marginal Likelihood

R
• Likelihood: p(Y|X , θ), Marginal likelihood: p(Y|X ) = p(Y|X , θ)p(θ)dθ
• Recall that the marginal likelihood is important for model selection via Bayes factor:
P(D|M1 )P(M1 )
P(M1 | D) P(D) P(M1 ) P(D | M1 )
(Posterior odds) = = P(D|M2 )P(M2 )
=
P(M2 | D) P(M2 ) P(D | M2 )
P(D) | {z } | {z }
Prior odds Bayes factor
Z Z
p(Y|X ) = p(Y|X , θ)p(θ)dθ = N (y |Φθ, σ 2 I )N (θ|m0 , S0 ) dθ

= N (y | Φm0 , ΦS0 ΦT + σ 2 I )

L9(4) November 17, 2024 28 / 34


Roadmap

(1) Problem Formulation


(2) Parameter Estimation: ML
(3) Parameter Estimation: MAP
(4) Bayesian Linear Regression
(5) Maximum Likelihood as Orthogonal Projection

L9(5) November 17, 2024 29 / 34


ML as Orthogonal Projection

X Ty
−1
• For f (x) = x T θ + N (0, σ 2 ), θML = (X T X ) X T y = T ∈ R
X X
XX T
X θML = T y
X X
◦ Orthogonal projection of y onto the one-dimensional subspace spanned by X
ΦTy
−1
• For f (x) = ϕT (x)θ + N (0, σ 2 ), θML = (ΦT Φ) ΦT y = T ∈ R
Φ Φ
ΦΦT
ΦθML = T y
Φ Φ
◦ Orthogonal projection of y onto the K -dimensional subspace spanned by columns of Φ

L9(5) November 17, 2024 30 / 34


Summary and Other Issues (1)

• Linear regression for Gaussian likelihood and conjugate Gaussian priors. Nice
analytical results and closed forms
• Other forms of likelihoods for other applications (e.g., classification)
• GLM (generalized linear model): y = σ ◦ f (σ: activation function)
◦ No longer linear in θ
1
◦ Logistic regression: σ(f ) = ∈ [0, 1] (interpreted as the probability of
1 + exp(−f )
becoming 1)
◦ Building blocks of (deep) feedforward neural nets
◦ y = σ(Ax + b). A: weight matrix, b: bias vector
◦ K -layer deep neural nets: xk+1 = fk (xk ), fk (xk ) = σk (Ak xk + bk )

L9(5) November 17, 2024 31 / 34


Summary and Other Issues (2)

• Gaussian process
◦ A distribution over parameters → a distribution over functions
◦ Gaussian process: distribution over functions without detouring via parameters
◦ Closely related to BLR and support vector regression, also interpreted as Bayesian neural
network with a single hidden layer and the infinite number of units
• Gaussian likelihood, but non-Gaussian prior
◦ When N ≪ D (small training data)
◦ Prior that enforces sparsity, e.g., Laplace prior
◦ A linear regression with the Laplace prior = linear regression with LASSO (L1
regularization)

L9(5) November 17, 2024 32 / 34


Questions?

L9(5) November 17, 2024 33 / 34


Review Questions

1)

L9(5) November 17, 2024 34 / 34

You might also like