LINEAR REGRESSION
J. Elder
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Credits
Probability & Bayesian Inference
Some of these slides were sourced and/or modified
from:
Christopher
Bishop, Microsoft UK
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Linear Regression Topics
Probability & Bayesian Inference
What is linear regression?
Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
What is Linear Regression?
Probability & Bayesian Inference
In classification, we seek to identify the categorical class Ck
associate with a given input vector x.
In regression, we seek to identify (or estimate) a continuous
variable y associated with a given input vector x.
y is called the dependent variable.
x is called the independent variable.
If y is a vector, we call this multiple regression.
We will focus on the case where y is a scalar.
Notation:
y will denote the continuous model of the dependent variable
t will denote discrete noisy observations of the dependent
variable (sometimes called the target variable).
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Where is the Linear in Linear Regression?
Probability & Bayesian Inference
In regression we assume that y is a function of x.
The exact nature of this function is governed by an
unknown parameter vector w:
y = y x, w
The regression is linear if y is linear in w. In other
words, we can express y as
( )
()
y = wt! x
where
()
! x is some (potentially nonlinear) function of x.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Linear Basis Function Models
Probability & Bayesian Inference
Generally
where j(x) are known as basis functions.
Typically, 0(x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions :
d(x) = xd.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Linear Regression Topics
Probability & Bayesian Inference
What is linear regression?
Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Example: Polynomial Bases
Probability & Bayesian Inference
Polynomial basis
functions:
These are global
small change in x
affects all basis functions.
A small change in a
basis function affects y
for all x.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Example: Polynomial Curve Fitting
9
Probability & Bayesian Inference
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Sum-of-Squares Error Function
10
Probability & Bayesian Inference
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
1st Order Polynomial
11
Probability & Bayesian Inference
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
3rd Order Polynomial
12
Probability & Bayesian Inference
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
9th Order Polynomial
13
Probability & Bayesian Inference
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Regularization
Probability & Bayesian Inference
14
Penalize large coefficient values
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Regularization
Probability & Bayesian Inference
15
9thOrderPolynomial
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Regularization
Probability & Bayesian Inference
16
9thOrderPolynomial
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Regularization
Probability & Bayesian Inference
17
9thOrderPolynomial
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Probabilistic View of Curve Fitting
Probability & Bayesian Inference
18
Why least squares?
Model noise (deviation of data from model) as
Gaussian i.i.d.
where ! !
1
is the precision of the noise.
2
"
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Maximum Likelihood
Probability & Bayesian Inference
19
We determine wML by minimizing the squared error E(w).
Thus least-squares regression reflects an assumption that the
noise is i.i.d. Gaussian.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Maximum Likelihood
Probability & Bayesian Inference
20
We determine wML by minimizing the squared error E(w).
Now given wML, we can estimate the variance of the noise:
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Predictive Distribution
21
Probability & Bayesian Inference
Generating function
Observed data
Maximum likelihood prediction
Posterior over t
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
MAP: A Step towards Bayes
Probability & Bayesian Inference
22
Prior knowledge about probable values of w can be incorporated into the
regression:
Now the posterior over w is proportional to the product of the likelihood
times the prior:
The result is to introduce a new quadratic term in w into the error function
to be minimized:
Thus regularized (ridge) regression reflects a 0-mean isotropic Gaussian
prior on the weights.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Linear Regression Topics
Probability & Bayesian Inference
23
What is linear regression?
Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Gaussian Bases
Probability & Bayesian Inference
24
Gaussian basis functions:
Think of these as interpolation functions.
These are local:
small change in x affects
only nearby basis functions.
a small change in a basis
function affects y only for
nearby x.
j and s control location
and scale (width).
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Linear Regression Topics
Probability & Bayesian Inference
25
What is linear regression?
Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Maximum Likelihood and Linear Least Squares
Probability & Bayesian Inference
26
Assume observations from a deterministic function with
added Gaussian noise:
where
which is the same as saying,
Given observed inputs,
, and
targets,
we obtain the likelihood
function
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Maximum Likelihood and Linear Least Squares
Probability & Bayesian Inference
27
Taking the logarithm, we get
where
is the sum-of-squares error.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Maximum Likelihood and Least Squares
Probability & Bayesian Inference
28
Computing the gradient and setting it to zero yields
Solving for w, we get
where
The Moore-Penrose
pseudo-inverse,
.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
End of Lecture 8
Linear Regression Topics
Probability & Bayesian Inference
30
What is linear regression?
Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Regularized Least Squares
Probability & Bayesian Inference
31
Consider the error function:
Data term + Regularization term
With the sum-of-squares error function and a
quadratic regularizer, we get
which is minimized by
is called the
regularization
coefficient.
Thus the name ridge regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Regularized Least Squares
Probability & Bayesian Inference
32
With a more general regularizer, we have
Lasso
Quadratic
(Least absolute shrinkage and selection operator)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Regularized Least Squares
Probability & Bayesian Inference
33
Lasso generates sparse solutions.
Iso-contours
of data term ED(w)
Iso-contour of
regularization term EW(w)
Quadratic
Lasso
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Solving Regularized Systems
Probability & Bayesian Inference
34
Quadratic regularization has the advantage that
the solution is closed form.
Non-quadratic regularizers generally do not have
closed form solutions
Lasso can be framed as minimizing a quadratic
error with linear constraints, and thus represents a
convex optimization problem that can be solved by
quadratic programming or other convex
optimization methods.
We will discuss quadratic programming when we
cover SVMs
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Linear Regression Topics
Probability & Bayesian Inference
35
What is linear regression?
Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Multiple Outputs
Probability & Bayesian Inference
36
Analogous to the single output case we have:
Given observed inputs
targets
we obtain the log likelihood function
, and
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Multiple Outputs
Probability & Bayesian Inference
37
Maximizing with respect to W, we obtain
If we consider a single target variable, tk, we see that
where
single output case.
, which is identical with the
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Some Useful MATLAB Functions
Probability & Bayesian Inference
38
polyfit
Least-squares
fit of a polynomial of specified order to
given data
regress
More
general function that computes linear weights for
least-squares fit
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Linear Regression Topics
Probability & Bayesian Inference
39
What is linear regression?
Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Bayesian Linear Regression
Rev. Thomas Bayes, 1702 - 1761
Bayesian Linear Regression
Probability & Bayesian Inference
41
Define a conjugate prior over w:
Combining this with the likelihood function and using
results for marginal and conditional Gaussian
distributions, gives the posterior
where
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Bayesian Linear Regression
Probability & Bayesian Inference
42
A common choice for the prior is
for which
Thus mN represents the ridge regression solution with
! =" /#
Next we consider an example
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Bayesian Linear Regression
43
Probability & Bayesian Inference
0 data points observed
Prior
Data Space
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Bayesian Linear Regression
Probability & Bayesian Inference
44
1 data point observed
Likelihood for (x1,t1)
Posterior
Data Space
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Bayesian Linear Regression
Probability & Bayesian Inference
45
2 data points observed
Likelihood for (x2,t2)
Posterior
Data Space
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Bayesian Linear Regression
Probability & Bayesian Inference
46
20 data points observed
Likelihood for (x20,t20)
Posterior
Data Space
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Predictive Distribution
Probability & Bayesian Inference
47
Predict t for new values of x by integrating over w:
where
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Predictive Distribution
Probability & Bayesian Inference
48
Example: Sinusoidal data, 9 Gaussian basis functions,
1 data point
Notice how much bigger our uncertainty is
relative to the ML method!!
p t | t,! , "
Samples of y(x,w)
E #$t | t,! , " %&
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Predictive Distribution
Probability & Bayesian Inference
49
Example: Sinusoidal data, 9 Gaussian basis functions,
2 data points
E #$t | t,! , " %&
p t | t,! , "
Samples of y(x,w)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Predictive Distribution
Probability & Bayesian Inference
50
Example: Sinusoidal data, 9 Gaussian basis functions,
4 data points
E #$t | t,! , " %&
p t | t,! , "
Samples of y(x,w)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Predictive Distribution
Probability & Bayesian Inference
51
Example: Sinusoidal data, 9 Gaussian basis functions,
25 data points
E #$t | t,! , " %&
p t | t,! , "
Samples of y(x,w)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Equivalent Kernel
Probability & Bayesian Inference
52
The predictive mean can be written
Equivalent kernel or
smoother matrix.
This is a weighted sum of the training data target
values, tn.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Equivalent Kernel
53
Probability & Bayesian Inference
Weight of tn depends on distance between x and xn;
nearby xn carry more weight.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Linear Regression Topics
Probability & Bayesian Inference
54
What is linear regression?
Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder