IAML: Linear Regression
Nigel Goddard
School of Informatics
Semester 1
1 / 38
Overview
I The linear model
I Fitting the linear model to data
I Probabilistic interpretation of the error function
I Examples of regression problems
I Dealing with multiple outputs
I Generalized linear regression
I Radial basis function (RBF) models
2 / 38
The Regression Problem
I Classification and regression problems:
I Classification: target of prediction is discrete
I Regression: target of prediction is continuous
I Training data: Set D of pairs (xi , yi ) for i = 1, . . . , n, where
xi ∈ RD and yi ∈ R
I Today: Linear regression, i.e., relationship between x and
y is linear.
I Although this is simple (and limited) it is:
I More powerful than you would expect
I The basis for more complex nonlinear methods
I Teaches a lot about regression and classification
3 / 38
Examples of regression problems
I Robot inverse dynamics: predicting what torques are
needed to drive a robot arm along a given trajectory
I Electricity load forecasting, generate hourly forecasts two
days in advance.
I Predicting staffing requirements at help desks based on
historical data and product and sales information,
I Predicting the time to failure of equipment based on
utilization and environmental conditions
4 / 38
The Linear Model
I Linear model
f (x; w) = w0 + w1 x1 + . . . + wD xD
= φ(x)w
where φ(x) = (1, x1 , . . . , xD ) = (1, xT )
and
w0
w1
w=
... (1)
wD
I The maths of fitting linear models to data is easy. We use
the notation φ(x) to make generalisation easy later.
5 / 38
Toy example: Data
4
● ●
3
●
●
●
●●
2
●
● ●
● ●
● ●
●
y
●
● ●●
0
●
●
●
−1
● ●
−2
−3 −2 −1 0 1 2 3
6 / 38
Toy example: Data
4
4
● ● ● ●
3
3
● ●
● ●
● ●
●● ●●
2
2
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
y
1
● ●
● ●
● ●● ● ●●
0
0
● ●
● ●
● ●
−1
−1
● ● ● ●
−2
−2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
7 / 38
With two features
•
• • ••
•
•• •
• • •• •• • • •
• • • • •• •
•
• •
•• • • • • •• •• • •
•
•• • •
• • X2
• •
•
X1
FIGURE 3.1. Linear least squares fitting with
Instead X of ∈a IRline,
2
. We a plane. With
seek the more
linear features,
function a hyperplane.
of X that mini-
mizes the sum of
Figure: Hastie, Tibshirani, and Friedman
squared residuals from Y .
8 / 38
With more features
CPU Performance Data Set
I Predict: PRP: published relative performance
I MYCT: machine cycle time in nanoseconds (integer)
I MMIN: minimum main memory in kilobytes (integer)
I MMAX: maximum main memory in kilobytes (integer)
I CACH: cache memory in kilobytes (integer)
I CHMIN: minimum channels in units (integer)
I CHMAX: maximum channels in units (integer)
9 / 38
With more features
PRP = - 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX
10 / 38
In matrix notation
I Design matrix is n × (D + 1)
1 x11 x12 . . . x1D
1 x21 x22 . . . x2D
Φ= . .. .. .. ..
.. . . . .
1 xn1 xn2 . . . xnD
I xij is the jth component of the training input xi
I Let y = (y1 , . . . , yn )T
I Then ŷ = Φw is ...?
11 / 38
Linear Algebra: The 1-Slide Version
What is matrix multiplication?
a11 a12 a13 b1
A = a21 a22 a23 , b = b2
a31 a32 a33 b3
First consider matrix times vector, i.e., Ab. Two answers:
1. Ab is a linear combination of the columns of A
a11 a12 a13
Ab = b1 a21 + b2 a22 + b3 a23
a31 a32 a33
2. Ab is a vector. Each element of the vector is the dot
products between b and one row of A.
(a11 , a12 , a13 )b
Ab = (a21 , a22 , a23 )b
(a31 , a32 , a33 )b
12 / 38
Linear model (part 2)
In matrix notation:
I Design matrix is n × (D + 1)
1 x11 x12 . . . x1D
1 x21 x22 . . . x2D
Φ= . .. .. .. ..
.. . . . .
1 xn1 xn2 . . . xnD
I xij is the jth component of the training input xi
I Let y = (y1 , . . . , yn )T
I Then ŷ = Φw is the model’s predicted values on training
inputs.
13 / 38
Solving for Model Parameters
This looks like what we’ve seen in linear algebra
y = Φw
We know y and Φ but not w.
So why not take w = Φ−1 y? (You can’t, but why?)
14 / 38
Solving for Model Parameters
This looks like what we’ve seen in linear algebra
y = Φw
We know y and Φ but not w.
So why not take w = Φ−1 y? (You can’t, but why?)
Three reasons:
I Φ is not square. It is n × (D + 1).
I The system is overconstrained (n equations for D + 1
parameters), in other words
I The data has noise
15 / 38
Loss function
Want a loss function O(w) that
I We minimize wrt w.
I At minimum, ŷ looks like y.
I (Recall: ŷ depends on w)
ŷ = Φw
16 / 38
Fitting a linear model to data
Y
I A common choice: squared error
(makes the maths easy)
•
• • •• n
X
•
•• • O(w) = (yi − wT xi )2
• • • •• • • •
•
• • • • •• • i=1
•
• • • • ••
•• • • • • • • I In the picture: this is sum of
•
•• • •• • X2 squared length of black sticks.
• • I (Each one is called a residual,
•
X1 i.e., each yi − wT xi )
URE 3.1. Linear least squares fitting with
IR2 . We seek the linear function of X that mini-
s the sum of squared residuals from Y . 17 / 38
Fitting a linear model to data
ting a linear model to data
I
n
� Given a dataset D of pairs (xi , yiX
) for i = 1, .T. . , n2
O(w) = (y − w xi )
� Squared error makes the maths easy i
i=1
n
�
E(w) = (y= (y − Φw)2T (y − Φw)
i − f (xi ; w))
i=1
I We want to minimize this with respect to w.
� TheIerror
Thesurface is a parabolic
error surface bowl
is a parabolic bowl
25
20
15
E[w]
10
0
2
1
-2
-1
0 0
1
2
-1 3
w0 w1
How do we do this?
Figure: Tom Mitchell
I 5 / 17
18 / 38
The Solution
Pn
I Answer: to minimize O(w) = i=1 (yi − wT xi )2 , set partial
derivatives to 0.
I This has an analytical solution
ŵ = (ΦT Φ)−1 ΦT y
I (ΦT Φ)−1 ΦT is the pseudo-inverse of Φ
I First check: Does this make sense? Do the matrix
dimensions line up?
I Then: Why is this called a pseudo-inverse? ()
I Finally: What happens if there are no features?
19 / 38
Probabilistic interpretation of O(w)
I Assume that y = wT x + , where ∼ N(0, σ 2 )
I (This is an exact linear relationship plus Gaussian noise.)
I This implies that y |xi ∼ N(wT xi , σ 2 ), i.e.
√ (y − wT xi )2
− log p(yi |xi ) = log 2π + log σ + i
2σ 2
I So minimising O(w) equivalent to maximising likelihood!
I Can view wT x as E[y |x].
I Squared residuals allow estimation of σ 2
n
2 1X
σ̂ = (yi − wT xi )2
n
i=1
20 / 38
Fitting this into the general structure for learning algorithms:
I Define the task: regression
I Decide on the model structure: linear regression model
I Decide on the score function: squared error (likelihood)
I Decide on optimization/search method to optimize the
score function: calculus (analytic solution)
21 / 38
Sensitivity to Outliers
I Linear regression is sensitive to outliers √
I Example: Suppose y = 0.5x + , where ∼ N(0, 0.25),
and then add a point (2.5,3):
3.0
2.5 ●
●
●
2.0
1.5
●
1.0
●
0.5
0.0
0 1 2 3 4 5
22 / 38
Diagnositics
Graphical diagnostics can be useful for checking:
I Is the relationship obviously nonlinear? Look for structure
in residuals?
I Are there obvious outliers?
The goal isn’t to find all problems. You can’t. The goal is to find
obvious, embarrassing problems.
Examples: Plot residuals by fitted values. Stats packages will
do this for you.
23 / 38
Dealing with multiple outputs
I Suppose there are q different targets for each input x
I We introduce a different wi for each target dimension, and
do regression separately for each one
I This is called multiple regression
24 / 38
Basis expansion
I We can easily transform the original attributes x
non-linearly into φ(x) and do linear regression on them
1 1 1
0.5 0.75 0.75
0 0.5 0.5
−0.5 0.25 0.25
−1 0 0
−1 0 1 −1 0 1 −1 0 1
polynomial Gaussians sigmoids
Figure credit: Chris Bishop, PRML
25 / 38
I Design matrix is n × m
φ1 (x1 ) φ2 (x1 ) . . . φm (x1 )
φ1 (x2 ) φ2 (x2 ) . . . φm (x2 )
Φ= .. .. .. ..
. . . .
φ1 (xn ) φ2 (xn ) . . . φm (xn )
I Let y = (y1 , . . . , yn )T
I Minimize E(w) = |y − Φw|2 . As before we have an
analytical solution
ŵ = (ΦT Φ)−1 ΦT y
I (ΦT Φ)−1 ΦT is the pseudo-inverse of Φ
26 / 38
Example: polynomial regression
φ(x) = (1, x, x 2 , . . . , x M )T
1 M =0 1 M =1
t t
0 0
−1 −1
0 x 1 0 x 1
1 M =3 1 M =9
t t
0 0
−1 −1
0 x 1 0 x 1
Figure credit: Chris Bishop, PRML
27 / 38
More about the features
I Transforming the features can be important.
I Example: Suppose I want to predict the CPU performance.
I Maybe one of the features is manufacturer.
1 if Intel
2 if AMD
x1 =
3 if Apple
4 if Motorola
I Let’s use this as a feature. Will this work?
28 / 38
More about the features
I Transforming the features can be important.
I Example: Suppose I want to predict the CPU performance.
I Maybe one of the features is manufacturer.
1 if Intel
2 if AMD
x1 =
3 if Apple
4 if Motorola
I Let’s use this as a feature. Will this work?
I Not the way you want. Do you really believe AMD is double
Intel?
29 / 38
More about the features
I Instead convert this into 0/1
x1 = 1 if Intel, 0 otherwise
x2 = 1 if AMD, 0 otherwise
..
.
I Note this is a consequence of linearity. We saw something
similar with text in the first week.
I Other good transformations: log, square, square root
30 / 38
Radial basis function (RBF) models
I Set φi (x) = exp(− 12 |x − ci |2 /α2 ).
I Need to position these “basis functions” at some prior
chosen centres ci and with a given width α. There are
many ways to do this but choosing a subset of the
datapoints as centres is one method that is quite effective
I Finding the weights is the same as ever: the
pseudo-inverse solution.
31 / 38
RBF example
2.0
1.5
1.0
y
0.5
0.0
●
−0.5
0 1 2 3 4 5 6 7
32 / 38
y
−0.5 0.0 0.5 1.0 1.5 2.0
0
1
2
RBF example
x
4
5
6
7
−0.5 0.0 0.5 1.0 1.5 2.0
●
0
1
2
3
x
4
5
6
7
33 / 38
An RBF feature
Original data RBF feature, c1 = 3, α1 = 1
2.0
2.0
1.5
1.5
1.0
1.0
y
y
0.5
0.5
0.0
0.0
● ●
−0.5
0 1 2 3 4 5 6 7 −0.5 0.0 0.1 0.2 0.3 0.4
x phi3.x
34 / 38
Another RBF feature
Notice how the feature functions “specialize” in input space.
Original data RBF feature, c2 = 6, α2 = 1
2.0
2.0
1.5
1.5
1.0
1.0
y
y
0.5
0.5
0.0
0.0
● ●
−0.5
−0.5
0 1 2 3 4 5 6 7 0.0 0.1 0.2 0.3 0.4
x RBF kernel (mean 6)
35 / 38
RBF example
Run the RBF with both basis functions above, plot the residuals
yi − φ(xi )T w
Original data Residuals
2.0
2.0
1.5
1.5
Residual (RBF model)
1.0
1.0
y
0.5
0.5
0.0
0.0
●
−0.5
−0.5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
x x
36 / 38
RBF: Ay, there’s the rub
I So why not use RBFs for everything?
I Short answer: You might need too many basis functions.
I This is especially true in high dimensions (we’ll say more
later)
I Too many means you probably overfit.
I Extreme example: Centre one on each training point.
I Also: notice that we haven’t seen yet in the course how to
learn the RBF parameters, i.e., the mean and standard
deviation of each kernel
I Main point of presenting RBFs now: Set up later methods
like support vector machines
37 / 38
Summary
I Linear regression often useful out of the box.
I More useful than it would be seem because linear means
linear in the parameters. You can do a nonlinear transform
of the data first, e.g., polynomial, RBF. This point will come
up again.
I Maximum likelihood solution is computationally efficient
(pseudo-inverse)
I Danger of overfitting, especially with many features or
basis functions
38 / 38