NB 12
NB 12
• Lecture notes: gradient for solving the multiple linear regression problem.
• LSD dataset
• Fitting a linear model
• Gradient explanation and examples
• Algorithms for linear least squares problem
• Solution of the normal equations
• Sources of error: sensitivities and perturbations
• Stress-testing and stability
• QR decomposition
• Performance trade o
• Cost of solving and scalability
• Online vs. batch regression
Overview of Linear Regression
observe Gi gi osi m I 8T
4 8t
intercept
Isfahan points slope g
the error in
at Xi relative the model's prediction
to the observedvalue
residual
Minimizing the
If ji yi II Etf g f 2,8
x
predicted tobserved we want to choose
response response 2and to minimize
so
minimize a
the Y
To
multivariatefunction
we consider partial
gf a g II 2 wit Br Y Xi 22 XX 28Ctx 2xty
firstderivative ofthe
man wet amunable I 15 I
22 atx 2m83 2 uty
Fgfld B
At the extreme points
p p
o
a C
m
observations of i
toI 0 Oz On On
two no squared
OEargz.in 1n f
Gradients Residual Minimization
first derivative
Gradient is a vector analog of taking the
scalar
Function of vector 05 0
variable a
Gradient of wit o
g
each dementia partial
derivative of wet tosome
element of 0 g
888
441
t so
X is the transpose
of Xo
3 D surface
2 D contour plot
Because we are
T eachcomponent samegraphbutin 2 D flatcontourplot
whoseminimum valueliesatthecenter
squaring
we get a paraboloid multidimensional
analog of a parabola
we will
when we implement the gradient calculation get vectors
at every point instead of scalars
adds littlearrows
to contourplot
winpoint inthedirection
of thegradientandwill
I gtonally
calemdagf.fi
thealagradigotgue
Vtv wit v is 2V
8gc.FI IIgnng
what is 0
the value of at which the gradient is zero
Pogo 25 0 2
8 0
go
f extreme point
therefore a candidate
for minimizing
g
yW
xtxO the normal equation s
Oo so
Then we plug Xand D into the
normal equation to
get at
What does it cost to run an
algorithm
argzinlly.FI
sum of squared
residuals
errors
at extreme
any
point
2 XX Xy 0
normal equations
X'X xty linear Yeastsquares problem
Cost
01min where m
n
observations
predictors
East of fitting a linear model asymptotically
An Online incremental algorithm for Regression
when we one or few
see a time datapoints at
onlyReasonable to solve for linear number of observation
g
xty
XtXO mine But to solve from scratch each
time
cost cost series nn't2h43n't mn's 0mine
Another approach
o 2 ways to view the data matrix
uol.jo
Tariff x.eu
2
x É 4 by rows
m observations
Fri
a
partial step
I
Part 0: Sample dataset (LSD)
In 1968, Wagner Agahajanian, and Bing conducted a study to determine whether you could improve a student's math test scores using lysergic
diethylamide, also known as "LSD."
c acid
Here is the original data sources. The code cell below downloads the file from an alternative location, for compatibility with the Azure Notebook
platforms you are using.
import requests
import os
import hashlib
import io
def on_vocareum():
return os.path.exists('.voc')
if on_vocareum():
URL_BASE = "https://cse6040.gatech.edu/datasets/"
DATA_PATH = "../resource/asnlib/publicdata/"
else:
URL_BASE = "https://github.com/cse6040/labs-fa17/raw/master/datasets/"
DATA_PATH = ""
Let's take a look at the data, first as a table and then using a scatter plot.
In [ ]: df = read_fwf('{}lsd.dat'.format(DATA_PATH),
colspecs=[(0, 4), (7, 13)],
names=['lsd_concentration', 'exam_score'])
display(df)
scatter(df['lsd_concentration'], df['exam_score'])
xlabel ('LSD Tissue Concentration')
title ('Shocking news: Math scores degrade with increasing LSD!');
Fitting a model
Numpy
Exercise 0 (2 points). Complete the function below so that it computes
arrays y[:] for the responses and x[:] for the predictor.
and for the univariate model, , given observations sto
edas
or
Use the linear regression formulas derived in class.
print("alpha:", alpha)
print("beta:", beta)
x, y = df['lsd_concentration'], df['exam_score']
alpha, beta = linreg_fit(x, y)
r = alpha*x + beta - y
ssqr = r.dot(r)
ssqr_ex = 253.88132881
print("\n(Passed!)")
scatter(x, y, marker='o')
plot(x_fit, y_fit, 'r--')
xlabel('LSD Tissue Concentration')
title('Best-fit linear model');
Fin! If you've gotten this far without errors, your notebook is ready to submit.
Exercise 0 (1 point). Implement the Python function, f(x0, x1), so that it computes .
f_vec = vectorize(f)
theta = randn(1000)
assert all(isclose(f_vec(sin(theta), cos(theta)), 1.0))
print("\n(Passed!)")
(Passed!)
The gradient
Let's create a mesh of coordinate values:
X0:
[[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]
[-2. -1.6 -1.2 -0.8 -0.4 0. 0.4 0.8 1.2 1.6 2. ]]
X1:
[[-2. -2. -2. -2. -2. -2. -2. -2. -2. -2. -2. ]
[-1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6]
[-1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2]
[-0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8]
[-0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[ 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8]
[ 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2]
[ 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. ]]
Z:
[[8. 6.56 5.44 4.64 4.16 4. 4.16 4.64 5.44 6.56 8. ]
[6.56 5.12 4. 3.2 2.72 2.56 2.72 3.2 4. 5.12 6.56]
[5.44 4. 2.88 2.08 1.6 1.44 1.6 2.08 2.88 4. 5.44]
[4.64 3.2 2.08 1.28 0.8 0.64 0.8 1.28 2.08 3.2 4.64]
[4.16 2.72 1.6 0.8 0.32 0.16 0.32 0.8 1.6 2.72 4.16]
[4. 2.56 1.44 0.64 0.16 0. 0.16 0.64 1.44 2.56 4. ]
[4.16 2.72 1.6 0.8 0.32 0.16 0.32 0.8 1.6 2.72 4.16]
[4.64 3.2 2.08 1.28 0.8 0.64 0.8 1.28 2.08 3.2 4.64]
[5.44 4. 2.88 2.08 1.6 1.44 1.6 2.08 2.88 4. 5.44]
[6.56 5.12 4. 3.2 2.72 2.56 2.72 3.2 4. 5.12 6.56]
[8. 6.56 5.44 4.64 4.16 4. 4.16 4.64 5.44 6.56 8. ]]
ax2d = fig.add_subplot(122)
cp = ax2d.contour(X0, X1, Z)
xlabel('x0')
ylabel('x1')
Out[6]: Text(0,0.5,'x1')
The gradient of with respect to is
Exercise 1 (1 point). Implement a function, grad_f(x0, x1), that implements the gradient shown above. It should return a pair of val
uessince
the
gradient for this has two components.
grad_f_vec = vectorize(grad_f)
z = randn(5)
gx, gy = grad_f_vec(z, -z)
assert all(isclose(gx*0.5, z)) and all(isclose(gy*(-0.5), z)), "Your function might have a bug..
print("\n(Passed!)")
(Passed!)
Fin! If you've gotten this far without errors, your notebook is ready to submit.
Part 2: Algorithms for the linear least squares problem
Recall the linear regression problem: given a data matrix, X, and responses y, we wish to determine the model parameters θ that minimizes ǁX
oy
This problem is also known as the linear least squares problem.
You may rightly ask, why bother with such details? Here are three reasons it's worth looking more closely.
1. It's helpful to have some deeper intuition for how one formalizes a mathematical problem and derives a computational solution, in
case you ever encounter a problem that does not exactly fit what a canned library can do for you.
2. If you have ever used a statistical analysis package, it's likely you have encountered "strange" numerical errors or warnings.
Knowing how problems are derived can help you understand what might have gone wrong. We will see an example below.
3. Because data analysis is quickly evolving, it's likely that new problems and new models will not exactly fit the template of existing
models. Therefore, it's possible you will need to derive a new model or know how to talk to someone who can derive one for you.
Implementation note. In this notebook, we ask that you use the following convention: any column vector should be explicit. That means its sh
ould
have two dimensions where the column dimension equals one (1).
Exercise 0 (ungraded). Inspect the following code cell and make sure you understand the difference between two conventions for storing a vec
ctor
as a one-dimensional array versus as a two-dimensional array (matrix) where the number of columns equals one (1). When you are asked to pro
oduceave
we will generally ask you to follow the second convention (z_colvec).
`z_array`:
[1. 2. 3.]
`z_colvec`:
[[1.]
[2.]
[3.]]
Before beginning, run this code cell to load some of the key modules you'll need.
# Viz
from IPython.display import display, Math
from matplotlib.pyplot import figure, subplot, xlim, ylim
from matplotlib.pyplot import scatter, axis, xlabel, ylabel, title, plot
%matplotlib inline
# From: https://stackoverflow.com/questions/17129290/numpy-2d-and-1d-array-to-latex-bmatrix
def nparray_to_bmatrix(a):
"""Returns a LaTeX bmatrix"""
assert len(a.shape) <= 2, 'bmatrix can at most display two dimensions'
lines = str(a).replace('[', '').replace(']', '').splitlines()
rv = [r'\begin{bmatrix}']
rv += [' ' + ' & '.join(l.split()) + r'\\' for l in lines]
rv += [r'\end{bmatrix}']
return '\n'.join(rv)
Your data consists of m observations and n + 1 variables. One of these variables is the response variable, y, which you want to predict from the
variables, {x 0, …, x n − 1}. You wish to fit a linear model of the following form to these data,
y i ≈ x i , 0θ 0 + x i , 1θ 1 + + x i , n − 1 θ n − 1 + θ n,
where {θ j | 0 ≤ j ≤ n} is the set of unknown coefficients. Your modeling task is to choose values for these coefficients that "best fit" the data.
If we further define a set of dummy variables, x i , n ≡ 1.0, associated with the θ n parameter, then the model can be written more compactly in ma
Arixno
tax
as
y ≈ Xθ,
Visually, you can also arrange the observations into a tibble like this one:
y x0 x1 xn-1 xn
y0 x0 , 1 x0 , 2 x0 , n − 1 1.0
y1 x1 , 1 x1 , 2 x1 , n − 1 1.0
y2 x2 , 1 x2 , 2 x2 , n − 1 1.0
1.0
ym − 1 xm − 1 , 1 xm − 1 , 2 x m − 1 , n − 1 1.0
This tibble includes an extra column (variable), x n, whose entries are all equal to 1.0.
Synthetic problem generator. For the exercises in this notebook, we will generate synthetic data. The function, gen_problem(m, n), will retturna tri
X y, theta, which are an m x (n+1) data matrix X, a response vector y, and the "true" model parameters theta. We will then run two different numescal
algorithms that estimate theta from X and y, and see how their answers compare against the true value.
Note 1. The problem generator constructs the data matrix X such that each entry (i, j) is i j. This structure makes it an instance of a
Vandermonde matrix (https://en.wikipedia.org/wiki/Vandermonde_matrix), which arises when fitting a polynomial to data. The "true"
parameter vector θ is set to all ones, and y computed simply by summing the rows.
Note 2. Although our usual convention is to make the last column all ones, the Vandermonde matrix has its first column set to all ones.
ordering is not important in this problem, but it does mean one would interpret θ 0 as the intercept rather than θ n, which will be our usua
convention.
[ ] []
1. 0. 0. 1.
1. 1. 1. 3.
1. 2. 4. 7.
1. 3. 9. 13.
[]
1.
1. 4. 16. 21.
X= , y= θ = 1.
1. 5. 25. 31.
1.
1. 6. 36. 43.
1. 7. 49. 57.
1. 8. 64. 73.
1. 9. 81. 91.
We are interested primarily in overdetermined systems, meaning X has more rows than columns, i.e., m > n + 1, as shown above. That's becaus typically
e
have more observations (data points, or rows) than predictors (variables or columns). For such problems, there is generally no unique solution.
Therefore, to identify some solution, we need to ask for the "best" fit and say what we mean by "best." For linear regression, the usual definition
notbes
is minimizing the sum-of-squared residual error:
Solving this minimization problem is equivalent to solving a special system known as the normal equations,
X TXθ = X Ty.
1. Form C ≡ X TX. This object is sometimes called the Gram matrix (https://en.wikipedia.org/wiki/Gramian_matrix) or Gramian of X.
2. Form b ≡ X Ty.
3. Solve Cθ = b for θ .
But, is this a "good" algorithm? There are at least three dimensions along which we might answer this question.
1. Is it accurate enough?
2. Is it fast enough?
3. Is it memory-efficient enough?
Exercise 1 (3 points). Implement a function, solve_neq(X, y) that implements Algorithm 1. It should return a Numpy vector containing the m
parameter estimates.
model
Recall the steps of the algorithm as previously outlined:
Your algorithm should carry out these steps. For the third step, use Scipy's routine, scipy.linalg.solve()
g y p p, py , py g ()
(https://docs.scipy.org/doc/scipy/reference/tutorial/linalg.html#solving-linear-system). It has an option that allows you to indicate that C is symm
positive definite, which will be true of C for our synthetic problem.
The code cell will run your function to compute a set of parameter estimates. It will store these in a variable named theta_neq, which
refer to later.
theta_neq = solve_neq(X, y)
try:
del np.linalg.lstsq
solve_neq(X, y)
except NameError as n:
if re.findall('lstsq', n.args[0]):
print("*** Double-check that you did not try to use `lstsq()`. ***")
raise n
except AttributeError as a:
if re.findall('lstsq', a.args[0]):
print("*** Double-check that you did not try to use `lstsq()`. ***")
raise a
finally:
np.linalg.lstsq = SAVE_LSTSQ
print("\n(Passed!)")
(Passed!)
Exercise 2 (1 point). Write a function to calculate the residual norm, ǁrǁ 2 = ǁXθ − yǁ 2.
2
Although we are minimizing ǁrǁ 2, for this exercise your function should return ǁrǁ 2.
(Passed.)
S f
Sources of error
We said before that one question we should ask about our algorithm is whether it is "accurate enough." But what does that mean?
Exercise 3 (ungraded). For any modeling problem, there will be several sources of error. Describe at least three such sources.
1. There will be errors in the inputs. That is, the data itself may only represent measurements of a certain accuracy.
2. There will be errors in the model. That is, the model is only an approximation of the underlying phenomena.
3. There will be errors in the algorithm. That is, you may implement an algorithm that can only approximately estimate the parameters of themodel
m
4. There will be roundoff errors. Recall that floating-point arithmetic necessarily represents all values finitely, which means you may lose accur
time you do an arithmetic operation. racyevery
Perturbations. One way to understand error in a numerical computation is to consider how sensitive the computed solution is to perturbations
tothein
aloft
That is, suppose we change X by an amount ΔX. We can then ask by how much the computed model parameters θ change. If they change by
our method for computing them may be overly sensitive to perturbations. Instead, we might prefer one method over another one that is more ensitive
s
changes.
Let's see how Algorithm 1 fares under small perturbations. But first, we'll need a method to generate a random perturbation of a certain maxim
izesiz
Exercise 4 (2 points). Implement a function that returns an m × n matrix whose entries are uniformly randomly distributed in the interval, [0, ϵ] fo
agivenva
of ϵ.
Hint: Check out Numpy's module for generating (pseudo)random numbers: numpy.random
(https://docs.scipy.org/doc/numpy/reference/routines.random.html)
print(random_mat(3, 2, 1e-3))
[[0.00059339 0.00057468]
[0.00092299 0.00063703]
[0.0003482 0.00091574]]
(Passed.)
Exercise 5 (2 points). Use your random_mat() function to write another function, perturb_system(X, y, eps), that creates two "perturb
system defined by X and y.
ations to
1. Let ΔX be the first perturbation. It should have the same dimensions as X, and its entries should lie in the interval [ − ϵ, ϵ]. The value of ϵ is g
eps. by
given
2. The second is Δy, a small perturbation to the response variable, y. Its entries should also lie in the same interval, [ − ϵ, ϵ],
EPSILON = 0.1
X_perturbed, y_perturbed = perturb_system(X, y, EPSILON)
Delta_X = X_perturbed - X
Delta_y = y_perturbed - y
display(Math(r'\Delta X = {}, \quad \Delta y = {}'.format(nparray_to_bmatrix(Delta_X[:5, :]),
nparray_to_bmatrix(Delta_y[:5]))))
[ ] [ ]
− 0.0106095 0.01422098 − 0.06028698 − 0.04424354
0.06029219 − 0.02538373 0.08494938 0.01997222
ΔX = 0 06021075 − 0 09076259 0 07316003 Δy = − 0 02917007
ΔX =
[ 0.06021075
0.03928084
− 0.00887685
0.09076259
− 0.05499615
0.07215012
0.07316003
− 0.07508389
− 0.03684651
] [ ]
, Δy = 0.02917007
0.04818775
0.06929618
Delta_X = X_perturbed - X
Delta_y = y_perturbed - y
(Passed.)
Sensitivity of Algorithm 1
Let's now run the following code, which uses your code from above to perform a "sensitivity experiment." In particular, the function
run_perturbation_trials() will repeatedly perturb the system and measure the resulting change to the estimated θ .
All of the estimated θ are stored in an array, Thetas_neq. Each column k of Thetas_neq, or Thetas_neq[:, k], is one of the calculatedestimate
e
under a random perturbation of the system.
The size of the random perturbation is set, by default, to eps=0.01. Recall that our synthetic problem consists of numerical values that are all g
equal to one, so this perturbation may be regarded as fairly small.
Thetas_neq = run_perturbation_trials(solve_neq, X, y)
print("Unperturbed solution:")
print(theta_neq)
Unperturbed solution:
[[1.]
[1.]
[1.]]
First few perturbed solutions (columns):
[[0.99028632 0.99231283 1.02243806 1.00139275 1.01358837]
[1.00529885 1.00407045 0.99118553 0.99889269 0.9952955 ]
[0.99954928 0.99960384 1.00074977 1.00007595 1.0003454 ]]
Here is a quick plot of the that shows two coordinates of the true parameters (red star), compared to all perturbed estimates (blue points). Wecould
more confidence in the algorithm's computed solutions if it did not appear to be too sensitive to changes in the input.
w
ha
Since θ may have more than two coordinates, the code below shows the first two coordinates.
You should observe that the change in the estimates are of the same order as the perturbation. So for this example system, the algorithm seem
enough.
isreliab
Stress-testing Algorithm 1
This experiment suggests all is fine. But what should we expect to happen?
In Numpy, there is a condition number estimator that will tell us approximately what the condition number is for a given matrix. Let's compare κ
κ(C) = κ(X TX):
show_cond_fancy(cond_X, 'X')
show_cond_fancy(cond_XTX, 'X^T X')
show_cond_fancy(cond_X**2, 'X', opt='^2')
κ(X) ≈ 1.07 × 10 2
κ(X) 2 ≈ 1.15 × 10 4
Ill-conditioning. As it happens, κ(C) is roughly the square of κ(X). So, by forming C explicitly and then trying to solve a system based on it, wemake
problem more difficult. Indeed, if the problem is ill-conditioned enough, this algorithm based on directly constructing the normal equations will p
the
producer
different results even under small changes, and we call the algorithm unstable.
In this particular example, the condition numbers are not very "big." You would be more concerned if the condition numbers were close to 1 / ϵ,where
machine epsilon. In double-precision, recall that ϵ d ≈ 10 − 15, so the values shown above is nothing to be worried about.
e
But what if we had a "hard" problem, that is, one whose condition number is large? The synthetic data generator allows us to create such a pro
blew
making the problem bigger. Let's try that next. by
In [15]: # Generate a "hard" problem
m_hard, n_hard = 100, 6
X_hard, y_hard, theta_hard_true = gen_problem(m_hard, n_hard)
cond_X_hard = np.linalg.cond(X_hard)
cond_XTX_hard = np.linalg.cond(X_hard.T.dot(X_hard))
κ(X h) ≈ 1.72 × 10 12
T
κ(X h X h) ≈ 2.82 × 10 23
These condition numbers are much larger. So, let's run the same sensitivity experiment as before, and see how the estimate varies for the hard
does it compare to the well-conditioned case?
Observe that the computed estimates can be relatively far from the true value, even getting the sign completely wrong in the case of the θ 0.
Algorithm 2: QR decomposition
A different method for solving an overdetermined systems is to use a tool from linear algebra known as the QR decomposition
(https://en.wikipedia.org/wiki/QR_decomposition).
Here is how we can use QR. If X has linearly independent columns, then we would first factor the m × n matrix X into the product X = QR, where
orthogonal matrix and R is an invertible n × n upper-triangular matrix. (These dimensions assume m ≥ n.) That Q is orthogonal means that Q TQ =
matrix; R being upper-triangular means all of its entries below the main diagonal are zero.
Next, observe that the normal equations can be transformed if we substitute X = QR:
X TXθ = X Ty
R TQ TQRθ = R TQ Ty
Rθ = Q Ty.
Lastly, because R is triangular, solving a system is "easy" using (backward) substitution. Consider the following 3 × 3 example (taken from here
(http://www.purplemath.com/modules/systlin6.htm)):
[ ][ ] [ ]
5 4 −1 θ0 0
10 −3 θ1 = 11 .
1 θ2 3
Because it is upper-triangular, you can see right away that 1 θ 2 = 3 θ 2 = 3. Then, going to the equation above it,
2 2
10θ 1 − 3θ 2 = 10θ 1 − 3(3) = 11 θ 1 = 2. Lastly, 5θ 0 + 4θ 1 − θ 2 = 5θ 0 + 4(2) − 3 = 0 θ 0 = − 1.
So, to summarize, a different algorithm to solve Xθ ≈ y using QR would look like the following:
1. Compute X = QR.
2. Form the modified right-hand side, z = Q Ty.
3. Use back-substitution to solve Rθ = z.
Conditioning. What about the sensitivity of this algorithm? Given R, we only need to solve linear systems involving R. Therefore, it's κ(R) that w
the stability of the algorithm. So if κ(R) is comparable to κ(X), then the algorithm should be as stable as one can expect any algorithm to be.
[[ 1. 0. 0.]
[ 1. 1. 1.]
[ 1. 2. 4.]
[ 1. 3. 9.]
[ 1. 4. 16.]]
...
Q: (10, 3)
R: (3, 3) ==
[[ -3.16227766 -14.23024947 -90.12491331]
[ 0. 9.08295106 81.74655956]
[ 0. 0. 22.97825059]]
assert type(Q) is np.ndarray, "`Q` is not a Numpy array but should be."
assert type(R) is np.ndarray, "`R` is not a Numpy array but should be."
assert Q.shape == (m, n+1), "`Q` has the wrong shape: it's {} rather than {}.".format(Q.shape, (
assert R.shape == (n+1, n+1), "`R` has the wrong shape: it's {} rather than {}.".format(R.shape,
for i in range(R.shape[0]):
for j in range(i):
assert np.isclose(R[i][j], 0.0), "R[{}][{}] == {} instead of 0!".format(i, j, R[i][j])
QTQ = Q.T.dot(Q)
assert np.isclose(QTQ, np.eye(Q.shape[1])).all(), "Q^T Q is not nearly the identity matrix, as i
e."
print("\n(Passed!)")
(Passed!)
Condition number of R. Let's check the condition number of R empirically, to verify that it is comparable to κ(X).
show_cond_fancy(cond_X, 'X')
show_cond_fancy(cond_XTX, 'X^T X')
show_cond_fancy(cond_R, 'R')
κ(X) ≈ 1.07 × 10 2
κ(R) ≈ 1.07 × 10 2
Exercise 7 (3 points). Implement a function, solve_qr(X, y), which uses the QR-based algorithm to estimate θ .
To solve the triangular system, use Scipy's specialized function, available as sp.linalg.solve_triangular()
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.solve_triangular.html).
theta_qr = solve_qr(X, y)
print("Residual norm:")
calc_residual_norm(X, y, theta_qr)
Residual norm:
Out[20]: 1.5434895314732317e-14
try:
del np.linalg.lstsq
solve_qr(X, y)
except NameError as n:
if re.findall('lstsq', n.args[0]):
print("*** Double-check that you did not try to use `lstsq()`. ***")
raise n
except AttributeError as a:
if re.findall('lstsq', a.args[0]):
print("*** Double-check that you did not try to use `lstsq()`. ***")
raise a
finally:
np.linalg.lstsq = SAVE_LSTSQ
print("\n(Passed!)")
(Passed!)
Is QR more stable? Let's run the same perturbation experiments on the "hard" regression problem and see the result.
You should observe that the QR-based method does, indeed, produce estimates much closer to the true value despite the problem's high cond
Performance tradeoff. Although QR produces more reliable results, there can be a performance tradeoff, as the following quick test should sh
ow
In [23]: print("=== Performance of the normal equations-based algorithm ===")
%timeit solve_neq(X_hard, y_hard)
This notebook only has one exercise, but it is not graded. So, you should complete it for your own edification.
We will also implement a linear least squares solver, estimate_coeffs(), that simply calls Numpy's lstsq() routine.
m = 50
theta_true = generate_model (1)
(X, y) = generate_data (m, theta_true, sigma=0.1)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot (X[:, 1], y, 'b+') # Noisy observations
ax1.plot (X[:, 1], X.dot (theta), 'r*') # Fit
ax1.plot (X[:, 1], X.dot (theta_true), 'go') # True solution
Dimensions of X: (50, 2)
Dimensions of theta_true: (2, 1)
Dimensions of y: (50, 1)
Condition number of X: 4.1056142618693565
True model coefficients: [[0.74734942 0.76050905]]
Estimated model coefficients: [[0.73296025 0.74693774]]
Benchmark varying . Let's benchmark the time to compute when the dimension of each point is fixed but the number of points varies
the running time scale with ?
n = 32 # dimension
M = [100, 1000, 10000, 100000, 1000000]
times = [0.] * len (M)
for (i, m) in enumerate (M):
theta_true = generate_model (n)
(X, y) = generate_data (m, theta_true, sigma=0.1)
t = %timeit -o estimate_coeffs (X, y)
times[i] = t.best
804 µs ± 7.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.13 ms ± 931 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.8 ms ± 899 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
107 ms ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.39 s ± 47.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.loglog (M, times, 'bo')
ax1.loglog (M, t_linear, 'r--')
ax1.set_xlabel ('m (number of observations)')
fig.suptitle ('Running time (fixed number of predictors)')
Hint: You can adapt the preceding benchmark. Also, note that the code cell following the one immediately below will plot your results a
and .
347 µs ± 238 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
520 µs ± 372 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
998 µs ± 75.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.96 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.82 ms ± 454 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.7 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
40.1 ms ± 7.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
97.9 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
249 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
876 ms ± 178 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.loglog (N, times, 'bo')
ax1.loglog (N, t_linear, 'r--')
ax1.loglog (N, t_quadratic, 'g--')
ax1.set_xlabel ('n (number of predictors)')
fig.suptitle ('Running time (fixed number of observations)')
Thus, the empirical scaling appears to be pretty reasonable, being roughly linear in . And while being quadratic in sounds bad, one expects
that in practical regression problems.
Fin! If you've gotten this far without errors your notebook is ready to submit
Fin! If you've gotten this far without errors, your notebook is ready to submit.
To start, here is some code to help generate synthetic problems of a certain size, namely, m × (n + 1), where m is the number of observations an
number of predictors. The + 1 comes from our usual dummy coefficient for a non-zero intercept.
An online algorithm
The empirical scaling of linear least squares appears to be pretty good, being roughly linear in m or at worst quadratic in n. But there is still a do
time and storage: each time there is a change in the data, you appear to need to form the data matrix all over again and recompute the solution
side
own
nfromsera
possibly touching the entire data set again!
This begs the question, is there a way to incrementally update the model coefficients whenever a new data point, or perhaps a small batch of n
points, arrives? Such a procedure would be considered incremental or online, rather than batched or offline.
en data
Setup: Key assumptions and main goal. In the discussion that follows, assume that you only get to see the observations one-at-a-time. Let (y
the current observation. (Relative to our previous notation, this tuple is just element k of y and row k of X.
yk.ttde
T
We will use x̂ k to denote a row k of X since we previously used x j to denote column j of X. That is,
( )
T
x̂ 0
X= ( x0 xn
)= ,
T
x̂ m − 1
where the first form is our previous "columns-view" representation and the second form is our "rows-view."
Additionally, assume that, at the time the k-th observation arrives, you start with a current estimate of the parameters, θ̃(k), which is a vector. If f
T
tr hater
reason you need to refer to element i of that vector, use θ̃ i(k). You will then compute a new estimate, θ̃(k + 1) using θ̃(k) and (y k, x̂ k ). For the disc
be
cussion
further assume that you throw out θ̃(k) once you have θ̃(k + 1).
As for your goal, recall that in the batch setting you start with all the observations, (y, X). From this starting point, you may estimate the linear re
gression
model's parameters, θ, by solving Xθ = y. In the online setting, you compute estimates one at a time. After seeing all m observations in X, your g
is
goal to
compute an θ̃ m − 1 ≈ θ.
An intuitive (but flawed) idea. Indeed, there is a technique from the signal processing literature that we can apply to the linear regression prob
the least mean square (LMS) algorithm. Before describing it, let's start with an initial idea.
T
Suppose that you have a current estimate of the parameters, θ(k), when you get a new sample, (y k, x̂ k ). The error in your prediction will be,
T
y k − x̂ k θ̃(k).
Ideally, this error would be zero. So, let's ask if there exists a correction, Δ k, such that
T
(
y k − x̂ k θ̃(k) + Δ k ) = 0
T T
yk − x̂ k θ̃(k) = x̂ k Δ k
Then, you could compute a new estimate of the parameter by θ̃(k + 1) = θ̃(k) + Δ k.
This idea has a major flaw, which we will discuss below. But before we do, please try the following exercise.
Mental exercise (no points). Verify that the following choice of Δ k would make the preceding equation true.
x̂ k
Δk =
ǁx̂ kǁ 22
(y k
T
− x̂ k θ̃(k) . )
Refining (or rather, "hacking") the basic idea: The least mean square (LMS) procedure. The basic idea sketched above has at least one ma
choice of Δ k might allow you to correctly predict y k from x k and the new estimate θ̃(k + 1) = θ̃(k) + Δ k, but there is no guarantee that this new est
preserves the quality of predictions made at all previous iterations!
There are a number of ways to deal with this problem, which includes carrying out an update with respect to some (or all) previous data. Howev
also a simpler "hack" that, though it might require some parameter tuning, can be made to work in practice.
That hack is as follows. Rather than using Δ k as computed above, let's compute a different update that has a "fudge" factor, ϕ:
θ̃(k + 1) = θ̃(k) + Δ k
where Δk = (
ϕ x̂ k y k − x̂ k θ̃(k) .
T
)
A big question is how to choose ϕ. There is some analysis out there that can help. We will just state the results of this analysis without proof.
Let λ max(X TX) be the largest eigenvalue of X TX. The result is that as the number of samples s → ∞, any choice of ϕ that satisfies the following co
eventually converge to the best least-squares estimator of θ̃, that is, the estimate of θ̃ you would have gotten by solving the linear least squares
all of the data.
2
0<ϕ< .
λ max(X TX)
This condition is not very satisfying, because you cannot really know λ max(X TX) until you've seen all the data, whereas we would like to apply th
online as the data arrive. Nevertheless, in practice you can imagine hybrid schemes that, given a batch of data points, use the QR fitting proced
starting estimate for θ̃ as well as to estimate a value of ϕ to use for all future updates.
( T
where Δ k = ϕ x̂ k y k − x̂ k θ̃(k) . )
To start, let's generate an initial 1-D problem (2 regression coefficients, a slope, and an intercept), and solve it using the batch procedure.
Recall that we need a value for ϕ, for which we have an upper-bound of λ max(X TX). Let's cheat by computing it explicitly, even though in practic
need to do something different.
In [4]: m = 100000
n = 1
theta_true = generate_model(n)
(X, y) = generate_data(m, theta_true, sigma=0.1)
theta = estimate_coeffs(X, y)
e_rel = rel_diff(theta, theta_true)
126970.43769457837
Exercise 1 (5 points). Implement the online LMS algorithm in the code cell below where indicated. It should produce a final parameter estimate
as a column vector.
theta l
In addition, the skeleton code below uses rel_diff() to record the relative difference between the estimate and the true vector, storing the k
relate
difference in rel_diffs[k]. Doing so will allow you to see the convergence behavior of the method.
Lastly, to help you out, we've defined a constant in terms of λ max(X TX) that you can use for ϕ.
In practice, you would only maintain the current estimate, or maybe just a few recent estimates, rather than all of them. Since we want
inspect these vectors later, go ahead and store them all.
theta_k = np.zeros((n+1))
for k in range(m):
rel_diffs[k] = rel_diff(theta_k, theta_true)
theta_lms = theta_k
rel_diffs[m] = rel_diff(theta_lms, theta_true)
Let's compare the true coefficients against the estimates, both from the batch algorithm and the online algorithm. The values of the variables be
change if the notebooks are re-run from start.
print("\n('Passed' -- this cell appears to run without error, but we aren't checking the solutio
[[0.87103594 0.46153527]]
[[0.87172037 0.46005275]]
[0 74984566 0 40192802]
[0.74984566 0.40192802]
('Passed' -- this cell appears to run without error, but we aren't checking the solution.)
Let's also compute the relative differences between each estimate Theta[:, k] and the true coefficients theta_true, measured in the two-
the estimate is converging to the truth.
Finally, if the dimension is n=1, let's go ahead and do a sanity-check regression fit plot. The plot can change if the notebooks are re-run from st
Exercise 2 (ungraded, optional). We said previously that, in practice, you would probably do some sort of hybrid scheme that mixes full batch u
(possibly only initially) and incremental updates. Implement such a scheme and describe what you observe. You might observe a different plot e
cell is re-run.
if theta_0 is None:
theta_k = np.zeros((n))
else:
theta_k = theta_0
theta_k = np.reshape(theta_k, (n, 1))
for k in range(m):
x_k = X[k:k+1, :].T
r_k = y[k] - x_k.T.dot(theta_k)
delta_k = PHI * r_k * x_k
theta_k = theta_k + delta_k
return theta_k
(100000, 2)
[[0.54400102]
[0.33553129]]
(99900, 2)
[[0.51075904]
[0.39327854]]
=== Sum of squared errors ===
True: [[994.76733987]]
Batch: [[994.76638474]]
Batch-100: [[1040.65200804]]
LMS(phi=0.014605068287428212): [[999.28578087]]
Batch (k <= 100) + LMS: [[999.28578087]]
=== Relative error in the coefficients (theta) ===
Batch: 0.00025460403740161595
Batch-100: 0.12047813595445639
LMS(phi=0.014605068287428212): 0.018194600645439905
Batch (k <= 100) + LMS: 0.018194600645439905