Transpose & Dot Product
Def: The transpose of an m × n matrix A is the n × m matrix AT whose
columns are the rows of A.
So: The columns of AT are the rows of A. The rows of AT are the columns
of A.
1 4
1 2 3
Example: If A = , then AT = 2 5.
4 5 6
3 6
Convention: From now on, vectors v ∈ Rn will be regarded as “columns”
(i.e.: n × 1 matrices). Therefore, vT is a “row vector” (a 1 × n matrix).
Observation: Let v, w ∈ Rn . Then vT w = v · w. This is because:
w
T
.1
v w = v1 · · · vn .. = v1 w1 + · · · + vn wn = v · w.
wn
Where theory is concerned, the key property of transposes is the following:
Prop 18.2: Let A be an m × n matrix. Then for x ∈ Rn and y ∈ Rm :
(Ax) · y = x · (AT y).
Here, · is the dot product of vectors.
Extended Example
Let A be a 5 × 3 matrix, so A : R3 → R5 .
◦ N (A) is a subspace of
◦ C(A) is a subspace of
The transpose AT is a matrix, so AT : →
◦ C(AT ) is a subspace of
◦ N (AT ) is a subspace of
Observation: Both C(AT ) and N (A) are subspaces of . Might there
be a geometric relationship between the two? (No, they’re not equal.) Hm...
Also: Both N (AT ) and C(A) are subspaces of . Might there be a
geometric relationship between the two? (Again, they’re not equal.) Hm...
Orthogonal Complements
Def: Let V ⊂ Rn be a subspace. The orthogonal complement of V is the
set
V ⊥ = {x ∈ Rn | x · v = 0 for every v ∈ V }.
So, V ⊥ consists of the vectors which are orthogonal to every vector in V .
Fact: If V ⊂ Rn is a subspace, then V ⊥ ⊂ Rn is a subspace.
Examples in R3 :
◦ The orthogonal complement of V = {0} is V ⊥ = R3
◦ The orthogonal complement of V = {z-axis} is V ⊥ = {xy-plane}
◦ The orthogonal complement of V = {xy-plane} is V ⊥ = {z-axis}
◦ The orthogonal complement of V = R3 is V ⊥ = {0}
Examples in R4 :
◦ The orthogonal complement of V = {0} is V ⊥ = R4
◦ The orthogonal complement of V = {w-axis} is V ⊥ = {xyz-space}
◦ The orthogonal complement of V = {zw-plane} is V ⊥ = {xy-plane}
◦ The orthogonal complement of V = {xyz-space} is V ⊥ = {w-axis}
◦ The orthogonal complement of V = R4 is V ⊥ = {0}
Prop 19.3-19.4-19.5: Let V ⊂ Rn be a subspace. Then:
(a) dim(V ) + dim(V ⊥ ) = n
(b) (V ⊥ )⊥ = V
(c) V ∩ V ⊥ = {0}
(d) V + V ⊥ = Rn .
Part (d) means: “Every vector x ∈ Rn can be written as a sum x = v + w
where v ∈ V and w ∈ V ⊥ .”
Also, it turns out that the expression x = v + w is unique: that is, there
is only one way to write x as a sum of a vector in V and a vector in V ⊥ .
Meaning of C(AT ) and N (AT )
Q: What does C(AT ) mean? Well, the columns of AT are the rows of A. So:
C(AT ) = column space of AT
= span of columns of AT
= span of rows of A.
For this reason: We call C(AT ) the row space of A.
Q: What does N (AT ) mean? Well:
x ∈ N (AT ) ⇐⇒ AT x = 0
⇐⇒ (AT x)T = 0T
⇐⇒ xT A = 0T .
So, for an m × n matrix A, we see that: N (AT ) = {x ∈ Rm | xT A = 0T }.
For this reason: We call N (AT ) the left null space of A.
Relationships among the Subspaces
Theorem: Let A be an m × n matrix. Then:
◦ C(AT ) = N (A)⊥
◦ N (AT ) = C(A)⊥
Corollary: Let A be an m × n matrix. Then:
◦ C(A) = N (AT )⊥
◦ N (A) = C(AT )⊥
Prop 18.3: Let A be an m × n matrix. Then rank(A) = rank(AT ).
Motivating Questions for Reading
Problem 1: Let b ∈ C(A). So, the system of equations Ax = b does have
solutions, possibly infinitely many.
Q: What is the solution x of Ax = b with kxk the smallest?
Problem 2: Let b ∈ / C(A). So, the system of equations Ax = b does not
have any solutions. In other words, Ax − b 6= 0.
Q: What is the vector x that minimizes the error kAx − bk? That is, what
is the vector x that comes closest to being a solution to Ax = b?
Orthogonal Projection
Def: Let V ⊂ Rn be a subspace. Then every vector x ∈ Rn can be written
uniquely as
x = v + w, where v ∈ V and w ∈ V ⊥ .
The orthogonal projection onto V is the function ProjV : Rn → Rn
given by: ProjV (x) = v. (Note that ProjV ⊥ (x) = w.)
Prop 20.1: Let V ⊂ Rn be a subspace. Then:
ProjV + ProjV ⊥ = In .
Of course, we already knew this: We have x = v+w = ProjV (x)+ProjV ⊥ (x).
Formula: Let {v1 , . . . , vk } be a basis of V ⊂ Rn . Let A be the n × k matrix
A = v1 · · · vk .
Then:
ProjV = A(AT A)−1 AT . (∗)
Geometry Observations: Let V ⊂ Rn be a subspace, and x ∈ Rn a vector.
(1) The distance from x to V is: kProjV ⊥ (x)k = kx − ProjV (x)k.
(2) The vector in V that is closest to x is: ProjV (x).
Derivation of (∗): Notice ProjV (x) is a vector in V = span(v1 , . . . , vk ) = C(A) = Range(A), and
therefore ProjV (x) = Ay for some vector y ∈ Rk .
Now notice that x − ProjV (x) = x − Ay is a vector in V ⊥ = C(A)⊥ = N (AT ), which means
that AT (x − Ay) = 0, which means AT x = AT Ay.
Now, it turns out that our matrix AT A is invertible (proof in L20), so we get y = (AT A)−1 AT x.
Thus, ProjV (x) = Ay = A(AT A)−1 AT x. ♦
Minimum Magnitude Solution
Prop 19.6: Let b ∈ C(A) (so Ax = b has solutions). Then there exists
exactly one vector x0 ∈ C(AT ) with Ax0 = b.
And: Among all solutions of Ax = b, the vector x0 has the smallest length.
In other words: There is exactly one vector x0 in the row space of A which
solves Ax = b – and this vector is the solution of smallest length.
To Find x0 : Start with any solution x of Ax = b. Then
x0 = ProjC(AT ) (x).
Least Squares Approximation
Idea: Suppose b ∈ / C(A). So, Ax = b has no solutions, so Ax − b 6= 0.
We want to find the vector x∗ which minimizes the error kAx∗ − bk. That
is, we want the vector x∗ for which Ax∗ is the closest vector in C(A) to b.
In other words, we want the vector x∗ for which Ax∗ − b is orthogonal to
C(A). So, Ax∗ − b ∈ C(A)⊥ = N (AT ), meaning that AT (Ax∗ − b) = 0, i.e.:
AT Ax∗ = AT b.
Quadratic Forms (Intro)
Given an m × n matrix A, we can regard it as a linear transformation
T : Rn → Rm . In the special case where the matrix A is a symmetric matrix,
we can also regard A as defining a “quadratic form”:
Def: Let A be a symmetric n × n matrix. The quadratic form associated
to A is the function QA : Rn → R given by:
QA (x) = x · Ax (· is the dot product)
x1
= x Ax = x1 · · · xn A ...
T
xn
Notice that quadratic forms are not linear transformations!
Orthonormal Bases
Def: A basis {w1 , . . . , wk } for a subspace V is an orthonormal basis if:
(1) The basis vectors are mutually orthogonal: wi · wj = 0 (for i 6= j);
(2) The basis vectors are unit vectors: wi · wi = 1. (i.e.: kwi k = 1)
Orthonormal bases are nice for (at least) two reasons:
(a) It is much easier to find the B-coordinates [v]B of a vector when the
basis B is orthonormal;
(b) It is much easier to find the projection matrix onto a subspace V
when we have an orthonormal basis for V .
Prop: Let {w1 , . . . , wk } be an orthonormal basis for a subspace V ⊂ Rn .
(a) Every vector v ∈ V can be written
v = (v · w1 )w1 + · · · + (v · wk )wk .
(b) For all x ∈ Rn :
ProjV (x) = (x · w1 )w1 + · · · + (x · wk )wk .
(c) Let A be the matrix with columns {w1 , . . . , wk }. Then AT A = Ik , so:
ProjV = A(AT A)−1 AT = AAT .
Orthogonal Matrices
Def: An orthogonal matrix is an invertible matrix C such that
C −1 = C T .
Example: Let {v1 , . . . , vn } be an orthonormal basis for Rn . Then the matrix
C = v1 · · · vn
is an orthogonal matrix.
In fact, every orthogonal matrix C looks like this: the columns of any
orthogonal matrix form an orthonormal basis of Rn .
Where theory is concerned, the key property of orthogonal matrices is:
Prop 22.4: Let C be an orthogonal matrix. Then for v, w ∈ Rn :
Cv · Cw = v · w.
Gram-Schmidt Process
Since orthonormal bases have so many nice properties, it would be great
if we had a way of actually manufacturing orthonormal bases. That is:
Goal: We are given a basis {v1 , . . . , vk } for a subspace V ⊂ Rn . We would
like an orthonormal basis {w1 , . . . , wk } for our subspace V .
Notation: We will let
V1 = span(v1 )
V2 = span(v1 , v2 )
..
.
Vk = span(v1 , . . . , vk ) = V.
Idea: Build an orthonormal basis for V1 , then for V2 , . . . , up to Vk = V .
Gram-Schmidt Algorithm: Let {v1 , . . . , vk } be a basis for V ⊂ Rn .
(1) Define w1 = kvv11 k .
(2) Having defined {w1 , . . . , wj }, let
yj+1 = vj+1 − ProjVj (vj+1 )
= vj+1 − (vj+1 · w1 )w1 − (vj+1 · w2 )w2 − · · · − (vj+1 · wj )wj ,
y j+1
and define wj+1 = kyj+1 k.
Then {w1 , . . . , wk } is an orthonormal basis for V .
Definiteness
Def: Let Q : Rn → R be a quadratic form.
We say Q is positive definite if Q(x) > 0 for all x 6= 0.
We say Q is negative definite if Q(x) < 0 for all x 6= 0.
We say Q is indefinite if there are vectors x for which Q(x) > 0, and also
vectors x for which Q(x) < 0.
Def: Let A be a symmetric matrix.
We say A is positive definite if QA (x) = xT Ax > 0 for all x 6= 0.
We say A is negative definite if QA (x) = xT Ax < 0 for all x 6= 0.
We say A is indefinite if there are vectors x for which xT Ax > 0, and
also vectors x for which xT Ax < 0.
In other words:
◦ A is positive definite ⇐⇒ QA is positive definite.
◦ A is negative definite ⇐⇒ QA is negative definite.
◦ A is indefinite ⇐⇒ QA is indefinite.
The Hessian
Def: Let f : Rn → R be a function. Its Hessian at a ∈ Rn is the symmetric
matrix of second partials:
fx1 x1 (a) · · · fx1 xn (a)
Hf (a) = · · · ... · · · .
fxn x1 (a) · · · fxn xn (a)
Note that the Hessian is a symmetric matrix. Therefore, we can also
regard Hf (a) as a quadratic form:
· · · fx1 xn (a)
f (a) x1
T
x1 x1 ... .
· · · .. .
QHf (a) (x) = x Hf (a) x = x1 · · · xn · · ·
fxn x1 (a) · · · fxn xn (a) xn
In particular, it makes sense to ask whether the Hessian is positive definite,
negative definite, or indefinite.
Single-Variable Calculus Review
Recall: In calculus, you learned that for a function f : R → R, a critical
point is a point a ∈ R where f 0 (a) = 0 or f 0 (a) does not exist.
You learned that if f (x) has a local min/max at x = a, then x = a is a
critical point. Of course, the converse is false: critical points don’t have to
be local minima or local maxima (e.g., they could be inflection points.)
You also learned the “second derivative test.” If x = a is a critical point
for f (x), then f 00 (a) > 0 tells us that x = a is a local min, whereas f 00 (a) < 0
tells us that x = a is a local max.
It would be nice to have similar statements in higher dimensions:
Critical Points & Second Derivative Test
Def: A critical point of f : Rn → R is a point a ∈ Rn at which Df (a) = 0T
or Df (a) is undefined.
∂f
In other words, each partial derivative ∂xi
(a) is zero or undefined.
Theorem: If f : Rn → R has a local max / local min at a ∈ Rn , then a is a
critical point of f .
N.B.: The converse of this theorem is false! Critical points do not have to
be a local max or local min – e.g., they could be saddle points.
Def: A saddle point of f : Rn → R is a critical point of f that is not a local
max or local min.
Second Derivative Test: Let f : Rn → R be a function, and a ∈ Rn be a
critical point of f .
(a) If Hf (a) is positive definite, then a is a local min of f .
(b) If Hf (a) is positive semi-definite, then a is local min or saddle point.
(c) If Hf (a) is negative definite, then a is a local max of f .
(d) If Hf (a) is negative semi-definite, then a is local max or saddle point.
(e) If Hf (a) is indefinite, then a is a saddle point of f .
Local Extrema vs Global Extrema
Finding Local Extrema: We want to find the local extrema of a function
f : Rn → R.
(i) Find the critical points of f .
(ii) Use the Second Derivative Test to decide if the critical points are local
maxima / minima / saddle points.
Theorem: Let f : Rn → R be a function. If R ⊂ Rn is a closed and bounded
region, then f has a global max and a global min on R.
Finding Global Extrema: We want to find the global extrema of a func-
tion f : Rn → R on a region R ⊂ Rn .
(1) Find the critical points of f on the interior of R.
(2) Find the extreme values of f on the boundary of R. (Lagrange mult.)
Then:
◦ The largest value from Steps (1)-(2) is a global max value.
◦ The smallest value from Steps (1)-(2) is a global min value.
Lagrange Multipliers (Constrained Optimization)
Notation: Let f : Rn → Rm be a function, and S ⊂ Rn be a subset.
The restricted function f |S : S → Rm is the same exact function as f , but
where the domain is restricted to S.
Theorem: Suppose we want to optimize a function f (x1 , . . . , xn ) constrained
to a level set S = {g(x1 , . . . , xn ) = c}.
If a is an extreme value of f |S on the level set S = {g(x1 , . . . , xn ) = c},
and if ∇g(a) 6= 0, then
∇f (a) = λ∇g(a)
for some constant λ.
Reason: If a is an extreme value of f |S on the level set S, then Dv f (a) = 0
for all vectors v that are tangent to the level set S. Therefore, ∇f (a) · v = 0
for all vectors v that are tangent to S.
This means that ∇f (a) is orthogonal to the level set S, so ∇f (a) must be
a scalar multiple of the normal vector ∇g(a). That is, ∇f (a) = λ∇g(a).
Motivation for Eigenvalues & Eigenvectors
We want to understand a quadratic form QA (x), which might be ugly and
complicated.
Idea: Maybe there’s an orthonormal basis B = {w1 , . . . , wn } of Rn that
is somehow “best suited to A” – so that with respect to the basis B, the
quadratic form QA looks simple.
What do we mean by “basis suited to A”? And does such a basis always
exist? Well:
Spectral Theorem: Let A be a symmetric n × n matrix. Then there exists
an orthonormal basis B = {w1 , . . . , wn } of Rn such that each w1 , . . . , wn is
an eigenvector of A.
i.e.: There is an orthonormal basis of Rn consisting of eigenvectors of A.
Why is this good? Well, since B is a basis, every w ∈ Rn can be written
w = u1 w1 + · · · + un wn . (That is, the B-coordinates of w are (u1 , . . . , un ).)
It then turns out that:
QA (w) = QA (u1 w1 + · · · + un wn )
= (u1 w1 + · · · + un wn ) · A(u1 w1 + · · · + un wn )
= λ1 (u1 )2 + λ2 (u2 )2 + · · · + λn (un )2 . (yay!)
In other words: the quadratic form QA is in diagonal form with respect to
the basis B. We have made QA look as simple as possible!
Also: the coefficients λ1 , . . . , λn are exactly the eigenvalues of A.
Corollary: Let A be a symmetric n × n matrix, with eigenvalues λ1 , . . . , λn .
(a) A is positive-definite ⇐⇒ all of λ1 , . . . , λn are positive.
(b) A is negative-definite ⇐⇒ all of λ1 , . . . , λn are negative.
(c) A is indefinite ⇐⇒ there is a positive eigenvalue λi > 0 and a negative
eigenvalue λj < 0.
Useful Fact: Let A be any n × n matrix, with eigenvalues λ1 , . . . , λn . Then
det(A) = λ1 λ2 · · · λn .
Cor: If any one of the eigenvalues λj = 0 is zero, then det(A) = 0.
What is a (Unit) Sphere?
◦ The 1-sphere (the “unit circle”) is S1 = {(x, y) ∈ R2 | x2 + y 2 = 1} ⊂ R2 .
◦ The 2-sphere (the “sphere”) is S2 = {(x, y, z) ∈ R3 | x2 +y 2 +z 2 = 1} ⊂ R3 .
◦ The 3-sphere is S3 = {(x, y, z, w) ∈ R4 | x2 + y 2 + z 2 + w2 = 1} ⊂ R4 .
Note that the 3-sphere is not the same as the unit ball {x2 + y 2 + z 2 ≤ 1}.
◦ The (n − 1)-sphere is the set
Sn−1 = {(x1 , . . . , xn ) ∈ Rn | (x1 )2 + · · · + (xn )2 = 1}
= {x ∈ Rn | kxk2 = 1} ⊂ Rn .
In other words, Sn−1 consists of the unit vectors in Rn .
Optimizing Quadratic Forms on Spheres
Problem: Optimize a quadratic form QA : Rn → R on the sphere Sn−1 ⊂ Rn .
That is, what are the maxima and minima of QA (w) subject to the con-
straint that kwk = 1?
Solution: Let λmax and λmin be the largest and smallest eigenvalues of A.
◦ The maximum value of QA for unit vectors is λmax . Any unit vector wmax
which attains this maximum is an eigenvector of A with eigenvalue λmax .
◦ The minimum value of QA for unit vectors is λmin . Any unit vector wmin
which attains this minimum is an eigenvector of A with eigenvalue λmin .
Corollary: Let A be a symmetric n × n matrix.
(a) A is positive-definite ⇐⇒ the minimum value of QA restricted to unit
vector inputs is positive (i.e., iff λmin > 0).
(b) A is negative-definite ⇐⇒ the maximum value of QA restricted to
unit vector inputs is negative (i.e., iff λmax < 0).
(c) A is indefinite ⇐⇒ λmax > 0 and λmin < 0.
Directional First & Second Derivatives
Def: Let f : Rn → R be a function, a ∈ Rn be a point.
The directional derivative of f at a in the direction v is:
Dv f (a) = ∇f (a) · v.
The “directional second derivative” of f at a in the direction v is:
QHf (a) (v) = vT Hf (a)v.
That is: the quadratic form whose associated matrix is the Hessian Hf (a).
Q: What direction v increases the directional derivative the most? What
direction v decreases the directional derivative the most?
A: We’ve learned this: the gradient ∇f (a) is the direction of greatest in-
crease, whereas −∇f (a) is the direction of greatest decrease.
New Questions:
◦ What direction v increases the directional second derivative the most?
◦ What direction v decreases the directional second derivative the most?
Answer: The (unit) directions of minimum and maximum second derivative
are (unitized) eigenvectors of Hf (a), and so they are mutually orthogonal.
The max/min values of the directional second derivative are the max/min
eigenvalues of Hf (a).