[go: up one dir, main page]

0% found this document useful (0 votes)
19 views13 pages

Transpose & Dot Product: M N A N M A A A A A A A A

Uploaded by

Methyl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views13 pages

Transpose & Dot Product: M N A N M A A A A A A A A

Uploaded by

Methyl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Transpose & Dot Product

Def: The transpose of an m × n matrix A is the n × m matrix AT whose


columns are the rows of A.
So: The columns of AT are the rows of A. The rows of AT are the columns
of A.  
  1 4
1 2 3
Example: If A = , then AT = 2 5.
4 5 6
3 6
Convention: From now on, vectors v ∈ Rn will be regarded as “columns”
(i.e.: n × 1 matrices). Therefore, vT is a “row vector” (a 1 × n matrix).

Observation: Let v, w ∈ Rn . Then vT w = v · w. This is because:


 
w
T
 .1
v w = v1 · · · vn  ..  = v1 w1 + · · · + vn wn = v · w.


wn
Where theory is concerned, the key property of transposes is the following:

Prop 18.2: Let A be an m × n matrix. Then for x ∈ Rn and y ∈ Rm :


(Ax) · y = x · (AT y).
Here, · is the dot product of vectors.

Extended Example
Let A be a 5 × 3 matrix, so A : R3 → R5 .
◦ N (A) is a subspace of
◦ C(A) is a subspace of

The transpose AT is a matrix, so AT : →


◦ C(AT ) is a subspace of
◦ N (AT ) is a subspace of

Observation: Both C(AT ) and N (A) are subspaces of . Might there


be a geometric relationship between the two? (No, they’re not equal.) Hm...

Also: Both N (AT ) and C(A) are subspaces of . Might there be a


geometric relationship between the two? (Again, they’re not equal.) Hm...
Orthogonal Complements
Def: Let V ⊂ Rn be a subspace. The orthogonal complement of V is the
set
V ⊥ = {x ∈ Rn | x · v = 0 for every v ∈ V }.
So, V ⊥ consists of the vectors which are orthogonal to every vector in V .

Fact: If V ⊂ Rn is a subspace, then V ⊥ ⊂ Rn is a subspace.

Examples in R3 :
◦ The orthogonal complement of V = {0} is V ⊥ = R3
◦ The orthogonal complement of V = {z-axis} is V ⊥ = {xy-plane}
◦ The orthogonal complement of V = {xy-plane} is V ⊥ = {z-axis}
◦ The orthogonal complement of V = R3 is V ⊥ = {0}

Examples in R4 :
◦ The orthogonal complement of V = {0} is V ⊥ = R4
◦ The orthogonal complement of V = {w-axis} is V ⊥ = {xyz-space}
◦ The orthogonal complement of V = {zw-plane} is V ⊥ = {xy-plane}
◦ The orthogonal complement of V = {xyz-space} is V ⊥ = {w-axis}
◦ The orthogonal complement of V = R4 is V ⊥ = {0}

Prop 19.3-19.4-19.5: Let V ⊂ Rn be a subspace. Then:


(a) dim(V ) + dim(V ⊥ ) = n
(b) (V ⊥ )⊥ = V
(c) V ∩ V ⊥ = {0}
(d) V + V ⊥ = Rn .

Part (d) means: “Every vector x ∈ Rn can be written as a sum x = v + w


where v ∈ V and w ∈ V ⊥ .”
Also, it turns out that the expression x = v + w is unique: that is, there
is only one way to write x as a sum of a vector in V and a vector in V ⊥ .
Meaning of C(AT ) and N (AT )
Q: What does C(AT ) mean? Well, the columns of AT are the rows of A. So:
C(AT ) = column space of AT
= span of columns of AT
= span of rows of A.
For this reason: We call C(AT ) the row space of A.

Q: What does N (AT ) mean? Well:


x ∈ N (AT ) ⇐⇒ AT x = 0
⇐⇒ (AT x)T = 0T
⇐⇒ xT A = 0T .
So, for an m × n matrix A, we see that: N (AT ) = {x ∈ Rm | xT A = 0T }.
For this reason: We call N (AT ) the left null space of A.

Relationships among the Subspaces


Theorem: Let A be an m × n matrix. Then:
◦ C(AT ) = N (A)⊥
◦ N (AT ) = C(A)⊥

Corollary: Let A be an m × n matrix. Then:


◦ C(A) = N (AT )⊥
◦ N (A) = C(AT )⊥

Prop 18.3: Let A be an m × n matrix. Then rank(A) = rank(AT ).

Motivating Questions for Reading


Problem 1: Let b ∈ C(A). So, the system of equations Ax = b does have
solutions, possibly infinitely many.
Q: What is the solution x of Ax = b with kxk the smallest?

Problem 2: Let b ∈ / C(A). So, the system of equations Ax = b does not


have any solutions. In other words, Ax − b 6= 0.
Q: What is the vector x that minimizes the error kAx − bk? That is, what
is the vector x that comes closest to being a solution to Ax = b?
Orthogonal Projection
Def: Let V ⊂ Rn be a subspace. Then every vector x ∈ Rn can be written
uniquely as
x = v + w, where v ∈ V and w ∈ V ⊥ .
The orthogonal projection onto V is the function ProjV : Rn → Rn
given by: ProjV (x) = v. (Note that ProjV ⊥ (x) = w.)

Prop 20.1: Let V ⊂ Rn be a subspace. Then:


ProjV + ProjV ⊥ = In .
Of course, we already knew this: We have x = v+w = ProjV (x)+ProjV ⊥ (x).

Formula: Let {v1 , . . . , vk } be a basis of V ⊂ Rn . Let A be the n × k matrix


 

A = v1 · · · vk .

Then:
ProjV = A(AT A)−1 AT . (∗)

Geometry Observations: Let V ⊂ Rn be a subspace, and x ∈ Rn a vector.


(1) The distance from x to V is: kProjV ⊥ (x)k = kx − ProjV (x)k.
(2) The vector in V that is closest to x is: ProjV (x).

Derivation of (∗): Notice ProjV (x) is a vector in V = span(v1 , . . . , vk ) = C(A) = Range(A), and
therefore ProjV (x) = Ay for some vector y ∈ Rk .
Now notice that x − ProjV (x) = x − Ay is a vector in V ⊥ = C(A)⊥ = N (AT ), which means
that AT (x − Ay) = 0, which means AT x = AT Ay.
Now, it turns out that our matrix AT A is invertible (proof in L20), so we get y = (AT A)−1 AT x.
Thus, ProjV (x) = Ay = A(AT A)−1 AT x. ♦
Minimum Magnitude Solution
Prop 19.6: Let b ∈ C(A) (so Ax = b has solutions). Then there exists
exactly one vector x0 ∈ C(AT ) with Ax0 = b.
And: Among all solutions of Ax = b, the vector x0 has the smallest length.

In other words: There is exactly one vector x0 in the row space of A which
solves Ax = b – and this vector is the solution of smallest length.

To Find x0 : Start with any solution x of Ax = b. Then

x0 = ProjC(AT ) (x).

Least Squares Approximation


Idea: Suppose b ∈ / C(A). So, Ax = b has no solutions, so Ax − b 6= 0.
We want to find the vector x∗ which minimizes the error kAx∗ − bk. That
is, we want the vector x∗ for which Ax∗ is the closest vector in C(A) to b.

In other words, we want the vector x∗ for which Ax∗ − b is orthogonal to


C(A). So, Ax∗ − b ∈ C(A)⊥ = N (AT ), meaning that AT (Ax∗ − b) = 0, i.e.:

AT Ax∗ = AT b.

Quadratic Forms (Intro)


Given an m × n matrix A, we can regard it as a linear transformation
T : Rn → Rm . In the special case where the matrix A is a symmetric matrix,
we can also regard A as defining a “quadratic form”:

Def: Let A be a symmetric n × n matrix. The quadratic form associated


to A is the function QA : Rn → R given by:

QA (x) = x · Ax (· is the dot product)


 
x1
= x Ax = x1 · · · xn A  ... 
T
 

xn
Notice that quadratic forms are not linear transformations!
Orthonormal Bases
Def: A basis {w1 , . . . , wk } for a subspace V is an orthonormal basis if:
(1) The basis vectors are mutually orthogonal: wi · wj = 0 (for i 6= j);
(2) The basis vectors are unit vectors: wi · wi = 1. (i.e.: kwi k = 1)

Orthonormal bases are nice for (at least) two reasons:


(a) It is much easier to find the B-coordinates [v]B of a vector when the
basis B is orthonormal;
(b) It is much easier to find the projection matrix onto a subspace V
when we have an orthonormal basis for V .

Prop: Let {w1 , . . . , wk } be an orthonormal basis for a subspace V ⊂ Rn .


(a) Every vector v ∈ V can be written
v = (v · w1 )w1 + · · · + (v · wk )wk .
(b) For all x ∈ Rn :
ProjV (x) = (x · w1 )w1 + · · · + (x · wk )wk .
(c) Let A be the matrix with columns {w1 , . . . , wk }. Then AT A = Ik , so:
ProjV = A(AT A)−1 AT = AAT .

Orthogonal Matrices
Def: An orthogonal matrix is an invertible matrix C such that
C −1 = C T .
Example: Let {v1 , . . . , vn } be an orthonormal basis for Rn . Then the matrix
 

C = v1 · · · vn 

is an orthogonal matrix.
In fact, every orthogonal matrix C looks like this: the columns of any
orthogonal matrix form an orthonormal basis of Rn .

Where theory is concerned, the key property of orthogonal matrices is:

Prop 22.4: Let C be an orthogonal matrix. Then for v, w ∈ Rn :


Cv · Cw = v · w.
Gram-Schmidt Process
Since orthonormal bases have so many nice properties, it would be great
if we had a way of actually manufacturing orthonormal bases. That is:

Goal: We are given a basis {v1 , . . . , vk } for a subspace V ⊂ Rn . We would


like an orthonormal basis {w1 , . . . , wk } for our subspace V .

Notation: We will let

V1 = span(v1 )
V2 = span(v1 , v2 )
..
.
Vk = span(v1 , . . . , vk ) = V.

Idea: Build an orthonormal basis for V1 , then for V2 , . . . , up to Vk = V .

Gram-Schmidt Algorithm: Let {v1 , . . . , vk } be a basis for V ⊂ Rn .


(1) Define w1 = kvv11 k .
(2) Having defined {w1 , . . . , wj }, let

yj+1 = vj+1 − ProjVj (vj+1 )


= vj+1 − (vj+1 · w1 )w1 − (vj+1 · w2 )w2 − · · · − (vj+1 · wj )wj ,
y j+1
and define wj+1 = kyj+1 k.
Then {w1 , . . . , wk } is an orthonormal basis for V .
Definiteness
Def: Let Q : Rn → R be a quadratic form.
We say Q is positive definite if Q(x) > 0 for all x 6= 0.
We say Q is negative definite if Q(x) < 0 for all x 6= 0.
We say Q is indefinite if there are vectors x for which Q(x) > 0, and also
vectors x for which Q(x) < 0.

Def: Let A be a symmetric matrix.


We say A is positive definite if QA (x) = xT Ax > 0 for all x 6= 0.
We say A is negative definite if QA (x) = xT Ax < 0 for all x 6= 0.
We say A is indefinite if there are vectors x for which xT Ax > 0, and
also vectors x for which xT Ax < 0.

In other words:
◦ A is positive definite ⇐⇒ QA is positive definite.
◦ A is negative definite ⇐⇒ QA is negative definite.
◦ A is indefinite ⇐⇒ QA is indefinite.

The Hessian
Def: Let f : Rn → R be a function. Its Hessian at a ∈ Rn is the symmetric
matrix of second partials:
fx1 x1 (a) · · · fx1 xn (a)
 

Hf (a) =  · · · ... · · · .
fxn x1 (a) · · · fxn xn (a)

Note that the Hessian is a symmetric matrix. Therefore, we can also


regard Hf (a) as a quadratic form:
· · · fx1 xn (a)
  
f (a) x1
T
 x1 x1 ... .
· · ·   .. .

QHf (a) (x) = x Hf (a) x = x1 · · · xn  · · ·
fxn x1 (a) · · · fxn xn (a) xn
In particular, it makes sense to ask whether the Hessian is positive definite,
negative definite, or indefinite.
Single-Variable Calculus Review
Recall: In calculus, you learned that for a function f : R → R, a critical
point is a point a ∈ R where f 0 (a) = 0 or f 0 (a) does not exist.
You learned that if f (x) has a local min/max at x = a, then x = a is a
critical point. Of course, the converse is false: critical points don’t have to
be local minima or local maxima (e.g., they could be inflection points.)
You also learned the “second derivative test.” If x = a is a critical point
for f (x), then f 00 (a) > 0 tells us that x = a is a local min, whereas f 00 (a) < 0
tells us that x = a is a local max.

It would be nice to have similar statements in higher dimensions:

Critical Points & Second Derivative Test


Def: A critical point of f : Rn → R is a point a ∈ Rn at which Df (a) = 0T
or Df (a) is undefined.
∂f
In other words, each partial derivative ∂xi
(a) is zero or undefined.

Theorem: If f : Rn → R has a local max / local min at a ∈ Rn , then a is a


critical point of f .

N.B.: The converse of this theorem is false! Critical points do not have to
be a local max or local min – e.g., they could be saddle points.

Def: A saddle point of f : Rn → R is a critical point of f that is not a local


max or local min.

Second Derivative Test: Let f : Rn → R be a function, and a ∈ Rn be a


critical point of f .
(a) If Hf (a) is positive definite, then a is a local min of f .
(b) If Hf (a) is positive semi-definite, then a is local min or saddle point.
(c) If Hf (a) is negative definite, then a is a local max of f .
(d) If Hf (a) is negative semi-definite, then a is local max or saddle point.
(e) If Hf (a) is indefinite, then a is a saddle point of f .
Local Extrema vs Global Extrema
Finding Local Extrema: We want to find the local extrema of a function
f : Rn → R.
(i) Find the critical points of f .
(ii) Use the Second Derivative Test to decide if the critical points are local
maxima / minima / saddle points.

Theorem: Let f : Rn → R be a function. If R ⊂ Rn is a closed and bounded


region, then f has a global max and a global min on R.

Finding Global Extrema: We want to find the global extrema of a func-


tion f : Rn → R on a region R ⊂ Rn .
(1) Find the critical points of f on the interior of R.
(2) Find the extreme values of f on the boundary of R. (Lagrange mult.)
Then:
◦ The largest value from Steps (1)-(2) is a global max value.
◦ The smallest value from Steps (1)-(2) is a global min value.

Lagrange Multipliers (Constrained Optimization)


Notation: Let f : Rn → Rm be a function, and S ⊂ Rn be a subset.
The restricted function f |S : S → Rm is the same exact function as f , but
where the domain is restricted to S.

Theorem: Suppose we want to optimize a function f (x1 , . . . , xn ) constrained


to a level set S = {g(x1 , . . . , xn ) = c}.
If a is an extreme value of f |S on the level set S = {g(x1 , . . . , xn ) = c},
and if ∇g(a) 6= 0, then
∇f (a) = λ∇g(a)
for some constant λ.

Reason: If a is an extreme value of f |S on the level set S, then Dv f (a) = 0


for all vectors v that are tangent to the level set S. Therefore, ∇f (a) · v = 0
for all vectors v that are tangent to S.
This means that ∇f (a) is orthogonal to the level set S, so ∇f (a) must be
a scalar multiple of the normal vector ∇g(a). That is, ∇f (a) = λ∇g(a). 
Motivation for Eigenvalues & Eigenvectors
We want to understand a quadratic form QA (x), which might be ugly and
complicated.
Idea: Maybe there’s an orthonormal basis B = {w1 , . . . , wn } of Rn that
is somehow “best suited to A” – so that with respect to the basis B, the
quadratic form QA looks simple.

What do we mean by “basis suited to A”? And does such a basis always
exist? Well:

Spectral Theorem: Let A be a symmetric n × n matrix. Then there exists


an orthonormal basis B = {w1 , . . . , wn } of Rn such that each w1 , . . . , wn is
an eigenvector of A.
i.e.: There is an orthonormal basis of Rn consisting of eigenvectors of A.

Why is this good? Well, since B is a basis, every w ∈ Rn can be written


w = u1 w1 + · · · + un wn . (That is, the B-coordinates of w are (u1 , . . . , un ).)
It then turns out that:

QA (w) = QA (u1 w1 + · · · + un wn )
= (u1 w1 + · · · + un wn ) · A(u1 w1 + · · · + un wn )
= λ1 (u1 )2 + λ2 (u2 )2 + · · · + λn (un )2 . (yay!)

In other words: the quadratic form QA is in diagonal form with respect to


the basis B. We have made QA look as simple as possible!
Also: the coefficients λ1 , . . . , λn are exactly the eigenvalues of A.

Corollary: Let A be a symmetric n × n matrix, with eigenvalues λ1 , . . . , λn .


(a) A is positive-definite ⇐⇒ all of λ1 , . . . , λn are positive.
(b) A is negative-definite ⇐⇒ all of λ1 , . . . , λn are negative.
(c) A is indefinite ⇐⇒ there is a positive eigenvalue λi > 0 and a negative
eigenvalue λj < 0.

Useful Fact: Let A be any n × n matrix, with eigenvalues λ1 , . . . , λn . Then

det(A) = λ1 λ2 · · · λn .

Cor: If any one of the eigenvalues λj = 0 is zero, then det(A) = 0.


What is a (Unit) Sphere?
◦ The 1-sphere (the “unit circle”) is S1 = {(x, y) ∈ R2 | x2 + y 2 = 1} ⊂ R2 .
◦ The 2-sphere (the “sphere”) is S2 = {(x, y, z) ∈ R3 | x2 +y 2 +z 2 = 1} ⊂ R3 .
◦ The 3-sphere is S3 = {(x, y, z, w) ∈ R4 | x2 + y 2 + z 2 + w2 = 1} ⊂ R4 .
Note that the 3-sphere is not the same as the unit ball {x2 + y 2 + z 2 ≤ 1}.

◦ The (n − 1)-sphere is the set

Sn−1 = {(x1 , . . . , xn ) ∈ Rn | (x1 )2 + · · · + (xn )2 = 1}


= {x ∈ Rn | kxk2 = 1} ⊂ Rn .

In other words, Sn−1 consists of the unit vectors in Rn .

Optimizing Quadratic Forms on Spheres


Problem: Optimize a quadratic form QA : Rn → R on the sphere Sn−1 ⊂ Rn .
That is, what are the maxima and minima of QA (w) subject to the con-
straint that kwk = 1?

Solution: Let λmax and λmin be the largest and smallest eigenvalues of A.
◦ The maximum value of QA for unit vectors is λmax . Any unit vector wmax
which attains this maximum is an eigenvector of A with eigenvalue λmax .
◦ The minimum value of QA for unit vectors is λmin . Any unit vector wmin
which attains this minimum is an eigenvector of A with eigenvalue λmin .

Corollary: Let A be a symmetric n × n matrix.


(a) A is positive-definite ⇐⇒ the minimum value of QA restricted to unit
vector inputs is positive (i.e., iff λmin > 0).
(b) A is negative-definite ⇐⇒ the maximum value of QA restricted to
unit vector inputs is negative (i.e., iff λmax < 0).
(c) A is indefinite ⇐⇒ λmax > 0 and λmin < 0.
Directional First & Second Derivatives
Def: Let f : Rn → R be a function, a ∈ Rn be a point.
The directional derivative of f at a in the direction v is:

Dv f (a) = ∇f (a) · v.

The “directional second derivative” of f at a in the direction v is:

QHf (a) (v) = vT Hf (a)v.

That is: the quadratic form whose associated matrix is the Hessian Hf (a).

Q: What direction v increases the directional derivative the most? What


direction v decreases the directional derivative the most?

A: We’ve learned this: the gradient ∇f (a) is the direction of greatest in-
crease, whereas −∇f (a) is the direction of greatest decrease.

New Questions:
◦ What direction v increases the directional second derivative the most?
◦ What direction v decreases the directional second derivative the most?

Answer: The (unit) directions of minimum and maximum second derivative


are (unitized) eigenvectors of Hf (a), and so they are mutually orthogonal.
The max/min values of the directional second derivative are the max/min
eigenvalues of Hf (a).

You might also like