Chapter 5: Vector Calculus (Math For Machine Learning)
Chapter 5: Vector Calculus (Math For Machine Learning)
Vector Calculus
0
Equivalent to 2 slots. Lecturers can use extra materials for other slots in the chater
FPT University Chapter 5. Vector Calculus 1 / 34
Chapter 5. Vector Calculus
The function
f (x) = cos(x) is
approximated by
Taylor polynomials
around x0 = 1.
The function
f (x) = cos(x) is
approximated by
Taylor polynomials
around x0 = 1.
∞
x2 x3 X xn
ex = 1 + x + + + ··· =
2! 3! n!
k=0
∞
x3 x5 X (−1)k x 2k+1
sin(x) = x − + − ··· =
3! 5! (2k + 1)!
k=0
∞
x2 x4 X (−1)k x 2k
cos(x) = 1 − + − ··· =
2! 4! (2k)!
k=0
∞
1 X
= 1 + x + x2 + · · · = xk.
1−x
k=0
Definition
For a function f : Rn → R, the partial derivatives of f are defined by
Definition
For a function f : Rn → R, the partial derivatives of f are defined by
∂f f (x1 + h, x2 , . . . , xn ) − f (x1 , x2 , . . . , xn )
:= lim
∂x1 h→0 h
..
.
∂f f (x1 , x2 , . . . , xn + h) − f (x1 , x2 , . . . , xn )
:= lim
∂xn h→0 h
Definition
For a function f : Rn → R, the partial derivatives of f are defined by
∂f f (x1 + h, x2 , . . . , xn ) − f (x1 , x2 , . . . , xn )
:= lim
∂x1 h→0 h
..
.
∂f f (x1 , x2 , . . . , xn + h) − f (x1 , x2 , . . . , xn )
:= lim
∂xn h→0 h
∂f
Note. When finding the partial derivative ∂xi , we consider only xi varies
and keep the others constant.
FPT University Chapter 5. Vector Calculus 7 / 34
5.2 Partial Differentiation and Gradients
Definition
A function f : Rn → R is a rule that assigns to each x ∈ Rn of n variables
x1 , . . . , xn to a real value f (x) = f (x1 , · · · , xn ).
Definition
For a function f : Rn → R, the partial derivatives of f are defined by
∂f f (x1 + h, x2 , . . . , xn ) − f (x1 , x2 , . . . , xn )
:= lim
∂x1 h→0 h
..
.
∂f f (x1 , x2 , . . . , xn + h) − f (x1 , x2 , . . . , xn )
:= lim
∂xn h→0 h
∂f
Note. When finding the partial derivative ∂xi , we consider only xi varies
and keep the others constant.
FPT University Chapter 5. Vector Calculus 7 / 34
Example
1
Find the partial derivatives of function f (x, y ) = 1+x 2 +2y 4
.
Answer.
∂f (x, y ) 1 ∂ 2 4 12y 3
=− (1 + x + 3y ) = − .
∂y (1 + x 2 + 3y 4 )2 ∂y (1 + x 2 + 3y 4 )2
∂f (x, y ) 1 ∂ 2 4 12y 3
=− (1 + x + 3y ) = − .
∂y (1 + x 2 + 3y 4 )2 ∂y (1 + x 2 + 3y 4 )2
Example
Find the gradient of function f (x1 , x2 , x3 ) = x12 x23 − 4x22 x3 .
Example
Find the gradient of function f (x1 , x2 , x3 ) = x12 x23 − 4x22 x3 .
Answer.
h i
∂f ∂f ∂f
∇x f = ∂x1 ∂x2 ∂x3
Example
Find the gradient of function f (x1 , x2 , x3 ) = x12 x23 − 4x22 x3 .
Answer.
h i
∂f ∂f ∂f
∇x f = ∂x1 ∂x2 ∂x3
Sum rule: ∇x [f + g ] = ∇x f + ∇x g .
Sum rule: ∇x [f + g ] = ∇x f + ∇x g .
Product rule: ∇x [fg ] = g (x)∇x f + f (x)∇x g .
Sum rule: ∇x [f + g ] = ∇x f + ∇x g .
Product rule: ∇x [fg ] = g (x)∇x f + f (x)∇x g .
Sum rule: ∇x [f + g ] = ∇x f + ∇x g .
Product rule: ∇x [fg ] = g (x)∇x f + f (x)∇x g .
df ∂f dx1 ∂f dxn
Chain rule : f 0 (t) = = + ··· +
dt ∂x1 dt ∂xn dt
n
X ∂f dxi
= .
∂xi dt
i=1
Sum rule: ∇x [f + g ] = ∇x f + ∇x g .
Product rule: ∇x [fg ] = g (x)∇x f + f (x)∇x g .
df ∂f dx1 ∂f dxn
Chain rule : f 0 (t) = = + ··· +
dt ∂x1 dt ∂xn dt
n
X ∂f dxi
= .
∂xi dt
i=1
Definition
The partial derivative of a vector-valued function f : Rn → Rm with
respect to xi , i = 1, . . . , n, is given as the vector
Definition
The partial derivative of a vector-valued function f : Rn → Rm with
respect to xi , i = 1, . . . , n, is given as the vector
∂f1
i∂x
∂f
= ... ∈ Rm .
∂xi ∂f m
∂xi
Definition
The partial derivative of a vector-valued function f : Rn → Rm with
respect to xi , i = 1, . . . , n, is given as the vector
∂f1
i∂x
∂f
= ... ∈ Rm .
∂xi ∂f m
∂xi
Then
2
y
∂f
= 0 ,
∂x
2x
2xy
∂f
= 3y 2 .
∂y
−2y
Then
2
y
∂f
= 0 ,
∂x
2x
2xy
∂f
= 3y 2 .
∂y
−2y
= ... · · · · · · .
∂fm
∂x1 · · · ∂f
∂xn
m
= ... · · · · · · .
∂fm
∂x1 · · · ∂f
∂xn
m
= ... · · · · · · .
∂fm
∂x1 · · · ∂f
∂xn
m
f (x) = Ax,
f (x) = Ax,
∇x f = A.
f (x) = Ax,
∇x f = A.
Example
−x +x
e 1 2
2 3
Consider a vector-valued function f : R → R , f (x1 , x2 ) = x1 x22 .
sin(x1 )
f (x) = Ax,
∇x f = A.
Example
−x +x
e 1 2
2 3
Consider a vector-valued function f : R → R , f (x1 , x2 ) = x1 x22 .
sin(x1 )
f (x) = Ax,
∇x f = A.
Example
−x +x
e 1 2
2 3
Consider a vector-valued function f : R → R , f (x1 , x2 ) = x1 x22 .
sin(x1 )
The Jacobian of f is
−x +x
−e 1 2 e −x1 +x2
J = ∇x f = x22 2x1 x2 .
cos x1 0
f (x) = Ax,
∇x f = A.
Example
−x +x
e 1 2
2 3
Consider a vector-valued function f : R → R , f (x1 , x2 ) = x1 x22 .
sin(x1 )
The Jacobian of f is
−x +x
−e 1 2 e −x1 +x2
J = ∇x f = x22 2x1 x2 .
cos x1 0
where x are the inputs (e.g., images), y are the observations (e.g., class
labels), and every function fi , i = 1, · · · , K , possesses its own parameters.
In the i th layer:
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )
In the i th layer:
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )
where xi−1 is the output of the layer i − 1, σ is an activation function
(sigmoid or ReLU or tanh, ... functions).
In the i th layer:
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )
where xi−1 is the output of the layer i − 1, σ is an activation function
(sigmoid or ReLU or tanh, ... functions).
Training these model requires us to compute the gradient of a loss
function L w.r.t all model parameters θj = {Aj , bj }, j = 0, . . . , K − 1.
In the i th layer:
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )
where xi−1 is the output of the layer i − 1, σ is an activation function
(sigmoid or ReLU or tanh, ... functions).
Training these model requires us to compute the gradient of a loss
function L w.r.t all model parameters θj = {Aj , bj }, j = 0, . . . , K − 1.
f0 := x
fi := σi (Ai fi−1 + bi−1 ), i = 1, . . . , K .
f0 := x
fi := σi (Ai fi−1 + bi−1 ), i = 1, . . . , K .
We get
∂a ∂b
= 2x = exp(a)
∂x ∂a
∂c ∂c ∂d 1
= =1 = √
∂a ∂b ∂c 2 c
∂e ∂f ∂f
= − sin c = = 1.
∂c ∂d ∂e
∂f ∂f ∂d ∂f ∂e 1
= + = 1. √ + 1.(− sin c)
∂c ∂d ∂c ∂e ∂c 2 c
∂f ∂f ∂c ∂f
= =
∂b ∂c ∂b ∂c
∂f ∂f ∂b ∂f ∂c ∂f ∂f
= + = exp(a) +
∂a ∂b ∂a ∂c ∂a ∂b ∂c
∂f ∂f ∂a ∂f
= = .2x
∂x ∂a ∂x ∂a
For i = d + 1, . . . , D : xi = gi (xPa(xi ) ),
where the gi (·) are elementary functions and xPa(xi ) are the parent nodes
of the variable xi in the graph. Let f = xD . By the chain rule,
For i = d + 1, . . . , D : xi = gi (xPa(xi ) ),
where the gi (·) are elementary functions and xPa(xi ) are the parent nodes
of the variable xi in the graph. Let f = xD . By the chain rule,
∂f X ∂f ∂xj X ∂f ∂gj
= =
∂xi ∂xj ∂xi ∂xj ∂xi
xj : xi ∈Pa(xj ) xj : xi ∈Pa(xj )
For i = d + 1, . . . , D : xi = gi (xPa(xi ) ),
where the gi (·) are elementary functions and xPa(xi ) are the parent nodes
of the variable xi in the graph. Let f = xD . By the chain rule,
∂f X ∂f ∂xj X ∂f ∂gj
= =
∂xi ∂xj ∂xi ∂xj ∂xi
xj : xi ∈Pa(xj ) xj : xi ∈Pa(xj )
Definition
Hessian matrix of of f is
Definition
Hessian matrix of of f is
∂2f ∂2f
" #
∂x 2
H = ∇2x,y f (x, y ) = ∂2f
∂x∂y
∂2f .
∂y ∂x ∂y 2
FPT University Chapter 5. Vector Calculus 28 / 34
5.7 Higher-Order Derivatives
Consider a function f : R2 → R of two variables x, y . We use the notation
for higher-order partial derivatives:
∂2f ∂2f
∂ ∂f ∂ ∂f
= , =
∂x 2 ∂x ∂x ∂y 2 ∂y ∂y
2 2
∂ f ∂ ∂f ∂ f ∂ ∂f
= , = ,···
∂x∂y ∂x ∂y ∂y ∂x ∂y ∂x
If f (x, y ) is a twice (continuously) differentiable function, then
∂2f ∂2f
= .
∂x∂y ∂y ∂x
Definition
Hessian matrix of of f is
∂2f ∂2f
" #
∂x 2
H = ∇2x,y f (x, y ) = ∂2f
∂x∂y
∂2f .
∂y ∂x ∂y 2
FPT University Chapter 5. Vector Calculus 28 / 34
Example
2 +2y
Let f (x, y ) = e −x . Find the Hessian matrix of f at (0, 0).
Answer:
Answer:
Answer: We have
∂f 2 ∂f 2
= −2xe −x +2y , = 2e −x +2y
∂x ∂y
∂2f 2 −x 2 +2y ∂ f
2
2
2
= (−2 + 4x )e , 2
= 4e −x +2y
∂x ∂y
2
∂ f 2
∂ f 2 2
= = −4xe −x +y .
∂x∂y ∂y ∂x
Hence
−2 0
∇2(x,y ) f (0, 0) = .
0 4
Answer: We have
∂f 2 ∂f 2
= −2xe −x +2y , = 2e −x +2y
∂x ∂y
∂2f 2 −x 2 +2y ∂ f
2
2
2
= (−2 + 4x )e , 2
= 4e −x +2y
∂x ∂y
2
∂ f 2
∂ f 2 2
= = −4xe −x +y .
∂x∂y ∂y ∂x
Hence
−2 0
∇2(x,y ) f (0, 0) = .
0 4
where the error term R1 (x, a) going to zero faster than a constant times
kx − ak2 as x → a.
Definition
The first order Taylor polynomial of f at a is:
where the error term R1 (x, a) going to zero faster than a constant times
kx − ak2 as x → a.
Definition
The first order Taylor polynomial of f at a is:
where the error term R1 (x, a) going to zero faster than a constant times
kx − ak2 as x → a.
Definition
The first order Taylor polynomial of f at a is:
Answer:
Answer:
∂f ∂f ∂f ∂f
= 2x + 2y 3 , = 6xy 2 ⇒ (1, 2) = 18, (1, 2) = 24.
∂x ∂y ∂x ∂y
∂f ∂f ∂f ∂f
= 2x + 2y 3 , = 6xy 2 ⇒ (1, 2) = 18, (1, 2) = 24.
∂x ∂y ∂x ∂y
where the error term R2 = R2 (x, a) going to zero faster than a constant
times kx − ak3 as x → a.
where the error term R2 = R2 (x, a) going to zero faster than a constant
times kx − ak3 as x → a.
Definition
The second order Taylor polynomial of f at a is
where the error term R2 = R2 (x, a) going to zero faster than a constant
times kx − ak3 as x → a.
Definition
The second order Taylor polynomial of f at a is
1
f (a) + ∇x f (a) · (x − a) + (x − a)T ∇2x f (a)(x − a).
2
where the error term R2 = R2 (x, a) going to zero faster than a constant
times kx − ak3 as x → a.
Definition
The second order Taylor polynomial of f at a is
1
f (a) + ∇x f (a) · (x − a) + (x − a)T ∇2x f (a)(x − a).
2
Answer:
Answer:
We have studied: