Artificial Intelligence II (CS4442 & CS9542)
A Brief Review of Mathematics for Machine Learning
Boyu Wang
Department of Computer Science
University of Western Ontario
Outline
If you do NOT have taken a linear algebra course (e.g., MATH 1600B), this course (at
least the first half) could be extremely difficult for you!
I Linear Algebra
I Probability
I Vector Calculus & Optimization
1
2
Why worry about the math?
I There are lots of easy-to-use machine learning packages out
there.
I However, to get really useful results, you need good
mathematical intuitions about certain machine learning
principles, as well as the inner workings of the individual
algorithms.
- Choose the right algorithm(s) for the problem
- Make good choices on parameter settings, validation
strategies
- Troubleshoot poor / ambiguous results
- Do a better job of coding algorithms
- Apply for a PhD at a top-tier university
3
Linear Algebra
4
Notation Reference
Table: Table of Notations
Notation Meaning
R set of real numbers (one-dimensional space)
Rn set of n-tuples of real numbers, (n-dimensional space)
Rm×n set of m × n matrix space
a a scalar or vector (i.e., x ∈ R or x ∈ Rn )
A a matrix (i.e., X ∈ Rm×n )
I identity matrix
A−1 inverse of a square matrix A: AA−1 = A−1 A = I
a> , A> transpose of a vector/matrix
dot product of vectors: ha, bi = a> b = ni=1 ai bi
P
ha, bi
||a||2 , ||a||1 `2 , `1 -norm of a
|A| determinant of a square matrix A
tr(A) trace of a square matrix A
5
Linear algebra applications
I Operations on or between vectors and matrices.
I Dimensionality reduction.
I Linear regression.
I Many others.
6
Why vectors and matrices?
Slide credit: Jeff Howbert
7
Vectors
I Definition: an n-tuple of values (usually real numbers).
I Can be written in column form or row form (column form is
conventional). Vector elements referenced by subscript.
a1
a = ... = [a1 , . . . , an ]>
an
I Can think of a vector as: a point in space or a directed line
segment with a magnitude and direction.
8
Vector arithmetic
I Addition of two vectors
- add corresponding elements: a + b = [a1 + b1 , . . . , an + bn ]>
- result is a vector
I Scalar multiplication of a vector
- multiply each element by scalar: λa = [λa1 , . . . , λan ]>
- result is a vector.
I Inner/Dot product of two vectors
- multiply corresponding elements,
P then add products:
n
ha, bi = a · b = a> b = b> a = i=1 ai bi
- result is a scalar.
I `2 -norm of a vector
p √ qP
n
- ||a|| = ha, ai = a> a = i=1 ai2
- a> b = ||a||||b|| cos(θ)
- Euclidean distance between two vectors: ||a − b|| 9
Matrices
A vector can be regarded as special case of a matrix, where one of
matrix dimensions = 1.
I Matrix transpose (denoted > ): swap columns and rows. m × n
matrix becomes n × m matrix
I Addition of two matrices
I Scalar multiplication of a matrix
10
Matrices multiplication
If A is an m × n matrix (i.e., A ∈ Rm×n ), and B is an n × p matrix (i.e.,
B ∈ Rn×p )
a11 a12 · · · a1n b11 b12 · · · b1p
a21 a22 · · · a2n b21 b22 · · · b2p
A= . , B= .
. . . .. .. ..
.. .. .. .. ..
. . .
am1 am2 · · · amn bn1 bn2 · · · bnp
the matrix product C = AB (denoted without multiplication dots) is
defined to be the m × p matrix
c11 c12 · · · c1p
c21 c22 · · · c2p
C= . .. ,
.. ..
.. . . .
cm1 cm2 ··· cmp
Pn
where cij = k=1 aik bkj : cij is given by the vector product between the
i-th row of A and the j-th column of B.
11
Matrices multiplication
Properties:
I A(BC) = (AB)C
I AB 6= BA (generally)
I (AB)> = B > A>
I RULE: In any chain of matrix multiplications, the column
dimension of one matrix in the chain must match the row
dimension of the following matrix in the chain.
12
Square matrices and symmetric matrices
A is a square matrix if it has the same number of rows and columns
(i.e., A ∈ Rn×n ). If A = A> , then A is also a symmetric matrix.
I Special kinds
- diagonal matrix
- identity matrix
- positive-definite matrix: x > Ax > 0 for any x
- invertible matrix and its inverse: A is invertible if or non-singular if
there exists a matrix B such that AB = BA = I. If B exists, it is
unique and is called the inverse matrix of A, denoted A−1 .
- orthogonal matrix: A is an orthogonal matrix if A> = A−1 , which
entails AA> = A> A = I
13
Eigenvalues and eigenvectors
Let A be a n × n square matrix, if we can find a scalar λ and a unit
vector v , such that
Av = λv
then, λ is an eigenvalue of A, and v is its corresponding eigenvector.
I A = V ΛV −1 is the eigendecomposition of A, where V is the
n × n matrix whose i-th column is the i-th eigenvector of A, and
Λ is the diagonal matrix whose diagonal elements are the
corresponding eigenvalues.
I If A is positive-definite ⇒ all the eigenvalues are positive
I If A is symmetric ⇒ V is an orthogonal matrix: V −1 = V >
14
Probability
15
Why probability
To characterize the uncertainties of the world!
I Uncertain data
- Missing data
- Noisy data
I Uncertain knowledge
- Incomplete enumeration of conditions or effects
- Incomplete knowledge of causality in the domain
- Stochastic effects
Probability theory provides powerful tools for modeling and dealing
with uncertainty.
16
Interpretations of probability
The probability that a coin will land heads is 0.5
I Frequentist interpretation: to represent long run frequencies of
events.
- If we flip the coin many times, we expect it to land heads about
half the time.
I Bayesian interpretation: to quantify our uncertainty about
something – related to information rather than repeated trials.
- We believe the coin is equally likely to land heads or tails on the
next toss.
17
Random variables
The expression p(A) denotes the probability that the event A is true.
I 0 ≤ p(A) ≤ 1
I p(A) denotes the probability that the event A is false:
p(A) = 1 − p(A).
I The probability of a disjunction is given by:
p(A ∨ B) = p(A) + p(B) − p(A ∧ B)
18
Fundamental concepts
I A union of two events – the probability of A or B:
p(A ∨ B) = p(A) + p(B) − p(A ∧ B). If A and B are mutually
exclusive, p(A ∨ B) = p(A) + p(B)
I Joint probability – the probability of the joint event A and B:
p(A, B) = p(A ∧ B) = p(A|B)p(B), where p(A|B) is the
conditional probability of event A given B:
p(A, B)
p(A|B) =
p(B)
If A and B are independent to each other, we have
p(A|B) = P(A).
I Chain rule:
p(X1:N ) = p(X1 )p(X2 |X1 )p(X3 |X1 , X2 ), . . . , p(XN |X1:N−1 )
19
Bayes Rule
Bayes Rule
p(B|A) × p(A)
p(A|B) =
p(B)
p(evidence|hypothesis) × p(hypothesis)
p(hypothesis|evidence) =
p(evidence)
likelihood × prior
posterior =
marginal distribution
I The most important formula in probabilistic machine learning
I Allows us to reason from evidence to hypotheses
1 1
- Example: p(headache) = 10 , p(flu) = 40 ,
1
p(headache|flu) = 2 . Given the evidence of headache,
what is p(flu|headache)?
20
21
22
23
Probability distributions
I Discrete distributions
- binomial and Bernoulli distributions
- multinomial and multinoulli distributions
- Poisson distribution
I Continuous distributions
- Gaussian distribution
- Laplace distribution
- gamma distribution
24
Gaussian distribution
1-dimensional Gaussian distribution
1 1 2
N (µ, σ 2 ) = √ e− 2σ2 (x−µ)
2πσ 2
25
Gaussian distribution
n-dimensional (multivariate) Gaussian distribution
1 1 > −1
N (µ, Σ) = e− 2 (x−µ) Σ (x−µ)
(2π)n/2 |Σ|1/2
26
Vector Calculus & Optimization
27
Fundamental concepts
I derivative: the sensitivity to change of the function value (output
value) with respect to a change in its argument (input value).
f (x + a) − f (x)
f 0 (x) = lim
a→0 a
second derivative – f 00 (x): the derivative of f 0 (x)
I convex function:
∀x1 , x2 , ∀t ∈ [0, 1] : f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 )
00
- f (x) ≥ 0 ⇔ f (x) is a convex function
- If f (x) is a convex function, f 0 (x0 ) = 0 ⇒ x0 is the global
minimum point (i.e., f (x0 ) ≤ f (x))
I chain rule: Let F = f (g(x)), then F 0 (x) = f 0 (g(x))g 0 (x)
28
Vector Calculus
I gradient: the derivative of a multi-variable function f with respect
to x = [x1 , . . . , xn ]>
∂f
∂x1
∇f (x) = ...
∂f
∂xn
∂f
where ∂xi is the partial derivative of f with respect to xi .
I f (x) = x12 + 3x2 + x2 x3 :
∂f ∂f ∂f
= 2x1 , = 3 + x3 , = x2
∂x1 ∂x2 ∂x3
⇒ ∇f (x) = [2x1 , 3 + x3 , x2 ]>
I Let f (x) = a> x, f (x) = x > Ax, what is ∇f (x)?
29
I Hessian matrix : the second derivative of a multi-variable function f with
respect to x = [x1 , . . . , xn ]>
2
∂2f ∂2f
∂ f
2 ∂x1 ∂x2
··· ∂x1 ∂xn
∂x21
∂2f ∂2f
∂ f ···
∂x2 ∂x1 ∂x22 ∂x2 ∂xn
H= .. .. ..
..
. . . .
∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2
··· ∂xn2
2
f
where ∂x∂ ∂x is the mixed partial derivative of f . The order of
i j
differentiation does not matter (Schwarz’s theorem).
I f (x) = x12 + 3x2 + x2 x3 :
∂f ∂f ∂f
= 2x1 , = 3 + x3 , = x2
∂x1 ∂x2 ∂x3
∂f ∂f ∂f ∂f ∂f ∂f
= 2, = = 0, = = 0, =1
∂x12 ∂x22 ∂x32 ∂x1 ∂x2 ∂x1 ∂x3 ∂x2 ∂x3
2 0 0
⇒ H = 0 0 1
0 1 0
30
Convex multi-variable function
∀x1 , x2 , ∀t ∈ [0, 1] : f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 )
I H is positive-definite ⇔ f (x) is a convex function
I If f (x) is a convex function, ∇f (x0 ) = 0 ⇒ x0 is the global
minimum point (i.e., f (x0 ) ≤ f (x))
I f (x) = x12 + 3x2 + x2 x3 :
2 0 0
H = 0 0 1
0 1 0
The eigenvalues of H are -1, 1, 2 ⇒ f (x) is not convex
31
Function minimization
I Most machine learning problems can be formulated as a
function minimization problem
32
Function minimization
I Most machine learning problems can be formulated as a
function minimization problem
I Solve the equation:
∇f (x) = 0 (1)
Done!
32
Function minimization
I Most machine learning problems can be formulated as a
function minimization problem
I Solve the equation:
∇f (x) = 0 (1)
Done!
I Really?
32
Function minimization
I Most machine learning problems can be formulated as a
function minimization problem
I Solve the equation:
∇f (x) = 0 (1)
Done!
I Really?
- If f is not convex, we obtain a suboptimal solution.
32
Function minimization
I Most machine learning problems can be formulated as a
function minimization problem
I Solve the equation:
∇f (x) = 0 (1)
Done!
I Really?
- If f is not convex, we obtain a suboptimal solution.
- We don’t have an analytical solution for (1).
Example: f (x) = x12 + ex2 + sin(x3 + log(x1 )).
32
Function minimization
I Most machine learning problems can be formulated as a
function minimization problem
I Solve the equation:
∇f (x) = 0 (1)
Done!
I Really?
- If f is not convex, we obtain a suboptimal solution.
- We don’t have an analytical solution for (1).
Example: f (x) = x12 + ex2 + sin(x3 + log(x1 )).
I Gradient descent!
32
Gradient descent
An optimization algorithm used to minimize a function by iteratively
moving in the direction of steepest descent as defined by the negative
of the gradient.
https://www.kodefork.com/learn/machine-learning/linear-regression-with-one-variable/
33