[go: up one dir, main page]

0% found this document useful (0 votes)
16 views39 pages

Lecture 2 - Math

The document provides an overview of essential mathematical concepts for machine learning, focusing on linear algebra, probability, and vector calculus. It emphasizes the importance of a solid mathematical foundation for understanding machine learning algorithms and making informed decisions in their application. Key topics include vector and matrix operations, probability theory, and optimization techniques, all critical for effective machine learning practices.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views39 pages

Lecture 2 - Math

The document provides an overview of essential mathematical concepts for machine learning, focusing on linear algebra, probability, and vector calculus. It emphasizes the importance of a solid mathematical foundation for understanding machine learning algorithms and making informed decisions in their application. Key topics include vector and matrix operations, probability theory, and optimization techniques, all critical for effective machine learning practices.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Artificial Intelligence II (CS4442 & CS9542)

A Brief Review of Mathematics for Machine Learning

Boyu Wang
Department of Computer Science
University of Western Ontario
Outline

If you do NOT have taken a linear algebra course (e.g., MATH 1600B), this course (at
least the first half) could be extremely difficult for you!

I Linear Algebra

I Probability

I Vector Calculus & Optimization

1
2
Why worry about the math?

I There are lots of easy-to-use machine learning packages out


there.

I However, to get really useful results, you need good


mathematical intuitions about certain machine learning
principles, as well as the inner workings of the individual
algorithms.

- Choose the right algorithm(s) for the problem


- Make good choices on parameter settings, validation
strategies
- Troubleshoot poor / ambiguous results
- Do a better job of coding algorithms
- Apply for a PhD at a top-tier university

3
Linear Algebra

4
Notation Reference

Table: Table of Notations

Notation Meaning
R set of real numbers (one-dimensional space)
Rn set of n-tuples of real numbers, (n-dimensional space)
Rm×n set of m × n matrix space
a a scalar or vector (i.e., x ∈ R or x ∈ Rn )
A a matrix (i.e., X ∈ Rm×n )
I identity matrix
A−1 inverse of a square matrix A: AA−1 = A−1 A = I
a> , A> transpose of a vector/matrix
dot product of vectors: ha, bi = a> b = ni=1 ai bi
P
ha, bi
||a||2 , ||a||1 `2 , `1 -norm of a
|A| determinant of a square matrix A
tr(A) trace of a square matrix A

5
Linear algebra applications

I Operations on or between vectors and matrices.

I Dimensionality reduction.

I Linear regression.

I Many others.

6
Why vectors and matrices?

Slide credit: Jeff Howbert

7
Vectors

I Definition: an n-tuple of values (usually real numbers).

I Can be written in column form or row form (column form is


conventional). Vector elements referenced by subscript.
 
a1
a =  ...  = [a1 , . . . , an ]>
 

an

I Can think of a vector as: a point in space or a directed line


segment with a magnitude and direction.

8
Vector arithmetic

I Addition of two vectors


- add corresponding elements: a + b = [a1 + b1 , . . . , an + bn ]>
- result is a vector
I Scalar multiplication of a vector
- multiply each element by scalar: λa = [λa1 , . . . , λan ]>
- result is a vector.
I Inner/Dot product of two vectors
- multiply corresponding elements,
P then add products:
n
ha, bi = a · b = a> b = b> a = i=1 ai bi
- result is a scalar.
I `2 -norm of a vector
p √ qP
n
- ||a|| = ha, ai = a> a = i=1 ai2

- a> b = ||a||||b|| cos(θ)


- Euclidean distance between two vectors: ||a − b|| 9
Matrices

A vector can be regarded as special case of a matrix, where one of


matrix dimensions = 1.

I Matrix transpose (denoted > ): swap columns and rows. m × n


matrix becomes n × m matrix

I Addition of two matrices

I Scalar multiplication of a matrix

10
Matrices multiplication
If A is an m × n matrix (i.e., A ∈ Rm×n ), and B is an n × p matrix (i.e.,
B ∈ Rn×p )
   
a11 a12 · · · a1n b11 b12 · · · b1p
 a21 a22 · · · a2n  b21 b22 · · · b2p 
A= . , B= .
   
. . . .. .. .. 
 .. .. .. ..   ..

. . . 
am1 am2 · · · amn bn1 bn2 · · · bnp

the matrix product C = AB (denoted without multiplication dots) is


defined to be the m × p matrix
 
c11 c12 · · · c1p
 c21 c22 · · · c2p 
C= . ..  ,
 
.. ..
 .. . . . 
cm1 cm2 ··· cmp
Pn
where cij = k=1 aik bkj : cij is given by the vector product between the
i-th row of A and the j-th column of B.
11
Matrices multiplication

Properties:

I A(BC) = (AB)C

I AB 6= BA (generally)

I (AB)> = B > A>

I RULE: In any chain of matrix multiplications, the column


dimension of one matrix in the chain must match the row
dimension of the following matrix in the chain.

12
Square matrices and symmetric matrices

A is a square matrix if it has the same number of rows and columns


(i.e., A ∈ Rn×n ). If A = A> , then A is also a symmetric matrix.

I Special kinds

- diagonal matrix
- identity matrix
- positive-definite matrix: x > Ax > 0 for any x
- invertible matrix and its inverse: A is invertible if or non-singular if
there exists a matrix B such that AB = BA = I. If B exists, it is
unique and is called the inverse matrix of A, denoted A−1 .
- orthogonal matrix: A is an orthogonal matrix if A> = A−1 , which
entails AA> = A> A = I

13
Eigenvalues and eigenvectors

Let A be a n × n square matrix, if we can find a scalar λ and a unit


vector v , such that

Av = λv

then, λ is an eigenvalue of A, and v is its corresponding eigenvector.

I A = V ΛV −1 is the eigendecomposition of A, where V is the


n × n matrix whose i-th column is the i-th eigenvector of A, and
Λ is the diagonal matrix whose diagonal elements are the
corresponding eigenvalues.

I If A is positive-definite ⇒ all the eigenvalues are positive

I If A is symmetric ⇒ V is an orthogonal matrix: V −1 = V >

14
Probability

15
Why probability

To characterize the uncertainties of the world!

I Uncertain data

- Missing data
- Noisy data

I Uncertain knowledge

- Incomplete enumeration of conditions or effects


- Incomplete knowledge of causality in the domain
- Stochastic effects

Probability theory provides powerful tools for modeling and dealing


with uncertainty.

16
Interpretations of probability

The probability that a coin will land heads is 0.5

I Frequentist interpretation: to represent long run frequencies of


events.

- If we flip the coin many times, we expect it to land heads about


half the time.

I Bayesian interpretation: to quantify our uncertainty about


something – related to information rather than repeated trials.

- We believe the coin is equally likely to land heads or tails on the


next toss.

17
Random variables

The expression p(A) denotes the probability that the event A is true.

I 0 ≤ p(A) ≤ 1
I p(A) denotes the probability that the event A is false:
p(A) = 1 − p(A).
I The probability of a disjunction is given by:

p(A ∨ B) = p(A) + p(B) − p(A ∧ B)

18
Fundamental concepts

I A union of two events – the probability of A or B:


p(A ∨ B) = p(A) + p(B) − p(A ∧ B). If A and B are mutually
exclusive, p(A ∨ B) = p(A) + p(B)

I Joint probability – the probability of the joint event A and B:


p(A, B) = p(A ∧ B) = p(A|B)p(B), where p(A|B) is the
conditional probability of event A given B:
p(A, B)
p(A|B) =
p(B)
If A and B are independent to each other, we have
p(A|B) = P(A).

I Chain rule:
p(X1:N ) = p(X1 )p(X2 |X1 )p(X3 |X1 , X2 ), . . . , p(XN |X1:N−1 )

19
Bayes Rule
Bayes Rule

p(B|A) × p(A)
p(A|B) =
p(B)
p(evidence|hypothesis) × p(hypothesis)
p(hypothesis|evidence) =
p(evidence)
likelihood × prior
posterior =
marginal distribution

I The most important formula in probabilistic machine learning

I Allows us to reason from evidence to hypotheses


1 1
- Example: p(headache) = 10 , p(flu) = 40 ,
1
p(headache|flu) = 2 . Given the evidence of headache,
what is p(flu|headache)?
20
21
22
23
Probability distributions

I Discrete distributions

- binomial and Bernoulli distributions


- multinomial and multinoulli distributions
- Poisson distribution

I Continuous distributions

- Gaussian distribution
- Laplace distribution
- gamma distribution

24
Gaussian distribution

1-dimensional Gaussian distribution


1 1 2
N (µ, σ 2 ) = √ e− 2σ2 (x−µ)
2πσ 2

25
Gaussian distribution

n-dimensional (multivariate) Gaussian distribution

1 1 > −1
N (µ, Σ) = e− 2 (x−µ) Σ (x−µ)
(2π)n/2 |Σ|1/2

26
Vector Calculus & Optimization

27
Fundamental concepts

I derivative: the sensitivity to change of the function value (output


value) with respect to a change in its argument (input value).
f (x + a) − f (x)
f 0 (x) = lim
a→0 a
second derivative – f 00 (x): the derivative of f 0 (x)

I convex function:

∀x1 , x2 , ∀t ∈ [0, 1] : f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 )


00
- f (x) ≥ 0 ⇔ f (x) is a convex function
- If f (x) is a convex function, f 0 (x0 ) = 0 ⇒ x0 is the global
minimum point (i.e., f (x0 ) ≤ f (x))
I chain rule: Let F = f (g(x)), then F 0 (x) = f 0 (g(x))g 0 (x)

28
Vector Calculus

I gradient: the derivative of a multi-variable function f with respect


to x = [x1 , . . . , xn ]>
 ∂f 
∂x1
∇f (x) =  ... 
 
∂f
∂xn
∂f
where ∂xi is the partial derivative of f with respect to xi .

I f (x) = x12 + 3x2 + x2 x3 :


∂f ∂f ∂f
= 2x1 , = 3 + x3 , = x2
∂x1 ∂x2 ∂x3
⇒ ∇f (x) = [2x1 , 3 + x3 , x2 ]>

I Let f (x) = a> x, f (x) = x > Ax, what is ∇f (x)?

29
I Hessian matrix : the second derivative of a multi-variable function f with
respect to x = [x1 , . . . , xn ]>
 2
∂2f ∂2f

∂ f
2 ∂x1 ∂x2
··· ∂x1 ∂xn
 ∂x21
∂2f ∂2f 

 ∂ f ···
 ∂x2 ∂x1 ∂x22 ∂x2 ∂xn 
H= .. .. ..

 .. 

 . . . . 

∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2
··· ∂xn2
2
f
where ∂x∂ ∂x is the mixed partial derivative of f . The order of
i j
differentiation does not matter (Schwarz’s theorem).

I f (x) = x12 + 3x2 + x2 x3 :


∂f ∂f ∂f
= 2x1 , = 3 + x3 , = x2
∂x1 ∂x2 ∂x3
∂f ∂f ∂f ∂f ∂f ∂f
= 2, = = 0, = = 0, =1
∂x12 ∂x22 ∂x32 ∂x1 ∂x2 ∂x1 ∂x3 ∂x2 ∂x3
 
2 0 0
⇒ H = 0 0 1 
0 1 0
30
Convex multi-variable function

∀x1 , x2 , ∀t ∈ [0, 1] : f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 )

I H is positive-definite ⇔ f (x) is a convex function


I If f (x) is a convex function, ∇f (x0 ) = 0 ⇒ x0 is the global
minimum point (i.e., f (x0 ) ≤ f (x))

I f (x) = x12 + 3x2 + x2 x3 :


 
2 0 0
H = 0 0 1
0 1 0

The eigenvalues of H are -1, 1, 2 ⇒ f (x) is not convex

31
Function minimization

I Most machine learning problems can be formulated as a


function minimization problem

32
Function minimization

I Most machine learning problems can be formulated as a


function minimization problem

I Solve the equation:

∇f (x) = 0 (1)

Done!

32
Function minimization

I Most machine learning problems can be formulated as a


function minimization problem

I Solve the equation:

∇f (x) = 0 (1)

Done!

I Really?

32
Function minimization

I Most machine learning problems can be formulated as a


function minimization problem

I Solve the equation:

∇f (x) = 0 (1)

Done!

I Really?

- If f is not convex, we obtain a suboptimal solution.

32
Function minimization

I Most machine learning problems can be formulated as a


function minimization problem

I Solve the equation:

∇f (x) = 0 (1)

Done!

I Really?

- If f is not convex, we obtain a suboptimal solution.


- We don’t have an analytical solution for (1).
Example: f (x) = x12 + ex2 + sin(x3 + log(x1 )).

32
Function minimization

I Most machine learning problems can be formulated as a


function minimization problem

I Solve the equation:

∇f (x) = 0 (1)

Done!

I Really?

- If f is not convex, we obtain a suboptimal solution.


- We don’t have an analytical solution for (1).
Example: f (x) = x12 + ex2 + sin(x3 + log(x1 )).
I Gradient descent!

32
Gradient descent

An optimization algorithm used to minimize a function by iteratively


moving in the direction of steepest descent as defined by the negative
of the gradient.

https://www.kodefork.com/learn/machine-learning/linear-regression-with-one-variable/
33

You might also like