Essentials of Machine Learning
Lesson 02 - Probability
T Essentials of Machine Learning
What is probability?
• Quantification of uncertainty
• Frequentist interpretation: long run frequencies of events
e.g.: The probability of a particular coin landing heads up is 0.5
• Bayesian interpretation: quantify our degrees of belief about something
e.g.: the probability of it raining tomorrow is 0.3
• Not possible to repeat “tomorrow" many times
• Basic rules of probability are the same, no matter which interpretation is
adopted
• 3
Thushari Silva, PhD Essentials of Machine Learning
Random Variables
• A random variable (RV), X denotes a quantity that is subject to variations due to
chance
• May denote the result of an experiment (e.g. flipping a coin) or the
measurement of a real-world fluctuating quantity (e.g. temperature)
• Use capital letters to denote random variables and lower case letters to denote
values that they take, e.g. p(X = x)
• A discrete variable takes on values from a finite or countably infinite set
• Probability mass function p(X = x) for discrete random variables
Thushari Silva, PhD Essentials of Machine Learning
Random Variables – Examples
• Examples:
• Colour of a car blue, green, red
• Number of children in a family 0, 1, 2, 3, 4, 5, 6, > 6
• Toss two coins, let X = (number of heads)2. X can take on the values 0, 1 and
4.
• Example p(Colour = red) = 0:3
• ∑! 𝑃 𝑥 = 1
Thushari Silva, PhD Essentials of Machine Learning
Continuous Random Variables
• Continuous RVs take on values that vary continuously within one or more real
intervals
• Probability density function (pdf) p(x) for a continuous random variable X
#
𝑃 𝑎≤𝑋≤𝑏 = ∫" 𝑝 𝑥 𝑑𝑥
therefore
𝑃 𝑥 ≤ 𝑋 ≤ 𝑥 + 𝛿𝑥 ≅ 𝑝 𝑥 𝛿𝑥
• ∫ 𝑝 𝑥 𝑑𝑥 = 1 (but values of p(x) can be greater than 1)
• Examples (coming soon): Gaussian, Gamma, Exponential, Beta
Thushari Silva, PhD Essentials of Machine Learning
Expectation
• Consider a function f(x) mapping from x onto numerical values
• 𝐸[𝑓 𝑥 ] = ∑! 𝑓 𝑥 𝑃 𝑥
= ∫ 𝑓 𝑥 𝑃 𝑥 𝑑𝑥
For discrete and continuous variable resp.
• f(x) = x, we obtain the mean, 𝜇!
• f(x) = (x - 𝜇! )2 , we obtain variance
Thushari Silva, PhD Essentials of Machine Learning
Joint distributions
• Properties of several random variables are important for modelling complex
problems
• 𝑃(𝑋! = 𝑥!, 𝑋" = 𝑥" , …, 𝑋# = 𝑥# )
• “,” is read as “and”
• Examples about Grade and Intelligence (from Koller and Friedman, 2009)
Intelligence = low Intelligence = high
Grade = A 0.07 0.18
Grade = B 0.28 0.09
Grade = C 0.35 0.03
Thushari Silva, PhD Essentials of Machine Learning
Marginal Probability
• The sum rule
𝑃 x = ∑! 𝑝(𝑥, 𝑦)
• p(Grade = A) ??
• Replace sum by an integral for continuous RVs
Thushari Silva, PhD Essentials of Machine Learning
Conditional Probability
• Let X and Y be two disjoint groups of variables, such that p(Y = y) > 0. Then the conditional
probability distribution (CPD) of X given Y = y is given by:
$(&,()
𝑝 𝑋=𝑥𝑌=𝑦 =𝑝 𝑥𝑦 = $(()
• Product rule
p(𝑋, 𝑌) = 𝑝 𝑋 𝑝 𝑌 𝑋 = 𝑝 𝑌 𝑝 𝑋 𝑌
• Example: In the grades example, what is p(Intelligence = high|Grade = A)?
• ∑! p 𝑋 = 𝑥 𝑌 = 𝑦 = 1 for all x
Thushari Silva, PhD Essentials of Machine Learning
Chain Rule
• The chain rule is derived by repeated application of the product rule
𝑝(𝑋! , 𝑋" , … , 𝑋# ) = 𝑝(𝑋! , 𝑋" , … , 𝑋#$! )𝑝(𝑋# |𝑋! , 𝑋" , … , 𝑋#$! )
= 𝑝(𝑋" , 𝑋# , … , 𝑋$%# )𝑝(𝑋$%" |𝑋" , 𝑋# , … , 𝑋$%# )
𝑝(𝑋$ |𝑋" , 𝑋# , … , 𝑋$%" )
=…
= 𝑝(𝑋" ) ∏$&'# 𝑝(𝑋& |𝑋" , 𝑋# , … , 𝑋&%" )
Exercise : give decompositions of p(x, y, z) using the chain rule
Thushari Silva, PhD Essentials of Machine Learning
Bayes' Rule
• From the product rule,
*𝑌 𝑋 *(+) % 𝑌 𝑋 %(')
𝑃(𝑋|𝑌) =
∑+ * 𝑌 𝑋 *(+)
= %(')
Thushari Silva, PhD Essentials of Machine Learning
Bayes’ rule example
• Consider the following medical diagnosis problem.
Suppose you decide to have a medical test for a cancer. If the test is positive, what is the
probability you have cancer? Test has a sensitivity of 80% and prior probability of having a
cancer is 0.004.
Assume that false positive are quite likely. i.e. p(x = 1|y = 0) = 0.1
p(x =1 |y = 1) = 0.8 , p(y =1 |x = 1) = ??
p(y =1 |x = 1) = p(x =1 |y = 1) p(y = 1)
p(x =1 |y = 1) p(y = 1) +p(x =1 |y = 0) p(y = 0)
= 0.8×0.004 = 0.031 = 3%
0.8×0.004 + 0.1×0.996
Thushari Silva, PhD Essentials of Machine Learning
Probabilistic Inference using Bayes' Rule
• Tuberculosis (TB) and a skin test (Test)
• p(TB = yes) = 0:001 (for subjects who get tested)
• p(Test = yes | TB = yes) = 0.95
• p(Test = no | TB = no) = 0.95
• Person gets a positive test result. What is p(TB = yes |Test = yes)?
𝑃 𝑇𝐵 = 𝑦𝑒𝑠 | 𝑇𝑒𝑠𝑡 = 𝑦𝑒𝑠 = * -./01(./*(-./014./)
| -31(./ *(-31(./)
).+, × ).))!
= ≅ 0.0187
).+,× ).)!.).),×).+++
Thushari Silva, PhD Essentials of Machine Learning
Independence
• Let X and Y be two disjoint groups of variables. Then X is said to be independent
of Y if and only if
𝑝(𝑋|𝑌) = p(X) ; for all possible values x and y of X and Y;
otherwise X is said to be dependent on Y
• Using the definition of conditional probability, we get an equivalent expression
for the independence condition
𝑝(𝑋, 𝑌) = p(X)p(Y)
• X independent of Y , Y independent of X
• Independence of a set of variables. X1,…,XD are independent iff
𝑝(𝑋" , 𝑋# , … , 𝑋$ ) = ∏$
&'" 𝑃(𝑥& )
Thushari Silva, PhD Essentials of Machine Learning
Conditional Independence
• Let X, Y and Z be three disjoint groups of variables. X is said to be conditionally
independent of Y given Z iff:
p(x|y,z) = p(x, z) = p(x|z)
for all possible values of x, y and z.
Thushari Silva, PhD Essentials of Machine Learning