Introduction to Bayesian Statistics
Richard Yi Da Xu
School of Computing & Communication, UTS
January 31, 2017
Richard Yi Da Xu Introduction to Bayesian Statistics
Random variables
I Pre-university: A number is just a fixed value.
When we talk about probabilities:
I When X is a continuous random variable, it has a probability density function (pdf)
I When X is a discrete random variable, it has a probability mass function (pmf)
p(x) = p(X = x) means that:
The probability when a random variable X is equal to a fixed number x, i.e.,
the probablity that number of machine learning participants = 20
Richard Yi Da Xu Introduction to Bayesian Statistics
Mean or Expectation
I discrete case:
N
1 X
µ = E(X ) = xi
N
i=1
I continous case:
Z
µ = E(X ) = xp(x)dx
x∈S
I can also measure the expecation of a function:
Z
E(f (X )) = f (x)p(x)dx
x∈S
For example,
Z Z
E(cos(X )) = cos(x)p(x)dx E(X 2 ) = x 2 p(x)dx
x∈S x∈S
I What about f (E(X )): Discuss later when we discuss Jensens Equality in
Expecation-Maximization
Richard Yi Da Xu Introduction to Bayesian Statistics
Variances an intuitive explanation
I You have data X = {2, 3, 3, 2, 1, 4}, i.e., x1 = 2, x2 = 3, . . . x6 = 4
I You have the mean:
2+3+3+2+1+4
µ= = 2.5
6
I The variance is then:
N
2 1 X 2
VAR(data) = σ = (xi − µ)
N i=1
I Division by N is intuitive. Otherwise, more data implies more variance
I Also think about what kind of values can VAR and σ take? - we will look at what kind of
distribuiton is required for them.
Richard Yi Da Xu Introduction to Bayesian Statistics
Two alternative expression:
People sometimes use: Other times, people use:
I You have data X = {2, 3, 3, 2, 1, 4}, I You have data X = {1, 2, 3, 4}, and
i.e., x1 = 2, x2 = 3, . . . x6 = 4 P(X = 1) = 62 , P(X = 2) = 26 , P(X = 3) = 1
6
and P(X = 4) = 16 .
N
2 1 X 2
VAR(X ) = σ = (xi − µ) 2
X 2
N i=1 Discrete :VAR(X ) = σ = (x − µ) p(x)
x∈X
N N N
1 X 2 1 X 1 X 2 Z
= xi − 2xi µ + µ Continous :VAR(X ) = σ =
2
(x − µ) p(x)
2
N i=1 N i=1 N i=1
x∈X
N N
1 X 2 1 X 2
= x − 2µ xi +µ
N i=1 i N i=1 X
2 2
VAR(X ) = x − 2µx + µ p(x)
| {z } x∈X
µ
2 2
X X X
N
! = x p(x) − 2 µxp(x) + µ p(x)
1 X 2 2
= x −µ x∈X x∈X x∈X
N i=1 i X 2
X 2
X
= x p(x) − 2µ xp(x) +µ p(x)
x∈X x∈X x∈X
| {z } | {z }
µ 1
2 2
X
= x p(x) − µ
x∈X
It’s easy to verify that both sides are the same
Richard Yi Da Xu Introduction to Bayesian Statistics
Numerical example
First version Second version
I X = {2, 3, 3, 2, 1, 4}, i.e., I X = {1, 2, 3, 4}, and P(X = 1) = 62 ,
x1 = 2, x2 = 3, . . . x6 = 4 P(X = 2) = 26 , P(X = 3) = 16 and
P(X = 4) = 16 .
N
!
1 X 2 2
VAR(X ) = x −µ
N i=1 i 2 2
X
Discrete :VAR(X ) = σ = (x − µ) p(x)
x∈X
1 2 2 2
= (2 − 2.5) + (3 − 2.5) + (3 − 2.5) 2
Z
2
6 Continous :VAR(X ) = σ = (x − µ) p(x)
2 2 2 x∈X
+ (2 − 2.5) + (1 − 2.5) + (4 − 2.5)
| {z }
f (x)
≈ 0.917
2 2
X
VAR(X ) = x p(x) − µ
x∈X
1 2 22 22
(1 − 2.5) + (2 − 2.5) + (3 − 2.5) +
6 6 6
21
(4 − 2.5)
6
≈ 0.917
Both sides are the same
Richard Yi Da Xu Introduction to Bayesian Statistics
Important fact of the Variances
Z
VAR(X ) = E[(X − E(X ))2 ] = (x − µ)2 p(x)dx
x∈S
Z Z Z
= x 2 p(x)dx − 2µ xp(x)dx + µ2 xp(x)dx
x∈S x∈S x∈S
= E(X ) − (E(X))2
2
Think VAR(X ) as “mean-subtracted” second order moment of random variable X .
Richard Yi Da Xu Introduction to Bayesian Statistics
Joint distributions
I The following is a tablet form of joint density Pr(X , Y ):
Y =0 Y =1 Y =2 Total
3 3 6
X =0 0 15 15 15
2 6 8
X =1 15 15
0 15
1 1
X =2 15
0 0 15
3 9 3
Total 15 15 15
1
I This table shows Pr(X , Y ) or Pr(X = x, Y = y).
6
I For example, p(X = 1, Y = 1) = 15
:
I exercise what is the probablity that X = 2, Y = 1?
I exercise what is the probablity that X = 3, Y = 2?
I exercise what is the value of:
2 X
X 2
Pr(X = i, Y = j)?
i=0 j=0
Richard Yi Da Xu Introduction to Bayesian Statistics
Marginal distributions
Y =0 Y =1 Y =2 Total
3 3 6
X =0 0 15 15 15
2 6 8
X =1 15 15
0 15
1 1
X =2 15
0 0 15
3 9 3
Total 15 15 15
1
I Using sum rule, the marginal distribution tells us that:
X Z
Pr(X ) = Pr(x, y) or p(X ) = p(x, y)dy
y ∈Sy y∈Sy
I For example:
2 X
2
X 3 6 0 9
Pr(Y = 1) = p(x = i, y = 1) = + + =
15 15 15 15
i=0 j=0
I exercise what is Pr(X = 2) and Pr(X = 1)?
Richard Yi Da Xu Introduction to Bayesian Statistics
Conditional distributions
Y =0 Y =1 Y =2 Total
3 3 6
X =0 0 15 15 15
2 6 8
X =1 15 15
0 15
1 1
X =2 15
0 0 15
3 9 3
Total 15 15 15
1
I Conditional density:
p(X , Y ) p(Y |X )p(X ) p(Y |X )p(X )
p(X |Y ) = = = P
p(Y ) p(Y ) X p(Y |X )p(X )
I What about p(X |Y = y)? Pick an example:
p(X = 1, Y = 1) 6/15 2
p(X = 1|Y = 1) = = =
p(Y = 1) 9/15 3
Richard Yi Da Xu Introduction to Bayesian Statistics
Conditional distributions: Exercise
Y =0 Y =1 Y =2 Total
3 3 6
X =0 0 15 15 15
2 6 8
X =1 15 15
0 15
1 1
X =2 15
0 0 15
3 9 3
Total 15 15 15
1
I The formulation for conditional density:
p(X , Y ) p(Y |X )p(X ) p(Y |X )p(X )
p(X |Y ) = = = P
p(Y ) p(Y ) X p(Y |X )p(X )
I exercise: What is p(X = 2|Y = 1)?
I exercise: What is p(X = 1|Y = 2)?
Richard Yi Da Xu Introduction to Bayesian Statistics
Independence
If X and Y are independent:
I p(X |Y ) = p(X )
I p(X , Y ) = P(X )P(Y )
I Both factors are related when A and B are independent:
p(X , Y ) p(X )p(Y )
p(X |Y ) = = = p(X )
p(Y ) p(Y )
Y =0 Y =1 Y =2 Total Y =0 Y =1 Y =2 Total
3 3 6 18 54 18 6
X =0 0 15 15 15 X =0 225 225 225 15
2 6 8 24 72 24 8
X =1 15 15 0 15 X =1 225 225 225 15
1 1 3 9 3 1
X =2 15 0 0 15 X =2 225 225 225 15
3 9 3 3 9 3
Total 15 15 15 1 Total 15 15 15 1
X and Y are NOT independent X and Y are independent
Richard Yi Da Xu Introduction to Bayesian Statistics
Conditional Independence
I Imagine we have three random variables: X , Y and Z :
I Once we know Z , then knowing Y does NOT tell us any additional information
about X
I Therefore:
Pr(X |Y , Z ) = Pr(X |Z )
I This means that X is conditionally independent of Y given Z .
I If Pr(X |Y , Z ) = Pr(X |Z ), then what about Pr(X , Y |Z )?
Pr(X , Y , Z ) Pr(X |Y , Z ) Pr(Y , Z )
Pr(X , Y |Z ) = =
Pr(Z ) Pr(Z )
= Pr(X |Y , Z ) Pr(Y |Z )
= Pr(X |Z ) Pr(Y |Z )
Richard Yi Da Xu Introduction to Bayesian Statistics
An example of Conditional Independence
We will study Dynamic model later.
xt−1 xt xt+1
yt−1 yt yt+1
From this model, we can see:
p(xt |x1 , . . . , xt−1 , y1 , . . . , yt−1 ) = p(xt |xt−1 )
p(yt |x1 , . . . , xt−1 , xt , y1 , . . . , yt−1 ) = p(yt |xt )
Right now, think of if a given variable is the only item that “blocks” the path between two
(or more) variables.
Richard Yi Da Xu Introduction to Bayesian Statistics
Another Example: Bayesian Linear Regression
We have data pairs:
I Input: X = x1 , . . . xN
I Output: Y = y1 , . . . yN
Each pair, xi and yi are related through model equation:
yi = f (xi |w) + N (0, σ 2 )
I Input alone isn’t going to tell you model parameter: p(w|X ) = p(w)
I Output alone isn’t going to tell you model parameter: p(w|Y ) = p(w)
I Obviously: p(w|X , Y ) 6= p(w)
Posterior over parameter w:
p(y|w, x)p(w|x)p(x) p(y|w, x)p(w) p(y |w, x)p(w)
p(w|x, y) = = = R
p(y|x)p(x) p(y|x) w p(y|x, w)p(w)
Richard Yi Da Xu Introduction to Bayesian Statistics
Expectation of Joint probabilities
Given that X , Y is a two-dimensional random variable:
I Continous case:
Z Z
E[f (X , Y )] = f (x, y)p(x, y)dxdy
y ∈Sy x∈Sx
I Discrete case:
Ni jN
X X
E[f (X , Y )] = f (X = i, Y = j)p(X = i, Y = j)
i=1 j=1
Richard Yi Da Xu Introduction to Bayesian Statistics
Numerical Example:
Y =1 Y =2 Y =3 Y =1 Y =2 Y =3
3 3 X =1 6 7 8
X =1 0 15 15
X =2 2 6
0 X =2 3 6 2
15 15 X =3 1 8 6
1
X =3 15 0 0
f (X,Y)
p (X,Y)
Ni jN
X X
E[f (X , Y )] = f (X = i, Y = j)p(X = i, Y = j)
i=1 j=1
3 3 2 6
=6×0+7× +8× +3× +6×
15 15 15 15
1
+2×0+1× +8×0+6×0
15
Richard Yi Da Xu Introduction to Bayesian Statistics
Conditional Expectation
It’s a useful property for later
Z
E(Y ) = E(Y |X )p(X )dx
ZX Z Z Z
= yp(Y |X )dy p(X )dx = yp(Y , X )dy dx
X Y X Y
| {z }
Z Z
= y p(Y , X )dx dy
Y X
Z
= yp(Y )dy = E(Y )
Y
Richard Yi Da Xu Introduction to Bayesian Statistics
Bayesian Predictive distribution
Put marginal distribution and Conditional Independence into a test:
I Very often, in machine learning, you want to compute the probability of new data
y ∗ given training data Y , i.e., p(y ∗ |Y ). You assume there are some model
explains both Y and y ∗ . The model parameter is θ:
Z
p(y ∗ |Y ) = p(y ∗ |θ)p(θ|Y )dθ
θ
I Excercise, tell me why the above works?
Richard Yi Da Xu Introduction to Bayesian Statistics
Revisit Bayes Theorem
Instead of using arbitrary random variable symbols, we now use:
I θ for model parameter
I and X = x1 , . . . xn for dataset:
p(X |θ) p(θ)
| {z } |{z}
likelihood prior p(X |θ)p(θ)
p(θ|X ) = = R
| {z } p(X ) θ p(X |θ)p(θ)
posterior | {z }
normalization constant
Richard Yi Da Xu Introduction to Bayesian Statistics
An Intrusion Detection System (IDS) Example
The setting: Imagine out of all the TCP connections (say millons), 1% of which are
intrusions:
I When there is an intrusion, the probability of system sends alarm is 87%.
I When there is no intrusion, the probability of system sends alarm is 6%.
I Prior probability:
1% of which are intrusions
=⇒ p(θ = intrusion) = 0.01 p(θ = no intrusion) = 0.99
I Likelihood probability:
I given intrusion occur, probability of system sends alarm is 87%
p(X = alarm|θ = intrusion) = 0.87 p(X = no alarm|θ = intrusion) = 0.13
I given there is no intrusion, the probability of system sends alarm is 6%:
p(X = alarm|θ = no intrusion) = 0.06 p(X = no alarm|θ = no intrusion) = 0.94
Richard Yi Da Xu Introduction to Bayesian Statistics
Posterior
I We are interested in posterior probability: Pr(θ|X ):
I There 2 two possible values for parameter θ and 2 possible observation X
I Therefore, there are 4 rates we need to compute:
I True Positive When system sends alarm, probability of an intrusion occurs:
Pr(θ = intrusion|X = alarm)
I False Positive When system sends alarm, probability that there is no intrusion:
Pr(θ = no intrusion|X = alarm)
I True Negative When system sends no alarm, probability that there is no intrusion:
Pr(θ = no intrusion|X = no alarm)
I False Negative When system sends no alarm, probability that an intrusion occurs:
Pr(θ = intrusion|X = no alarm)
I Question which are the two probabilities you’d like to maximise?
Richard Yi Da Xu Introduction to Bayesian Statistics
Apply Bayes Theorem in this setting
Pr(X |θ) Pr(θ)
Pr(θ|X ) = P
θ Pr(X |θ) Pr(θ)
Pr(X |θ) Pr(θ)
=
Pr(X |θ = Intrusion) Pr(θ = Intrusion) + Pr(X |θ = no intrusion) Pr(θ = no intrusion)
Richard Yi Da Xu Introduction to Bayesian Statistics
Apply Bayes Theorem in this setting
True Positive rate When system sends alarm, what is the probability of an intrusion
occurs:
Pr(θ = intrusion|X = alarm)
Pr(X = alarm|θ = intrusion) Pr(θ = intrusion)
=
Pr(X = alarm|θ = Intrusion) Pr(θ = Intrusion) + Pr(X = alarm|θ = no intrusion) Pr(θ = Intrusion)
0.87 × 0.01
= = 0.1278
0.87 × 0.01 + 0.06 × 0.99
False Positive rate When system sends alarm, what is the probability that there is no
intrusion:
Pr(θ = no intrusion|X = alarm)
Pr(X = alarm|θ = no intrusion) Pr(θ = no intrusion)
=
Pr(X = alarm|θ = no intrusion) Pr(θ = no intrusion) + Pr(X = alarm|θ = no intrusion) Pr(θ = no intrusion)
0.06 × 0.99
= = 0.8722
0.87 × 0.01 + 0.06 × 0.99
Richard Yi Da Xu Introduction to Bayesian Statistics
Apply Bayes Theorem in this setting
False Negative When system sends no alarm, what is the probability that an intrusion
occurs?
Pr(θ = intrusion|X = no alarm)
Pr(X = no alarm|θ = intrusion)p(θ = intrusion)
=
Pr(X = no alarm|θ = Intrusion) Pr(θ = Intrusion) + Pr(X = no alarm|θ = no intrusion) Pr(θ = no Intrusion)
0.13 × 0.01
= = 0.0014
0.13 × 0.01 + 0.94 × 0.99
True Negative When system sends no alarm, what is the probability that there is no
intrusion?
Pr(θ = no intrusion|X = no alarm)
Pr(X = no alarm|θ = no intrusion) Pr(θ = no intrusion)
=
Pr(X = no alarm|θ = no intrusion) Pr(θ = no intrusion) + Pr(X = no alarm|θ = no intrusion) Pr(θ = no intrusion)
0.94 × 0.99
= = 0.9986
0.87 × 0.001 + 0.06 × 0.99
Richard Yi Da Xu Introduction to Bayesian Statistics
Statistics way to think about Posterior Inference
The posterior inference is to find the best q(θ) to approximate p(θ|X ), such that:
infq(θ)∈Q KL(q(θ)kp(θ)) − Eθ∼q(θ) ln(p(X |θ)
Z
q(θ)
Z
=infq(θ)∈Q ln q(θ) − ln(p(X |θ)q(θ)
θ p(θ) θ
Z
=infq(θ)∈Q [ln q(θ) − (ln p(θ) + ln p(X |θ))] q(θ)
θ
Z
q(θ)
=infq(θ)∈Q ln q(θ)
θ p(θ)p(X |θ)
Z
1 q(θ)
= infq(θ)∈Q ln q(θ)
p(X ) θ p(θ|X )
=infq(θ)∈Q {KL(q(θ)kp(θ|X ))}
Richard Yi Da Xu Introduction to Bayesian Statistics