[go: up one dir, main page]

0% found this document useful (0 votes)
17 views26 pages

Bayesian

This document provides an introduction to Bayesian statistics. It discusses key concepts such as random variables, probability density functions, means, variances, and numerical examples. The document defines random variables as having either a probability density function for continuous variables or a probability mass function for discrete variables. It also explains how to calculate the mean or expectation of a random variable and introduces two common ways to express the variance - as the expected value of the squared deviation from the mean or as the expected value of the squared random variable minus the squared expected value.

Uploaded by

twqtwtw6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views26 pages

Bayesian

This document provides an introduction to Bayesian statistics. It discusses key concepts such as random variables, probability density functions, means, variances, and numerical examples. The document defines random variables as having either a probability density function for continuous variables or a probability mass function for discrete variables. It also explains how to calculate the mean or expectation of a random variable and introduces two common ways to express the variance - as the expected value of the squared deviation from the mean or as the expected value of the squared random variable minus the squared expected value.

Uploaded by

twqtwtw6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to Bayesian Statistics

Richard Yi Da Xu

School of Computing & Communication, UTS

January 31, 2017

Richard Yi Da Xu Introduction to Bayesian Statistics


Random variables

I Pre-university: A number is just a fixed value.


When we talk about probabilities:
I When X is a continuous random variable, it has a probability density function (pdf)
I When X is a discrete random variable, it has a probability mass function (pmf)
p(x) = p(X = x) means that:
The probability when a random variable X is equal to a fixed number x, i.e.,

the probablity that number of machine learning participants = 20

Richard Yi Da Xu Introduction to Bayesian Statistics


Mean or Expectation

I discrete case:
N
1 X
µ = E(X ) = xi
N
i=1
I continous case:
Z
µ = E(X ) = xp(x)dx
x∈S
I can also measure the expecation of a function:
Z
E(f (X )) = f (x)p(x)dx
x∈S

For example,
Z Z
E(cos(X )) = cos(x)p(x)dx E(X 2 ) = x 2 p(x)dx
x∈S x∈S
I What about f (E(X )): Discuss later when we discuss Jensens Equality in
Expecation-Maximization

Richard Yi Da Xu Introduction to Bayesian Statistics


Variances an intuitive explanation

I You have data X = {2, 3, 3, 2, 1, 4}, i.e., x1 = 2, x2 = 3, . . . x6 = 4


I You have the mean:

2+3+3+2+1+4
µ= = 2.5
6
I The variance is then:

N
2 1 X 2
VAR(data) = σ = (xi − µ)
N i=1

I Division by N is intuitive. Otherwise, more data implies more variance


I Also think about what kind of values can VAR and σ take? - we will look at what kind of
distribuiton is required for them.

Richard Yi Da Xu Introduction to Bayesian Statistics


Two alternative expression:
People sometimes use: Other times, people use:
I You have data X = {2, 3, 3, 2, 1, 4}, I You have data X = {1, 2, 3, 4}, and
i.e., x1 = 2, x2 = 3, . . . x6 = 4 P(X = 1) = 62 , P(X = 2) = 26 , P(X = 3) = 1
6
and P(X = 4) = 16 .
N
2 1 X 2
VAR(X ) = σ = (xi − µ) 2
X 2
N i=1 Discrete :VAR(X ) = σ = (x − µ) p(x)
x∈X
N N N
1 X 2 1 X 1 X 2 Z
= xi − 2xi µ + µ Continous :VAR(X ) = σ =
2
(x − µ) p(x)
2
N i=1 N i=1 N i=1
x∈X
N N
1 X 2 1 X 2
= x − 2µ xi +µ
N i=1 i N i=1 X 
2 2
VAR(X ) = x − 2µx + µ p(x)
| {z } x∈X
µ
2 2
X X X
N
! = x p(x) − 2 µxp(x) + µ p(x)
1 X 2 2
= x −µ x∈X x∈X x∈X
N i=1 i X 2
X 2
X
= x p(x) − 2µ xp(x) +µ p(x)
x∈X x∈X x∈X
| {z } | {z }
µ 1
 
2 2
X
= x p(x) − µ
x∈X
It’s easy to verify that both sides are the same

Richard Yi Da Xu Introduction to Bayesian Statistics


Numerical example

First version Second version


I X = {2, 3, 3, 2, 1, 4}, i.e., I X = {1, 2, 3, 4}, and P(X = 1) = 62 ,
x1 = 2, x2 = 3, . . . x6 = 4 P(X = 2) = 26 , P(X = 3) = 16 and
P(X = 4) = 16 .
N
!
1 X 2 2
VAR(X ) = x −µ
N i=1 i 2 2
X
Discrete :VAR(X ) = σ = (x − µ) p(x)
x∈X
1 2 2 2
= (2 − 2.5) + (3 − 2.5) + (3 − 2.5) 2
Z
2
6 Continous :VAR(X ) = σ = (x − µ) p(x)
2 2 2 x∈X
+ (2 − 2.5) + (1 − 2.5) + (4 − 2.5)
| {z }
f (x)
≈ 0.917
 
2 2
X
VAR(X ) =  x p(x) − µ
x∈X

1 2 22 22
(1 − 2.5) + (2 − 2.5) + (3 − 2.5) +
6 6 6
21
(4 − 2.5)
6
≈ 0.917
Both sides are the same

Richard Yi Da Xu Introduction to Bayesian Statistics


Important fact of the Variances

Z
VAR(X ) = E[(X − E(X ))2 ] = (x − µ)2 p(x)dx
x∈S
Z Z Z
= x 2 p(x)dx − 2µ xp(x)dx + µ2 xp(x)dx
x∈S x∈S x∈S

= E(X ) − (E(X))2
2

Think VAR(X ) as “mean-subtracted” second order moment of random variable X .

Richard Yi Da Xu Introduction to Bayesian Statistics


Joint distributions

I The following is a tablet form of joint density Pr(X , Y ):


Y =0 Y =1 Y =2 Total
3 3 6
X =0 0 15 15 15
2 6 8
X =1 15 15
0 15
1 1
X =2 15
0 0 15
3 9 3
Total 15 15 15
1
I This table shows Pr(X , Y ) or Pr(X = x, Y = y).
6
I For example, p(X = 1, Y = 1) = 15
:
I exercise what is the probablity that X = 2, Y = 1?
I exercise what is the probablity that X = 3, Y = 2?
I exercise what is the value of:

2 X
X 2
Pr(X = i, Y = j)?
i=0 j=0

Richard Yi Da Xu Introduction to Bayesian Statistics


Marginal distributions

Y =0 Y =1 Y =2 Total
3 3 6
X =0 0 15 15 15
2 6 8
X =1 15 15
0 15
1 1
X =2 15
0 0 15
3 9 3
Total 15 15 15
1

I Using sum rule, the marginal distribution tells us that:

X Z
Pr(X ) = Pr(x, y) or p(X ) = p(x, y)dy
y ∈Sy y∈Sy

I For example:

2 X
2
X 3 6 0 9
Pr(Y = 1) = p(x = i, y = 1) = + + =
15 15 15 15
i=0 j=0

I exercise what is Pr(X = 2) and Pr(X = 1)?

Richard Yi Da Xu Introduction to Bayesian Statistics


Conditional distributions

Y =0 Y =1 Y =2 Total
3 3 6
X =0 0 15 15 15
2 6 8
X =1 15 15
0 15
1 1
X =2 15
0 0 15
3 9 3
Total 15 15 15
1

I Conditional density:

p(X , Y ) p(Y |X )p(X ) p(Y |X )p(X )


p(X |Y ) = = = P
p(Y ) p(Y ) X p(Y |X )p(X )

I What about p(X |Y = y)? Pick an example:

p(X = 1, Y = 1) 6/15 2
p(X = 1|Y = 1) = = =
p(Y = 1) 9/15 3

Richard Yi Da Xu Introduction to Bayesian Statistics


Conditional distributions: Exercise

Y =0 Y =1 Y =2 Total
3 3 6
X =0 0 15 15 15
2 6 8
X =1 15 15
0 15
1 1
X =2 15
0 0 15
3 9 3
Total 15 15 15
1

I The formulation for conditional density:

p(X , Y ) p(Y |X )p(X ) p(Y |X )p(X )


p(X |Y ) = = = P
p(Y ) p(Y ) X p(Y |X )p(X )

I exercise: What is p(X = 2|Y = 1)?


I exercise: What is p(X = 1|Y = 2)?

Richard Yi Da Xu Introduction to Bayesian Statistics


Independence

If X and Y are independent:


I p(X |Y ) = p(X )
I p(X , Y ) = P(X )P(Y )
I Both factors are related when A and B are independent:

p(X , Y ) p(X )p(Y )


p(X |Y ) = = = p(X )
p(Y ) p(Y )

Y =0 Y =1 Y =2 Total Y =0 Y =1 Y =2 Total
3 3 6 18 54 18 6
X =0 0 15 15 15 X =0 225 225 225 15
2 6 8 24 72 24 8
X =1 15 15 0 15 X =1 225 225 225 15
1 1 3 9 3 1
X =2 15 0 0 15 X =2 225 225 225 15
3 9 3 3 9 3
Total 15 15 15 1 Total 15 15 15 1

X and Y are NOT independent X and Y are independent

Richard Yi Da Xu Introduction to Bayesian Statistics


Conditional Independence

I Imagine we have three random variables: X , Y and Z :


I Once we know Z , then knowing Y does NOT tell us any additional information
about X
I Therefore:

Pr(X |Y , Z ) = Pr(X |Z )

I This means that X is conditionally independent of Y given Z .


I If Pr(X |Y , Z ) = Pr(X |Z ), then what about Pr(X , Y |Z )?

Pr(X , Y , Z ) Pr(X |Y , Z ) Pr(Y , Z )


Pr(X , Y |Z ) = =
Pr(Z ) Pr(Z )
= Pr(X |Y , Z ) Pr(Y |Z )
= Pr(X |Z ) Pr(Y |Z )

Richard Yi Da Xu Introduction to Bayesian Statistics


An example of Conditional Independence

We will study Dynamic model later.

xt−1 xt xt+1

yt−1 yt yt+1

From this model, we can see:

p(xt |x1 , . . . , xt−1 , y1 , . . . , yt−1 ) = p(xt |xt−1 )


p(yt |x1 , . . . , xt−1 , xt , y1 , . . . , yt−1 ) = p(yt |xt )

Right now, think of if a given variable is the only item that “blocks” the path between two
(or more) variables.

Richard Yi Da Xu Introduction to Bayesian Statistics


Another Example: Bayesian Linear Regression

We have data pairs:


I Input: X = x1 , . . . xN
I Output: Y = y1 , . . . yN
Each pair, xi and yi are related through model equation:

yi = f (xi |w) + N (0, σ 2 )

I Input alone isn’t going to tell you model parameter: p(w|X ) = p(w)
I Output alone isn’t going to tell you model parameter: p(w|Y ) = p(w)
I Obviously: p(w|X , Y ) 6= p(w)
Posterior over parameter w:

p(y|w, x)p(w|x)p(x) p(y|w, x)p(w) p(y |w, x)p(w)


p(w|x, y) = = = R
p(y|x)p(x) p(y|x) w p(y|x, w)p(w)

Richard Yi Da Xu Introduction to Bayesian Statistics


Expectation of Joint probabilities

Given that X , Y is a two-dimensional random variable:


I Continous case:

Z Z
E[f (X , Y )] = f (x, y)p(x, y)dxdy
y ∈Sy x∈Sx

I Discrete case:

Ni jN
X X
E[f (X , Y )] = f (X = i, Y = j)p(X = i, Y = j)
i=1 j=1

Richard Yi Da Xu Introduction to Bayesian Statistics


Numerical Example:

Y =1 Y =2 Y =3 Y =1 Y =2 Y =3
3 3 X =1 6 7 8
X =1 0 15 15
X =2 2 6
0 X =2 3 6 2
15 15 X =3 1 8 6
1
X =3 15 0 0
f (X,Y)
p (X,Y)

Ni jN
X X
E[f (X , Y )] = f (X = i, Y = j)p(X = i, Y = j)
i=1 j=1

3 3 2 6
=6×0+7× +8× +3× +6×
15 15 15 15
1
+2×0+1× +8×0+6×0
15

Richard Yi Da Xu Introduction to Bayesian Statistics


Conditional Expectation

It’s a useful property for later

Z
E(Y ) = E(Y |X )p(X )dx
ZX Z Z Z
= yp(Y |X )dy p(X )dx = yp(Y , X )dy dx
X Y X Y
| {z }
Z Z 
= y p(Y , X )dx dy
Y X
Z
= yp(Y )dy = E(Y )
Y

Richard Yi Da Xu Introduction to Bayesian Statistics


Bayesian Predictive distribution

Put marginal distribution and Conditional Independence into a test:


I Very often, in machine learning, you want to compute the probability of new data
y ∗ given training data Y , i.e., p(y ∗ |Y ). You assume there are some model
explains both Y and y ∗ . The model parameter is θ:

Z
p(y ∗ |Y ) = p(y ∗ |θ)p(θ|Y )dθ
θ

I Excercise, tell me why the above works?

Richard Yi Da Xu Introduction to Bayesian Statistics


Revisit Bayes Theorem

Instead of using arbitrary random variable symbols, we now use:


I θ for model parameter
I and X = x1 , . . . xn for dataset:

p(X |θ) p(θ)


| {z } |{z}
likelihood prior p(X |θ)p(θ)
p(θ|X ) = = R
| {z } p(X ) θ p(X |θ)p(θ)
posterior | {z }
normalization constant

Richard Yi Da Xu Introduction to Bayesian Statistics


An Intrusion Detection System (IDS) Example

The setting: Imagine out of all the TCP connections (say millons), 1% of which are
intrusions:
I When there is an intrusion, the probability of system sends alarm is 87%.
I When there is no intrusion, the probability of system sends alarm is 6%.

I Prior probability:
1% of which are intrusions
=⇒ p(θ = intrusion) = 0.01 p(θ = no intrusion) = 0.99

I Likelihood probability:
I given intrusion occur, probability of system sends alarm is 87%

p(X = alarm|θ = intrusion) = 0.87 p(X = no alarm|θ = intrusion) = 0.13


I given there is no intrusion, the probability of system sends alarm is 6%:

p(X = alarm|θ = no intrusion) = 0.06 p(X = no alarm|θ = no intrusion) = 0.94

Richard Yi Da Xu Introduction to Bayesian Statistics


Posterior

I We are interested in posterior probability: Pr(θ|X ):


I There 2 two possible values for parameter θ and 2 possible observation X
I Therefore, there are 4 rates we need to compute:
I True Positive When system sends alarm, probability of an intrusion occurs:

Pr(θ = intrusion|X = alarm)


I False Positive When system sends alarm, probability that there is no intrusion:

Pr(θ = no intrusion|X = alarm)


I True Negative When system sends no alarm, probability that there is no intrusion:

Pr(θ = no intrusion|X = no alarm)


I False Negative When system sends no alarm, probability that an intrusion occurs:

Pr(θ = intrusion|X = no alarm)

I Question which are the two probabilities you’d like to maximise?

Richard Yi Da Xu Introduction to Bayesian Statistics


Apply Bayes Theorem in this setting

Pr(X |θ) Pr(θ)


Pr(θ|X ) = P
θ Pr(X |θ) Pr(θ)
Pr(X |θ) Pr(θ)
=
Pr(X |θ = Intrusion) Pr(θ = Intrusion) + Pr(X |θ = no intrusion) Pr(θ = no intrusion)

Richard Yi Da Xu Introduction to Bayesian Statistics


Apply Bayes Theorem in this setting

True Positive rate When system sends alarm, what is the probability of an intrusion
occurs:

Pr(θ = intrusion|X = alarm)


Pr(X = alarm|θ = intrusion) Pr(θ = intrusion)
=
Pr(X = alarm|θ = Intrusion) Pr(θ = Intrusion) + Pr(X = alarm|θ = no intrusion) Pr(θ = Intrusion)
0.87 × 0.01
= = 0.1278
0.87 × 0.01 + 0.06 × 0.99

False Positive rate When system sends alarm, what is the probability that there is no
intrusion:

Pr(θ = no intrusion|X = alarm)


Pr(X = alarm|θ = no intrusion) Pr(θ = no intrusion)
=
Pr(X = alarm|θ = no intrusion) Pr(θ = no intrusion) + Pr(X = alarm|θ = no intrusion) Pr(θ = no intrusion)
0.06 × 0.99
= = 0.8722
0.87 × 0.01 + 0.06 × 0.99

Richard Yi Da Xu Introduction to Bayesian Statistics


Apply Bayes Theorem in this setting

False Negative When system sends no alarm, what is the probability that an intrusion
occurs?

Pr(θ = intrusion|X = no alarm)


Pr(X = no alarm|θ = intrusion)p(θ = intrusion)
=
Pr(X = no alarm|θ = Intrusion) Pr(θ = Intrusion) + Pr(X = no alarm|θ = no intrusion) Pr(θ = no Intrusion)
0.13 × 0.01
= = 0.0014
0.13 × 0.01 + 0.94 × 0.99

True Negative When system sends no alarm, what is the probability that there is no
intrusion?

Pr(θ = no intrusion|X = no alarm)


Pr(X = no alarm|θ = no intrusion) Pr(θ = no intrusion)
=
Pr(X = no alarm|θ = no intrusion) Pr(θ = no intrusion) + Pr(X = no alarm|θ = no intrusion) Pr(θ = no intrusion)
0.94 × 0.99
= = 0.9986
0.87 × 0.001 + 0.06 × 0.99

Richard Yi Da Xu Introduction to Bayesian Statistics


Statistics way to think about Posterior Inference

The posterior inference is to find the best q(θ) to approximate p(θ|X ), such that:


infq(θ)∈Q KL(q(θ)kp(θ)) − Eθ∼q(θ) ln(p(X |θ)
Z 
q(θ)
Z
=infq(θ)∈Q ln q(θ) − ln(p(X |θ)q(θ)
θ p(θ) θ
Z 
=infq(θ)∈Q [ln q(θ) − (ln p(θ) + ln p(X |θ))] q(θ)
θ
Z   
q(θ)
=infq(θ)∈Q ln q(θ)
θ p(θ)p(X |θ)
Z   
1 q(θ)
= infq(θ)∈Q ln q(θ)
p(X ) θ p(θ|X )
=infq(θ)∈Q {KL(q(θ)kp(θ|X ))}

Richard Yi Da Xu Introduction to Bayesian Statistics

You might also like