[go: up one dir, main page]

0% found this document useful (0 votes)
27 views48 pages

13 Density Estimation Note

Uploaded by

jt5nsvbbff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views48 pages

13 Density Estimation Note

Uploaded by

jt5nsvbbff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Machine Learning CS 4641-7641

Density Estimation
Mahdi Roozbahani
Georgia Tech

The slides are inspired from Le Song.


Why we love exponential terms?
Outline

• Overview
• Parametric Density Estimation
• Nonparametric Density Estimation

4
Continuous variable
Continuous probability distribution
Probability density function
න 𝑓𝑋 𝑥 𝑑𝑥 = 1
Density value
Temperature (real number)
Gaussian Distribution

Discrete variable
Discrete probability distribution
Probability mass function
Probability value ෍ 𝑓𝑋 𝑥 = 1
Coin flip (integer) 𝑥𝜖𝐴
Bernoulli distribution
Why Density Estimation?

Access the density of seeing a particular data point

6
Example: Test Scores

Histogram is an estimate of the probability distribution of a continuous


variable
7
Example: Test Scores

8
Parametric Density Estimation

1 → 𝐻𝑒𝑎𝑑
0 → 𝑇𝑎𝑖𝑙𝑠
x
x 𝜃 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒

𝑑 𝑑×𝑑

9
Nonparametric Density Estimation

10
Parametric v.s. Nonparametric Density Estimation

11
Parametric v.s. Nonparametric Density Estimation

12
Outline

• Overview
• Parametric Density Estimation
• Nonparametric Density Estimation

13
Estimating Parametric Models

n
n 𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑛

Using the parameters, we can estimate each data point

N
𝑋 𝑥𝑖 |

14
Example Problem

𝑋 = 𝑥1 , 𝑥2 , … , 𝑥n𝑛 = 1,0,1, … , 0 , 𝑥𝑖 ∈ {0,1}

𝐿 𝜃 𝑥𝑖 = 𝑝 𝑥𝑖 𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖

15
MLE for Biased Coin

𝑖 𝑖
𝑖

16
Estimating Gaussian Distributions

n
n
𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑥𝑖 ∈ 𝑅

Density of a data point:

𝑥𝑖 𝑥𝑖

17
Estimating Gaussian Distributions

n
𝑥𝑖
n

n
𝑥𝑖
n
18
MLE for Gaussian Distribution
N

𝑙 𝜃 𝑋 =𝑙(𝜇, 𝜎|𝑋) 𝑥𝑖
N
n n 𝑥𝑖

19
MLE for Gaussian Distribution

N
n n 𝑥𝑖
𝑙(𝜇, 𝜎|𝑋)
𝑋
n
N
𝑥𝑖
n
N n
N
𝑥𝑖 n 𝑥𝑖
n

n
N
n
𝑥𝑖
n
N n
N
𝑥𝑖 n 𝑥𝑖
n

20
Example

21
Outline

• Overview
• Parametric Density Estimation
• Nonparametric Density Estimation

Can be used for:


• Visualization
• Classification
• Regression

22
Example: Test Scores

23
1-D Histogram

N 𝑋 = 𝑥1 , 𝑥2 , … , 𝑥n𝑛 = 𝑥𝑖 ∈ [0,1)

M
1 1 2 𝑙−1 𝑙 M 𝑀−1
𝐵1 = ൥0, ቇ , 𝐵2 = ቈ , ቇ , … , 𝐵𝑙 = ቈ , ቇ , … , 𝐵𝑀 = ൤ , 1)
𝑀 M𝑀 𝑀 M M 𝑀 M 𝑀 𝑀

1
For a new test data point x which belongs to 𝐵𝑙
𝑀
m 𝑁
𝑀 M 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑏𝑖𝑛 𝐵𝑙 (𝑐𝑙 )
The probability that point x is 𝑝 𝑥 = 𝑒(𝑥
෍ 1(𝑥𝑖 𝜖𝜖𝐵
𝐵𝑙𝑗)) =
drawn from a distribution p(x) 𝑁 N 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 × 𝑏𝑖𝑛 𝑤𝑖𝑑𝑡ℎ
𝑖=1

‫? 𝑥𝑑 𝑥 𝑝 ׬‬
24
Why is Histogram Valid?

1 𝑁
𝑀
න ෍ 1 𝑥𝑖 𝜖 𝐵𝑙 𝑑𝑥
0 𝑁 𝑖=1
1 2 𝑙
𝑀 𝑁 𝑀 𝑁 𝑀 𝑁
𝑀 𝑀 𝑀
= න ෍ 1 𝑥𝑖 𝜖 𝐵𝑙 𝑑𝑥 + න ෍ 1 𝑥𝑖 𝜖 𝐵𝑙 𝑑𝑥 + … + න ෍ 1 𝑥𝑖 𝜖 𝐵𝑙 𝑑𝑥 =
𝑁 𝑁 𝑁
0 𝑖=1 1 𝑖=1 𝑙−1 𝑖=1
𝑀 𝑀
1 2 𝑙
𝑀 𝑀 𝑀 1
𝑀
= න 𝑐1 𝑑𝑥 + න 𝑐2 𝑑𝑥 + ⋯ + න 𝑐𝑙 𝑑𝑥 + ⋯ + න 𝑐𝑀 𝑑𝑥 =
𝑁
0 1 𝑙−1 𝑀−1
𝑀 𝑀 𝑀
𝑀 𝑙 𝑀 𝑀
𝑀 𝑀
𝑀 𝑙 𝑙−1 𝑐𝑙
= ෍ න 𝑐𝑙 𝑑𝑥 = ෍ 𝑐𝑙 − = ෍ = 1
𝑁 𝑙−1 𝑁 𝑀 𝑀 𝑁
𝑙=1 𝑀 𝑗=1 𝑙=1
27
Higher-Dimensional Histogram

𝑛 𝑋 = 𝑥1 , 𝑥2 , … , 𝑛𝑥𝑛 , 𝑥𝑖 ∈ [0,1)𝑑

𝑀𝑑

1
𝑀

𝑇𝑤𝑜 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑑𝑎𝑡𝑎:

𝑀 = 10 (number of bins in each dimension)

𝑀2 = 100 ( total number of bins for two dimensional data)

28
29
Output Depends on Where You Put the Bins

0.5

30
Output Depends on Where You Put the Bins

0.25

31
Kernel Density Estimation

𝑁
𝑁 𝑥𝑔 𝑥𝑖
1 1 𝑥𝑙 − 𝑥𝑖 𝑥𝑙 = 𝑥𝑔𝑟𝑖𝑑𝑙𝑖𝑛𝑒
𝑝 𝑥 =𝑁 ෍ 𝐾
𝑁 ℎ ℎ
𝑖

32
Smoothing Kernel Functions

33
Example
𝑝 𝑥
𝑁𝑁 𝑥𝑖
1 1 𝑥𝑙 − 𝑥𝑖
= 𝑁𝑛෍ 𝐾
𝑁 ℎ ℎ
𝑖

34
Effect of the Kernel Bandwidth (h)
𝑁
1 𝑁 1 𝑥𝑙 − 𝑥𝑖
𝑝 𝑥 = 𝑥
෍ 𝐾 𝑖
𝑁 ℎ ℎ
𝑛𝑁 𝑖

35
Visual Example
50 datapoints are given to us

-15 -10 -5 0 5 10 15
Visual Example
Let’s implement 20 bins histogram
Frequency

-15 -10 -5 0 5 10 15
Visual Example
Let’s create 200 uniform gridlines (𝑥𝑙 ) to have a smoother density function
OR simply you can just implement this on each datapoint

14

12

10

0
-15 -10 -5 0 5 10 15
𝑁
For each linearly spaced 1 1
gridline 𝑥𝑙 , let’s calculate the 𝑝 𝑥 = ෍ 𝐾(𝑢𝑖 )
Gaussian kernel value over 𝑁 ℎ
𝑖 𝑥𝑙 − 𝑥𝑖
the given 50 points 1 −𝑢𝑖2 /2
𝑢𝑖 = 𝐾(𝑢𝑖 ) = 𝑒
ℎ 2𝜋

14

12

10

0
-15 -10 -5 0 5 10 15
Density value
As an example of kernel heights for line at 0

𝑥𝑙 = −15
Linearly spaced lines

𝑥𝑙 = 0 0.000 0.0001 0.0002 0.0003 0.0004………………0.0 0.0 0.0 0.0………….……… 0.0004 0.0003 0.0002 0.0001 0.0000

𝑥𝑙 = 15

200X50
Density value

𝑥𝑙 = −15
Linearly spaced lines

𝑥𝑙 = 0 0.000 0.0001 0.0002 0.0003 0.0004……………….……0 0 0 0…………….……… 0.0004 0.0003 0.0002 0.0001 0.0000

Density at L = 0
⋮ 𝑃 𝑥𝑙 = 0 = 𝑚𝑒𝑎𝑛 𝑳𝟎

𝑥𝑙 = 15
𝑁
1 1 𝑥𝑙 − 𝑥𝑖
𝑝 𝑥 = ෍ 𝐾 200X50
𝑁 ℎ ℎ
𝑖
Visual Example
Based on Gaussian kernel estimator Interactive Example

-15 -10 -5 0 5 10 15
𝐹𝑜𝑟 𝜎 = 1; Numerical Example

% Data ; There are 200 data points (-13~<data<~13)


randn('seed',1) % Used for reproducibility
x = [randn(100,1)-10; randn(100,1)+10]; % Two Normals mixed (GROUND TRUTH)

1
4𝜎ො 2 5 −
1
ℎ= ≈ 1.06𝜎𝑁
ො 5
3𝑁
h = std(x)*(4/3/numel(x))^(1/5); % Bandwidth estimated by Silverman's Rule of Thumb

% Let’s create apply density estimation over 1000 linearly spaced points (𝑥𝑙 )
xl = linspace(-25,+25,1000); % gridlines
% Let’s generate a “TRUE” density over all the bins given the “Ground Truth” information.

truepdf_firstnormal = exp(-.5*(xl-10).^2)/sqrt(2*pi);
truepdf_secondnormal = exp(-.5*(xl+10).^2)/sqrt(2*pi);
truepdf = truepdf_firstnormal/2 + truepdf_secondnormal/2;
% divided down by 2, because we are adding density value two times
plot(x,truepdf) % Plot True Density
% Let’s calculate Gaussian kernel density for each linearly spaced
point over 200 Given data points

𝑁
1 1 𝑥𝑙 − 𝑥𝑖
𝑝 𝑥 = ෍ 𝐾(𝑢𝑖 ) 𝑢𝑖 =
𝑁 ℎ ℎ
𝑖

1 −𝑢𝑖2 /2
Gaussian kernel 𝐾(𝑢𝑖 ) = 𝑒
2𝜋

for l=1:size(xl,1) % let’s loop over grid lines (𝑥𝑙 )


u = (xl(l) - x)./h; % length of u is 200
Ku = exp(-.5*u.^2)/sqrt(2*pi);
Ku = Ku./h;
px(l) = mean(Ku);
end
plot(x,truepdf)
plot(x,px)

truepdf

px
Two-Dimensional Examples

47
Choosing the Kernel Bandwidth

1

≈ 1.06𝜎𝑁
ො 5

48
Non-parametric vs parametric
Summary

• Parametric density estimation


Maximum likelihood estimation
Different parametric forms

• Nonparametric density estimation


Histogram
Kernel density estimation

50

You might also like