Machine Learning CS 4641-7641
Density Estimation
Mahdi Roozbahani
Georgia Tech
The slides are inspired from Le Song.
Why we love exponential terms?
Outline
• Overview
• Parametric Density Estimation
• Nonparametric Density Estimation
4
Continuous variable
Continuous probability distribution
Probability density function
න 𝑓𝑋 𝑥 𝑑𝑥 = 1
Density value
Temperature (real number)
Gaussian Distribution
Discrete variable
Discrete probability distribution
Probability mass function
Probability value 𝑓𝑋 𝑥 = 1
Coin flip (integer) 𝑥𝜖𝐴
Bernoulli distribution
Why Density Estimation?
Access the density of seeing a particular data point
6
Example: Test Scores
Histogram is an estimate of the probability distribution of a continuous
variable
7
Example: Test Scores
8
Parametric Density Estimation
1 → 𝐻𝑒𝑎𝑑
0 → 𝑇𝑎𝑖𝑙𝑠
x
x 𝜃 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒
𝑑 𝑑×𝑑
9
Nonparametric Density Estimation
10
Parametric v.s. Nonparametric Density Estimation
11
Parametric v.s. Nonparametric Density Estimation
12
Outline
• Overview
• Parametric Density Estimation
• Nonparametric Density Estimation
13
Estimating Parametric Models
n
n 𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑛
Using the parameters, we can estimate each data point
N
𝑋 𝑥𝑖 |
14
Example Problem
𝑋 = 𝑥1 , 𝑥2 , … , 𝑥n𝑛 = 1,0,1, … , 0 , 𝑥𝑖 ∈ {0,1}
𝐿 𝜃 𝑥𝑖 = 𝑝 𝑥𝑖 𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖
15
MLE for Biased Coin
𝑖 𝑖
𝑖
16
Estimating Gaussian Distributions
n
n
𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑥𝑖 ∈ 𝑅
Density of a data point:
𝑥𝑖 𝑥𝑖
17
Estimating Gaussian Distributions
n
𝑥𝑖
n
n
𝑥𝑖
n
18
MLE for Gaussian Distribution
N
𝑙 𝜃 𝑋 =𝑙(𝜇, 𝜎|𝑋) 𝑥𝑖
N
n n 𝑥𝑖
19
MLE for Gaussian Distribution
N
n n 𝑥𝑖
𝑙(𝜇, 𝜎|𝑋)
𝑋
n
N
𝑥𝑖
n
N n
N
𝑥𝑖 n 𝑥𝑖
n
n
N
n
𝑥𝑖
n
N n
N
𝑥𝑖 n 𝑥𝑖
n
20
Example
21
Outline
• Overview
• Parametric Density Estimation
• Nonparametric Density Estimation
Can be used for:
• Visualization
• Classification
• Regression
22
Example: Test Scores
23
1-D Histogram
N 𝑋 = 𝑥1 , 𝑥2 , … , 𝑥n𝑛 = 𝑥𝑖 ∈ [0,1)
M
1 1 2 𝑙−1 𝑙 M 𝑀−1
𝐵1 = 0, ቇ , 𝐵2 = ቈ , ቇ , … , 𝐵𝑙 = ቈ , ቇ , … , 𝐵𝑀 = , 1)
𝑀 M𝑀 𝑀 M M 𝑀 M 𝑀 𝑀
1
For a new test data point x which belongs to 𝐵𝑙
𝑀
m 𝑁
𝑀 M 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑏𝑖𝑛 𝐵𝑙 (𝑐𝑙 )
The probability that point x is 𝑝 𝑥 = 𝑒(𝑥
1(𝑥𝑖 𝜖𝜖𝐵
𝐵𝑙𝑗)) =
drawn from a distribution p(x) 𝑁 N 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 × 𝑏𝑖𝑛 𝑤𝑖𝑑𝑡ℎ
𝑖=1
? 𝑥𝑑 𝑥 𝑝
24
Why is Histogram Valid?
1 𝑁
𝑀
න 1 𝑥𝑖 𝜖 𝐵𝑙 𝑑𝑥
0 𝑁 𝑖=1
1 2 𝑙
𝑀 𝑁 𝑀 𝑁 𝑀 𝑁
𝑀 𝑀 𝑀
= න 1 𝑥𝑖 𝜖 𝐵𝑙 𝑑𝑥 + න 1 𝑥𝑖 𝜖 𝐵𝑙 𝑑𝑥 + … + න 1 𝑥𝑖 𝜖 𝐵𝑙 𝑑𝑥 =
𝑁 𝑁 𝑁
0 𝑖=1 1 𝑖=1 𝑙−1 𝑖=1
𝑀 𝑀
1 2 𝑙
𝑀 𝑀 𝑀 1
𝑀
= න 𝑐1 𝑑𝑥 + න 𝑐2 𝑑𝑥 + ⋯ + න 𝑐𝑙 𝑑𝑥 + ⋯ + න 𝑐𝑀 𝑑𝑥 =
𝑁
0 1 𝑙−1 𝑀−1
𝑀 𝑀 𝑀
𝑀 𝑙 𝑀 𝑀
𝑀 𝑀
𝑀 𝑙 𝑙−1 𝑐𝑙
= න 𝑐𝑙 𝑑𝑥 = 𝑐𝑙 − = = 1
𝑁 𝑙−1 𝑁 𝑀 𝑀 𝑁
𝑙=1 𝑀 𝑗=1 𝑙=1
27
Higher-Dimensional Histogram
𝑛 𝑋 = 𝑥1 , 𝑥2 , … , 𝑛𝑥𝑛 , 𝑥𝑖 ∈ [0,1)𝑑
𝑀𝑑
1
𝑀
𝑇𝑤𝑜 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑑𝑎𝑡𝑎:
𝑀 = 10 (number of bins in each dimension)
𝑀2 = 100 ( total number of bins for two dimensional data)
28
29
Output Depends on Where You Put the Bins
0.5
30
Output Depends on Where You Put the Bins
0.25
31
Kernel Density Estimation
𝑁
𝑁 𝑥𝑔 𝑥𝑖
1 1 𝑥𝑙 − 𝑥𝑖 𝑥𝑙 = 𝑥𝑔𝑟𝑖𝑑𝑙𝑖𝑛𝑒
𝑝 𝑥 =𝑁 𝐾
𝑁 ℎ ℎ
𝑖
32
Smoothing Kernel Functions
33
Example
𝑝 𝑥
𝑁𝑁 𝑥𝑖
1 1 𝑥𝑙 − 𝑥𝑖
= 𝑁𝑛 𝐾
𝑁 ℎ ℎ
𝑖
34
Effect of the Kernel Bandwidth (h)
𝑁
1 𝑁 1 𝑥𝑙 − 𝑥𝑖
𝑝 𝑥 = 𝑥
𝐾 𝑖
𝑁 ℎ ℎ
𝑛𝑁 𝑖
35
Visual Example
50 datapoints are given to us
-15 -10 -5 0 5 10 15
Visual Example
Let’s implement 20 bins histogram
Frequency
-15 -10 -5 0 5 10 15
Visual Example
Let’s create 200 uniform gridlines (𝑥𝑙 ) to have a smoother density function
OR simply you can just implement this on each datapoint
14
12
10
0
-15 -10 -5 0 5 10 15
𝑁
For each linearly spaced 1 1
gridline 𝑥𝑙 , let’s calculate the 𝑝 𝑥 = 𝐾(𝑢𝑖 )
Gaussian kernel value over 𝑁 ℎ
𝑖 𝑥𝑙 − 𝑥𝑖
the given 50 points 1 −𝑢𝑖2 /2
𝑢𝑖 = 𝐾(𝑢𝑖 ) = 𝑒
ℎ 2𝜋
14
12
10
0
-15 -10 -5 0 5 10 15
Density value
As an example of kernel heights for line at 0
𝑥𝑙 = −15
Linearly spaced lines
𝑥𝑙 = 0 0.000 0.0001 0.0002 0.0003 0.0004………………0.0 0.0 0.0 0.0………….……… 0.0004 0.0003 0.0002 0.0001 0.0000
𝑥𝑙 = 15
200X50
Density value
𝑥𝑙 = −15
Linearly spaced lines
𝑥𝑙 = 0 0.000 0.0001 0.0002 0.0003 0.0004……………….……0 0 0 0…………….……… 0.0004 0.0003 0.0002 0.0001 0.0000
Density at L = 0
⋮ 𝑃 𝑥𝑙 = 0 = 𝑚𝑒𝑎𝑛 𝑳𝟎
𝑥𝑙 = 15
𝑁
1 1 𝑥𝑙 − 𝑥𝑖
𝑝 𝑥 = 𝐾 200X50
𝑁 ℎ ℎ
𝑖
Visual Example
Based on Gaussian kernel estimator Interactive Example
-15 -10 -5 0 5 10 15
𝐹𝑜𝑟 𝜎 = 1; Numerical Example
% Data ; There are 200 data points (-13~<data<~13)
randn('seed',1) % Used for reproducibility
x = [randn(100,1)-10; randn(100,1)+10]; % Two Normals mixed (GROUND TRUTH)
1
4𝜎ො 2 5 −
1
ℎ= ≈ 1.06𝜎𝑁
ො 5
3𝑁
h = std(x)*(4/3/numel(x))^(1/5); % Bandwidth estimated by Silverman's Rule of Thumb
% Let’s create apply density estimation over 1000 linearly spaced points (𝑥𝑙 )
xl = linspace(-25,+25,1000); % gridlines
% Let’s generate a “TRUE” density over all the bins given the “Ground Truth” information.
truepdf_firstnormal = exp(-.5*(xl-10).^2)/sqrt(2*pi);
truepdf_secondnormal = exp(-.5*(xl+10).^2)/sqrt(2*pi);
truepdf = truepdf_firstnormal/2 + truepdf_secondnormal/2;
% divided down by 2, because we are adding density value two times
plot(x,truepdf) % Plot True Density
% Let’s calculate Gaussian kernel density for each linearly spaced
point over 200 Given data points
𝑁
1 1 𝑥𝑙 − 𝑥𝑖
𝑝 𝑥 = 𝐾(𝑢𝑖 ) 𝑢𝑖 =
𝑁 ℎ ℎ
𝑖
1 −𝑢𝑖2 /2
Gaussian kernel 𝐾(𝑢𝑖 ) = 𝑒
2𝜋
for l=1:size(xl,1) % let’s loop over grid lines (𝑥𝑙 )
u = (xl(l) - x)./h; % length of u is 200
Ku = exp(-.5*u.^2)/sqrt(2*pi);
Ku = Ku./h;
px(l) = mean(Ku);
end
plot(x,truepdf)
plot(x,px)
truepdf
px
Two-Dimensional Examples
47
Choosing the Kernel Bandwidth
1
−
≈ 1.06𝜎𝑁
ො 5
48
Non-parametric vs parametric
Summary
• Parametric density estimation
Maximum likelihood estimation
Different parametric forms
• Nonparametric density estimation
Histogram
Kernel density estimation
50