8000 gh-115532: Add kernel density estimation to the statistics module (gh… · python/cpython@6d34eb0 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6d34eb0

Browse files
authored
gh-115532: Add kernel density estimation to the statistics module (gh-115863)
1 parent 6a3236f commit 6d34eb0

File tree

5 files changed

+285
-41
lines changed

5 files changed

+285
-41
lines changed

Doc/library/statistics.rst

Lines changed: 49 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ or sample.
7676
:func:`fmean` Fast, floating point arithmetic mean, with optional weighting.
7777
:func:`geometric_mean` Geometric mean of data.
7878
:func:`harmonic_mean` Harmonic mean of data.
79+
:func:`kde` Estimate the probability density distribution of the data.
7980
:func:`median` Median (middle value) of data.
8081
:func:`median_low` Low median of data.
8182
:func:`median_high` High median of data.
@@ -259,6 +260,54 @@ However, for reading convenience, most of the examples show sorted sequences.
259260
.. versionchanged:: 3.10
260261
Added support for *weights*.
261262

263+
264+
.. function:: kde(data, h, kernel='normal')
265+
266+
`Kernel Density Estimation (KDE)
267+
<https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf>`_:
268+
Create a continuous probability density function from discrete samples.
269+
270+
The basic idea is to smooth the data using `a kernel function
271+
<https://en.wikipedia.org/wiki/Kernel_(statistics)>`_.
272+
to help draw inferences about a population from a sample.
273+
274+
The degree of smoothing is controlled by the scaling parameter *h*
275+
which is called the bandwidth. Smaller values emphasize local
276+
features while larger values give smoother results.
277+
278+
The *kernel* determines the relative weights of the sample data
279+
points. Generally, the choice of kernel shape does not matter
280+
as much as the more influential bandwidth smoothing parameter.
281+
282+
Kernels that give some weight to every sample point include
283+
*normal* or *gauss*, *logistic*, and *sigmoid*.
284+
285+
Kernels that only give weight to sample points within the bandwidth
286+
include *rectangular* or *uniform*, *triangular*, *parabolic* or
287+
*epanechnikov*, *quartic* or *biweight*, *triweight*, and *cosine*.
288+
289+
A :exc:`StatisticsError` will be raised if the *data* sequence is empty.
290+
291+
`Wikipedia has an example
292+
<https://en.wikipedia.org/wiki/Kernel_density_estimation#Example>`_
293+
where we can use :func:`kde` to generate and plot a probability
294+
density function estimated from a small sample:
295+
296+
.. doctest::
297+
298+
>>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
299+
>>> f_hat = kde(sample, h=1.5)
300+
>>> xarr = [i/100 for i in range(-750, 1100)]
301+
>>> yarr = [f_hat(x) for x in xarr]
302+
303+
The points in ``xarr`` and ``yarr`` can be used to make a PDF plot:
304+
305+
.. image:: kde_example.png
306+
:alt: Scatter plot of the estimated probability density function.
307+
308+
.. versionadded:: 3.13
309+
310+
262311
.. function:: median(data)
263312

264313
Return the median (middle value) of numeric data, using the common "mean of
@@ -1095,46 +1144,6 @@ The final prediction goes to the largest posterior. This is known as the
10951144
'female'
10961145

10971146

1098-
Kernel density estimation
1099-
*************************
1100-
1101-
It is possible to estimate a continuous probability density function
1102-
from a fixed number of discrete samples.
1103-
1104-
The basic idea is to smooth the data using `a kernel function such as a
1105-
normal distribution, triangular distribution, or uniform distribution
1106-
<https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use>`_.
1107-
The degree of smoothing is controlled by a scaling parameter, ``h``,
1108-
which is called the *bandwidth*.
1109-
1110-
.. testcode::
1111-
1112-
def kde_normal(sample, h):
1113-
"Create a continuous probability density function from a sample."
1114-
# Smooth the sample with a normal distribution kernel scaled by h.
1115-
kernel_h = NormalDist(0.0, h).pdf
1116-
n = len(sample)
1117-
def pdf(x):
1118-
return sum(kernel_h(x - x_i) for x_i in sample) / n
1119-
return pdf
1120-
1121-
`Wikipedia has an example
1122-
<https://en.wikipedia.org/wiki/Kernel_density_estimation#Example>`_
1123-
where we can use the ``kde_normal()`` recipe to generate and plot
1124-
a probability density function estimated from a small sample:
1125-
1126-
.. doctest::
1127-
1128-
>>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
1129-
>>> f_hat = kde_normal(sample, h=1.5)
1130-
>>> xarr = [i/100 for i in range(-750, 1100)]
1131-
>>> yarr = [f_hat(x) for x in xarr]
1132-
1133-
The points in ``xarr`` and ``yarr`` can be used to make a PDF plot:
1134-
1135-
.. image:: kde_example.png
1136-
:alt: Scatter plot of the estimated probability density function.
1137-
11381147
..
11391148
# This modelines must appear within the last ten lines of the file.
11401149
kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;

Doc/whatsnew/3.13.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -476,6 +476,14 @@ sqlite3
476476
for filtering database objects to dump.
477477
(Contributed by Mariusz Felisiak in :gh:`91602`.)
478478

479+
statistics
480+
----------
481+
482+
* Add :func:`statistics.kde` for kernel density estimation.
483+
This makes it possible to estimate a continuous probability density function
484+
from a fixed number of discrete samples.
485+
(Contributed by Raymond Hettinger in :gh:`115863`.)
486+
479487
subprocess
480488
----------
481489

Lib/statistics.py

Lines changed: 167 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@
112112
'fmean',
113113
'geometric_mean',
114114
'harmonic_mean',
115+
'kde',
115116
'linear_regression',
116117
'mean',
117118
'median',
@@ -137,7 +138,7 @@
137138
from itertools import count, groupby, repeat
138139
from bisect import bisect_left, bisect_right
139140
from math import hypot, sqrt, fabs, exp, erf, tau, log, fsum, sumprod
140-
from math import isfinite, isinf
141+
from math import isfinite, isinf, pi, cos, cosh
141142
from functools import reduce
142143
from operator import itemgetter
143144
from collections import Counter, namedtuple, defaultdict
@@ -802,6 +803,171 @@ def multimode(data):
802803
return [value for value, count in counts.items() if count == maxcount]
803804

804805

806+
def kde(data, h, kernel='normal'):
807+
"""Kernel Density Estimation: Create a continuous probability
808+
density function from discrete samples.
809+
810+
The basic idea is to smooth the data using a kernel function
811+
to help draw inferences about a population from a sample.
812+
813+
The degree of smoothing is controlled by the scaling parameter h
814+
which is called the bandwidth. Smaller values emphasize local
815+
features while larger values give smoother results.
816+
817+
The kernel determines the relative weights of the sample data
818+
points. Generally, the choice of kernel shape does not matter
819+
as much as the more influential bandwidth smoothing parameter.
820+
821+
Kernels that give some weight to every sample point:
822+
823+
normal or gauss
824+
logistic
825+
sigmoid
826+
827+
Kernels that only give weight to sample points within
828+
the bandwidth:
829+
830+
rectangular or uniform
831+
triangular
832+
parabolic or epanechnikov
833+
quartic or biweight
834+
triweight
835+
cosine
836+
837+
A StatisticsError will be raised if the data sequence is empty.
838+
839+
Example
840+
-------
841+
842+
Given a sample of six data points, construct a continuous
843+
function that estimates the underlying probability density:
844+
845+
>>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
846+
>>> f_hat = kde(sample, h=1.5)
847+
848+
Compute the area under the curve:
849+
850+
>>> sum(f_hat(x) for x in range(-20, 20))
851+
1.0
852+
853+
Plot the estimated probability density function at
854+
evenly spaced points from -6 to 10:
855+
856+
>>> for x in range(-6, 11):
857+
... density = f_hat(x)
858+
... plot = ' ' * int(density * 400) + 'x'
859+
... print(f'{x:2}: {density:.3f} {plot}')
860+
...
861+
-6: 0.002 x
862+
-5: 0.009 x
863+
-4: 0.031 x
864+
-3: 0.070 x
865+
-2: 0.111 x
866+
-1: 0.125 x
867+
0: 0.110 x
868+
1: 0.086 x
869+
2: 0.068 x
870+
3: 0.059 x
871+
4: 0.066 x
872+
5: 0.082 x
873+
6: 0.082 x
874+
7: 0.058 x
875+
8: 0.028 x
876+
9: 0.009 x
877+
10: 0.002 x
878+
879+
References
880+
----------
881+
882+
Kernel density estimation and its application:
883+
https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf
884+
885+
Kernel functions in common use:
886+
https://en.wikipedia.org/wiki/Kernel_(statistics)#kernel_functions_in_common_use
887+
888+
Interactive graphical demonstration and exploration:
889+
https://demonstrations.wolfram.com/KernelDensityEstimation/
890+
891+
"""
892+
893+
n = len(data)
894+
if not n:
895+
raise StatisticsError('Empty data sequence')
896+
897+
if not isinstance(data[0], (int, float)):
898+
raise TypeError('Data sequence must contain ints or floats')
899+
900+
if h <= 0.0:
901+
raise StatisticsError(f'Bandwidth h must be positive, not {h=!r}')
902+
903+
match kernel:
904+
905+
case 'normal' | 'gauss':
906+
c = 1 / sqrt(2 * pi)
907+
K = lambda t: c * exp(-1/2 * t * t)
908+
support = None
909+
910+
case 'logistic':
911+
# 1.0 / (exp(t) + 2.0 + exp(-t))
912+
K = lambda t: 1/2 / (1.0 + cosh(t))
913+
support = None
914+
915+
case 'sigmoid':
916+
# (2/pi) / (exp(t) + exp(-t))
917+
c = 1 / pi
918+
K = lambda t: c / cosh(t)
919+
support = None
920+
921+
case 'rectangular' | 'uniform':
922+
K = lambda t: 1/2
923+
support = 1.0
924+
925+
case 'triangular':
926+
K = lambda t: 1.0 - abs(t)
927+
support = 1.0
928+
929+
case 'parabolic' | 'epanechnikov':
930+
K = lambda t: 3/4 * (1.0 - t * t)
931+
support = 1.0
932+
933+
case 'quartic' | 'biweight':
934+
K = lambda t: 15/16 * (1.0 - t * t) ** 2
935+
support = 1.0
936+
937+
case 'triweight':
938+
K = lambda t: 35/32 * (1.0 - t * t) ** 3
939+
support = 1.0
940+
941+
case 'cosine':
942+
c1 = pi / 4
943+
c2 = pi / 2
944+
K = lambda t: c1 * cos(c2 * t)
945+
support = 1.0
946+
947+
case _:
948+
raise StatisticsError(f'Unknown kernel name: {kernel!r}')
949+
950+
if support is None:
951+
952+
def pdf(x):
953+
return sum(K((x - x_i) / h) for x_i in data) / (n * h)
954+
955+
else:
956+
957+
sample = sorted(data)
958+
bandwidth = h * support
959+
960+
def pdf(x):
961+
i = bisect_left(sample, x - bandwidth)
962+
j = bisect_right(sample, x + bandwidth)
963+
supported = sample[i : j]
964+
return sum(K((x - x_i) / h) for x_i in supported) / (n * h)
965+
966+
pdf.__doc__ = f'PDF estimate with {kernel=!r} and {h=!r}'
967+
968+
return pdf
969+
970+
805971
# Notes on methods for computing quantiles
806972
# ----------------------------------------
807973
#

Lib/test/test_statistics.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2353,6 +2353,66 @@ def test_mixed_int_and_float(self):
23532353
self.assertAlmostEqual(actual_mean, expected_mean, places=5)
23542354

23552355

2356+
class TestKDE(unittest.TestCase):
2357+
2358+
def test_kde(self):
2359+
kde = statistics.kde
2360+
StatisticsError = statistics.StatisticsError
2361+
2362+
kernels = ['normal', 'gauss', 'logistic', 'sigmoid', 'rectangular',
2363+
'uniform', 'triangular', 'parabolic', 'epanechnikov',
2364+
'quartic', 'biweight', 'triweight', 'cosine']
2365+
2366+
sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
2367+
2368+
# The approximate integral of a PDF should be close to 1.0
2369+
2370+
def integrate(func, low, high, steps=10_000):
2371+
"Numeric approximation of a definite function integral."
2372+
dx = (high - low) / steps
2373+
midpoints = (low + (i + 1/2) * dx for i in range(steps))
10000
2374+
return sum(map(func, midpoints)) * dx
2375+
2376+
for kernel in kernels:
2377+
with self.subTest(kernel=kernel):
2378+
f_hat = kde(sample, h=1.5, kernel=kernel)
2379+
area = integrate(f_hat, -20, 20)
2380+
self.assertAlmostEqual(area, 1.0, places=4)
2381+
2382+
# Check error cases
2383+
2384+
with self.assertRaises(StatisticsError):
2385+
kde([], h=1.0) # Empty dataset
2386+
with self.assertRaises(TypeError):
2387+
kde(['abc', 'def'], 1.5) # Non-numeric data
2388+
with self.assertRaises(TypeError):
2389+
kde(iter(sample), 1.5) # Data is not a sequence
2390+
with self.assertRaises(StatisticsError):
2391+
kde(sample, h=0.0) # Zero bandwidth
2392+
with self.assertRaises(StatisticsError):
2393+
kde(sample, h=0.0) # Negative bandwidth
2394+
with self.assertRaises(TypeError):
2395+
kde(sample, h='str') # Wrong bandwidth type
2396+
with self.assertRaises(StatisticsError):
2397+
kde(sample, h=1.0, kernel='bogus') # Invalid kernel
2398+
2399+
# Test name and docstring of the generated function
2400+
2401+
h = 1.5
2402+
kernel = 'cosine'
2403+
f_hat = kde(sample, h, kernel)
2404+
self.assertEqual(f_hat.__name__, 'pdf')
2405+
self.assertIn(kernel, f_hat.__doc__)
2406+
self.assertIn(str(h), f_hat.__doc__)
2407+
2408+
# Test closed interval for the support boundaries.
2409+
# In particular, 'uniform' should non-zero at the boundaries.
2410+
2411+
f_hat = kde([0], 1.0, 'uniform')
2412+
self.assertEqual(f_hat(-1.0), 1/2)
2413+
self.assertEqual(f_hat(1.0), 1/2)
2414+
2415+
23562416
class TestQuantiles(unittest.TestCase):
23572417

23582418
def test_specific_cases(self):
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Add kernel density estimation to the statistics module.

0 commit comments

Comments
 (0)
0