FIX `KernelDensity` incorrectly handling bandwidth #27971

Charlie-XIAO · 2023-12-17T13:07:58Z

Note

See this gist for some results of this PR. The scikit-learn results should be (almost) consistent with scipy results.

By the way, though not related to this PR, the implementation of weighted KDE in scikit-learn seems to be very slow (#10803). It needs to traverse all points in a node and sum their weights up every time, which makes the tree implementation (that should be fast) several times slower even than the naive implementation of scipy as data size scales up.

github-actions · 2023-12-17T13:09:15Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6379922. Link to the linter CI: here}

Charlie-XIAO · 2024-07-21T15:33:34Z

It seems that I forgot to apply sample weight to covariance and the automatic bandwidth calculation - but this still doesn't pass the test that "sample weight is equivalent to repetition". The results are indeed closer than before - but probably there's something else I overlooked 🤔

Charlie-XIAO · 2024-07-22T02:22:57Z

sklearn/neighbors/tests/test_kde.py

+    assert_allclose(
+        np.exp(scores_weight), np.exp(scores_ref_sampling), atol=1e-8, rtol=1e-2
+    )
+    assert_allclose(sample_weight, sample_ref_sampling, rtol=1e-2)


I relaxed the checks here quite a bit to pass the test, not sure if this is reasonable. I checked the scipy implementation gaussian_kde and it cannot pass the strict version either (unless scipy implementation is also buggy 🤔).

Test code

import numpy as np from scipy.stats import gaussian_kde from numpy.testing import assert_allclose n_samples = 400 size_test = 20 rng = np.random.default_rng(0) X = rng.random((n_samples, 1)) weights = 1 + (10 * X.sum(axis=1)).astype(np.int8) X_repetitions = np.repeat(X, weights, axis=0) n_samples_test = size_test test_points = rng.random((n_samples_test, 1)) kde1 = gaussian_kde(X.T, weights=weights) scores_weight = kde1.pdf(test_points.T) sample_weight = kde1.resample(1) kde2 = gaussian_kde(X_repetitions.T) scores_ref_sampling = kde2.pdf(test_points.T) sample_ref_sampling = kde2.resample(1) assert_allclose(scores_weight, scores_ref_sampling) assert_allclose(sample_weight, sample_ref_sampling)

I assume that the equivalence is in expectation so the strict check will not work. I think this is something that @snath-xoc might help us here. She is solving a similar issue regarding this equivalence in other estimators. I assume that the KDE could be added to the list of estimator to look after.

Thanks for looping me in @glemaitre, @Charlie-XIAO the test itself looks O.K. at a first glance. It may be that due to the stochasticity, the values will only converge to the same in expected but not exact values. I will add this into my tests for these kind of situations and see how it performs.

Maybe check that the bandwidth_ value is the same with integer weights and repetitions before trying to sample from the models.

I don't see anything that is expected to behave stochastically at fit time in the code. In particular the KernelDensity constructor does not accept a random_state argument.

bandwidth_ is not the same if we use scott or silverman and I don't think they should be the same by definition. If we directly specify a float bandwidth then bandwidth_ would indeed be the same. But even in the latter case both scipy and scikit-learn fails (in exactly the same way).

I don't agree with the way the neff thing is defined with weighted samples. It fundamentally breaks the sample weight / sample repetition equivalence as highlighted in https://github.com/scikit-learn/scikit-learn/pull/27971/files#r1762904747. Maybe we should report this upstream in case it hasn't been already.

sklearn/neighbors/_binary_tree.pxi.tp

glemaitre

The fact that we are inline with the scipy implementation make me think that this is the right fix.

I'm wondering if we should have not in the changelog in the change models because someone fixing the bandwidth manually will observe a change of behaviour. I don't think that we can preserve the previous behaviour.

Charlie-XIAO · 2024-07-23T13:24:06Z

Yes it is impossible (and unreasonable) to retain the original behavior. I moved the changelog to changed models and added some additional explanations (based on previous changed models). Please let me know if the wording should be changed or any other information is needed.

By the way I just realized that I didn't look at the resampling when kernel="tophat" (maybe there's a relevant test missing as well otherwise I imagine something would fail) 😢 It needs to be fixed as part of this PR. Update: Seems that I accidentally made it work correctly already...

snath-xoc · 2024-09-05T14:13:54Z

I believe the reason that the tests for sample weighting aren't passing at higher rtol is because the sample weights are factored in using:

np.cov(..., aweight=sample_weight) L251 under _kde.py

Using fweights instead:

np.cov(..., fweight=sample_weight)

returns identical covariance matrices when weighting versus repeating samples. A simple minimal reproducer is as follows:

import numpy as np
from numpy.testing import assert_equal

n_samples = 1000
d = 3

X = rng.rand(n_samples, d)
rng = np.random.RandomState(3)
sample_weight = rng.randint(0, 3, size=X.shape[0])

X_repeated = np.repeat(X,sample_weight,axis=0)
sample_weight = sample_weight/sample_weight.sum()
cov_weighted = np.cov(X,aweights=sample_weight, rowvar=False)
cov_repeated = np.cov(X_repeated,rowvar=False)

assert_equal(cov_repeated,cov_weighted)

Which gives the assertion error:

AssertionError: 
Arrays are not equal

Mismatched elements: 9 / 9 (100%)
Max absolute difference among violations: 5.80480489e-05
Max relative difference among violations: 0.00068234
 ACTUAL: array([[ 0.085014, -0.005551,  0.002722],
       [-0.005551,  0.083199,  0.00039 ],
       [ 0.002722,  0.00039 ,  0.078405]])
 DESIRED: array([[ 0.085072, -0.005554,  0.002724],
       [-0.005554,  0.083256,  0.00039 ],
       [ 0.002724,  0.00039 ,  0.078458]])

@ogrisel and @Charlie-XIAO we may need to think about how to add sample weights in here as fweights is preferred however it only accepts integer sample weights.

ogrisel · 2024-09-05T15:21:00Z

@snath-xoc Interesting finding! The numpy doc is not very explicit about the mathematical definition of aweights though: https://numpy.org/doc/stable/reference/generated/numpy.cov.html

Looking a the code:

both weights are multiplied with one another, but only aweights interact with the ddof / bias parameters at the end (when bias=False or ddof != 0);
I think it's ok to pass floating point weights for the frequency weights (fweights) when looking at the code.

snath-xoc · 2024-09-05T16:36:27Z

Agreed the numpy doc is not so informative, I tried to put in normalised weights into fweight (i.e. sample_weight/sample_weight.sum()) and it gives the error

TypeError: fweights must be integer

I think it is because this is for frequency based weighting which must have some resampling, I am not sure what the absolute based weighting (aweight) uses as a strategy though.

snath-xoc · 2024-09-05T16:48:48Z

sklearn/neighbors/_kde.py

+                sample_weight, X, dtype=np.float64, ensure_non_negative=True
+            )
+            normalized_sample_weight = sample_weight / sample_weight.sum()
+            n_effective_samples = 1 / np.sum(normalized_sample_weight**2)


@Charlie-XIAO would you have a reference as to why you use (sorry I may just be missing something here):
n_effective_samples = 1 / np.sum(normalized_sample_weight**2)

Scipy doc of gaussian_kde: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. It has the neff thing which is short for effective number of samples, computed as

neff = sum(weights)^2 / sum(weights^2)

This is breaking the usual "reweighing is repeating" semantics for sample_weight that we would like to enforce in all scikit-learn estimators.

import matplotlib.pyplot as plt from scipy.stats import gaussian_kde import numpy as np rng = np.random.default_rng(seed=42) n_samples = 30 data = np.concatenate( [ rng.standard_normal(size=n_samples // 2) - 1, 3 * rng.standard_normal(size=n_samples // 2) + 5, ] ) weights = rng.integers(low=0, high=5, size=n_samples) $ bw_method = "scott" bw_method = "silverman" # similar discrepancy with "scott" kde_weights = gaussian_kde( data, bw_method=bw_method, weights=weights.astype(np.float64) ) kde_repetitions = gaussian_kde(np.repeat(data, weights), bw_method=bw_method) x = np.linspace(data.min(), data.max(), 100) fig, (ax0, ax1) = plt.subplots(nrows=2, ncols=1, figsize=(8, 8)) ax0.plot(x, kde_weights.evaluate(x), label="weights") ax0.plot(x, kde_repetitions.evaluate(x), label="repetitions") ax0.legend() bins = np.linspace(data.min(), data.max(), 30) ax1.hist(data, weights=weights, bins=bins, alpha=0.5, label="weights") ax1.hist(np.repeat(data, weights), bins=bins, alpha=0.5, label="repetitions") ax1.legend()

Note that in scipy's Gaussian KDE implementation, if you use a fixed bandwidth (e.g. bw_method = 0.4 in the code above), then the repetitions vs weights equivalence seems to be (almost) restored when n_samples is large enough (here 30):

Scipy does not specify ddof=0, and uses neff = sum(weights)^2 / sum(weights^2) (for scott and silverman), which is essentially the same as currently in this PR, so in #27971 (comment) I think the small deviation comes from ddof?

Regarding scott and silverman, I briefly searched through the references in the scipy docs, did not find something useful in the weighted case so I don't know where the scipy formula really comes from...

In https://github.com/Charlie-XIAO/scikit-learn/pull/3/files#r1761412299 you changed to neff = sum(weights), I wonder what's the intuition behind that as well 🤔

I tried patching scipy's implementation and passing bias=True / ddof=0 to its own internal call to np.cov fixes the remaining (smaller) discrepancy between weighted and repeated data points (as it does for scikit-learn's implementation).

Charlie-XIAO · 2024-09-05T19:28:20Z

Indeed we would need to use aweights because we wouldn't want to force users to put integer weights... And I don't think fweights or aweights is making a real difference on the final result?

ogrisel · 2024-09-16T12:59:33Z

Indeed we would need to use aweights because we wouldn't want to force users to put integer weights... And I don't think fweights or aweights is making a real difference on the final result?

As @snath-xoc found out, using we don't want aweights to interact with the ddof term to avoid breaking the usual "sample_weight is equivalent to repetitions" equivalence property. One possible way to deal with this is to pass ddof=0.

Here is a PR that shows that we can use a much smaller rtol as a result: Charlie-XIAO#3

The alternative would be to implement our own numpy.cov with the fweight scheme but without enforcing the integer dtype in sample_weight.

ogrisel · 2024-09-20T08:02:55Z

And I don't think fweights or aweights is making a real difference on the final result?

It does because fweights and aweights do not interact in the same way when ddof=1/bias=False: the fweights has the weighted/repeated equivalence property we want (that is what #29818 expect, similarly to what I checked in #27971 (comment) for gaussian_kde and plt.hist), even with ddof=1. While aweights only gives this property when ddof=0/bias=True.

If we want to keep bias=False, we could swap the np.cov implementation with our own, where we would implement the fweights scheme while allowing for floating-point values.

Charlie-XIAO · 2024-09-21T09:01:44Z

I now see that ddof=0 is the correct way to go. We still need to deal with the neff thing though. The current version (i.e. scipy version) mathematically would not preserve the equivalence between integer weights and repetition I believe. sum(weights) does not seem to work either... (e.g. constant scaling I think?) I wonder if you've figured out the mathematically correct way @ogrisel?

ogrisel · 2024-09-21T09:05:55Z

Not yet. There are still failures in my side PR when i enable the silverman and scott methods in the sample weight test. But i will be busy with pydata paris until the end of next week.

Charlie-XIAO · 2024-09-21T09:12:15Z

That's fine I'm also busy with something else recently 🫠 When I have time I will look into the internals again (there should be something which we can prove to be mathematically correct I believe...)

scale kde

5bea7c6

github-actions bot added the module:neighbors label Dec 17, 2023

Charlie-XIAO added 2 commits December 18, 2023 14:40

fix doctest; fix incorrect boundary

c0e8c7c

cython line too long

59df116

glemaitre self-requested a review January 9, 2024 09:38

Charlie-XIAO added 3 commits July 21, 2024 13:18

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

0e12e0e

apply sample weights to auto bandwith calculation and covariance

1a08329

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

f6914cd

relax sample weight test

e196db9

Charlie-XIAO commented Jul 22, 2024

View reviewed changes

Charlie-XIAO added 4 commits July 22, 2024 11:13

fix test; poor binary tree doc formatting; add changelog

1e637af

wrong pr number

0e5f476

fix doc test

7e4669d

still support a single sample

0140c49

Charlie-XIAO marked this pull request as ready for review July 22, 2024 06:53

Charlie-XIAO added 2 commits July 22, 2024 21:47

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

717567f

codecov in test file

ca865d8

glemaitre reviewed Jul 23, 2024

View reviewed changes

sklearn/neighbors/_binary_tree.pxi.tp Outdated Show resolved Hide resolved

Charlie-XIAO added 2 commits July 23, 2024 18:50

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

8856800

revert some changes and postpone into another pr

2e14fb0

glemaitre self-requested a review July 23, 2024 11:27

glemaitre reviewed Jul 23, 2024

View reviewed changes

Charlie-XIAO added 2 commits July 23, 2024 21:20

move changelog to changed models

8cf6b17

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

b2dd9c9

Charlie-XIAO added 2 commits July 23, 2024 21:58

some minor changes

437d940

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

f6f81ea

Charlie-XIAO mentioned this pull request Jul 23, 2024

BUG (maybe) wrong node bound spread in KernelDensity #27186

Open

Merge branch 'main' into kde-bw-fix

31688f6

glemaitre self-requested a review September 5, 2024 06:51

snath-xoc reviewed Sep 5, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

a8b058a

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

835ae35

This was referenced Sep 18, 2024

Pass ddof=0 to np.cov in KernelDensity.fit Charlie-XIAO/scikit-learn#3

Draft

Refactor check_sample_weights_invariance into a more general repetition/reweighting equivalence check #29818

Merged

Charlie-XIAO added 2 commits October 13, 2024 12:41

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

f270b04

Merge remote-tracking branch 'upstream/main' into kde-bw-fix

6379922

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX `KernelDensity` incorrectly handling bandwidth #27971

FIX `KernelDensity` incorrectly handling bandwidth #27971

FIX KernelDensity incorrectly handling bandwidth #27971

Are you sure you want to change the base?

FIX KernelDensity incorrectly handling bandwidth #27971

Conversation

✔️ Linting Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FIX `KernelDensity` incorrectly handling bandwidth #27971

FIX `KernelDensity` incorrectly handling bandwidth #27971