8000 FIX `KernelDensity` incorrectly handling bandwidth by Charlie-XIAO · Pull Request #27971 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX KernelDensity incorrectly handling bandwidth #27971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

Charlie-XIAO
Copy link
Contributor
@Charlie-XIAO Charlie-XIAO commented Dec 17, 2023

Towards #25623, #26658.

Note

See this gist for some results of this PR. The scikit-learn results should be (almost) consistent with scipy results.

By the way, though not related to this PR, the implementation of weighted KDE in scikit-learn seems to be very slow (#10803). It needs to traverse all points in a node and sum their weights up every time, which makes the tree implementation (that should be fast) several times slower even than the naive implementation of scipy as data size scales up.

Copy link
github-actions bot commented Dec 17, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6379922. Link to the linter CI: here

@glemaitre glemaitre self-requested a review January 9, 2024 09:38
@Charlie-XIAO
Copy link
Contributor Author
Charlie-XIAO commented Jul 21, 2024

It seems that I forgot to apply sample weight to covariance and the automatic bandwidth calculation - but this still doesn't pass the test that "sample weight is equivalent to repetition". The results are indeed closer than before - but probably there's something else I overlooked 🤔

Comment on lines +207 to +210
assert_allclose(
np.exp(scores_weight), np.exp(scores_ref_sampling), atol=1e-8, rtol=1e-2
)
assert_allclose(sample_weight, sample_ref_sampling, rtol=1e-2)
Copy link
Contributor Author
@Charlie-XIAO Charlie-XIAO Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I relaxed the checks here quite a bit to pass the test, not sure if this is reasonable. I checked the scipy implementation gaussian_kde and it cannot pass the strict version either (unless scipy implementation is also buggy 🤔).

Test code

import numpy as np
from scipy.stats import gaussian_kde
from numpy.testing import assert_allclose

n_samples = 400
size_test = 20
rng = np.random.default_rng(0)

X = rng.random((n_samples, 1))
weights = 1 + (10 * X.sum(axis=1)).astype(np.int8)
X_repetitions = np.repeat(X, weights, axis=0)
n_samples_test = size_test
test_points = rng.random((n_samples_test, 1))

kde1 = gaussian_kde(X.T, weights=weights)
scores_weight = kde1.pdf(test_points.T)
sample_weight = kde1.resample(1)
kde2 = gaussian_kde(X_repetitions.T)
scores_ref_sampling = kde2.pdf(test_points.T)
sample_ref_sampling = kde2.resample(1)
assert_allclose(scores_weight, scores_ref_sampling)
assert_allclose(sample_weight, sample_ref_sampling)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the equivalence is in expectation so the strict check will not work. I think this is something that @snath-xoc might help us here. She is solving a similar issue regarding this equivalence in other estimators. I assume that the KDE could be added to the list of estimator to look after.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looping me in @glemaitre, @Charlie-XIAO the test itself looks O.K. at a first glance. It may be that due to the stochasticity, the values will only converge to the same in expected but not exact values. I will add this into my tests for these kind of situations and see how it performs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check that the bandwidth_ value is the same with integer weights and repetitions before trying to sample from the models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything that is expected to behave stochastically at fit time in the code. In particular the KernelDensity constructor does not accept a random_state argument.

Copy link
Contributor Author
@Charlie-XIAO Charlie-XIAO Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bandwidth_ is not the same if we use scott or silverman and I don't think they should be the same by definition. If we directly specify a float bandwidth then bandwidth_ would indeed be the same. But even in the latter case both scipy and scikit-learn fails (in exactly the same way).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with the way the neff thing is defined with weighted samples. It fundamentally breaks the sample weight / sample repetition equivalence as highlighted in https://github.com/scikit-learn/scikit-learn/pull/27971/files#r1762904747. Maybe we should report this upstream in case it hasn't been already.

@Charlie-XIAO Charlie-XIAO marked this pull request as ready for review July 22, 2024 06:53
@glemaitre glemaitre self-requested a review July 23, 2024 11:27
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we are inline with the scipy implementation make me think that this is the right fix.

I'm wondering if we should have not in the changelog in the change models because someone fixing the bandwidth manually will observe a change of behaviour. I don't think that we can preserve the previous behaviour.

@Charlie-XIAO
Copy link
Contributor Author
Charlie-XIAO commented Jul 23, 2024

Yes it is impossible (and unreasonable) to retain the original behavior. I moved the changelog to changed models and added some additional explanations (based on previous changed models). Please let me know if the wording should be changed or any other information is needed.

By the way I just realized that I didn't look at the resampling when kernel="tophat" (maybe there's a relevant test missing as well otherwise I imagine something would fail) 😢 It needs to be fixed as part of this PR. Update: Seems that I accidentally made it work correctly already...

@glemaitre glemaitre self-requested a review September 5, 2024 06:51
@snath-xoc
Copy link
Contributor
snath-xoc commented Sep 5, 2024

I believe the reason that the tests for sample weighting aren't passing at higher rtol is because the sample weights are factored in using:

np.cov(..., aweight=sample_weight) L251 under _kde.py

Using fweights instead:

np.cov(..., fweight=sample_weight)

returns identical covariance matrices when weighting versus repeating samples. A simple minimal reproducer is as follows:

import numpy as np
from numpy.testing import assert_equal

n_samples = 1000
d = 3

X = rng.rand(n_samples, d)
rng = np.random.RandomState(3)
sample_weight = rng.randint(0, 3, size=X.shape[0])

X_repeated = np.repeat(X,sample_weight,axis=0)
sample_weight = sample_weight/sample_weight.sum()
cov_weighted = np.cov(X,aweights=sample_weight, rowvar=False)
cov_repeated = np.cov(X_repeated,rowvar=False)

assert_equal(cov_repeated,cov_weighted)

Which gives the assertion error:

AssertionError: 
Arrays are not equal

Mismatched elements: 9 / 9 (100%)
Max absolute difference among violations: 5.80480489e-05
Max relative difference among violations: 0.00068234
 ACTUAL: array([[ 0.085014, -0.005551,  0.002722],
       [-0.005551,  0.083199,  0.00039 ],
       [ 0.002722,  0.00039 ,  0.078405]])
 DESIRED: array([[ 0.085072, -0.005554,  0.002724],
       [-0.005554,  0.083256,  0.00039 ],
       [ 0.002724,  0.00039 ,  0.078458]])

@ogrisel and @Charlie-XIAO we may need to think about how to add sample weights in here as fweights is preferred however it only accepts integer sample weights.

@ogrisel
Copy link
Member
ogrisel commented Sep 5, 2024

@snath-xoc Interesting finding! The numpy doc is not very explicit about the mathematical definition of aweights though: https://numpy.org/doc/stable/reference/generated/numpy.cov.html

Looking a the code:

  • both weights are multiplied with one another, but only aweights interact with the ddof / bias parameters at the end (when bias=False or ddof != 0);
  • I think it's ok to pass floating point weights for the frequency weights (fweights) when looking at the code.

@snath-xoc
Copy link
Contributor

Agreed the numpy doc is not so informative, I tried to put in normalised weights into fweight (i.e. sample_weight/sample_weight.sum()) and it gives the error

TypeError: fweights must be integer

I think it is because this is for frequency based weighting which must have some resampling, I am not sure what the absolute based weighting (aweight) uses as a strategy though.

sample_weight, X, dtype=np.float64, ensure_non_negative=True
)
normalized_sample_weight = sample_weight / sample_weight.sum()
n_effective_samples = 1 / np.sum(normalized_sample_weight**2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Charlie-XIAO would you have a reference as to why you use (sorry I may just be missing something here):
n_effective_samples = 1 / np.sum(normalized_sample_weight**2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scipy doc of gaussian_kde: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. It has the neff thing which is short for effective number of samples, computed as

neff = sum(weights)^2 / sum(weights^2)

Copy link
Member
@ogrisel ogrisel Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking the usual "reweighing is repeating" semantics for sample_weight that we would like to enforce in all scikit-learn estimators.

import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import numpy as np

rng = np.random.default_rng(seed=42)
n_samples = 30
data = np.concatenate(
    [
        rng.standard_normal(size=n_samples // 2) - 1,
        3 * rng.standard_normal(size=n_samples // 2) + 5,
    ]
)
weights = rng.integers(low=0, high=5, size=n_samples)

$ bw_method = "scott"
bw_method = "silverman"  # similar discrepancy with "scott"
kde_weights = gaussian_kde(
    data, bw_method=bw_method, weights=weights.astype(np.float64)
)
kde_repetitions = gaussian_kde(np.repeat(data, weights), bw_method=bw_method)

x = np.linspace(data.min(), data.max(), 100)
fig, (ax0, ax1) = plt.subplots(nrows=2, ncols=1, figsize=(8, 8))
ax0.plot(x, kde_weights.evaluate(x), label="weights")
ax0.plot(x, kde_repetitions.evaluate(x), label="repetitions")
ax0.legend()

bins = np.linspace(data.min(), data.max(), 30)
ax1.hist(data, weights=weights, bins=bins, alpha=0.5, label="weights")
ax1.hist(np.repeat(data, weights), bins=bins, alpha=0.5, label="repetitions")
ax1.legend()

image

Copy link
Member
@ogrisel ogrisel Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in scipy's Gaussian KDE implementation, if you use a fixed bandwidth (e.g. bw_method = 0.4 in the code above), then the repetitions vs weights equivalence seems to be (almost) restored when n_samples is large enough (here 30):

image

Copy link
Contributor Author
@Charlie-XIAO Charlie-XIAO Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scipy does not specify ddof=0, and uses neff = sum(weights)^2 / sum(weights^2) (for scott and silverman), which is essentially the same as currently in this PR, so in #27971 (comment) I think the small deviation comes from ddof?

Regarding scott and silverman, I briefly searched through the references in the scipy docs, did not find something useful in the weighted case so I don't know where the scipy formula really comes from...

In https://github.com/Charlie-XIAO/scikit-learn/pull/3/files#r1761412299 you changed to neff = sum(weights), I wonder what's the intuition behind that as well 🤔

Copy link
Member
@ogrisel ogrisel Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried patching scipy's implementation and passing bias=True / ddof=0 to its own internal call to np.cov fixes the remaining (smaller) discrepancy between weighted and repeated data points (as it does for scikit-learn's implementation).

@Charlie-XIAO
Copy link
Contributor Author

Indeed we would need to use aweights because we wouldn't want to force users to put integer weights... And I don't think fweights or aweights is making a real difference on the final result?

@ogrisel
Copy link
Member
ogrisel commented Sep 16, 2024

Indeed we would need to use aweights because we wouldn't want to force users to put integer weights... And I don't think fweights or aweights is making a real difference on the final result?

As @snath-xoc found out, using we don't want aweights to interact with the ddof term to avoid breaking the usual "sample_weight is equivalent to repetitions" equivalence property. One possible way to deal with this is to pass ddof=0.

Here is a PR that shows that we can use a much smaller rtol as a result: Charlie-XIAO#3

The alternative would be to implement our own numpy.cov with the fweight scheme but without enforcing the integer dtype in sample_weight.

@ogrisel
Copy link
Member
ogrisel commented Sep 20, 2024

And I don't think fweights or aweights is making a real difference on the final result?

It does because fweights and aweights do not interact in the same way when ddof=1/bias=False: the fweights has the weighted/repeated equivalence property we want (that is what #29818 expect, similarly to what I checked in #27971 (comment) for gaussian_kde and plt.hist), even with ddof=1. While aweights only gives this property when ddof=0/bias=True.

If we want to keep bias=False, we could swap the np.cov implementation with our own, where we would implement the fweights scheme while allowing for floating-point values.

@Charlie-XIAO
Copy link
Contributor Author

I now see that ddof=0 is the correct way to go. We still need to deal with the neff thing though. The current version (i.e. scipy version) mathematically would not preserve the equivalence between integer weights and repetition I believe. sum(weights) does not seem to work either... (e.g. constant scaling I think?) I wonder if you've figured out the mathematically correct way @ogrisel?

@ogrisel
Copy link
Member
ogrisel commented Sep 21, 2024

Not yet. There are still failures in my side PR when i enable the silverman and scott methods in the sample weight test. But i will be busy with pydata paris until the end of next week.

@Charlie-XIAO
Copy link
Contributor Author

That's fine I'm also busy with something else recently 🫠 When I have time I will look into the internals again (there should be something which we can prove to be mathematically correct I believe...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants
0