-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX KernelDensity
incorrectly handling bandwidth
#27971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It seems that I forgot to apply sample weight to covariance and the automatic bandwidth calculation - but this still doesn't pass the test that "sample weight is equivalent to repetition". The results are indeed closer than before - but probably there's something else I overlooked 🤔 |
assert_allclose( | ||
np.exp(scores_weight), np.exp(scores_ref_sampling), atol=1e-8, rtol=1e-2 | ||
) | ||
assert_allclose(sample_weight, sample_ref_sampling, rtol=1e-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I relaxed the checks here quite a bit to pass the test, not sure if this is reasonable. I checked the scipy implementation gaussian_kde
and it cannot pass the strict version either (unless scipy implementation is also buggy 🤔).
Test code
import numpy as np
from scipy.stats import gaussian_kde
from numpy.testing import assert_allclose
n_samples = 400
size_test = 20
rng = np.random.default_rng(0)
X = rng.random((n_samples, 1))
weights = 1 + (10 * X.sum(axis=1)).astype(np.int8)
X_repetitions = np.repeat(X, weights, axis=0)
n_samples_test = size_test
test_points = rng.random((n_samples_test, 1))
kde1 = gaussian_kde(X.T, weights=weights)
scores_weight = kde1.pdf(test_points.T)
sample_weight = kde1.resample(1)
kde2 = gaussian_kde(X_repetitions.T)
scores_ref_sampling = kde2.pdf(test_points.T)
sample_ref_sampling = kde2.resample(1)
assert_allclose(scores_weight, scores_ref_sampling)
assert_allclose(sample_weight, sample_ref_sampling)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the equivalence is in expectation so the strict check will not work. I think this is something that @snath-xoc might help us here. She is solving a similar issue regarding this equivalence in other estimators. I assume that the KDE could be added to the list of estimator to look after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looping me in @glemaitre, @Charlie-XIAO the test itself looks O.K. at a first glance. It may be that due to the stochasticity, the values will only converge to the same in expected but not exact values. I will add this into my tests for these kind of situations and see how it performs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check that the bandwidth_
value is the same with integer weights and repetitions before trying to sample from the models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything that is expected to behave stochastically at fit time in the code. In particular the KernelDensity
constructor does not accept a random_state
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bandwidth_
is not the same if we use scott
or silverman
and I don't think they should be the same by definition. If we directly specify a float bandwidth then bandwidth_
would indeed be the same. But even in the latter case both scipy and scikit-learn fails (in exactly the same way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with the way the neff
thing is defined with weighted samples. It fundamentally breaks the sample weight / sample repetition equivalence as highlighted in https://github.com/scikit-learn/scikit-learn/pull/27971/files#r1762904747. Maybe we should report this upstream in case it hasn't been already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we are inline with the scipy implementation make me think that this is the right fix.
I'm wondering if we should have not in the changelog in the change models because someone fixing the bandwidth manually will observe a change of behaviour. I don't think that we can preserve the previous behaviour.
Yes it is impossible (and unreasonable) to retain the original behavior. I moved the changelog to changed models and added some additional explanations (based on previous changed models). Please let me know if the wording should be changed or any other information is needed.
|
I believe the reason that the tests for sample weighting aren't passing at higher rtol is because the sample weights are factored in using: np.cov(..., aweight=sample_weight) L251 under _kde.py Using fweights instead: np.cov(..., fweight=sample_weight) returns identical covariance matrices when weighting versus repeating samples. A simple minimal reproducer is as follows: import numpy as np
from numpy.testing import assert_equal
n_samples = 1000
d = 3
X = rng.rand(n_samples, d)
rng = np.random.RandomState(3)
sample_weight = rng.randint(0, 3, size=X.shape[0])
X_repeated = np.repeat(X,sample_weight,axis=0)
sample_weight = sample_weight/sample_weight.sum()
cov_weighted = np.cov(X,aweights=sample_weight, rowvar=False)
cov_repeated = np.cov(X_repeated,rowvar=False)
assert_equal(cov_repeated,cov_weighted) Which gives the assertion error:
@ogrisel and @Charlie-XIAO we may need to think about how to add sample weights in here as fweights is preferred however it only accepts integer sample weights. |
@snath-xoc Interesting finding! The numpy doc is not very explicit about the mathematical definition of Looking a the code:
|
Agreed the numpy doc is not so informative, I tried to put in normalised weights into fweight (i.e. sample_weight/sample_weight.sum()) and it gives the error
I think it is because this is for frequency based weighting which must have some resampling, I am not sure what the absolute based weighting (aweight) uses as a strategy though. |
sample_weight, X, dtype=np.float64, ensure_non_negative=True | ||
) | ||
normalized_sample_weight = sample_weight / sample_weight.sum() | ||
n_effective_samples = 1 / np.sum(normalized_sample_weight**2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Charlie-XIAO would you have a reference as to why you use (sorry I may just be missing something here):
n_effective_samples = 1 / np.sum(normalized_sample_weight**2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scipy doc of gaussian_kde
: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. It has the neff
thing which is short for effective number of samples, computed as
neff = sum(weights)^2 / sum(weights^2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is breaking the usual "reweighing is repeating" semantics for sample_weight
that we would like to enforce in all scikit-learn estimators.
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import numpy as np
rng = np.random.default_rng(seed=42)
n_samples = 30
data = np.concatenate(
[
rng.standard_normal(size=n_samples // 2) - 1,
3 * rng.standard_normal(size=n_samples // 2) + 5,
]
)
weights = rng.integers(low=0, high=5, size=n_samples)
$ bw_method = "scott"
bw_method = "silverman" # similar discrepancy with "scott"
kde_weights = gaussian_kde(
data, bw_method=bw_method, weights=weights.astype(np.float64)
)
kde_repetitions = gaussian_kde(np.repeat(data, weights), bw_method=bw_method)
x = np.linspace(data.min(), data.max(), 100)
fig, (ax0, ax1) = plt.subplots(nrows=2, ncols=1, figsize=(8, 8))
ax0.plot(x, kde_weights.evaluate(x), label="weights")
ax0.plot(x, kde_repetitions.evaluate(x), label="repetitions")
ax0.legend()
bins = np.linspace(data.min(), data.max(), 30)
ax1.hist(data, weights=weights, bins=bins, alpha=0.5, label="weights")
ax1.hist(np.repeat(data, weights), bins=bins, alpha=0.5, label="repetitions")
ax1.legend()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scipy does not specify ddof=0
, and uses neff = sum(weights)^2 / sum(weights^2)
(for scott and silverman), which is essentially the same as currently in this PR, so in #27971 (comment) I think the small deviation comes from ddof
?
Regarding scott and silverman, I briefly searched through the references in the scipy docs, did not find something useful in the weighted case so I don't know where the scipy formula really comes from...
In https://github.com/Charlie-XIAO/scikit-learn/pull/3/files#r1761412299 you changed to neff = sum(weights)
, I wonder what's the intuition behind that as well 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried patching scipy's implementation and passing bias=True
/ ddof=0
to its own internal call to np.cov
fixes the remaining (smaller) discrepancy between weighted and repeated data points (as it does for scikit-learn's implementation).
Indeed we would need to use |
As @snath-xoc found out, using we don't want Here is a PR that shows that we can use a much smaller The alternative would be to implement our own |
It does because If we want to keep |
I now see that |
Not yet. There are still failures in my side PR when i enable the silverman and scott methods in the sample weight test. But i will be busy with pydata paris until the end of next week. |
That's fine I'm also busy with something else recently 🫠 When I have time I will look into the internals again (there should be something which we can prove to be mathematically correct I believe...) |
Towards #25623, #26658.
Note
See this gist for some results of this PR. The scikit-learn results should be (almost) consistent with scipy results.
By the way, though not related to this PR, the implementation of weighted KDE in scikit-learn seems to be very slow (#10803). It needs to traverse all points in a node and sum their weights up every time, which makes the tree implementation (that should be fast) several times slower even than the naive implementation of
scipy
as data size scales up.