-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Automatic bandwidth calculation valid only for normalized data #26658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related issue #25623 |
I agree it looks like a bug, please feel free to open a PR and rework your reproducer as a non-regression test to check that scikit-learn's KDE approximately match scipy's implementation on this toy data. Please check that your solution is a fix for #25623. |
Hi @ogrisel, not sure who exactly to pin but after some investigation I want to compare two approaches here:
Correctness
Multi-dimensional data
Effects on other public interfacesOne public interface that I notice is the
Other issuesFor both approaches, especially Approach 2, I think we should change the default to SummaryPersonally I would prefer Approach 2, and I tried a "hack" implementation locally and seems that it works well and almost completely agrees with the scipy result when using Gaussian kernel. However I cannot guarantee that my analysis is completely correct. What do maintainers think about this? P.S. If any further information is needed, please let me know. AttachmentsTesting code (1D)import os
from itertools import product
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde, norm
from sklearn.neighbors import KernelDensity
if __name__ == "__main__":
savepath = os.path.join(os.path.dirname(__file__), "../figures")
if not os.path.exists(savepath):
os.mkdir(savepath)
for scale, bw in product([0.01, 1, 100], ["scott", 1.0]):
data1 = np.random.normal(5 * scale, scale, 10000)
data2 = np.concatenate(
[np.random.normal(0, scale, 3000), np.random.normal(5 * scale, scale, 7000)]
)
xs1 = np.linspace(start=0, stop=10 * scale, num=1000)
xs2 = np.linspace(start=-5 * scale, stop=10 * scale, num=1000)
ref1 = norm(5 * scale, scale).pdf(xs1)
ref2 = 0.3 * norm(0, scale).pdf(xs2) + 0.7 * norm(5 * scale, scale).pdf(xs2)
for i, (data, xs, ref) in enumerate(
zip([data1, data2, data3], [xs1, xs2, xs3], [ref1, ref2, ref3])
):
kd_sklearn = KernelDensity(kernel="gaussian", bandwidth=bw).fit(data.reshape(-1, 1))
sklearn_result = np.exp(kd_sklearn.score_samples(xs.reshape(-1, 1)))
sklearn_bw = kd_sklearn.bandwidth_
kd_scipy = gaussian_kde(data, bw_method=bw)
scipy_result = kd_scipy.pdf(xs)
scipy_bw = kd_scipy.factor
plt.plot(xs, scipy_result, label=f"scipy ({scipy_bw:.3f})", lw=5, alpha=0.3)
plt.plot(xs, sklearn_result, label=f"sklearn ({sklearn_bw:.3f})")
plt.fill(xs, ref, alpha=0.1, color="black")
plt.legend()
plt.title(f"scale={scale}, bw={bw}")
plt.savefig(os.path.join(savepath, f"{scale}-{bw}-data{i + 1}.png"))
# plt.show()
print("\033[92m.\033[0m", end="", flush=True)
plt.clf() Figures produced by my hack implementation (using the testing code above, 1D)
Figures produced by the current main branch (using the testing code above, 1D, zipped)Download: main-1d-test.zip 2D exampleThe data for this example is simply stacking the two univariate data, the first with To further validate the result, scipy has the |
@Charlie-XIAO thanks for the summary and experiments, and sorry for the late feedback. I agree with your conclusions. Would you mind opening a PR with approach 2? |
Note that the 2 approaches are not necessarily mutually exclusive. In the future we could decide to implement the variance based approach as well if someone can highlight a real-life case where approach 1 would perform significantly better than approach two. |
@ogrisel Thanks for your reply! Unfortunately I’m kinda busy recently, but will revisit and make a PR in a month or so. |
Describe the bug
sklearn.neighbors.KernelDensity
supports automatic (optimal) bandwidth calculation viabandwidth = 'silverman'
andbandwidth = 'scott'
. The algorithm computes the appropriate observation-weighted bandwidth factors (proportional to nobs^0.2) but does not adjust for the standard deviation or interquartile range of the dataset. Roughly, the algorithm should scale the dataset's standard error by the algorithmic bandwidth factors.See, e.g., Wikipedia. The implementation in
scipy.stats._kde
is correct.Steps/Code to Reproduce
Expected Results
Automatic SKLearn bandwidth curve should approximately match SciPy bandwidth curve, roughly the shape of the underlying data histogram.
Actual Results
Automatic SKLearn bandwidth curve generates a flat PDF.
Versions
The text was updated successfully, but these errors were encountered: