8000 Automatic bandwidth calculation valid only for normalized data · Issue #26658 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Automatic bandwidth calculation valid only for normalized data #26658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kwdwrd-puff opened this issue Jun 21, 2023 · 7 comments
Open

Automatic bandwidth calculation valid only for normalized data #26658

kwdwrd-puff opened this issue Jun 21, 2023 · 7 comments
Labels

Comments

@kwdwrd-puff
Copy link
kwdwrd-puff commented Jun 21, 2023

Describe the bug

sklearn.neighbors.KernelDensity supports automatic (optimal) bandwidth calculation via bandwidth = 'silverman' and bandwidth = 'scott'. The algorithm computes the appropriate observation-weighted bandwidth factors (proportional to nobs^0.2) but does not adjust for the standard deviation or interquartile range of the dataset. Roughly, the algorithm should scale the dataset's standard error by the algorithmic bandwidth factors.

See, e.g., Wikipedia. The implementation in scipy.stats._kde is correct.

Steps/Code to Reproduce

import matplotlib.pyplot as plot
import numpy as np
from sklearn.neighbors import KernelDensity
from scipy.stats import gaussian_kde

data = np.random.normal( scale = 0.01, size = 100 )

#
# 1. sklearn (auto)
#
kd_sklearn_auto = KernelDensity( kernel = 'gaussian', bandwidth = 'silverman' )
kd_sklearn_auto.fit( np.reshape( data, ( -1, 1 ) ) )

#
# 2. sklearn (manual)
#
kd_sklearn_manual = KernelDensity( kernel = 'gaussian', bandwidth = 0.9 * np.std( data ) / len( data ) ** ( 1 / 5 ) )
kd_sklearn_manual.fit( np.reshape( data, ( -1, 1 ) ) )

#
# 3. scipy
#
kd_scipy = gaussian_kde( data, bw_method = 'silverman' )

#
# 4. show the difference
#
xs = np.arange( start = -0.05, stop = 0.05, step = 1e-4 )
plot.plot( xs, np.exp( kd_sklearn_auto.score_samples( np.reshape( xs, ( -1, 1 ) ) ) ), label = 'KDE SKLearn (auto)' )
plot.plot( xs, np.exp( kd_sklearn_manual.score_samples( np.reshape( xs, ( -1, 1 ) ) ) ), label = 'KDE SKLearn (manual)' )
plot.plot( xs, kd_scipy.pdf( xs ), label = 'KDE SciPy' )
plot.hist( data, label = 'Data' )
plot.legend()
plot.show()

Expected Results

Automatic SKLearn bandwidth curve should approximately match SciPy bandwidth curve, roughly the shape of the underlying data histogram.

Actual Results

Automatic SKLearn bandwidth curve generates a flat PDF.

Versions

System:
    python: 3.9.5 (default, Nov 23 2021, 15:27:38)  [GCC 9.3.0]
executable: /local_disk0/.ephemeral_nfs/envs/pythonEnv-67c47d19-3f15-49a2-ab8f-bcf25a2bc29f/bin/python
   machine: Linux-5.15.0-1038-azure-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.2
          pip: 21.2.4
   setuptools: 58.0.4
        numpy: 1.20.3
        scipy: 1.7.1
       Cython: 0.29.24
       pandas: 1.3.4
   matplotlib: 3.4.3
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /databricks/python3/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.13.dev
    num_threads: 6
threading_layer: pthreads
   architecture: Haswell

       filepath: /local_disk0/.ephemeral_nfs/envs/pythonEnv-67c47d19-3f15-49a2-ab8f-bcf25a2bc29f/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 6

       filepath: /databricks/python3/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.9
    num_threads: 6
threading_layer: pthreads
   architecture: Haswell
@kwdwrd-puff kwdwrd-puff added Bug Needs Triage Issue requires triage labels Jun 21, 2023
@ogrisel
Copy link
Member
ogrisel commented Jun 29, 2023

For the record, here is the generated plot:

image

@betatim
Copy link
Member
betatim commented Jun 29, 2023

Related issue #25623

@betatim betatim removed the Needs Triage Issue requires triage label Jun 29, 2023
@ogrisel
Copy link
Member
ogrisel commented Jun 29, 2023

I agree it looks like a bug, please feel free to open a PR and rework your reproducer as a non-regression test to check that scikit-learn's KDE approximately match scipy's implementation on this toy data.

Please check that your solution is a fix for #25623.

@Charlie-XIAO
Copy link
Contributor
Charlie-XIAO commented Jul 3, 2023

Hi @ogrisel, not sure who exactly to pin but after some investigation I want to compare two approaches here:

  • Approach 1: When using automatic bandwidth calculation (scott or silverman), multiply the current coefficient by np.std(X), as proposed in this issue.

  • Approach 2: Treat bandwidth only as a coefficient, no matter it is a strategy (scott or silverman) or a given float value. Then choose the bandwidth matrix proportional to the square root of the covariance matrix (by this factor) and apply its inverse to X, as is also mentioned in KernelDensity incorrect handling of bandwidth #25623 (comment).

Correctness

  • Approach 1: statsmodels is using this approach.

  • Approach 2: scipy is using this approach.

Multi-dimensional data

  • Approach 1: The variance in different dimensions can be different, so it's necessary to use np.std(X, axis=0) instead of np.std(X) thus turning bandwidth into a vector. This does not fit well into the current scikit-learn framework because our algorithm is treating bandwidth as a constant to simplify calculations. This problem is simple to solve: we can transform X in advance using the inverse of the diagonal matrix constructed from this vector. However things become more complicated when users are using a float. It's fine when variance in different dimensions coincide but unreasonable otherwise. Then users would have to specify an array-like bandwidth, but this makes bandwidth cross-validation (e.g. using GridSearchCV) much harder (infeasible IMO). Note that statsmodels have different classes for univariate and multivariate, and indeed in the multivariate class it requires a bandwidth vector if not using an automatic strategy.

  • Approach 2: There are literatures supporting that this is a good choice of bandwidth matrix. Problems in Approach 1 does not exist. One (possible) issue is that users won't have direct access to the bandwidth matrix but only to the factor (but will users really need that?)

Effects on other public interfaces

One public interface that I notice is the kernel_density method of KDTree and BallTree.

  • Approach 1: No effect because kernel_density takes only float bandwidth while this approach only makes a difference when using automatic bandwidth strategies.

  • Approach 2: Similarly, the h argument would become a factor rather than the real bandwidth. Also, it would require changing the Cython code because transformation would have to be done internally if using the kernel_density method.

Other issues

For both approaches, especially Approach 2, I think we should change the default to "scott" or "silverman", because the current default value 1.0 would work poorly unless the variance is small.

Summary

Personally I would prefer Approach 2, and I tried a "hack" implementation locally and seems that it works well and almost completely agrees with the scipy result when using Gaussian kernel. However I cannot guarantee that my analysis is completely correct. What do maintainers think about this?

P.S. If any further information is needed, please let me know.

Attachments

Testing code (1D)

import os
from itertools import product

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde, norm

from sklearn.neighbors import KernelDensity

if __name__ == "__main__":
    savepath = os.path.join(os.path.dirname(__file__), "../figures")
    if not os.path.exists(savepath):
        os.mkdir(savepath)

    for scale, bw in product([0.01, 1, 100], ["scott", 1.0]):
        data1 = np.random.normal(5 * scale, scale, 10000)
        data2 = np.concatenate(
            [np.random.normal(0, scale, 3000), np.random.normal(5 * scale, scale, 7000)]
        )
        xs1 = np.linspace(start=0, stop=10 * scale, num=1000)
        xs2 = np.linspace(start=-5 * scale, stop=10 * scale, num=1000)
        ref1 = norm(5 * scale, scale).pdf(xs1)
        ref2 = 0.3 * norm(0, scale).pdf(xs2) + 0.7 * norm(5 * scale, scale).pdf(xs2)

        for i, (data, xs, ref) in enumerate(
            zip([data1, data2, data3], [xs1, xs2, xs3], [ref1, ref2, ref3])
        ):
            kd_sklearn = KernelDensity(kernel="gaussian", bandwidth=bw).fit(data.reshape(-1, 1))
            sklearn_result = np.exp(kd_sklearn.score_samples(xs.reshape(-1, 1)))
            sklearn_bw = kd_sklearn.bandwidth_

            kd_scipy = gaussian_kde(data, bw_method=bw)
            scipy_result = kd_scipy.pdf(xs)
            scipy_bw = kd_scipy.factor

            plt.plot(xs, scipy_result, label=f"scipy ({scipy_bw:.3f})", lw=5, alpha=0.3)
            plt.plot(xs, sklearn_result, label=f"sklearn ({sklearn_bw:.3f})")
            plt.fill(xs, ref, alpha=0.1, color="black")
            plt.legend()
            plt.title(f"scale={scale}, bw={bw}")

            plt.savefig(os.path.join(savepath, f"{scale}-{bw}-data{i + 1}.png"))
            # plt.show()
            print("\033[92m.\033[0m", end="", flush=True)
            plt.clf()
Figures produced by my hack implementation (using the testing code above, 1D)

Scale Bandwidth Data 1 Data 2
1 scott 1-scott-data1 1-scott-data2
1 1.0 1-1 0-data1 1-1 0-data2
0.01 scott 0 01-scott-data1 0 01-scott-data2
0.01 1.0 0 01-1 0-data1 0 01-1 0-data2
100 scott 100-scott-data1 100-scott-data2
100 1.0 100-1 0-data1 100-1 0-data2
Figures produced by the current main branch (using the testing code above, 1D, zipped)

Download: main-1d-test.zip

2D example

The data for this example is simply stacking the two univariate data, the first with scale=0.01 and the second with scale=100. The strategy chosen here is bandwidth="scott" (I believe other cases are more or less similar). The result is projected respective onto the xz- and yz-planes. As is shown below, with the hack implementation the result corresponds with scipy again.

multivariate

To further validate the result, scipy has the resample method and I also "hack" modified the sample method of KernelDensity (for Gaussian kernel, Tophat TBD). The two dimensions are respectively used to plot the histogram on the x- and y-axes, and the bivariate data is used to create the hexogonal binning plot in the middle. Seems that both scipy and the hack implementation are working well.

multivariate-resample

@ogrisel
Copy link
Member
ogrisel commented Oct 9, 2023

@Charlie-XIAO thanks for the summary and experiments, and sorry for the late feedback. I agree with your conclusions. Would you mind opening a PR with approach 2?

@ogrisel
Copy link
Member
ogrisel commented Oct 9, 2023

Note that the 2 approaches are not necessarily mutually exclusive. In the future we could decide to implement the variance based approach as well if someone can highlight a real-life case where approach 1 would perform significantly better than approach two.

@Charlie-XIAO
Copy link
Contributor
Charlie-XIAO commented Nov 17, 2023

@ogrisel Thanks for your reply! Unfortunately I’m kinda busy recently, but will revisit and make a PR in a month or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Easy
Development

No branches or pull requests

4 participants
0