8000 KBinsDiscretizer: Automatic determination of number of bins · Issue #9337 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

KBinsDiscretizer: Automatic determination of number of bins #9337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jnothman opened this issue Jul 12, 2017 · 4 comments
Open

KBinsDiscretizer: Automatic determination of number of bins #9337

jnothman opened this issue Jul 12, 2017 · 4 comments
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing

Comments

@jnothman
Copy link
Member
jnothman commented Jul 12, 2017

One small extension to KBinsDiscretizer is to allow the number of bins to be guessed by the estimator, using one of the strategies supported by np.histogram. We very possibly don't want to implement all of the options, but fd, sturges and auto might be appropriate.

However, I'm not actually sure how useful these estimates are in discretization, when they have been designed for visualisation. So a contribution would be best accompanied by an example that showed that this automatic determination was better for machine learning than a fixed number of bins across all features.

@jnothman jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices Need Contributor labels Jul 12, 2017
@jnothman jnothman changed the title discrete branch: automatic determination of number of bins KBinsDiscretizer: Automatic determination of number of bins Jul 12, 2018
@glevv
Copy link
Contributor
glevv commented Jan 14, 2021

There is a bunch of methods for calculating number of bins for histogram (look here). It's easy to implement, but I am not sure about usecases.

@glevv
Copy link
Contributor
glevv commented Jan 18, 2021

I tried and don't think it's going to work good.

Simple example: feature ~ N(0,1) with 10_000 samples. By Scott rule we will have: n_bins = ceil(3.49*1/cbrt(10_000)) = 1. The minimum number of bins is 2. Same thing with Freedman-Diaconis rule. To summarize the problem: it's too sensitive to feature scaling.
Rules that rely only on number of samples like Sturges, Rice, Sqrt are more stable. The difference between Rice and Sqrt is tiny, Sturges rule is the most promising and works somewhat better than previous two, but I am not sure it's enough to add it to KBD.

@jnothman
Copy link
Member Author
jnothman commented Jan 19, 2021 via email

@glevv
Copy link
Contributor
glevv commented Jan 26, 2021

Agreed.

I can make the changes, it's actually a very small fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing
Projects
None yet
Development

No branches or pull requests

4 participants
0