-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
KBinsDiscretizer: Automatic determination of number of bins #9337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is a bunch of methods for calculating number of bins for histogram (look here). It's easy to implement, but I am not sure about usecases. |
I tried and don't think it's going to work good. Simple example: feature ~ N(0,1) with 10_000 samples. By Scott rule we will have: n_bins = ceil(3.49*1/cbrt(10_000)) = 1. The minimum number of bins is 2. Same thing with Freedman-Diaconis rule. To summarize the problem: it's too sensitive to feature scaling. |
I think basing it on the number of samples might be better than the current
heuristic of "always use 5 unless the user says otherwise".
|
Agreed. I can make the changes, it's actually a very small fix. |
One small extension to KBinsDiscretizer is to allow the number of bins to be guessed by the estimator, using one of the strategies supported by
np.histogram
. We very possibly don't want to implement all of the options, butfd
,sturges
andauto
might be appropriate.However, I'm not actually sure how useful these estimates are in discretization, when they have been designed for visualisation. So a contribution would be best accompanied by an example that showed that this automatic determination was better for machine learning than a fixed number of bins across all features.
The text was updated successfully, but these errors were encountered: