-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Efficiency updates to KBinsDiscretizer #19290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
'auto' option added (Sturges rule); KMeans algorithm changed from 'auto' to 'full'; Format-strings changed to f-strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Please add a test and an entry in the change log for version 1.0.
I also wonder how easily we could empirically demonstrate that this is a reasonable rule for discretisation (rather than histogramming).
@@ -126,7 +127,7 @@ class KBinsDiscretizer(TransformerMixin, BaseEstimator): | |||
""" | |||
|
|||
@_deprecate_positional_args | |||
def __init__(self, n_bins=5, *, encode='onehot', strategy='quantile', | |||
def __init__(self, n_bins='auto', *, encode='onehot', strategy='quantile', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call it 'sturges'
instead of 'auto'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was experimenting I called it auto, since I used three different rules: Sqrt, Rice, Sturges.
Same logic applies here: maybe in the future the formula will be change and we will have to change 'sturge' option to something different which could also lead to changing docs, tests etc. Calling it 'auto' and just changing the description is more sustainable imo
if orig_bins == 'auto': | ||
# calculcate number of bins | ||
# depending on number of samples with Sturges rule | ||
orig_bins = int(np.ceil(np.log2(n_samples) + 1.)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether +1 is actually beneficial for discretisation. The number of bins looks fairly big for smallish numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed for very small number of samples:
log2(1) = 0 but log2(1) + 1 = 1
log2(2) = 1 but log2(2) + 1 = 2
etc.
For larger n_samples adding 1 will not drastically change the output. So I would vote to keep it
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
You mean in
The simplest explanation will be that compressing always to 5 (previous default) will loose more and more information the more unique samples we will get. With different formulas for binning we can achieve good compression and save more information. With continuous feature, n_samples=1000, compressing it to 5 unique values will cut a lot of information while giving great compression. Compressing it to 11 ( |
Changing the default value of
I like "auto" but I'm not sure we'd change the behavior without a deprecation cycle anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few comments. Please also add a test to check that the Future warning is issued (example here
@pytest.mark.parametrize("precompute_distances", ["auto", False, True]) |
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Kbd changes
Still 2 tests failling |
All green (it was just a CI worker that failed to start) |
if isinstance(orig_bins, str): | ||
if orig_bins == 'auto': | ||
# calculate number of bins with Sturges rule | ||
orig_bins = max(int(np.ceil(np.log2(n_samples) + 1.)), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an advantage of using the 'auto'
behavior defined by np.histogram_bin_edges? In their case, they take the maximum of 'sturges'
and 'fd'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already talked about this a bit in #9337
- Freedman-Diaconis and Scott's rule are both based on variance measure (IQR and STD), so they are very dependant on scaling of the features. Ex: we have feature with 1000 samples and after StandardScaler it will have std=1 -> 3.49*1/cbrt(1000) = 0.349. For FD rule this number would be even less. Same will apply to every scaling transformation. Since scaling is very popular technique I don't see the point in additional computations that could lead to degenerate solution (yes we could preemptively check every feature std/iqr, but it will increase time and memory consumption).
Another example would be feature with small absolute values, like 0.01, 0.002, 0.0003 etc. In this case it will be even worse. Basically these rules rely on absolute values of variance. Normalized values should work better (CV instead of STD, QCD instead of IQR), but this is more of research than practice topic.
And to finish the topic: all three (Sturges, FD, Scott) rely on some sort of "normality" of the distribution, which is not always the case. It will be better to mix one of them with some other rule that doesn't rely on normality. - It won't work that straightforward: max(sturges, fd). It should be at least something like max(sturges, fd, 2). And also we will need to calculate iqr for every feature in dataset and determine n_bins for every feature, instead of unified approach. It will require a lot of tests, checks and fallbacks to get it right while also increasing memory and time footprint. Not that it cannot be done, it's just that it may not improve model performance on binned in such fashion data.
- I tested Rice, Sqrt and Sturges rule on some toy datasets and with different type of models (linear, rf) and different tasks (classification, regression). Difference between Rice and Sqrt were non-existent, while Sturges rule were outperforming other two. Rice (2*cbrt(n)) and sqrt (sqrt(n)) rules will always give bigger values than Sturges rule, so could FD and Scotts (or at least that what we expect them to do), but as I mentioned above more bins does not equal more accuracy. Not saying that it is always the case, but just a thought to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your detailed thoughts!
If one were planning to discretize their feature, I do not think we would require them to scale their feature. This means that, for benchmarking, I would assume that the data is not scaled when comparing the different techniques.
tested Rice, Sqrt and Sturges rule on some toy datasets and with different type of models
Can you provide the code for these benchmarks? When implementing a feature without much literature backing it, I tend to try to see what kind of benchmarks we can show to see when this is useful. There looks to be benchmarks here:
but they do not use sturges
or fd
. I could be missing a paper somewhere that actually looks into sturges
or fd
for ML.
And also we will need to calculate iqr for every feature in dataset and determine n_bins for every feature, instead of unified approach.
Codewise, it would be okay, since the IQR computation is np.percentile(X, q=[0.25, 0.75], axis=0)
, and we already support n_bins
as an array-like.
Also looks like @amueller and @janvanrijn wrote a paper on this here: https://arxiv.org/pdf/2007.07588.pdf where they wrote on page 3:
The Freedman-Diaconis rule is known to be re- silient to outliers. An alternative of this rule is Sturges formula [14]. The automatic histogram binning, as implemented in the python library numpy [10], uses the maximum of the Freedman-Diaconis rule and Sturges formula. However, Sturges formula assumes that the data follows a Gaussian distribution. Since we have no reason to assume that this holds for our data, we only use the Freedman- Diaconis rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. FD rule suggest not the number of bins but the width of bins. We cannot compare the two directly.
- In original paper FD assumed that the number of bins should be proportional to cbrt(N) (cbrt(6*gamma^(-1/3)*N), to be precise), which is essentially Rice rule. If you really want to use Rice rule, than we don't even need to compare it to Sturges, since it will always be bigger (and grow faster) and we won't need to calculate IQR.
- If we really want to compare Sturges and FD than we will need to calculate IQR
(np.diff(np.nanquantile(X, [0.25, 0.75], axis=0), axis=0)
orstats.iqr(X, axis=0, nan_policy='omit')
, so we will need to refactor code (pass X to_validate_bins
function at least). On top of that to convert bin width to number of bins we will need to calculate rangenp.ptp(X, axis=0)
which will also introduce dependence on 'Uniform' strategy, whileKBinsDiscretizer
also has 'Quantile' and 'Kmeans' (and I guess Spline strategy could be next). 'Uniform' strategy makes a lot of sense for plotting the data (which isnp.histogram
is for) because it saves the form of distribution, but not always used in ML (every GBM and GPU RF uses 'Quantile' strategy for example). On top of that FD will almost always give bigger numbers than Sturges, so we don't even need to compare to it, justmax(FD, 2)
would be enough. - IQR is robust to outliers, yes, but that's not what I said. I said it relies on normality of the data, or at least on not so skewed data. In original paper, they optimized L2 functional, which is useful only on symmetric distribution. Optimizing L2 on Exponentially distributed data won't give the expected results, for instance.
- Those papers compare different discretization strategies. In
KBinsDiscretizer
we have 'Uniform' (equal width), 'Quantile' (equal frequency) and 'KMeans'. Can you please elaborate how discretization strategy is connected to selecting number of bins? - The code was pretty simple:
make_classification
andmake_regression
to create data,KBinsDiscretizer
+LogisticRegression
/RandomForestClassifier
orRidge
/RandomForestRegressor
incross_val_score
function with 'R2' and 'ROC AUC' metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate how discretization strategy is connected to selecting number of bins?
I was mistaken and the papers I listed does not apply.
On the surface this looks like a simple feature to sklearn, but we still have an inclusion criteria. The only source I have that directly speaks about struges and fd in the ML context is the https://arxiv.org/pdf/2007.07588.pdf paper, where they explicitly use fd.
To try to move this forward, I suggest:
- Changing the name from
'auto'
to'sturges'
. We can not change the behavior of 'auto' without a deprecation cycle, so being more explicit would be better. (A deprecation cycle consist of giving a warning about changing behavior for 2 releases and then changing the behavior.) - Benchmark to compare it to the current default=5 or find a paper that uses struges in the ML context.
I suspect that comparing struges to n_bins=5
would result in struges being better. For simplicity, it would use make_*
but with different n_samples
to see how the performance compares between 5 and struges.
Should be ok now |
Oh, it's not merged. Should I close then? |
@glevv: it just needs reviewers and members to approve and integrate your changes. |
Is this idea abandoned? Shame would be useful :) |
The speed improvement has been adressed in #19934. |
Reference Issues/PRs
Fixes #19256 - Change KMeans algorithm in KBinsDiscretizer from 'auto' (elkan) to 'full'