[MRG] Efficiency updates to KBinsDiscretizer #19290

glevv · 2021-01-28T08:22:29Z

Reference Issues/PRs

Fixes #19256 - Change KMeans algorithm in KBinsDiscretizer from 'auto' (elkan) to 'full'

'auto' option added (Sturges rule); KMeans algorithm changed from 'auto' to 'full'; Format-strings changed to f-strings.

jnothman

Thanks! Please add a test and an entry in the change log for version 1.0.

I also wonder how easily we could empirically demonstrate that this is a reasonable rule for discretisation (rather than histogramming).

sklearn/preprocessing/_discretization.py

jnothman · 2021-01-28T10:19:18Z

sklearn/preprocessing/_discretization.py

@@ -126,7 +127,7 @@ class KBinsDiscretizer(TransformerMixin, BaseEstimator):
    """

    @_deprecate_positional_args
-    def __init__(self, n_bins=5, *, encode='onehot', strategy='quantile',
+    def __init__(self, n_bins='auto', *, encode='onehot', strategy='quantile',


should we call it 'sturges' instead of 'auto'?

When I was experimenting I called it auto, since I used three different rules: Sqrt, Rice, Sturges.
Same logic applies here: maybe in the future the formula will be change and we will have to change 'sturge' option to something different which could also lead to changing docs, tests etc. Calling it 'auto' and just changing the description is more sustainable imo

sklearn/preprocessing/_discretization.py

jnothman · 2021-01-28T10:23:07Z

sklearn/preprocessing/_discretization.py

+            if orig_bins == 'auto':
+                # calculcate number of bins
+                # depending on number of samples with Sturges rule
+                orig_bins = int(np.ceil(np.log2(n_samples) + 1.))


I wonder whether +1 is actually beneficial for discretisation. The number of bins looks fairly big for smallish numbers.

It is needed for very small number of samples:
log2(1) = 0 but log2(1) + 1 = 1
log2(2) = 1 but log2(2) + 1 = 2
etc.
For larger n_samples adding 1 will not drastically change the output. So I would vote to keep it

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

glevv · 2021-01-28T11:11:52Z

Thanks! Please add a test and an entry in the change log for version 1.0.

You mean in v1.0.rst?

I also wonder how easily we could empirically demonstrate that this is a reasonable rule for discretisation (rather than histogramming).

The simplest explanation will be that compressing always to 5 (previous default) will loose more and more information the more unique samples we will get. With different formulas for binning we can achieve good compression and save more information. With continuous feature, n_samples=1000, compressing it to 5 unique values will cut a lot of information while giving great compression. Compressing it to 11 (log2(1000)+1) unique values will save a lot more information and still give good compression.

jeremiedbb · 2021-01-28T13:25:22Z

Changing the default value of n_bins requires a deprecation cycle (https://scikit-learn.org/stable/developers/contributing.html#change-the-default-value-of-a-parameter)

maybe in the future the formula will be change and we will have to change 'sturge' option to something different which could also lead to changing docs, tests etc

I like "auto" but I'm not sure we'd change the behavior without a deprecation cycle anyway.

jeremiedbb

Here are a few comments. Please also add a test to check that the Future warning is issued (example here

scikit-learn/sklearn/cluster/tests/test_k_means.py

Line 855 in 96dfe1e

@pytest.mark.parametrize("precompute_distances", ["auto", False, True])

)

sklearn/preprocessing/_discretization.py

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

…19261)" This reverts commit 8965abb.

Kbd changes

glevv · 2021-02-02T13:09:16Z

Still 2 tests failling

jeremiedbb · 2021-02-02T15:37:22Z

Still 2 tests failling

All green (it was just a CI worker that failed to start)

thomasjpfan · 2021-02-02T15:44:50Z

sklearn/preprocessing/_discretization.py

+        if isinstance(orig_bins, str):
+            if orig_bins == 'auto':
+                # calculate number of bins with Sturges rule
+                orig_bins = max(int(np.ceil(np.log2(n_samples) + 1.)), 2)


Is there an advantage of using the 'auto' behavior defined by np.histogram_bin_edges? In their case, they take the maximum of 'sturges' and 'fd'.

I already talked about this a bit in #9337

Freedman-Diaconis and Scott's rule are both based on variance measure (IQR and STD), so they are very dependant on scaling of the features. Ex: we have feature with 1000 samples and after StandardScaler it will have std=1 -> 3.49*1/cbrt(1000) = 0.349. For FD rule this number would be even less. Same will apply to every scaling transformation. Since scaling is very popular technique I don't see the point in additional computations that could lead to degenerate solution (yes we could preemptively check every feature std/iqr, but it will increase time and memory consumption).
Another example would be feature with small absolute values, like 0.01, 0.002, 0.0003 etc. In this case it will be even worse. Basically these rules rely on absolute values of variance. Normalized values should work better (CV instead of STD, QCD instead of IQR), but this is more of research than practice topic.
And to finish the topic: all three (Sturges, FD, Scott) rely on some sort of "normality" of the distribution, which is not always the case. It will be better to mix one of them with some other rule that doesn't rely on normality.

It won't work that straightforward: max(sturges, fd). It should be at least something like max(sturges, fd, 2). And also we will need to calculate iqr for every feature in dataset and determine n_bins for every feature, instead of unified approach. It will require a lot of tests, checks and fallbacks to get it right while also increasing memory and time footprint. Not that it cannot be done, it's just that it may not improve model performance on binned in such fashion data.

I tested Rice, Sqrt and Sturges rule on some toy datasets and with different type of models (linear, rf) and different tasks (classification, regression). Difference between Rice and Sqrt were non-existent, while Sturges rule were outperforming other two. Rice (2*cbrt(n)) and sqrt (sqrt(n)) rules will always give bigger values than Sturges rule, so could FD and Scotts (or at least that what we expect them to do), but as I mentioned above more bins does not equal more accuracy. Not saying that it is always the case, but just a thought to consider.

Thank you for your detailed thoughts!

If one were planning to discretize their feature, I do not think we would require them to scale their feature. This means that, for benchmarking, I would assume that the data is not scaled when comparing the different techniques.

tested Rice, Sqrt and Sturges rule on some toy datasets and with different type of models

Can you provide the code for these benchmarks? When implementing a feature without much literature backing it, I tend to try to see what kind of benchmarks we can show to see when this is useful. There looks to be benchmarks here:

https://ai.stanford.edu/~ronnyk/disc.pdf

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2656082/

but they do not use sturges or fd. I could be missing a paper somewhere that actually looks into sturges or fd for ML.

And also we will need to calculate iqr for every feature in dataset and determine n_bins for every feature, instead of unified approach.

Codewise, it would be okay, since the IQR computation is np.percentile(X, q=[0.25, 0.75], axis=0), and we already support n_bins as an array-like.

Also looks like @amueller and @janvanrijn wrote a paper on this here: https://arxiv.org/pdf/2007.07588.pdf where they wrote on page 3:

The Freedman-Diaconis rule is known to be re- silient to outliers. An alternative of this rule is Sturges formula [14]. The automatic histogram binning, as implemented in the python library numpy [10], uses the maximum of the Freedman-Diaconis rule and Sturges formula. However, Sturges formula assumes that the data follows a Gaussian distribution. Since we have no reason to assume that this holds for our data, we only use the Freedman- Diaconis rule.

My bad. FD rule suggest not the number of bins but the width of bins. We cannot compare the two directly.

In original paper FD assumed that the number of bins should be proportional to cbrt(N) (cbrt(6*gamma^(-1/3)*N), to be precise), which is essentially Rice rule. If you really want to use Rice rule, than we don't even need to compare it to Sturges, since it will always be bigger (and grow faster) and we won't need to calculate IQR.

If we really want to compare Sturges and FD than we will need to calculate IQR (np.diff(np.nanquantile(X, [0.25, 0.75], axis=0), axis=0) or stats.iqr(X, axis=0, nan_policy='omit'), so we will need to refactor code (pass X to _validate_bins function at least). On top of that to convert bin width to number of bins we will need to calculate range np.ptp(X, axis=0) which will also introduce dependence on 'Uniform' strategy, while KBinsDiscretizer also has 'Quantile' and 'Kmeans' (and I guess Spline strategy could be next). 'Uniform' strategy makes a lot of sense for plotting the data (which is np.histogram is for) because it saves the form of distribution, but not always used in ML (every GBM and GPU RF uses 'Quantile' strategy for example). On top of that FD will almost always give bigger numbers than Sturges, so we don't even need to compare to it, just max(FD, 2) would be enough.

IQR is robust to outliers, yes, but that's not what I said. I said it relies on normality of the data, or at least on not so skewed data. In original paper, they optimized L2 functional, which is useful only on symmetric distribution. Optimizing L2 on Exponentially distributed data won't give the expected results, for instance.

Those papers compare different discretization strategies. In KBinsDiscretizer we have 'Uniform' (equal width), 'Quantile' (equal frequency) and 'KMeans'. Can you please elaborate how discretization strategy is connected to selecting number of bins?

The code was pretty simple: make_classification and make_regression to create data, KBinsDiscretizer + LogisticRegression/RandomForestClassifier or Ridge/RandomForestRegressor in cross_val_score function with 'R2' and 'ROC AUC' metrics.

Can you please elaborate how discretization strategy is connected to selecting number of bins?

I was mistaken and the papers I listed does not apply.

On the surface this looks like a simple feature to sklearn, but we still have an inclusion criteria. The only source I have that directly speaks about struges and fd in the ML context is the https://arxiv.org/pdf/2007.07588.pdf paper, where they explicitly use fd.

To try to move this forward, I suggest:

Changing the name from 'auto' to 'sturges'. We can not change the behavior of 'auto' without a deprecation cycle, so being more explicit would be better. (A deprecation cycle consist of giving a warning about changing behavior for 2 releases and then changing the behavior.)

Benchmark to compare it to the current default=5 or find a paper that uses struges in the ML context.

I suspect that comparing struges to n_bins=5 would result in struges being better. For simplicity, it would use make_* but with different n_samples to see how the performance compares between 5 and struges.

Revert "Kbd changes"

glevv · 2021-02-09T08:23:01Z

Should be ok now

glevv · 2021-04-19T19:22:12Z

Oh, it's not merged. Should I close then?

jjerphan · 2021-04-20T09:30:57Z

@glevv: it just needs reviewers and members to approve and integrate your changes.
There's a final comment from @jnothman that you need to address. Your changes for messages formatting are relevant but should probably be introduced in another PR.

update

ghost · 2021-07-23T14:56:33Z

Is this idea abandoned? Shame would be useful :)

jeremiedbb · 2021-07-23T15:00:12Z

The speed improvement has been adressed in #19934.
It does not seem that there is a consensus for the bin strategy so this is still on hold.

glevv added 3 commits January 26, 2021 15:33

KBD changes

d93b74c

'auto' option added (Sturges rule); KMeans algorithm changed from 'auto' to 'full'; Format-strings changed to f-strings.

Small fix

59a84d6

Added checks for n_bins=str

566ac59

github-actions bot added the module:preprocessing label Jan 28, 2021

glevv added 4 commits January 28, 2021 08:29

Lint changes

675e1e6

Changed behaviour to catch n_bins<2 in 'auto'

ab4f868

Update _discretization.py

87579ed

Update _discretization.py

44ea08d

jnothman reviewed Jan 28, 2021

View reviewed changes

glevv and others added 2 commits January 28, 2021 10:48

Update sklearn/preprocessing/_discretization.py

731de9d

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

Update sklearn/preprocessing/_discretization.py

55b84f6

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

glevv added 2 commits January 28, 2021 11:35

Update test_discretization.py

fc9f935

Update test_discretization.py

f5bff72

glevv added 12 commits January 28, 2021 14:21

Update _discretization.py

a8e24e5

Update _discretization.py

c363827

Update _discretization.py

6d767c5

Update _discretization.py

dc6b095

Update _discretization.py

1dac5f9

Update _discretization.py

abff576

Update _discretization.py

bcb118d

Update test_discretization.py

cca972c

Update _discretization.py

7b552e6

Update _discretization.py

617bf90

Update _discretization.py

9563a1c

Update _discretization.py

28dbbc5

jeremiedbb reviewed Jan 29, 2021

View reviewed changes

glevv and others added 2 commits January 29, 2021 12:43

Update _discretization.py

3e5a86d

Update sklearn/preprocessing/_discretization.py

0d7cc14

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

glevv added 2 commits February 2, 2021 16:03

Revert "CLN Removes duplicated or unneeded code in ColumnTransformer (#…

cdee357

…19261)" This reverts commit 8965abb.

Merge pull request #1 from GLevV/kbd_changes

667c7b8

Kbd changes

Merge branch 'main' of https://github.com/GLevV/scikit-learn into main

6e69d3a

glevv changed the title ~~Kbd changes~~ [MRG] Auto determinations of number of bins and efficiency updates to KBinsDiscretizer Feb 2, 2021

thomasjpfan reviewed Feb 2, 2021

View reviewed changes

glevv changed the title ~~[MRG] Auto determinations of number of bins and efficiency updates to KBinsDiscretizer~~ [WIP] Auto determinations of number of bins and efficiency updates to KBinsDiscretizer Feb 7, 2021

glevv added 8 commits February 8, 2021 11:28

Revert "Kbd changes"

15ba412

Merge pull request #2 from GLevV/revert-1-kbd_changes

cb320b6

Revert "Kbd changes"

reverse

f7f394a

reverse

173f18f

reverse

192a37c

reverse

f003f99

reverse

181e0c4

Update _discretization.py

4a3380a

glevv changed the title ~~[WIP] Auto determinations of number of bins and efficiency updates to KBinsDiscretizer~~ [WIP] Efficiency updates to KBinsDiscretizer Feb 8, 2021

glevv changed the title ~~[WIP] Efficiency updates to KBinsDiscretizer~~ [MRG] Efficiency updates to KBinsDiscretizer Feb 8, 2021

glevv added 6 commits April 20, 2021 11:10

reverse

cf29075

reverse

f4d30ab

Merge pull request #3 from scikit-learn/main

334011a

update

Update _discretization.py

6e3f50f

Merge branch 'kbd_changes' into main

0b32083

Merge pull request #5 from GLevV/main

e8ecd14

update

glevv closed this Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Efficiency updates to KBinsDiscretizer #19290

[MRG] Efficiency updates to KBinsDiscretizer #19290

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG] Efficiency updates to KBinsDiscretizer #19290

[MRG] Efficiency updates to KBinsDiscretizer #19290

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!