8000 [MRG] Efficiency updates to KBinsDiscretizer by glevv · Pull Request #19290 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Efficiency updates to KBinsDiscretizer #19290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 79 commits into from
Closed

[MRG] Efficiency updates to KBinsDiscretizer #19290

wants to merge 79 commits into from

Conversation

glevv
Copy link
Contributor
@glevv glevv commented Jan 28, 2021

Reference Issues/PRs

Fixes #19256 - Change KMeans algorithm in KBinsDiscretizer from 'auto' (elkan) to 'full'

glevv added 3 commits January 26, 2021 15:33
'auto' option added (Sturges rule);
KMeans algorithm changed from 'auto' to 'full';
Format-strings changed to f-strings.
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please add a test and an entry in the change log for version 1.0.

I also wonder how easily we could empirically demonstrate that this is a reasonable rule for discretisation (rather than histogramming).

@@ -126,7 +127,7 @@ class KBinsDiscretizer(TransformerMixin, BaseEstimator):
"""

@_deprecate_positional_args
def __init__(self, n_bins=5, *, encode='onehot', strategy='quantile',
def __init__(self, n_bins='auto', *, encode='onehot', strategy='quantile',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call it 'sturges' instead of 'auto'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was experimenting I called it auto, since I used three different rules: Sqrt, Rice, Sturges.
Same logic applies here: maybe in the future the formula will be change and we will have to change 'sturge' option to something different which could also lead to changing docs, tests etc. Calling it 'auto' and just changing the description is more sustainable imo

if orig_bins == 'auto':
# calculcate number of bins
# depending on number of samples with Sturges rule
orig_bins = int(np.ceil(np.log2(n_samples) + 1.))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether +1 is actually beneficial for discretisation. The number of bins looks fairly big for smallish numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed for very small number of samples:
log2(1) = 0 but log2(1) + 1 = 1
log2(2) = 1 but log2(2) + 1 = 2
etc.
For larger n_samples adding 1 will not drastically change the output. So I would vote to keep it

glevv and others added 2 commits January 28, 2021 10:48
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
@glevv
Copy link
Contributor Author
glevv commented Jan 28, 2021

Thanks! Please add a test and an entry in the change log for version 1.0.

You mean in v1.0.rst?

I also wonder how easily we could empirically demonstrate that this is a reasonable rule for discretisation (rather than histogramming).

The simplest explanation will be that compressing always to 5 (previous default) will loose more and more information the more unique samples we will get. With different formulas for binning we can achieve good compression and save more information. With continuous feature, n_samples=1000, compressing it to 5 unique values will cut a lot of information while giving great compression. Compressing it to 11 (log2(1000)+1) unique values will save a lot more information and still give good compression.

@jeremiedbb
Copy link
Member

Changing the default value of n_bins requires a deprecation cycle (https://scikit-learn.org/stable/developers/contributing.html#change-the-default-value-of-a-parameter)

maybe in the future the formula will be change and we will have to change 'sturge' option to something different which could also lead to changing docs, tests etc

I like "auto" but I'm not sure we'd change the behavior without a deprecation cycle anyway.

Copy link
Member
@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few comments. Please also add a test to check that the Future warning is issued (example here

@pytest.mark.parametrize("precompute_distances", ["auto", False, True])
)

glevv and others added 2 commits January 29, 2021 12:43
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
@glevv
Copy link
Contributor Author
glevv commented Feb 2, 2021

Still 2 tests failling

@glevv glevv changed the title Kbd changes [MRG] Auto determinations of number of bins and efficiency updates to KBinsDiscretizer Feb 2, 2021
@jeremiedbb
Copy link
Member

Still 2 tests failling

All green (it was just a CI worker that failed to start)

if isinstance(orig_bins, str):
if orig_bins == 'auto':
# calculate number of bins with Sturges rule
orig_bins = max(int(np.ceil(np.log2(n_samples) + 1.)), 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage of using the 'auto' behavior defined by np.histogram_bin_edges? In their case, they take the maximum of 'sturges' and 'fd'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already talked about this a bit in #9337

  1. Freedman-Diaconis and Scott's rule are both based on variance measure (IQR and STD), so they are very dependant on scaling of the features. Ex: we have feature with 1000 samples and after StandardScaler it will have std=1 -> 3.49*1/cbrt(1000) = 0.349. For FD rule this number would be even less. Same will apply to every scaling transformation. Since scaling is very popular technique I don't see the point in additional computations that could lead to degenerate solution (yes we could preemptively check every feature std/iqr, but it will increase time and memory consumption).
    Another example would be feature with small absolute values, like 0.01, 0.002, 0.0003 etc. In this case it will be even worse. Basically these rules rely on absolute values of variance. Normalized values should work better (CV instead of STD, QCD instead of IQR), but this is more of research than practice topic.
    And to finish the topic: all three (Sturges, FD, Scott) rely on some sort of "normality" of the distribution, which is not always the case. It will be better to mix one of them with some other rule that doesn't rely on normality.
  2. It won't work that straightforward: max(sturges, fd). It should be at least something like max(sturges, fd, 2). And also we will need to calculate iqr for every feature in dataset and determine n_bins for every feature, instead of unified approach. It will require a lot of tests, checks and fallbacks to get it right while also increasing memory and time footprint. Not that it cannot be done, it's just that it may not improve model performance on binned in such fashion data.
  3. I tested Rice, Sqrt and Sturges rule on some toy datasets and with different type of models (linear, rf) and different tasks (classification, regression). Difference between Rice and Sqrt were non-existent, while Sturges rule were outperforming other two. Rice (2*cbrt(n)) and sqrt (sqrt(n)) rules will always give bigger values than Sturges rule, so could FD and Scotts (or at least that what we expect them to do), but as I mentioned above more bins does not equal more accuracy. Not saying that it is always the case, but just a thought to consider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your detailed thoughts!

If one were planning to discretize their feature, I do not think we would require them to scale their feature. This means that, for benchmarking, I would assume that the data is not scaled when comparing the different techniques.

tested Rice, Sqrt and Sturges rule on some toy datasets and with different type of models

Can you provide the code for these benchmarks? When implementing a feature without much literature backing it, I tend to try to see what kind of benchmarks we can show to see when this is useful. There looks to be benchmarks here:

but they do not use sturges or fd. I could be missing a paper somewhere that actually looks into sturges or fd for ML.

And also we will need to calculate iqr for every feature in dataset and determine n_bins for every feature, instead of unified approach.

Codewise, it would be okay, since the IQR computation is np.percentile(X, q=[0.25, 0.75], axis=0), and we already support n_bins as an array-like.

Also looks like @amueller and @janvanrijn wrote a paper on this here: https://arxiv.org/pdf/2007.07588.pdf where they wrote on page 3:

The Freedman-Diaconis rule is known to be re- silient to outliers. An alternative of this rule is Sturges formula [14]. The automatic histogram binning, as implemented in the python library numpy [10], uses the maximum of the Freedman-Diaconis rule and Sturges formula. However, Sturges formula assumes that the data follows a Gaussian distribution. Since we have no reason to assume that this holds for our data, we only use the Freedman- Diaconis rule.

Copy link
Contributor Author
@glevv glevv Feb 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. FD rule suggest not the number of bins but the width of bins. We cannot compare the two directly.

  1. In original paper FD assumed that the number of bins should be proportional to cbrt(N) (cbrt(6*gamma^(-1/3)*N), to be precise), which is essentially Rice rule. If you really want to use Rice rule, than we don't even need to compare it to Sturges, since it will always be bigger (and grow faster) and we won't need to calculate IQR.
  2. If we really want to compare Sturges and FD than we will need to calculate IQR (np.diff(np.nanquantile(X, [0.25, 0.75], axis=0), axis=0) or stats.iqr(X, axis=0, nan_policy='omit'), so we will need to refactor code (pass X to _validate_bins function at least). On top of that to convert bin width to number of bins we will need to calculate range np.ptp(X, axis=0) which will also introduce dependence on 'Uniform' strategy, while KBinsDiscretizer also has 'Quantile' and 'Kmeans' (and I guess Spline strategy could be next). 'Uniform' strategy makes a lot of sense for plotting the data (which is np.histogram is for) because it saves the form of distribution, but not always used in ML (every GBM and GPU RF uses 'Quantile' strategy for example). On top of that FD will almost always give bigger numbers than Sturges, so we don't even need to compare to it, just max(FD, 2) would be enough.
  3. IQR is robust to outliers, yes, but that's not what I said. I said it relies on normality of the data, or at least on not so skewed data. In original paper, they optimized L2 functional, which is useful only on symmetric distribution. Optimizing L2 on Exponentially distributed data won't give the expected results, for instance.
  4. Those papers compare different discretization strategies. In KBinsDiscretizer we have 'Uniform' (equal width), 'Quantile' (equal frequency) and 'KMeans'. Can you please elaborate how discretization strategy is connected to selecting number of bins?
  5. The code was pretty simple: make_classification and make_regression to create data, KBinsDiscretizer + LogisticRegression/RandomForestClassifier or Ridge/RandomForestRegressor in cross_val_score function with 'R2' and 'ROC AUC' metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate how discretization strategy is connected to selecting number of bins?

I was mistaken and the papers I listed does not apply.

On the surface this looks like a simple feature to sklearn, but we still have an inclusion criteria. The only source I have that directly speaks about struges and fd in the ML context is the https://arxiv.org/pdf/2007.07588.pdf paper, where they explicitly use fd.

To try to move this forward, I suggest:

  1. Changing the name from 'auto' to 'sturges'. We can not change the behavior of 'auto' without a deprecation cycle, so being more explicit would be better. (A deprecation cycle consist of giving a warning about changing behavior for 2 releases and then changing the behavior.)
  2. Benchmark to compare it to the current default=5 or find a paper that uses struges in the ML context.

I suspect that comparing struges to n_bins=5 would result in struges being better. For simplicity, it would use make_* but with different n_samples to see how the performance compares between 5 and struges.

@glevv glevv changed the title [MRG] Auto determinations of number of bins and efficiency updates to KBinsDiscretizer [WIP] Auto determinations of number of bins and efficiency updates to KBinsDiscretizer Feb 7, 2021
@glevv glevv changed the title [WIP] Auto determinations of number of bins and efficiency updates to KBinsDiscretizer [WIP] Efficiency updates to KBinsDiscretizer Feb 8, 2021
@glevv glevv changed the title [WIP] Efficiency updates to KBinsDiscretizer [MRG] Efficiency updates to KBinsDiscretizer Feb 8, 2021
@glevv
Copy link
Contributor Author
glevv commented Feb 9, 2021

Should be ok now

@glevv
Copy link
Contributor Author
glevv commented Apr 19, 2021

Oh, it's not merged. Should I close then?

@jjerphan
Copy link
Member

@glevv: it just needs reviewers and members to approve and integrate your changes.
There's a final comment from @jnothman that you need to address. Your changes for messages formatting are relevant but should probably be introduced in another PR.

@glevv glevv closed this Apr 20, 2021
@ghost
Copy link
ghost commented Jul 23, 2021

Is this idea abandoned? Shame would be useful :)

@jeremiedbb
Copy link
Member
jeremiedbb commented Jul 23, 2021

The speed improvement has been adressed in #19934.
It does not seem that there is a consensus for the bin strategy so this is still on hold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change KMeans algorithm for KBinsDiscretizer from 'elkan' (default) to 'full'
5 participants
0