8000 QuantileTransformer quantiles can be unordered because of rounding errors which cause np.interp to return nonsense results · Issue #15733 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

QuantileTransformer quantiles can be unordered because of rounding errors which cause np.interp to return nonsense results #15733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fcharras opened this issue Nov 28, 2019 · 12 comments · Fixed by #15751
Labels

Comments

@fcharras
Copy link
Contributor
fcharras commented Nov 28, 2019

Description

At inference in QuantileTransformer, np.interp is used. The documentation of this function states: Does not check that the x-coordinate sequence xp is increasing. If xp is not increasing, the results are nonsense. Within QuantileTransformer xp are quantiles. To ensure that np.interp behaves correctly we must ensure that quantiles stored in self.quantiles_ are ordered i.e that np.all(np.diff(self.quantiles_, axis=0) >= 0) holds true.

I've found that because of rounding errors, sometimes this does not hold. It is actually a very big issue because it causes inference to behave very erratically (for instance, a sample will not be transformed the same way depending on its position within the input), it is very confusing and very hard to debug.

Steps/Code to Reproduce

Finding a minimal example is really hard, I will provide an example I've managed to isolate that reproduces the issue with 100% reproducibility, however since it happens because of a very tiny rounding error and this feature make use of randomness (for sampling), I hope it is not dependent on hardware.

Here is a gist that defines an array of size (300,2), I can reproduce the bug with the following code:

import numpy as np
from sklearn.preprocessing import QuantileTransformer
X = np.loadtxt('gistfile1.txt', delimiter=',')
quantile_transformer = QuantileTransformer(n_quantiles=150).fit(X)
print(np.all(np.diff(quantile_transformer.quantiles_, axis=0) >= 0))

Expected Results

The previous code should print True

Actual Results

It prints False

Versions

I have taken note of the fixes of QuantileTransformer in 21.3 (ensuring that n_quantiles <= n_samples) and I have already checked that it is unrelated. It can be seen in the minimal example that the input has 300 samples and the parameter n_quantiles is set to 150 anyway.

[GCC 5.4.0 20160609]
NumPy 1.15.4
SciPy 1.3.3
Scikit-Learn 0.19.2

Quickfix

I haven't investigated more deeply to understand the cause of the rounding error. Here is a suggestion of a quick, dirty fix to anyone that would meet the same issue: if quantile is unordered, replace it with something like np.minimum.accumulate(quantile_transformer.quantiles_[::-1])[::-1] (i think it's better than forcing a sort).

@tirthasheshpatel
Copy link
Contributor

This issue is closely related to #11000.
I figured out that the error was due to dtype float32. This can be fixed by transforming the data to np.float64 by X.astype(np.float64).
Instead, a dtype parameter can be added to the QuantileTransformer class to support np.float64 datatype.

@glemaitre
Copy link
Member
glemaitre commented Dec 2, 2019

I figured out that the error was due to dtype float32. This can be fixed by transforming the data to np.float64 by X.astype(np.float64).

I am not sure about your diagnostic. Taking the first column of the data in the gist file and computing only the np.percentile and np.nanpercentile reveals the issue with float64 as well.

import numpy as np
from sklearn.preprocessing import QuantileTransformer
X = np.loadtxt('tmp.txt', delimiter=',')
quantile_transformer = QuantileTransformer(n_quantiles=150).fit(X)
xx = np.percentile(X[:, 0], quantile_transformer.references_*100)
print(np.all(np.diff(xx) >= 0))
False

@tirthasheshpatel
Copy link
Contributor
tirthasheshpatel commented Dec 2, 2019

@glemaitre
It turns out that self.references_ must also be projected to np.float64.

@glemaitre
Copy link
Member

quantile_transformer.references_*100 is already float64 in the above snippet.

@NicolasHug
Copy link
Member

@fcharras can you reproduce the issue with simply calling np.nanpercentile?

If so, I would assume this is a bug for numpy

@glemaitre
Copy link
Member

@NicolasHug I try nanpercentile and percentile and I reproduce the issue. I am actually trying to debug the numpy implementation.

@glemaitre
Copy link
Member

As a note, we need to fix the code upstream and import the fix in scikit-learn instead of doing hotfix. We could consider a hotfix if the bug upstream is actually complicated to solve quickly.

@NicolasHug
Copy link
Member

Agreed, though the fix proposed in #15751 seems simple enough to be considered for inclusion maybe

@fcharras
Copy link
Contributor Author
fcharras commented Dec 2, 2019

@tirthasheshpatel be careful that the issue is not easily reproducible, for example if you randomly remove a few rows in the gist I provided it's fine most of the time. So I'd be cautious to call this bug fixed without a more extensive test suite. I think the quickfix ensures that the most harmful part of this bug is dealt with.

In my opinion it is dangerous that np.interp assumes that the input is sorted without checking it. I guess that it is to increase performance, in the case of QuantileTransformer the check could be added anyway since the performance stakes are not high (it can be checked only once during the fit in linear time with respect to the number of quantiles). If the check doesn't pass, I'd recommend if not adding this quickfix, at least adding a RuntimeError (because this issue can really silently cause disastrous outcomes).

@glemaitre
Copy link
Member

So the error is due that some floating-point precision error. The relevant line are here

Since the quantiles are computed with some weighted scheme, we can get a precision error. I think this is difficult to actually change the code without breaking the vectorization. I am thinking that the np.maximum.accumulate might be a nice fix directly in NumPy.

@fcharras Do you want to propose this fix in NumPy yourself and see what the developers from NumPy think? If you are short in time, I can take care and loop with NumPy.

@fcharras
Copy link
Contributor Author
fcharras commented Dec 3, 2019

I'll keep an eye on the issue but I am indeed short in time, sorry. It's awesome if you can loop.

@glemaitre
Copy link
Member

A potential fix was submitted there: numpy/numpy#15098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0