-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
QuantileTransformer quantiles can be unordered because of rounding errors which cause np.interp to return nonsense results #15733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This issue is closely related to #11000. |
I am not sure about your diagnostic. Taking the first column of the data in the gist file and computing only the import numpy as np
from sklearn.preprocessing import QuantileTransformer
X = np.loadtxt('tmp.txt', delimiter=',')
quantile_transformer = QuantileTransformer(n_quantiles=150).fit(X)
xx = np.percentile(X[:, 0], quantile_transformer.references_*100)
print(np.all(np.diff(xx) >= 0))
False |
@glemaitre |
|
@fcharras can you reproduce the issue with simply calling np.nanpercentile? If so, I would assume this is a bug for numpy |
@NicolasHug I try |
As a note, we need to fix the code upstream and import the fix in scikit-learn instead of doing hotfix. We could consider a hotfix if the bug upstream is actually complicated to solve quickly. |
Agreed, though the fix proposed in #15751 seems simple enough to be considered for inclusion maybe |
@tirthasheshpatel be careful that the issue is not easily reproducible, for example if you randomly remove a few rows in the gist I provided it's fine most of the time. So I'd be cautious to call this bug fixed without a more extensive test suite. I think the quickfix ensures that the most harmful part of this bug is dealt with. In my opinion it is dangerous that |
So the error is due that some floating-point precision error. The relevant line are here Since the quantiles are computed with some weighted scheme, we can get a precision error. I think this is difficult to actually change the code without breaking the vectorization. I am thinking that the @fcharras Do you want to propose this fix in NumPy yourself and see what the developers from NumPy think? If you are short in time, I can take care and loop with NumPy. |
I'll keep an eye on the issue but I am indeed short in time, sorry. It's awesome if you can loop. |
A potential fix was submitted there: numpy/numpy#15098 |
Description
At inference in
QuantileTransformer
,np.interp
is used. The documentation of this function states:Does not check that the x-coordinate sequence xp is increasing. If xp is not increasing, the results are nonsense.
Within QuantileTransformerxp
are quantiles. To ensure thatnp.interp
behaves correctly we must ensure that quantiles stored inself.quantiles_
are ordered i.e thatnp.all(np.diff(self.quantiles_, axis=0) >= 0)
holds true.I've found that because of rounding errors, sometimes this does not hold. It is actually a very big issue because it causes inference to behave very erratically (for instance, a sample will not be transformed the same way depending on its position within the input), it is very confusing and very hard to debug.
Steps/Code to Reproduce
Finding a minimal example is really hard, I will provide an example I've managed to isolate that reproduces the issue with 100% reproducibility, however since it happens because of a very tiny rounding error and this feature make use of randomness (for sampling), I hope it is not dependent on hardware.
Here is a gist that defines an array of size
(300,2)
, I can reproduce the bug with the following code:Expected Results
The previous code should print
True
Actual Results
It prints
False
Versions
I have taken note of the fixes of
QuantileTransformer
in21.3
(ensuring thatn_quantiles <= n_samples
) and I have already checked that it is unrelated. It can be seen in the minimal example that the input has 300 samples and the parametern_quantiles
is set to150
anyway.Quickfix
I haven't investigated more deeply to understand the cause of the rounding error. Here is a suggestion of a quick, dirty fix to anyone that would meet the same issue: if
quantile
is unordered, replace it with something likenp.minimum.accumulate(quantile_transformer.quantiles_[::-1])[::-1]
(i think it's better than forcing a sort).The text was updated successfully, but these errors were encountered: