-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TST Work-around for Ubuntu atlas float32 failure #24198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In the last commit, I can reproduce locally the Debian 32 bit failure on |
In the last commit the build pass for all the seeds in the Debian 32bit build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to extract a minimal reproducer to report the bug upstream but in the mean time, +1 for this workaround.
Actually I re-read the linked comment and there is probably no bug to report to SciPy / LAPACK. Maybe our FastICA code would benefit from a special if condition to avoid dividing by zero?
# XXX we generate noise even when add_noise=False to avoid a rare case | ||
# where the test fails for float32. For more details see: | ||
# https://github.com/scikit-learn/scikit-learn/issues/24131#issuecomment-1208091119 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# XXX we generate noise even when add_noise=False to avoid a rare case | |
# where the test fails for float32. For more details see: | |
# https://github.com/scikit-learn/scikit-learn/issues/24131#issuecomment-1208091119 | |
# XXX we consume the RNG even when add_noise=False to avoid a rare case | |
# where the test fails for float32 with Atlas. For more details see: | |
# https://github.com/scikit-learn/scikit-learn/issues/24131#issuecomment-1208091119 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing my review state because we should probably attempt to fix the division by zero instead of using a workaround in the tests.
We already observed numerical instabilities when reviewing #22806 def _sym_decorrelation(W):
"""Symmetric decorrelation
i.e. W <- (W * W.T) ^{-1/2} * W
"""
s, u = linalg.eigh(np.dot(W, W.T))
# Avoid sqrt of negative values because of rounding errors. Note that #
# np.sqrt(tiny) is larger than tiny and therefore this clipping also #
# prevents division by zero in the next step. #
s = np.clip(s, a_min=np.finfo(W.dtype).tiny, a_max=None) # <--
# u (resp. s) contains the eigenvectors (resp. square roots of
# the eigenvalues) of W * W.T
return np.linalg.multi_dot([u * (1.0 / np.sqrt(s)), u.T, W]) There's also this old issue that already mentions the instabilities in fastica #2735 and it looks to be the same division by 0 issue from |
For reference, I tried to run It could be a numerical stability issue in Atlas itself. Still it would be interesting to check if a revived SVD solver as in #2738 makes this test stable on Atlas. |
Does this PR completes or adapts #24244 to resolve |
#24244 was an easy fix to fix some of the failures by using a higher tolerance for float32 tests. This PR was an hacky and questionable attempt to hide #24131 (comment) under the carpet. To be honest if we want a quick fix to not see the CI red from time to time, we can always skip the test for Atlas and the single global random seed for which it fails. Looks like the consensus instead is to try to fix the fastica algorithm. Full disclosure: I am not planning to work on this in the near future. |
Thanks for providing context. I think |
We could very likely From what I could quickly gather, it does not seem like there is a reliable way of using |
Do you think |
I think
But using
|
ATLAS is not supported by threadpoolctl because it doesn't use multithreading.
Can you post the full result of ldd ? and ideally resolve the symlinks ? |
|
test_fastica_simple
7d2ede5
to
73a418b
Compare
test_fastica_simple
test_fastica_simple
hum, maybe libopenblas is linked to another .so because threadpoolctl only looks for |
The work-around works locally (when setting DISTRIB=ubuntu), it seems like it works in the CI too, e.g. this Ubuntu Atlas build shows a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with that until someone is motivated to take a look at fastica :)
Thanks I'll create an issue about it when I get the time. |
Done in #25038 |
This fixes the Ubuntu atlas issue seen in #24131 (comment). I have tried using the magic
[all random seeds]
syntax for the CI, let's see if that works 🤞.This is definitely a kludgy work-around but maybe good enough since the failure seems to happen quite rarely (see below for more details)?
The snippet below reproduces the problem (both for Ubuntu 20.04 with atlas and when installing from conda-forge). Trying different
random_state
for thefastica
call, the exception happens 1-3 times out of 10000 so maybe this is rare enough to use this work-around.More than happy to create an issue about the division by zero in fastica for a more proper fix