-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
numpy.linalg.LinAlgError: Eigenvalues did not converge on ARM64 builds #19411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There are some passing tests more recent than 4 days. Looking at the worker info at the top of the jobs:
The 1.21.x branch has also started failing, so I don't think the problem is in numpy. |
This is really weird. Any idea how to find the cause of this problem? It seems that there was no change in numpy or openblas that could explain this. Could it be some faulty hardware or another change in the software environment? I also tried to run those tests locally on a linux arm64 docker image on an Apple M1 machine they pass without any problem. |
A successful job reports build system information
A failing job reports
|
I opened an issue on the travis-ci forum, please comment/upvote so it gets some attention |
On my local machine with a successful run:
|
We might want to run |
Note that this is not travis specific: we also observe the same problem on circle ci: But maybe the use the same hardware provider as travis (travis uses https://metal.equinix.com/, but I don't know for circle ci). |
My centOS-based local machine(KunPeng 920) can run successfully. but there has an compiler error when parallel build is enabled.
|
Here is the lscpu output of a failed run on travis:
Unfortunately, this did not print the CPU flags. Maybe because of virtualization? |
I am also tried to trigger a run on conda-forge and the scikit-learn tests pass (the only failure is an unrelated warning problem which was fixed in scikit-learn main branch) on the ARM64 workers of the drone CI. |
Apparently we can request aarch64 access from https://github.com/WorksOnArm/cluster by opening a new issue like this one for conda-forge WorksOnArm/cluster#193. I think each project needs to do that separately. From the conda-forge issue it seems that conda-forge, in Feb 2020, got at least a "c2.large.arm machine", which seem to be these offerings from Equinix Metal Server. For NumPy, let's discuss this at the weekly meeting. |
I think I am making some progress: on the travis machine where I observe the failure, the one that has not To introspect the corename detected by openblas, use the following augmented version of threadpoolctl (see joblib/threadpoolctl#85 for details):
Here is the (failing) run: https://app.travis-ci.com/github/scikit-learn/scikit-learn/builds/232084976#L3799-L3825 If I force the use of the use of the
then the tests pass again on travis: https://app.travis-ci.com/github/scikit-learn/scikit-learn/builds/232087064 So it's possible that the problem comes from a bug in the |
So a workaround for now it force |
Using |
For information, OpenBLAS detects the ARM64 variants based on the "cpu part" element of the contents of On the failing travis node, the contents of
I will try to do a proper bug report to OpenBLAS developers tomorrow. |
Unfortunately, the fixes don't work for the docker images used to test the numpy wheels. |
Can you please run:
before running the tests on the numpy wheels CI? |
This data was obtained inside the docker image where the tests are run. graviton2
arm64 (80 processors)
|
So indeed the
What do you mean? Setting the In any case, we should probably try to write a minimal C reproducer to report the issue to OpenBLAS but I won't have the time to do it soon... |
It seems that the problem has disappeared magically on the scikit-learn Travis CI (see scikit-learn/scikit-learn#20476 after
It seems the be the same as for the previously failing runs... so I am confused. Maybe this was a transient hardware / virtualization problem on the hosting provider... In anycase we can no longer debug. Maybe we can close. |
There was an OpenBLAS fix for the problem in 0.3.16, probably OpenMathLib/OpenBLAS#3278. |
It seems unrelated (not the same CPU part: |
The fix was probably released in 0.3.16, from the changelog:
|
Currently, there are some problems with the neoversen1 kernel, which makes computations using BLAS via scipy unstable for this architecture. See this comment: numpy/numpy#19411 (comment)
TestPolynomial.test_poly
and other tests recently started to fail on the Travis ARM64 nightly builds:https://travis-ci.com/github/numpy/numpy/jobs/521809468
The last successful ARM64 test run is 4 days old:
https://travis-ci.com/github/numpy/numpy/builds/231467572
Note that we observed similar failures on Circle CI and Travis CI ARM64 builds for the scikit-learn project using only stable releases for numpy and scipy so this probably not caused by a change in numpy itself.
We observed a failing job and a successful run that both used:
Link to those (weekly running) scikit-learn jobs:
I am not sure what caused this change since openblas is embedded in those wheels.
The text was updated successfully, but these errors were encountered: