10000 numpy.linalg.LinAlgError: Eigenvalues did not converge on ARM64 builds · Issue #19411 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

numpy.linalg.LinAlgError: Eigenvalues did not converge on ARM64 builds #19411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ogrisel opened this issue Jul 5, 2021 · 23 comments
Closed

numpy.linalg.LinAlgError: Eigenvalues did not converge on ARM64 builds #19411

ogrisel opened this issue Jul 5, 2021 · 23 comments

Comments

@ogrisel
Copy link
Contributor
ogrisel commented Jul 5, 2021

TestPolynomial.test_poly and other tests recently started to fail on the Travis ARM64 nightly builds:

https://travis-ci.com/github/numpy/numpy/jobs/521809468

The last successful ARM64 test run is 4 days old:

https://travis-ci.com/github/numpy/numpy/builds/231467572

Note that we observed similar failures on Circle CI and Travis CI ARM64 builds for the scikit-learn project using only stable releases for numpy and scipy so this probably not caused by a change in numpy itself.

We observed a failing job and a successful run that both used:

  • numpy-1.21.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
  • scipy-1.7.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

Link to those (weekly running) scikit-learn jobs:

I am not sure what caused this change since openblas is embedded in those wheels.

@charris
Copy link
Member
charris commented Jul 5, 2021

There are some passing tests more recent than 4 days. Looking at the worker info at the top of the jobs:

  • lxd-arm64-02 fails
  • lxd-arm64-05 works
  • lxd-arm64-06 works

The 1.21.x branch has also started failing, so I don't think the problem is in numpy.

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 6, 2021

This is really weird. Any idea how to find the cause of this problem? It seems that there was no change in numpy or openblas that could explain this. Could it be some faulty hardware or another change in the software environment?

I also tried to run those tests locally on a linux arm64 docker image on an Apple M1 machine they pass without any problem.

@mattip
Copy link
Member
mattip commented Jul 6, 2021

A successful job reports build system information

Runtime kernel version: 5.8.0-55-generic
travis-build version: 091d532a
Build image provisioning date and time
Fri Jul 10 13:20:51 UTC 2020
Operating System Details
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.6 LTS
Release:	16.04
Codename:	xenial
Linux Version
5.0.0-23-generic

A failing job reports

Runtime kernel version: 5.8.0-59-generic
travis-build version: 091d532a
Build image provisioning date and time
Mon Nov  2 09:23:45 UTC 2020
Operating System Details
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal
Linux Version
5.0.0-23-generic

@mattip
Copy link
Member
mattip commented Jul 6, 2021

I opened an issue on the travis-ci forum, please comment/upvote so it gets some attention
https://travis-ci.community/t/arm64-builds-failing-blas-tests-did-the-hardware-change-recently/11813

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 6, 2021

On my local machine with a successful run:

$ uname -a
Linux a8c27e8c4c7a 5.10.25-linuxkit #1 SMP PREEMPT Tue Mar 23 09:24:45 UTC 2021 aarch64 GNU/Linux
$ cat /etc/issue.net
Debian GNU/Linux 10
$ cat /etc/debian_version 
10.7
$ lspcu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               0
Model name:          Cortex-A57
Stepping:            r1p0
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 fphp asimdhp cpuid dit

@ogrisel
Copy link
8000 Contributor Author
ogrisel commented Jul 6, 2021

We might want to run lscpu before running the tests to collect additional info.

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 6, 2021

Note that this is not travis specific: we also observe the same problem on circle ci:

https://app.circleci.com/pipelines/github/scikit-learn/scikit-learn/15376/workflows/6dc4ec4f-c32a-4bc8-be42-a6ac42bb346f/jobs/143767

But maybe the use the same hardware provider as travis (travis uses https://metal.equinix.com/, but I don't know for circle ci).

@Qiyu8
Copy link
Member
Qiyu8 commented Jul 6, 2021

My centOS-based local machine(KunPeng 920) can run successfully. but there has an compiler error when parallel build is enabled.

$ uname -a
Linux ecs-9d50 4.19.36-vhulk1907.1.0.h962.eulerosv2r8.aarch64 #1 SMP Fri Jan 8 13:18:01 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
$ cat /etc/os-release
NAME="EulerOS"
VERSION="2.0 (SP8)"
ID="euleros"
ID_LIKE="rhel fedora centos"
VERSION_ID="2.0"
PRETTY_NAME="EulerOS 2.0 (SP8)"
ANSI_COLOR="0;31"
$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           0x48
Model:               0
Stepping:            0x1
CPU MHz:             2600.000
CPU max MHz:         2600.0000
CPU min MHz:         2600.0000
BogoMIPS:            200.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-7
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 7, 2021

Here is the lscpu output of a failed run on travis:

Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    1
Core(s) per socket:    80
Socket(s):             1
NUMA node(s):          1
L1d cache:             64K
L1i cache:             64K
L2 cache:              1024K
NUMA node0 CPU(s):     0-79

Unfortunately, this did not print the CPU flags. Maybe because of virtualization?

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 7, 2021

I am also tried to trigger a run on conda-forge and the scikit-learn tests pass (the only failure is an unrelated warning problem which was fixed in scikit-learn main branch) on the ARM64 workers of the drone CI.

@mattip
Copy link
Member
mattip commented Jul 7, 2021

Apparently we can request aarch64 access from https://github.com/WorksOnArm/cluster by opening a new issue like this one for conda-forge WorksOnArm/cluster#193. I think each project needs to do that separately. From the conda-forge issue it seems that conda-forge, in Feb 2020, got at least a "c2.large.arm machine", which seem to be these offerings from Equinix Metal Server. For NumPy, let's discuss this at the weekly meeting.

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 8000 7, 2021

I think I am making some progress: on the travis machine where I observe the failure, the one that has not flags entry in the output of lscpu, OpenBLAS detects the neoversen1 instead of the armv8 on my Apple M1 machine under macOS and cortexa57 on the same machine under linux aarch64 via docker and the tests pass on my machine both with cortexa57 and armv8 on this Apple M1 machine.

To introspect the corename detected by openblas, use the following augmented version of threadpoolctl (see joblib/threadpoolctl#85 for details):

$ pip install git+https://github.com/ogrisel/threadpoolctl.git@openblas_get_corename
$ python -m threadpoolctl -i numpy
[
  {
    "filepath": "/home/travis/miniconda/envs/testenv/lib/libopenblasp-r0.3.15.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.15",
    "num_threads": 2,
    "threading_layer": "pthreads",
    "corename": "neoversen1"
  }
]

Here is the (failing) run:

https://app.travis-ci.com/github/scikit-learn/scikit-learn/builds/232084976#L3799-L3825

If I force the use of the use of the armv8 by setting the following env variable:

export OPENBLAS_CORETYPE=armv8

then the tests pass again on travis:

https://app.travis-ci.com/github/scikit-learn/scikit-learn/builds/232087064

So it's possible that the problem comes from a bug in the neoversen1 kernel in OpenBLAS 0.3.15 (latest version at the moment) or that the travis machines make OpenBLAS detect the wrong kind of kernel (because the CPU flags are not properly reported).

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 7, 2021

So a workaround for now it force export OPENBLAS_CORETYPE=armv8 in the Travis CI config of numpy, scipy, scikit-learn...

@charris
Copy link
Member
charris commented Jul 7, 2021

Using arch: arm64-graviton2 in travis.yml also fixes the problem for NumPy. See #19426. Note s390x seems to have gone missing.

@ogrisel
Copy link
Contributor Author
ogrisel commented 8000 Jul 7, 2021

For information, OpenBLAS detects the ARM64 variants based on the "cpu part" element of the contents of /proc/cpuinfo (see the source code for more details):

On the failing travis node, the contents of /proc/cpuinfo is:

processor	: 0
BogoMIPS	: 50.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 1
BogoMIPS	: 50.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 2
BogoMIPS	: 50.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1
[...]

I will try to do a proper bug report to OpenBLAS developers tomorrow.

@charris
Copy link
Member
charris commented Jul 8, 2021

Unfortunately, the fixes don't work for the docker images used to test the numpy wheels.

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 8, 2021

Can you please run:

lscpu
pip install git+https://github.com/ogrisel/threadpoolctl.git@openblas_get_corename
python -m threadpoolctl -i numpy
cat /proc/cpuinfo

before running the tests on the numpy wheels CI?

@charris
Copy link
Member
charris commented Jul 9, 2021

@ogrisel

This data was obtained inside the docker image where the tests are run.

graviton2

[
  {
    "filepath": "/venv/lib/python3.7/site-packages/numpy.libs/libopenblasp-r0-afb71072.3.13.dev.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.13.dev",
    "num_threads": 2,
    "threading_layer": "pthreads",
    "corename": "neoversen1"
  }
]
processor	: 0
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

processor	: 1
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

arm64 (80 processors)

[
  {
    "filepath": "/venv/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-afb71072.3.13.dev.so",
    "prefix": "libopenblas
8000
",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.13.dev",
    "num_threads": 64,
    "threading_layer": "pthreads",
    "corename": "neoversen1"
  }
]
processor	: 0
BogoMIPS	: 50.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 28, 2021

So indeed the arm64 (80 processors) and graviton2 look very similar from an lscpu point of view (in particular the CPU part identifier than OpenBLAS relies on to detect the CPU architecture is the same).

Unfortunately, the fixes don't work for the docker images used to test the numpy wheels.

What do you mean? Setting the OPENBLAS_CORETYPE=armv8 environment variable does not work on the arm64 host used to generate the manylinux wheels? Or using the graviton2 host does not work for the manylinux wheels?

In any case, we should probably try to write a minimal C reproducer to report the issue to OpenBLAS but I won't have the time to do it soon...

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 29, 2021

It seems that the problem has disappeared magically on the scikit-learn Travis CI (see scikit-learn/scikit-learn#20476 after a1e1a03). There is still a (single) test failure but it's unrelated to the originally reported numpy.linalg.LinAlgError: Eigenvalues did not converge problem. Here are the CPU info on the arm64 80 cores machine of the last successful run.

[
  {
    "filepath": "/home/travis/miniconda/envs/testenv/lib/libgomp.so.1.0.0",
    "prefix": "libgomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 2
  },
  {
    "filepath": "/home/travis/miniconda/envs/testenv/lib/libopenblasp-r0.3.17.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.17",
    "num_threads": 2,
    "threading_layer": "pthreads",
    "architecture": "neoversen1"
  }
]
processor	: 0
BogoMIPS	: 50.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

It seems the be the same as for the previously failing runs... so I am confused. Maybe this was a transient hardware / virtualization problem on the hosting provider...

In anycase we can no longer debug. Maybe we can close.

@charris
Copy link
Member
charris commented Jul 29, 2021

There was an OpenBLAS fix for the problem in 0.3.16, probably OpenMathLib/OpenBLAS#3278.

@charris charris closed this as completed Jul 29, 2021
@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 30, 2021

There was an OpenBLAS fix for the problem in 0.3.16, probably OpenMathLib/OpenBLAS#3278.

It seems unrelated (not the same CPU part: 0xd05 vs 0xd0c on Travis CI) however it's true that now the new runs are running with OpenBLAS 0.3.17 while the failed run used: 0.3.15.

@ogrisel
Copy link
Contributor Author
ogrisel commented Jul 30, 2021

The fix was probably released in 0.3.16, from the changelog:

  • fixed missing restore of a register in the recently rewritten DNRM2 kernel
    for ThunderX2 and Neoverse N1 that could cause spurious failures in e.g.
    DGEEV

jjerphan added a commit to jjerphan/scikit-learn that referenced this issue Aug 5, 2021
Currently, there are some problems with the neoversen1 kernel,
which makes computations using BLAS via scipy unstable for this
architecture.

See this comment:
numpy/numpy#19411 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0