8000 BUG: crashes in sort/argsort on macOS arm64 · Issue #25464 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: crashes in sort/argsort on macOS arm64 #25464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rgommers opened this issue D 8000 ec 22, 2023 · 53 comments
Closed

BUG: crashes in sort/argsort on macOS arm64 #25464

rgommers opened this issue Dec 22, 2023 · 53 comments
Assignees
Labels
Milestone

Comments

@rgommers
Copy link
Member

Not sure exactly how long this has been happening, but the macOS arm64 CI job on Cirrus is crashing fairly regularly right now. This happened both before and after the downgrade of the Highway git submodule to 1.0.7

The CI logs show things like:

________________________ _core/tests/test_multiarray.py ________________________
[gw1] darwin -- Python 3.10.11 /private/var/folders/c2/b2kmn__12wq57m82ltz0hl1w0000gn/T/cibw-run-3g7m_5go/cp310-macosx_arm64/venv-test/bin/python
worker 'gw1' crashed while running '_core/tests/test_multiarray.py::test_sort_float[e-N151]'

It's not just that one test, there are multiple tests and input values that crash.

It could be multiple things, but gh-25247 is probably the first thing to check (Cc @Mousius).

@Mousius
Copy link
Member
Mousius commented Dec 22, 2023

Could you link to a full Cirrus run @rgommers?

@rgommers
Copy link
Member Author

Sure, here is an example: https://cirrus-ci.com/task/4660394644471808

@andyfaff
Copy link
Member

If it's the main ci run (not a wheel) then this should be visible from PR runs.

@seiko2plus
Copy link
Member

I don't think it's related to #25247 for the obvious reason that the compiler used within cirrus-ci doesn't support SVE:

Test features "NEON NEON_FP16 NEON_VFPV4 ASIMD" : Supported 
Test features "ASIMDHP" : Supported 
Test features "ASIMDFHM" : Supported 
Test features "SVE" : Unsupported due to Compiler fails against the test code of "SVE"
Configuring npy_cpu_dispatch_config.h using configuration
Message: 
CPU Optimization Options
  baseline:
    Requested : min
    Enabled   : NEON NEON_FP16 NEON_VFPV4 ASIMD
  dispatch:
    Requested : max -xop -fma4
    Enabled   : ASIMDHP ASIMDFHM

It may related to #24018 or #25045

@Mousius
Copy link
Member
Mousius commented Jan 25, 2024

I've been unable to reproduce this even with the same exact setup, can anyone reproduce this reliably? 🤔 @jan-wassenberg can you think of any changes that might've fixed something like this in the next Highway release?

@Mousius Mousius self-assigned this Jan 25, 2024
@jan-wassenberg
Copy link
Contributor

Ah, your testing is with latest Highway? 1.0.7 was Aug30. Since then there were indeed some changes to VQSort:

Those, especially the first, may well have fixed such a bug.

@Mousius
Copy link
Member
Mousius commented Jan 26, 2024

Ah, your testing is with latest Highway? 1.0.7 was Aug30. Since then there were indeed some changes to VQSort:

Those, especially the first, may well have fixed such a bug.

Do you have an update on the next release? We rolled back to 1.0.7 due to compiler issues with the latest HEAD and wanting to choose the releases for stability.

@jan-wassenberg
Copy link
Contributor

:) I understand. Am currently fixing warnings from recent work and hope/expect we can do a release next week.

@seberg seberg added this to the 2.0.0 release milestone Feb 7, 2024
@mattip
Copy link
Member
mattip commented Feb 7, 2024

Ping @jan-wassenberg. This is problematic since we only see this in the infrequent wheel builds, and the crash prevents the wheel uploads. So downstream projects are not testing against our latest builds.

@jan-wassenberg
Copy link
Contributor

Acknowledged :) Sorry that I am running behind, am at a conference this week with minimal coding time. I am postponing further work on Highway's ThreadPool until the release is done.

@jan-wassenberg
Copy link
Contributor

@mattip A quick update, the pre-release work is done and we're currently waiting for two reviewers who should be back tomorrow.

@jan-wassenberg
Copy link
Contributor

FYI the 1.1.0 release is now done :)

@rgommers
Copy link
Member Author

Great, thanks @jan-wassenberg!

@Mousius are you planning to submit the PR to update to the new Highway version? Would be nice to get this one off the 2.0 milestone.

@Mousius
Copy link
Member
Mousius commented Feb 22, 2024

@jan-wassenberg excellent news! 😸
@rgommers #25873 😸

@r-devulap
Copy link
Member

perhaps we can close this? has anyone reproduced the crash and confirmed highway 1.1.0 fixes it?

@rgommers
Copy link
Member Author

That crash was happening irregularly in CI (reported several times more in addition to this issue, for example gh-25685 and gh-25686). I wasn't able to reproduce it locally on a macOS M1 machine.

So yes, let's go with your suggestion and close this issue - let's keep an eye out for it happening again.

@rgommers
Copy link
Member Author
rgommers commented Mar 1, 2024

In the wheel build run on Cirrus CI that just ran on main ([CI logs](https://cirrus-ci.com/build/6671602109120512, with commit c3a32b5), 2 out of 4 arm64 jobs crashed in test_partition with:

worker 'gw2' crashed while running '_core/tests/test_multiarray.py::test_partition_int[uint32-N66]'

and

worker 'gw0' crashed while running '_core/tests/test_multiarray.py::test_partition_fp[float32-N82]'

The other two jobs passed. There doesn't seem to be a correlation with platform or Python versions; of the two failing jobs, one was Python 3.10 on the Sonoma image, the other Python 3.11 on the Monterey image.

Looking back at previous cron invocations, we had similar issues. On 25 Feb two jobs failed with:

_core/tests/test_multiarray.py::test_argsort_float[float32-N255]

_core/tests/test_multiarray.py::test_argsort_int[int32-N252]

And on 18 Feb one jobs failed with:

_core/tests/test_multiarray.py::test_sort_float[e-N223]

I'm re-running the two failed jobs now; we still have an issue here though.

@r-devulap
Copy link
Member
r-devulap commented Mar 1, 2024

@rgommers Are these only on ARM? np.partition and np.argsort are vectorized only on x86 (avx-512 and avx2). ARM still uses the NumPy scalar version.

@rgommers
Copy link
Member Author
rgommers commented Mar 1, 2024

Yes, arm64 jobs only.

ARM still uses the NumPy scalar version.

Don't we use Highway for these since gh-24018?

@r-devulap
Copy link
Member
r-devulap commented Mar 1, 2024

Don't we use Highway for these since gh-24018?

only np.sort. not for np.partition and np.argsort. Although, I just realized the tests test_partition_int and test_partition_fp call np.sort. It is possible it could be coming from there.

assert_arr_partitioned(np.sort(arr)[k], k,
np.partition(arr, k, kind='introselect'))
assert_arr_partitioned(np.sort(arr)[k], k,
arr[np.argpartition(arr, k, kind='introselect')])

@rgommers
Copy link
Member Author
rgommers commented Mar 1, 2024

I think that was the assumption here. That's why we wanted to try the new Highway release.

@r-devulap
Copy link
Member

yeah, the fact that we have failure in the sort test _core/tests/test_multiarray.py::test_sort_float[e-N223] also points to it.

@mattip
Copy link
Member
mattip commented Mar 10, 2024

There is a similar crash on the PyPy 3.9 macos arm64 wheel builds, which repeated the second time I ran it

@charris
Copy link
Member
charris commented Mar 11, 2024

See https://github.com/numpy/numpy/pull/25987/checks?check_run_id=22490803090 for a sort failure on arm64.

@ngoldbaum
Copy link
Member

Is it possible we could get better error messages by disabling pytest parallelism?

@mattip
Copy link
Member
mattip commented Mar 11, 2024

If we can't figure this out, I think we should revert using Highway in the arm64 sort ufuncs. It has proven unstable, we can't release with this kind of heisenbug.

@rgommers
Copy link
Member Author

Is it possible we could get better error messages by disabling pytest parallelism?

Unlikely, I don't think pytest gives useful output on segfaults either way.

If we can't figure this out, I think we should revert using Highway in the arm64 sort ufuncs. It has proven unstable, we can't release with this kind of heisenbug.

That sounds right to me - we should do so later this week unless the problem is resolved, and then work towards re-enabling it in main.

@mattip
Copy link
Member
mattip commented Mar 11, 2024
6D40

It seems the PyPy case fails consistently, perhaps exploration could start there.

cc @seiko2plus @Mousius.

@Mousius
Copy link
Member
Mousius commented Mar 11, 2024

That sounds right to me - we should do so later this week unless the problem is resolved, and then work towards re-enabling it in main.

@seiko2plus pointed out that we could leave this enabled for non-AArch64, as they don't seem to be erroring. That'd leave most of the code intact and the benefit still available to some users 😸

@rgommers
Copy link
Member Author

x86-64 is covered by x86-simd-sort, right? And architectures other than Intel/Arm are only tested lightly under Qemu, which could hide a problem like this perhaps? With so few users on those non-standard architectures, we may not get any testing signal in time.

@r-devulap
Copy link
Member
r-devulap commented Mar 11, 2024

x86-64 is covered by x86-simd-sort, right?

yes.

With so few users on those non-standard architectures, we may not get any testing signal in time. 9E88

@jan-wassenberg might be able to clarify which architectures they test in their CI/release process for vqsort. NumPy currently uses highway for SVE, ASIMD, ASIMDHP, VSX2.

@mattip
Copy link
Member
mattip commented Mar 11, 2024

could leave this enabled for non-AArch64, as they don't seem to be erroring

What would be the easiest way to disable Highway sorting code for arm64?

@r-devulap
Copy link
Member

@mattip just remove ASIMD (or is it SVE?) from this line:

SVE, ASIMD, VSX2, # FIXME: disable VXE due to runtime segfault

@Mousius
Copy link
Member
Mousius commented Mar 11, 2024

It'd be ASIMD, we haven't seen any issues with SVE as yet 🤔 We could also only enable it for Linux, which we've not seen any issues on?

@mattip
Copy link
Member
mattip commented Mar 11, 2024

@Mousius how consistent was the segfault you saw in the comment above? Enough that you could see that disabling ASIMD solves the problem?

@jan-wassenberg
Copy link
Contributor

x86-64 is covered by x86-simd-sort, right?

yes.

With so few users on those non-standard architectures, we may not get any testing signal in time.

@jan-wassenberg might be able to clarify which architectures they test in their CI/release process for vqsort. NumPy currently uses highway for SVE, ASIMD, ASIMDHP, VSX2.

Sure, CI is x86, WASM, RVV, Armv7, Armv8 A F438 SIMD (native HW), Arm SVE, Arm SVE2, plus msan/asan.
Release: also GCC, POWER8, POWER9, Z14, Z15.

@Mousius
Copy link
Member
Mousius commented Mar 12, 2024

@Mousius how consistent was the segfault you saw in the comment above? Enough that you could see that disabling ASIMD solves the problem?

I saw it once; it's proven very difficult to reproduce again despite the same conditions. Disabling ASIMD should fix it by disabling any optimisation whilst we figure out why clang on darwin is causing issues. Unfortunately, it would also disable ASIMD for Linux, which seems well-tested by Highway, and we haven't seen any failures from the Linux AArch64 wheel builds?

If it seems acceptable to limit it to Linux, I'm happy to raise a patch so we can get something in to mitigate this.

@mattip
Copy link
Member
mattip commented Mar 12, 2024

Looking back over the cirrus CI runs https://cirrus-ci.com/github/numpy/numpy, and diving into the last failures, I only see arm64 ones. Can we change the meson build condition to include the platform?

@Mousius
Copy link
Member
Mousius commented Mar 12, 2024

@mattip #26000 is what I'm suggesting.

@mattip
Copy link
Member
mattip commented Mar 12, 2024

Thanks for the immediate mitigation, that will take the pressure off. Any progress with this?

If I compile with -Db_lundef=false -Db_sanitize=address, I get import errors when running the tests

@jan-wassenberg
Copy link
Contributor

We have found and fixed a corner case in PartitionRightmost. When the rightmost keys all belong to the left partition, we previously loaded (and wrote back unchanged) memory beyond the array. This works on SVE and AVX-512 thanks to predication, but fails on NEON especially if the array to sort is at the end of a memory page.

@mattip
Copy link
Member
mattip commented Apr 6, 2024

Cool. I understand that fix will be part of an upcoming release, then we can update vendored Highway to the next version? Or should we update to latest unreleased HEAD to test the fix?

@jan-wassenberg
Copy link
Contributor

Right, generally we recommend using a release. But for purposes of testing to make sure this fixes the heisenbug, using HEAD seems reasonable :)

@Mousius
Copy link
Member
Mousius commented Apr 12, 2024

Raised #26273 😸

@mattip
Copy link
Member
mattip commented May 26, 2024

I think we can close this?

@rgommers
Copy link
Member Author

Yes, it looks like we're in the clear now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants
0