-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: crashes in sort/argsort on macOS arm64 #25464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you link to a full Cirrus run @rgommers? |
Sure, here is an example: https://cirrus-ci.com/task/4660394644471808 |
If it's the main ci run (not a wheel) then this should be visible from PR runs. |
I don't think it's related to #25247 for the obvious reason that the compiler used within Test features "NEON NEON_FP16 NEON_VFPV4 ASIMD" : Supported
Test features "ASIMDHP" : Supported
Test features "ASIMDFHM" : Supported
Test features "SVE" : Unsupported due to Compiler fails against the test code of "SVE"
Configuring npy_cpu_dispatch_config.h using configuration
Message:
CPU Optimization Options
baseline:
Requested : min
Enabled : NEON NEON_FP16 NEON_VFPV4 ASIMD
dispatch:
Requested : max -xop -fma4
Enabled : ASIMDHP ASIMDFHM |
I've been unable to reproduce this even with the same exact setup, can anyone reproduce this reliably? 🤔 @jan-wassenberg can you think of any changes that might've fixed something like this in the next Highway release? |
Ah, your testing is with latest Highway? 1.0.7 was Aug30. Since then there were indeed some changes to VQSort:
Those, especially the first, may well have fixed such a bug. |
Do you have an update on the next release? We rolled back to |
:) I understand. Am currently fixing warnings from recent work and hope/expect we can do a release next week. |
Ping @jan-wassenberg. This is problematic since we only see this in the infrequent wheel builds, and the crash prevents the wheel uploads. So downstream projects are not testing against our latest builds. |
Acknowledged :) Sorry that I am running behind, am at a conference this week with minimal coding time. I am postponing further work on Highway's ThreadPool until the release is done. |
@mattip A quick update, the pre-release work is done and we're currently waiting for two reviewers who should be back tomorrow. |
FYI the 1.1.0 release is now done :) |
Great, thanks @jan-wassenberg! @Mousius are you planning to submit the PR to update to the new Highway version? Would be nice to get this one off the 2.0 milestone. |
@jan-wassenberg excellent news! 😸 |
perhaps we can close this? has anyone reproduced the crash and confirmed highway 1.1.0 fixes it? |
That crash was happening irregularly in CI (reported several times more in addition to this issue, for example gh-25685 and gh-25686). I wasn't able to reproduce it locally on a macOS M1 machine. So yes, let's go with your suggestion and close this issue - let's keep an eye out for it happening again. |
In the wheel build run on Cirrus CI that just ran on
and
The other two jobs passed. There doesn't seem to be a correlation with platform or Python versions; of the two failing jobs, one was Python 3.10 on the Sonoma image, the other Python 3.11 on the Monterey image. Looking back at previous cron invocations, we had similar issues. On 25 Feb two jobs failed with:
And on 18 Feb one jobs failed with:
I'm re-running the two failed jobs now; we still have an issue here though. |
@rgommers Are these only on ARM? |
Yes, arm64 jobs only.
Don't we use Highway for these since gh-24018? |
only numpy/numpy/_core/tests/test_multiarray.py Lines 10095 to 10098 in 2e3f52f
|
I think that was the assumption here. That's why we wanted to try the new Highway release. |
yeah, the fact that we have failure in the sort test |
There is a similar crash on the PyPy 3.9 macos arm64 wheel builds, which repeated the second time I ran it |
See https://github.com/numpy/numpy/pull/25987/checks?check_run_id=22490803090 for a sort failure on arm64. |
Is it possible we could get better error messages by disabling pytest parallelism? |
If we can't figure this out, I think we should revert using Highway in the arm64 sort ufuncs. It has proven unstable, we can't release with this kind of heisenbug. |
Unlikely, I don't think
That sounds right to me - we should do so later this week unless the problem is resolved, and then work towards re-enabling it in |
It seems the PyPy case fails consistently, perhaps exploration could start there. cc @seiko2plus @Mousius. |
@seiko2plus pointed out that we could leave this enabled for non-AArch64, as they don't seem to be erroring. That'd leave most of the code intact and the benefit still available to some users 😸 |
x86-64 is covered by |
yes.
@jan-wassenberg might be able to clarify which architectures they test in their CI/release process for vqsort. NumPy currently uses highway for SVE, ASIMD, ASIMDHP, VSX2. |
What would be the easiest way to disable Highway sorting code for arm64? |
It'd be |
@Mousius how consistent was the segfault you saw in the comment above? Enough that you could see that disabling |
Sure, CI is x86, WASM, RVV, Armv7, Armv8 A
F438
SIMD (native HW), Arm SVE, Arm SVE2, plus msan/asan. |
I saw it once; it's proven very difficult to reproduce again despite the same conditions. Disabling ASIMD should fix it by disabling any optimisation whilst we figure out why clang on darwin is causing issues. Unfortunately, it would also disable ASIMD for Linux, which seems well-tested by Highway, and we haven't seen any failures from the Linux AArch64 wheel builds? If it seems acceptable to limit it to Linux, I'm happy to raise a patch so we can get something in to mitigate this. |
Looking back over the cirrus CI runs https://cirrus-ci.com/github/numpy/numpy, and diving into the last failures, I only see arm64 ones. Can we change the meson build condition to include the platform? |
Thanks for the immediate mitigation, that will take the pressure off. Any progress with this?
|
We have found and fixed a corner case in PartitionRightmost. When the rightmost keys all belong to the left partition, we previously loaded (and wrote back unchanged) memory beyond the array. This works on SVE and AVX-512 thanks to predication, but fails on NEON especially if the array to sort is at the end of a memory page. |
Cool. I understand that fix will be part of an upcoming release, then we can update vendored Highway to the next version? Or should we update to latest unreleased HEAD to test the fix? |
Right, generally we recommend using a release. But for purposes of testing to make sure this fixes the heisenbug, using HEAD seems reasonable :) |
Raised #26273 😸 |
I think we can close this? |
Yes, it looks like we're in the clear now! |
Not sure exactly how long this has been happening, but the macOS arm64 CI job on Cirrus is crashing fairly regularly right now. This happened both before and after the downgrade of the Highway git submodule to 1.0.7
The CI logs show things like:
It's not just that one test, there are multiple tests and input values that crash.
It could be multiple things, but gh-25247 is probably the first thing to check (Cc @Mousius).
The text was updated successfully, but these errors were encountered: