-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: poor multithreaded performance scaling for small arrays (nogil) #27786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Whithout diving in too deeply, are you controlling for the OpenBLAS threading underneath NumPy? You might want to use threadpoolctl or set |
I tried this and I don't see any significant change in results. Using
Using
|
Thanks. Now that I look at this more carefully, it seems there is some thread contention limiting NumPy's performance. Looking only at the "NumPy Threads" graphs, they appear to scale "per call": i.e. if you divide the mflops by the vector length, the graphs will align. The core calculation |
When I run the attached script, I get:
My understanding is that closures aren't pickleable and you can't use them with multiprocessing like you're doing with That said if I hack the script to skip the multiprocessing benchmark then I am able to reproduce the behavior described in the issue: |
Agreed! Python free-threading has biased reference counting , "a fast-path for objects owned by the current thread", I wonder if numpy could do something similar for whatever it's doing per call.
Hmmm, the script runs as-is for me. Maybe some platform or version difference? Here's an alternate snippet that just removes the synchronization for the multiprocessing case. So, the timing includes the setup (not just computation), which is sloppy, but you'll still get similar results.
|
Thanks! I wonder if your site environment patches python to use something like I'm going to try to profile this today using a few tools to see if this is due to lock contention inside CPython or inside NumPy. If it's due to locking that we added in NumPy to make things thread-safe, maybe we can simply disable it or find a quick fix for the scaling issue. It's also probably worth adding documentation to py-free-threading.github.io to help with debugging issues like this. This is the first time this has come up so far in NumPy but doubtless it will happen a lot across the ecosystem. |
I rewrote the test script to pass the I went ahead and edited the issue description so it only includes one graph and includes working reproducer code, hope you don't mind @eundersander. |
I've attached a flamegraph I generated using the I ran this script: import threading
import numpy as np
def numpy_worker(barrier, array_length, num_iterations):
"""Worker function for NumPy computations."""
x = np.arange(array_length, dtype=np.float64)
barrier.wait()
for _ in range(num_iterations):
x += 0.01 # Element-wise operation
x[0] += x.mean() * 0.01 # Reduction operation
print(f"ended {threading.get_ident()}")
num_workers = 1024
barrier = threading.Barrier(num_workers + 1)
workers = []
for _ in range(num_workers):
t = threading.Thread(target=numpy_worker, args=(barrier, 100, 5000))
workers.append(t)
t.start()
barrier.wait()
[t.join() for t in workers] So clearly the way we're using a critical section to lock the ufunc identity cache doesn't scale well: numpy/numpy/_core/src/umath/dispatching.c Lines 979 to 983 in b85a149
I wonder how hard it would be to make the ufunc idetity cache per-thread. It's tricky because the cache hangs off of a global ufunc object. If instead we had some kind of thread-local registry mapping ufuncs to the corresponding identity cache that would avoid the need for any kind of locking. @colesbury I'd appreciate your thoughts here. |
Yeah, we should make the fast path of numpy/numpy/_core/src/umath/dispatching.c Lines 773 to 779 in b85a149
One option is to use a a readers-writer lock. The downsides are:
I'll think about it some more. |
Yeah, it's basically a custom dict lookup that needs to be safe (and if that fails, we shouldn't worry about speed). |
Unfortunately it's not just that there's a custom hashmap, the problem is that if you have a cache miss and lots of threads are doing similar ufunc operations then you're almost certain to have races inside |
I think that's fine -- we can lock around the other parts. |
See #27859 |
See #27896 for a second try at a fix :) |
Describe the issue:
Using python 3.13 free threading, I observe that multithreaded performance (MFLOPS) scales poorly for numpy array computation, especially on small arrays. For comparison, performance scales well for (1) multiprocess computation, and (2) multithreaded/multiprocess ordinary python list computation. Although I measure MFLOPS here, I would guess the underlying performance issue is some per-array-access overhead, perhaps a lock that results in thread contention.
In the attached benchmark, the workers are embarrassingly parallel. I create an array/list in each worker thread/process and then do computation on it. I only time the computation, not the setup.
The main takeaway from the plots below is that numpy performance drops dramatically when using 8+ threads in the same process (solid orange line).
The original reporter used an AMD Ryzen Threadripper PRO 5955WX 16-Cores, but @ngoldbaum edited the description and reproducer script and used a Macbook Pro M3 Max to reproduce the original report with the updated script.
Benchmark text output:
Reproduce the code example:
Error message:
No response
Python and NumPy Versions:
2.1.3
3.13.0 experimental free-threading build | packaged by conda-forge | (main, Oct 8 2024, 20:16:19) [GCC 13.3.0]
Runtime Environment:
Context for the issue:
Our application is RL training with a robotics simulator. We use multiprocessing, with each worker doing mostly-independent CPU-heavy work. I was excited to try python 3.13 free-threading to reduce the cost of gathering results from workers--use multithreading instead of multiprocessing and thus avoid interprocess communication overhead. Instead, I see a big drop in overall performance. We use a lot of small numpy arrays for 3D math (3D positions, 4D rotation quaternions, 4x4 transform matrices, etc.).
The text was updated successfully, but these errors were encountered: