-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Add torch.cuda.streams.ExternalStream
#57781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 9ce8e84 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
1058f07
to
7bf6a9c
Compare
cc: @gmarkall @pentschev @jakirkham @kkraus14 for vis |
Codecov Report
@@ Coverage Diff @@
## master #57781 +/- ##
==========================================
+ Coverage 76.44% 76.52% +0.07%
==========================================
Files 1990 2008 +18
Lines 199690 201099 +1409
==========================================
+ Hits 152651 153885 +1234
- Misses 47039 47214 +175 |
See previous attempt #39567 |
This is basically the same thing as the previous PR which @dzhulgakov convinced me that it would be OK to take as a stopgap. So I guess I'm OK with this too. I'm a little concerned about the ROCm failure, though I swear I've seen this event failure in some other PRs too. |
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
giving others a chance to review |
According to @ngimel the only blocker is making sure you don't silently overflow the 32 external streams you get (just need to add a little error checking there) |
@ezyang tahnks for the follow up!
I believe this is not as easy as it might seem. import torch
streams = []
stream_ids = set()
for i in range(128):
streams.append(torch.cuda.Stream())
if streams[-1].cuda_stream in stream_ids:
raise ValueError(f'Stream {i} reallocated')
else:
stream_ids.add(streams[-1].cuda_stream)
----
Traceback (most recent call last):
File "stream_stress.py", line 8, in <module>
raise ValueError(f'Stream {i} reallocated')
ValueError: Stream 32 reallocated There is no destructor in CUDAStream that allows to mark a stream to be used or unused so I believe the only cleaner way is to have a hash map that holds the externally allocated streams with the |
Have we considered doing pointer bit packing tricks? :) Stream/CudaStream are currently 1 byte for device type, 1 byte for device index and 4 bytes for stream id: https://github.com/pytorch/pytorch/blob/master/c10/core/Stream.h#L120 We don't seem to pass around raw stream id much at all. So let's say that we extend StreamId to 6 bytes. On x86 pointers are only 48 bits long, so we can easily pack the cuda_stream_t pointer as the cuda_stream_id. Of course we need to distinguish stream type, but it's easy, currently CUDA stream has the following scheme allocating the ids: https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDAStream.cpp#L76 - they are always small (7 bits) so those aren't valid pointers either. This way we can just assume that if StreamId is more than say 256 - it's reinterpreted as cudaStream_t (which is defined as a pointer btw). I know it sounds crazy, but it's a small change from the today's code and doesn't bring any need for the bookkeeping / static limitations |
To be sure I understand this proposal, you're suggesting that instead of looking at subclassing CUDAStream with something like CUDAStreamExternal, instead CUDAStream simply understand it's a "native" or "external" stream, and pack, unpack, and destroy itself appropriately? For example, the unpack method would change to interrogate the id, and then infer whether the CUDAStream was a "native" or "external" stream, and configure itself properly. Similarly the CUDAStream class would get a destructor that would check if the stream was "external" and, if so, destroy the stream? That makes sense to me if I'm understanding you correctly, @dzhulgakov. We should be very careful to document this. |
Bit packing is very reasonable and might just moot our entire conversation here. However, just to clarify the earlier point: @emcastillo Sorry, I wasn't clear, I was suggesting something very very dumb: just hard error at the point we overflow. Don't bother trying to find unused stream ids that can be reused. That should be simple to implement, and blow up very loudly when someone tries to use too many external streams. |
We can go even simpler and say that the ownership of the external stream stays with the caller. So we don't even need the destructor |
That sounds good. Is that the DLPack ownership model for streams? That the producer manages the stream lifetime independent of the consumer? |
I think it is reasonable to assume that the stream life-cycle is responsibly of the caller. |
In the DLPack protocol, the consumer hands its stream to the producer, and the producer decides what to do (to sync immediately or just establish the stream order). |
c712af2
to
eecce18
Compare
3eebaa0
to
b339a50
Compare
I'm slowly grinding through the bit patterns |
Fix assert and tests Fix mask and checks fix fix
48996d7
to
be785c1
Compare
I verified that the bit twiddling correctly works in the sign extended case using Crux (thanks @atomb) using the following program:
Crux seems to work pretty well and is able to detect bugs when I slightly modify the program to introduce bugs (e.g., forgetting to write 1ULL instead of 1, or forgetting to cast to the same size unsigned integer type before widening). |
c10/cuda/CUDAStream.cpp
Outdated
AT_ASSERT(ptr); | ||
return ptr->stream; | ||
int64_t stream_id = unwrap().id(); | ||
// the stream_ids managed from the pool have only 8 bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment out of date now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so great, thank you so much for writing this
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
need to skip tests on rocm |
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Needs internal updates (related to int to int64) |
needs more internal updates (related to CUDA) |
This pull request has been reverted by 689a5ed. |
what happened? 😥 |
@emcastillo If you could push a new copy of the PR that would be a help. My guess is the use of ctypes to access CUDA library functionality is what is not working. We can instead stick all the functions we need in torch/csrc/cuda/shared/cudart.cpp and that should be good enough to get the test going. |
Sure! let me work on it. |
Summary: This is required in pytorch#57110 (comment) We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api. cc mruberry rgommers leofang asi1024 kmaehashi Pull Request resolved: pytorch#57781 Reviewed By: mrshenli Differential Revision: D28326365 Pulled By: ezyang fbshipit-source-id: b67858c8033949951b49a3d319f649884dfd0a91
Summary: Previous is #57781 We add now two CUDA bindings to avoid using ctypes to fix a windows issue. However, we use ctypes to allocate the stream and create its pointer (we can do this with a 0-dim tensor too if it feels better). CC. ezyang rgommers ngimel mruberry Pull Request resolved: #59527 Reviewed By: albanD Differential Revision: D29053062 Pulled By: ezyang fbshipit-source-id: 661e7e58de98b1bdb7a0871808cd41d91fe8f13f
This is required in #57110 (comment)
We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api.
cc @mruberry @rgommers @leofang @asi1024 @kmaehashi