-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init #147025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…call or manual_seed call; support more flexible OffsetBasedRNGTracker init [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147025
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 58d77b2 with merge base 580f118 ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. I left some comments.
Can you also add a PR description? |
… random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init" Resolves #146767. May also resolve #147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k [ghstack-poisoned]
… random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init" Resolves #146767. May also resolve #147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k [ghstack-poisoned]
@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
if self._device.type != "cuda": | ||
raise RuntimeError( | ||
f"{self.__class__.__name__} instantiation requires the presence of " | ||
f"CUDA/CUDA-like device. Got {self._device.type} instead." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Lgtm overall.
Remind me if we have a cpu-compatible rng tracker today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wconstab unfortunately no, and it may not be very useful IMO, because the main usage of DTensor random ops is model init and dropout()
in model training. Neither is expected to run on CPU.
If we see a use case, it's also possible to achieve a CPU RNG tracker by either us or users. Users can also extend the base class to make a tracker for CPU. Let me know if we already have such a use case.
@pytorchbot merge -f "unrelated CI failure -- periodic / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge.nvidia.gpu) is on main" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025) Resolves #146767. May also resolve #147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856) Pull Request resolved: #147025 Approved by: https://github.com/kwen2501
…call or manual_seed call; support more flexible OffsetBasedRNGTracker init (pytorch#147025) Resolves pytorch#146767. May also resolve pytorch#147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856) Pull Request resolved: pytorch#147025 Approved by: https://github.com/kwen2501
Stack from ghstack (oldest at bottom):
Resolves #146767.
May also resolve #147584.
Summary
This PR removes the RNG tracker init from the
distribute_tensor
call for the following reasons:src_data_rank=None
shard distribution.Besides,
OffsetBasedRNGTracker
only acceptsDeviceMesh
argument to its constructor method.Consequence
DTensor RNG initialization is delayed till the first DTensor random ops call or
torch.distributed.tensor.random.manual_seed
.Test
pytest test/distributed/tensor/test_random_ops.py
pytest test/distributed/tensor/parallel/test_tp_random_state.py
pytest test/distributed/tensor/parallel/test_tp_style.py
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o
Differential Revision: D70201856