[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init #147025

XilunWu · 2025-02-12T22:17:02Z

Stack from ghstack (oldest at bottom):

-> [DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init #147025

Resolves #146767.

May also resolve #147584.

Summary

This PR removes the RNG tracker init from the distribute_tensor call for the following reasons:

if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present.
this complies with the 0-communication semantic of src_data_rank=None shard distribution.

Besides, OffsetBasedRNGTracker only accepts DeviceMesh argument to its constructor method.

Consequence

DTensor RNG initialization is delayed till the first DTensor random ops call or torch.distributed.tensor.random.manual_seed.

Test

pytest test/distributed/tensor/test_random_ops.py
pytest test/distributed/tensor/parallel/test_tp_random_state.py
pytest test/distributed/tensor/parallel/test_tp_style.py

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Differential Revision: D70201856

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init [ghstack-poisoned]

pytorch-bot · 2025-02-12T22:17:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147025

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 58d77b2 with merge base 580f118 ():

NEW FAILURE - The following job has failed:

periodic / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge.nvidia.gpu) (gh)
/var/lib/jenkins/workspace/BUILD.bazel:434:11: Compiling aten/src/ATen/native/cuda/RowwiseScaledMM.cu failed: (Exit 1): nvcc failed: error executing command (from target //:aten_cuda) external/local_cuda/cuda/bin/nvcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++17' -MD -MF ... (remaining 320 arguments skipped)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 4, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (detected as infra flaky with no log or failing log classifier)
periodic / linux-focal-cuda12.4-py3-gcc9-slow-gradcheck / test (default, 1, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (similar failure)
inductor/test_cutlass_backend.py::TestCutlassBackend::test_max_autotune_cutlass_backend_sparse_semi_structured_mm_dynamic_False

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init ghstack-source-id: 728e114 Pull Request resolved: #147025

kwen2501

LGTM in general. I left some comments.

torch/distributed/tensor/_random.py

torch/distributed/tensor/_dispatch.py

torch/distributed/tensor/_random.py

kwen2501 · 2025-02-12T22:37:53Z

Can you also add a PR description?

… random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init" Resolves #146767. May also resolve #147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k [ghstack-poisoned]

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init ghstack-source-id: f1a5131 Pull Request resolved: #147025

… random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init" Resolves #146767. May also resolve #147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k [ghstack-poisoned]

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init ghstack-source-id: d5fdf08 Pull Request resolved: #147025

XilunWu · 2025-02-25T21:27:19Z

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wconstab · 2025-02-26T05:07:59Z

torch/distributed/tensor/_random.py

+        if self._device.type != "cuda":
+            raise RuntimeError(
+                f"{self.__class__.__name__} instantiation requires the presence of "
+                f"CUDA/CUDA-like device. Got {self._device.type} instead."


PR Lgtm overall.
Remind me if we have a cpu-compatible rng tracker today?

@wconstab unfortunately no, and it may not be very useful IMO, because the main usage of DTensor random ops is model init and dropout() in model training. Neither is expected to run on CPU.

If we see a use case, it's also possible to achieve a CPU RNG tracker by either us or users. Users can also extend the base class to make a tracker for CPU. Let me know if we already have such a use case.

XilunWu · 2025-02-26T17:31:40Z

@pytorchbot merge -f "unrelated CI failure -- periodic / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge.nvidia.gpu) is on main"

pytorchmergebot · 2025-02-26T17:33:09Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025) Resolves #146767. May also resolve #147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856) Pull Request resolved: #147025 Approved by: https://github.com/kwen2501

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init (pytorch#147025) Resolves pytorch#146767. May also resolve pytorch#147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856) Pull Request resolved: pytorch#147025 Approved by: https://github.com/kwen2501

[DTensor][random] defer DTensor RNG state sync until first random op …

9115a03

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 12, 2025

XilunWu added a commit that referenced this pull request Feb 12, 2025

[DTensor][random] defer DTensor RNG state sync until first random op …

2925652

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init ghstack-source-id: 728e114 Pull Request resolved: #147025

XilunWu added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Feb 12, 2025

kwen2501 approved these changes Feb 12, 2025

View reviewed changes

kwen2501 added release notes: distributed (dtensor) release notes category topic: bug fixes topic category labels Feb 12, 2025

kwen2501 mentioned this pull request Feb 12, 2025

Tensor Parallel (TP) broken on 2.6 (cannot parallelize_module correctly) #146767

Closed

kwen2501 mentioned this pull request Feb 25, 2025

[Distributed Tensor]OffsetBasedRNGTracker instantiation always try to create with CUDA backend #147584

Closed

XilunWu added a commit that referenced this pull request Feb 25, 2025

[DTensor][random] defer 8000 DTensor RNG state sync until first random op …

f452bc8

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init ghstack-source-id: f1a5131 Pull Request resolved: #147025

XilunWu added a commit that referenced this pull request Feb 25, 2025

[DTensor][random] defer DTensor RNG state sync until first random op …

accf737

…call or manual_seed call; support more flexible OffsetBasedRNGTracker init ghstack-source-id: d5fdf08 Pull Request resolved: #147025

XilunWu linked an issue Feb 25, 2025 that may be closed by this pull request

Tensor Parallel (TP) broken on 2.6 (cannot parallelize_module correctly) #146767

Closed

wconstab reviewed Feb 26, 2025

View reviewed changes

pytorchmergebot added the merging label Feb 26, 2025

pytorchmergebot added the Merged label Feb 26, 2025

pytorchmergebot closed this in ef61c29 Feb 26, 2025

pytorchmergebot removed the merging label Feb 26, 2025

atalman added this to the 2.6.1 milestone Feb 26, 2025

ankurneog mentioned this pull request Feb 27, 2025

[Dtensor] Pass device information in OffsetBasedRNGTracker #147594

Closed

zqwenn mentioned this pull request Mar 10, 2025

RuntimeError: OffsetBasedRNGTracker instantiation requires the presence of CUDA/CUDA-like device #148858

Open

kwen2501 mentioned this pull request Mar 11, 2025

[c10d] Make nonblocking mode the default for NCCL APIs #117749

Closed

XilunWu mentioned this pull request Mar 20, 2025

compilation error on SequenceParallel'ed Dropout #147757

Open

github-actions bot deleted the gh/XilunWu/114/head branch March 30, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init #147025

[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init #147025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init #147025

[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init #147025

Uh oh!

Conversation

Uh oh!

Summary

Consequence

Test

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147025

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!