[NOT FOR LANDING] experimental NVSHMEM integration #146593

yifuwang · 2025-02-06T10:26:18Z

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2025-02-06T10:26:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146593

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 2 Unrelated Failures

As of commit bcd4a36 with merge base 07b9fe0 ():

NEW FAILURES - The following jobs have failed:

Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x fe66899 returned non-zero exit code 1
pull / cuda12.4-py3.10-gcc9-sm75 / build (gh)
undefined reference to c10d::nvshmem_extension::nvshmem_reduce_scatter_out(at::Tensor&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, at::Tensor&)'`
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
Process completed with exit code 1.
pull / linux-focal-cuda12.4-py3.10-gcc9 / build (gh)
undefined reference to c10d::nvshmem_extension::nvshmem_reduce_scatter_out(at::Tensor&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, at::Tensor&)'`
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / build (gh)
undefined reference to c10d::nvshmem_extension::nvshmem_reduce_scatter_out(at::Tensor&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, at::Tensor&)'`
pull / linux-focal-rocm6.3-py3.10 / build (gh)
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu:726:13: error: no member named 'nvshmem_extension' in namespace 'c10'
pull / linux-jammy-cuda11.8-cudnn9-py3.9-clang12 / build (gh)
Process completed with exit code 1.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge) (gh) (disabled by #144902 but the issue was closed recently and a rebase is needed to make it pass)
test_quantization.py::TestQuantizePT2EAffineQuantization::test_channel_group_quantization
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge) (gh) (disabled by #144902 but the issue was closed recently and a rebase is needed to make it pass)
test_quantization.py::TestQuantizePT2EAffineQuantization::test_channel_group_quantization

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 6c41422 Pull Request resolved: #146593

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: dc491e4 Pull Request resolved: #146593

seth-howell · 2025-02-12T22:36:15Z

caffe2/CMakeLists.txt

+    set_target_properties(nvshmem_extension PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
+    target_compile_options(nvshmem_extension PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-rdc=true>)
+    target_link_libraries(nvshmem_extension PRIVATE
+        ${NVSHMEM_LIB_DIR}/libnvshmem.a


Typically we dynamically link libnvshmem_host.so and statically link libnvshmem_device.a.

You also don't need to link the extension to nvshmem_bootstrap_uid.so. It will by dynamically opened by NVSHMEM.

Alternatively, if you are only using host APIs, you can forego linking to libnvshmem_device.a and dynamically load libnvshmem_host.so. Which would mean you wouldn't actually have any build-time NVSHMEM dependencies from your module.

Alternatively, if you are only using host APIs, you can forego linking to libnvshmem_device.a and dynamically load libnvshmem_host.so

@seth-howell I tried this but it didn't work. I might've done something wrong, but I got a dynamic linker error complaining about some missing symboI (I forgot the name). I tried building with the host compiler, including only the host header and only calling nvshmem_init but they didn't help.

Sorry, I probably should have provided a little more context. If you are dynamically loading the library you will want to use the nvshmemx_hostlib_init_attr and nvshmemx_hostlib_finalize APIs instead. https://docs.nvidia.com/nvshmem/api/gen/api/setup.html#nvshmemx-hostlib-init-attr

seth-howell · 2025-02-12T23:43:16Z

caffe2/CMakeLists.txt

    target_compile_definitions(torch_cuda PRIVATE USE_NCCL)
  endif()
+
+  # Use env var for these for now for prototyping purposes


FWIW, Outside of the prototyping phase, NVSHMEM does support cmake's find_package command.

ghstack-source-id: dc491e4 Pull Request resolved: pytorch#146593

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 7f19667 Pull Request resolved: #146593

Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`. Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc). Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`. The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO). Ported most code from #146593. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k [ghstack-poisoned]

Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`. Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc). Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`. The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO). Ported most code from #146593. Pull Request resolved: #151261 Approved by: https://github.com/fegin, https://github.com/fduwjj

[NOT FOR LANDING] experimental NVSHMEM integration

702bb7a

[ghstack-poisoned]

yifuwang mentioned this pull request Feb 6, 2025

clang-format CUDASymmetricMemory.cu #146592

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Feb 6, 2025

Update on "[NOT FOR LANDING] experimental NVSHMEM integration"

792d2da

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

yifuwang pushed a commit that referenced this pull request Feb 6, 2025

[NOT FOR LANDING] experimental NVSHMEM integration

541bc69

ghstack-source-id: 6c41422 Pull Request resolved: #146593

Update on "[NOT FOR LANDING] experimental NVSHMEM integration"

25e75fa

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

yifuwang pushed a commit that referenced this pull request Feb 10, 2025

[NOT FOR LANDING] experimental NVSHMEM integration

24d855c

ghstack-source-id: dc491e4 Pull Request resolved: #146593

seth-howell reviewed Feb 14, 2025

View reviewed changes

yifuwang pushed a commit to yifuwang/pytorch that referenced this pull request Feb 25, 2025

[NOT FOR LANDING] experimental NVSHMEM integration

76dd06c

ghstack-source-id: dc491e4 Pull Request resolved: pytorch#146593

Update on "[NOT FOR LANDING] experimental NVSHMEM integration"

bcd4a36

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

yifuwang pushed a commit that referenced this pull request Mar 4, 2025

[NOT FOR LANDING] experimental NVSHMEM integration

fe66899

ghstack-source-id: 7f19667 Pull Request resolved: #146593

pytorchbot added the open source label Mar 6, 2025

kwen2501 added the no-stale label Apr 9, 2025

kwen2501 mentioned this pull request Apr 29, 2025

[SymmMem] Experimental NVSHMEM integration #151261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NOT FOR LANDING] experimental NVSHMEM integration #146593

[NOT FOR LANDING] experimental NVSHMEM integration #146593

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[NOT FOR LANDING] experimental NVSHMEM integration #146593

Are you sure you want to change the base?

[NOT FOR LANDING] experimental NVSHMEM integration #146593

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146593

❌ 7 New Failures, 2 Unrelated Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants