8000 [NOT FOR LANDING] experimental NVSHMEM integration by yifuwang · Pull Request #146593 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@yifuwang
Copy link
Collaborator
@yifuwang yifuwang commented Feb 6, 2025

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Feb 6, 2025
@pytorch-bot
Copy link
pytorch-bot bot commented Feb 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146593

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 2 Unrelated Failures

As of commit bcd4a36 with merge base 07b9fe0 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Feb 6, 2025
ghstack-source-id: 6c41422
Pull Request resolved: #146593
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Feb 10, 2025
ghstack-source-id: dc491e4
Pull Request resolved: #146593
set_target_properties(nvshmem_extension PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_compile_options(nvshmem_extension PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-rdc=true>)
target_link_libraries(nvshmem_extension PRIVATE
${NVSHMEM_LIB_DIR}/libnvshmem.a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we dynamically link libnvshmem_host.so and statically link libnvshmem_device.a.

You also don't need to link the extension to nvshmem_bootstrap_uid.so. It will by dynamically opened by NVSHMEM.

Alternatively, if you are only using host APIs, you can forego linking to libnvshmem_device.a and dynamically load libnvshmem_host.so. Which would mean you wouldn't actually have any build-time NVSHMEM dependencies from your module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if you are only using host APIs, you can forego linking to libnvshmem_device.a and dynamically load libnvshmem_host.so

@seth-howell I tried this but it didn't work. I might've done something wrong, but I got a dynamic linker error complaining about some missing symboI (I forgot the name). I tried building with the host compiler, including only the host header and only calling nvshmem_init but they didn't help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I probably should have provided a little more context. If you are dynamically loading the library you will want to use the nvshmemx_hostlib_init_attr and nvshmemx_hostlib_finalize APIs instead. https://docs.nvidia.com/nvshmem/api/gen/api/setup.html#nvshmemx-hostlib-init-attr

target_compile_definitions(torch_cuda PRIVATE USE_NCCL)
endif()

# Use env var for these for now for prototyping purposes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, Outside of the prototyping phase, NVSHMEM does support cmake's find_package command.

yifuwang pushed a commit to yifuwang/pytorch that referenced this pull request Feb 25, 2025
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Mar 4, 2025
ghstack-source-id: 7f19667
Pull Request resolved: #146593
kwen2501 added a commit that referenced this pull request Apr 29, 2025
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Apr 29, 2025
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request May 1, 2025
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request May 1, 2025
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request May 1, 2025
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

Pull Request resolved: #151261
Approved by: https://github.com/fegin, https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-stale oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

0