8000 c10d/gloo: add ibverbs backend (#153015) · harikodali/pytorch@d900c68 · GitHub
[go: up one dir, main page]

Skip to content

Commit d900c68

Browse files
d4l3kpytorchmergebot
authored andcommitted
c10d/gloo: add ibverbs backend (pytorch#153015)
Summary: X-link: pytorch/gloo#437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Differential Revision: D73291471 Pull Request resolved: pytorch#153015 Approved by: https://github.com/fduwjj
1 parent 7cdf504 commit d900c68

File tree

1 file changed

+45
-0
lines changed

1 file changed

+45
-0
lines changed

torch/csrc/distributed/c10d/GlooDeviceFactory.cpp

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#include <torch/csrc/distributed/c10d/GlooDeviceFactory.hpp>
22

3+
#include <torch/csrc/distributed/c10d/Utils.hpp>
4+
35
#ifdef USE_C10D_GLOO
46

57
#include <cstdlib>
@@ -19,6 +21,10 @@
1921
#include <gloo/transport/uv/device.h>
2022
#endif
2123

24+
#if GLOO_HAVE_TRANSPORT_IBVERBS
25+
#include <gloo/transport/ibverbs/device.h>
26+
#endif
27+
2228
// On Linux, check that the tcp transport is available.
2329
#ifdef __linux__
2430
#if !GLOO_HAVE_TRANSPORT_TCP
@@ -140,6 +146,45 @@ C10_REGISTER_CREATOR(GlooDeviceRegistry, WIN32, makeUVDevice)
140146
C10_REGISTER_CREATOR(GlooDeviceRegistry, UV, makeUVDevice)
141147
#endif
142148

149+
#if GLOO_HAVE_TRANSPORT_IBVERBS
150+
static std::shared_ptr<::gloo::transport::Device> makeIBVerbsDevice(
151+
const std::string& interface,
152+
const std::string& hostname,
153+
bool lazyInit) {
154+
TORCH_CHECK(hostname.empty(), "ibverbs transport does not support hostname");
155+
156+
TORCH_CHECK(!lazyInit, "transport does not support lazy init");
157+
158+
::gloo::transport::ibverbs::attr attr;
159+
attr.name = getCvarString(
160+
{
161+
"TORCH_GLOO_IBV_NAME",
162+
},
163+
"");
164+
attr.port = getCvarInt(
165+
{
166+
"TORCH_GLOO_IBV_PORT",
167+
},
168+
1);
169+
attr.index = getCvarInt(
170+
{
171+
"TORCH_GLOO_IBV_INDEX",
172+
},
173+
0);
174+
175+
if (!interface.empty()) {
176+
attr.name = interface;
177+
}
178+
179+
// use global port
180+
attr.port = 1;
181+
182+
return ::gloo::transport::ibverbs::CreateDevice(attr);
183+
}
184+
185+
C10_REGISTER_CREATOR(GlooDeviceRegistry, IBVERBS, makeIBVerbsDevice)
186+
#endif
187+
143188
namespace {
144189
std::shared_ptr<::gloo::transport::Device> makeGlooDevice(
145190
const std::string& interfaceName,

0 commit comments

Comments
 (0)
0