8000 Update base for Update on "Distributed Autograd - FAST mode backward … · pytorch/pytorch@0ea3425 · GitHub
[go: up one dir, main page]

Skip to content

Commit 0ea3425

Browse files
author
pritam
committed
Update base for Update on "Distributed Autograd - FAST mode backward pass implementation."
[test all] This change implements the "FAST" mode distributed autograd backward pass as described in #23110. At a high level the backward pass works as follows: 1. We start by computing dependencies on the node that calls `torch.distributed.backward`. 2. This node computes the dependencies starting from the root nodes provided in the backward call and all the 'send' functions present in the current autograd context. The "FAST" mode assumes all 'send' functions are part of the autograd computation. 3. Once the dependency computation is done, the distributed autograd engine calls the local autograd engine to execute the autograd graph. Note that the autograd graph on a single node is not necessarily connected because of inter-node communication. As a result, we have special handling to ensure the local autograd engine ensures we execute the entire graph starting from the provided roots and all 'send' functions on the node. 4. When the local autograd engine hits a 'recv' function, it performs an async RPC to send the gradients over to the appropriate node and stores a future in the autograd context to keep track of this RPC. 5. On the destination node, the appropriate 'send' function is looked up and enqueued on the local autograd engine. If this is the first time the node is hearing about this autograd context id on the backward pass, then the node computes dependencies for the local autograd engine. 6. As part of compute dependencies, the distributed autograd engine discovers all leaf nodes and ensures those are passed as 'outputs' to the local autograd engine. This avoids running the 'AccumulateGrad' function. 7. The gradients computed for the leaf nodes are then actually accumulated in `DistAutogradContext` for the appropriate autograd context id. 8. The distributed autograd engine waits for the local autograd engine to complete and also waits for all the 'Futures' (stored in 4.) for respective RPCs to finish. We have made the following changes to the local autograd engine for this purpose: 1. Expose GraphTask and NodeTask so that the distributed autograd engine can use them. 2. Expose a `execute_with_graph_task` API which gives the distributed engine to build a GraphTask and pass it to the local autograd engine. 3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build a `NodeTask` for a 'send' function and enqueue it on the local autograd engine. In addition to this a few general improvements: 1. Added a `PropagateGradients` RPC call for the 'recv' function to pass gradients to the appropriate node during the backward pass. 2. Use IValues as much as possible in serialization for RpcWithAutograd. 3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate exception instead of just returning the message. This is inline with what most Future.wait() APIs do. This was mostly done to ensure Future.wait() propagates errors correctly on the backward pass. 4. Added a `get_gradients(context_id)` API which allows users to retrieve a map from Tensor to respective gradient for the provided context_id on the local node. Differential Revision: [D17652615](https://our.internmc.facebook.com/intern/diff/D17652615/) [ghstack-poisoned]
2 parents 3876a60 + f35d7d4 commit 0ea3425

File tree

152 files changed

+13360
-1120
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

152 files changed

+13360
-1120
lines changed

.circleci/config.yml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1312,6 +1312,28 @@ jobs:
13121312
- should_run_job
13131313
- checkout
13141314
- run_brew_for_ios_build
1315+
- run:
1316+
name: cert install
1317+
no_output_timeout: "1h"
1318+
command: |
1319+
set -e
1320+
PROJ_ROOT=/Users/distiller/project
1321+
cd ${PROJ_ROOT}/ios/TestApp
1322+
# install fastlane
1323+
sudo gem install bundler && bundle install
1324+
# install certificates
1325+
echo ${IOS_CERT_KEY} >> cert.txt
1326+
base64 --decode cert.txt -o Certificates.p12
1327+
rm cert.txt
1328+
bundle exec fastlane install_cert
1329+
# install the provisioning profile
1330+
PROFILE=TestApp_CI.mobileprovision
1331+
PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
1332+
mkdir -pv "${PROVISIONING_PROFILES}"
1333+
cd "${PROVISIONING_PROFILES}"
1334+
echo ${IOS_SIGN_KEY} >> cert.txt
1335+
base64 --decode cert.txt -o ${PROFILE}
1336+
rm cert.txt
13151337
- run:
13161338
name: Build
13171339
no_output_timeout: "1h"
@@ -1344,6 +1366,24 @@ jobs:
13441366
export IOS_ARCH=${IOS_ARCH}
13451367
export IOS_PLATFORM=${IOS_PLATFORM}
13461368
unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts
1369+
- run:
1370+
name: Test
1371+
no_output_timeout: "30m"
1372+
command: |
1373+
set -e
1374+
PROJ_ROOT=/Users/distiller/project
1375+
PROFILE=TestApp_CI
1376+
# run the ruby build script
1377+
if ! [ -x "$(command -v xcodebuild)" ]; then
1378+
echo 'Error: xcodebuild is not installed.'
1379+
exit 1
1380+
fi
1381+
echo ${IOS_DEV_TEAM_ID}
1382+
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
1383+
if ! [ "$?" -eq "0" ]; then
1384+
echo 'xcodebuild failed!'
1385+
exit 1
1386+
fi
13471387
13481388
# update_s3_htmls job
13491389
# These jobs create html files for every cpu/cu## folder in s3. The html
@@ -1925,14 +1965,18 @@ workflows:
19251965
# Pytorch iOS PR builds
19261966
- pytorch_ios_build:
19271967
name: pytorch_ios_10_2_1_x86_64_build
1968+
context: org-member
19281969
build_environment: "pytorch-ios-10.2.1-x86_64_build"
1970+
ios_arch: "x86_64"
19291971
ios_platform: "SIMULATOR"
19301972
requires:
19311973
- setup
19321974
- pytorch_ios_build:
19331975
name: pytorch_ios_10_2_1_arm64_build
1976+
context: org-member
19341977
build_environment: "pytorch-ios-10.2.1-arm64_build"
19351978
ios_arch: "arm64"
1979+
ios_platform: "OS"
19361980
requires:
19371981
- setup
19381982
- caffe2_linux_build:

.circleci/verbatim-sources/job-specs-custom.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,28 @@
412412
- should_run_job
413413
- checkout
414414
- run_brew_for_ios_build
415+
- run:
416+
name: cert install
417+
no_output_timeout: "1h"
418+
command: |
419+
set -e
420+
PROJ_ROOT=/Users/distiller/project
421+
cd ${PROJ_ROOT}/ios/TestApp
422+
# install fastlane
423+
sudo gem install bundler && bundle install
424+
# install certificates
425+
echo ${IOS_CERT_KEY} >> cert.txt
426+
base64 --decode cert.txt -o Certificates.p12
427+
rm cert.txt
428+
bundle exec fastlane install_cert
429+
# install the provisioning profile
430+
PROFILE=TestApp_CI.mobileprovision
431+
PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
432+
mkdir -pv "${PROVISIONING_PROFILES}"
433+
cd "${PROVISIONING_PROFILES}"
434+
echo ${IOS_SIGN_KEY} >> cert.txt
435+
base64 --decode cert.txt -o ${PROFILE}
436+
rm cert.txt
415437
- run:
416438
name: Build
417439
no_output_timeout: "1h"
@@ -444,3 +466,21 @@
444466
export IOS_ARCH=${IOS_ARCH}
445467
export IOS_PLATFORM=${IOS_PLATFORM}
446468
unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts
469+
- run:
470+
name: Test
471+
no_output_timeout: "30m"
472+
command: |
473+
set -e
474+
PROJ_ROOT=/Users/distiller/project
475+
PROFILE=TestApp_CI
476+
# run the ruby build script
477+
if ! [ -x "$(command -v xcodebuild)" ]; then
478+
echo 'Error: xcodebuild is not installed.'
479+
exit 1
480+
fi
481+
echo ${IOS_DEV_TEAM_ID}
482+
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
483+
if ! [ "$?" -eq "0" ]; then
484+
echo 'xcodebuild failed!'
485+
exit 1
486+
fi
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,17 @@
11
# Pytorch iOS PR builds
22
- pytorch_ios_build:
33
name: pytorch_ios_10_2_1_x86_64_build
4+
context: org-member
45
build_environment: "pytorch-ios-10.2.1-x86_64_build"
6+
ios_arch: "x86_64"
57
ios_platform: "SIMULATOR"
68
requires:
79
- setup
810
- pytorch_ios_build:
911
name: pytorch_ios_10_2_1_arm64_build
12+
context: org-member
1013
build_environment: "pytorch-ios-10.2.1-arm64_build"
1114
ios_arch: "arm64"
15+
ios_platform: "OS"
1216
requires:
1317
- setup

aten/src/ATen/CMakeLists.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -449,9 +449,10 @@ endif()
449449

450450
# https://stackoverflow.com/questions/11096471/how-can-i-install-a-hierarchy-of-files-using-cmake
451451
FOREACH(HEADER ${INSTALL_HEADERS})
452-
string(REPLACE "${CMAKE_CURRENT_SOURCE_DIR}/" "" HEADER_SUB ${HEADER})
452+
string(REPLACE "${CMAKE_CURRENT_SOURCE_DIR}/" "ATen/" HEADER_SUB ${HEADER})
453+
string(REPLACE "${Caffe2_SOURCE_DIR}/" "" HEADER_SUB ${HEADER_SUB})
453454
GET_FILENAME_COMPONENT(DIR ${HEADER_SUB} DIRECTORY)
454-
INSTALL(FILES ${HEADER} DESTINATION ${AT_INSTALL_INCLUDE_DIR}/ATen/${DIR})
455+
INSTALL(FILES ${HEADER} DESTINATION "${AT_INSTALL_INCLUDE_DIR}/${DIR}")
455456
ENDFOREACH()
456457

457458
# TODO: Install hip_generated_h when we have it

aten/src/ATen/Dispatch.h

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -146,10 +146,24 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
146146
switch (_st) { \
147147
AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__) \
148148
AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__) \
149-
AT_PRIVATE_CASE_TYPE(at::ScalarType::Half, at::Half, __VA_ARGS__) \
150149
AT_PRIVATE_CASE_TYPE(at::ScalarType::ComplexDouble, std::complex<double>, __VA_ARGS__) \
151150
AT_PRIVATE_CASE_TYPE(at::ScalarType::ComplexFloat, std::complex<float>, __VA_ARGS__) \
152-
AT_PRIVATE_CASE_TYPE(at::ScalarType::ComplexHalf, std::complex<at::Half>, __VA_ARGS__) \
151+
default: \
152+
AT_ERROR(#NAME, " not implemented for '", toString(_st), "'"); \
153+
} \
154+
}()
155+
156+
#define AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(SCALARTYPE, TYPE, NAME, ...) \
157+
[&] { \
158+
const auto& the_type = TYPE; \
159+
/* don't use TYPE again in case it is an expensive or side-effect op */ \
160+
at::ScalarType _st = ::detail::scalar_type(the_type); \
161+
switch (_st) { \
162+
AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__) \
163+
AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__) \
164+
AT_PRIVATE_CASE_TYPE(at::ScalarType::ComplexDouble, std::complex<double>, __VA_ARGS__) \
165+
AT_PRIVATE_CASE_TYPE(at::ScalarType::ComplexFloat, std::complex<float>, __VA_ARGS__) \
166+
AT_PRIVATE_CASE_TYPE(SCALARTYPE, decltype(c10::impl::ScalarTypeToCPPType<SCALARTYPE>::t), __VA_ARGS__) \
153167
default: \
154168
AT_ERROR(#NAME, " not implemented for '", toString(_st), "'"); \
155169
} \

aten/src/ATen/NumericUtils.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
#include <cmath>
88
#include <type_traits>
99
#include <c10/util/BFloat16.h>
10+
#include <c10/util/Complex.h>
1011
#include <c10/macros/Macros.h>
1112

1213
namespace at {
@@ -31,6 +32,12 @@ inline C10_HOST_DEVICE bool _isnan(T val) {
3132
#endif
3233
}
3334

35+
template <typename T,
36+
typename std::enable_if<std::is_complex_t<T>::value, int>::type = 0>
37+
inline bool _isnan(T val) {
38+
return std::isnan(std::real(val)) || std::isnan(std::imag(val));
39+
}
40+
3441
inline C10_HOST_DEVICE bool _isnan(at::BFloat16 val) {
3542
return at::_isnan(float(val));
3643
}

aten/src/ATen/core/ATenDispatch.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ namespace impl {
4040
// question is whether or not we have access to all the relevant TLS at this
4141
// point.
4242
static inline TensorTypeId dispatchTypeId(TensorTypeSet ts) {
43-
return (ts - c10::impl::tls_excluded_tensor_type_set()).highestPriorityTypeId();
43+
c10::impl::LocalTensorTypeSet local = c10::impl::tls_local_tensor_type_set();
44+
return ((ts | local.included_) - local.excluded_).highestPriorityTypeId();
4445
}
4546

4647
}

aten/src/ATen/core/CMakeLists.txt

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,22 +8,22 @@ EXCLUDE(ATen_CORE_SRCS "${ATen_CORE_SRCS}" ${ATen_CORE_TEST_SRCS})
88

99
# Add files needed from jit folders
1010
LIST(APPEND ATen_CORE_HEADERS
11-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/source_range.h
12-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/function_schema_parser.h
13-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/lexer.h
14-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/strtod.h
15-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/parse_string_literal.h
16-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/schema_type_parser.h
17-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/error_report.h
18-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/tree.h
11+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/source_range.h
12+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/function_schema_parser.h
13+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/lexer.h
14+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/strtod.h
15+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/parse_string_literal.h
16+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/schema_type_parser.h
17+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/error_report.h
18+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/tree.h
1919
)
2020
LIST(APPEND ATen_CORE_SRCS
21-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/error_report.cpp
22-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/function_schema_parser.cpp
23-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/lexer.cpp
24-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/strtod.cpp
25-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/script/schema_type_parser.cpp
26-
${CMAKE_CURRENT_SOURCE_DIR}/../../../../torch/csrc/jit/source_range.cpp
21+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/error_report.cpp
22+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/function_schema_parser.cpp
23+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/lexer.cpp
24+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/strtod.cpp
25+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/script/schema_type_parser.cpp
26+
${Caffe2_SOURCE_DIR}/torch/csrc/jit/source_range.cpp
2727
)
2828

2929
# Pass to parent

aten/src/ATen/core/LegacyTypeDispatch.h

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
#include <c10/core/ScalarType.h>
1313
#include <c10/util/Exception.h>
1414
#include <ATen/core/LegacyDeviceTypeInit.h>
15+
#include <c10/core/impl/LocalTensorTypeSet.h>
1516
#include <c10/core/TensorImpl.h>
1617
#include <ATen/core/ATenDispatch.h>
1718
#include <ATen/core/TensorBody.h>
@@ -47,22 +48,19 @@ class CAFFE2_API LegacyTypeDispatch {
4748

4849
CAFFE2_API LegacyTypeDispatch& globalLegacyTypeDispatch();
4950

50-
// A RAII, thread local (!) guard that has the following effect:
51-
//
52-
// Upon construction: sets NonVariableTypeMode_enabled for the current thread to
53-
// control whether we are in non-Variable-type mode.
54-
//
55-
// Upon destruction: sets NonVariableTypeMode_enabled back to the original value.
51+
// A RAII, thread local (!) guard that will disable dispatch to variable
52+
// handler.
5653
//
5754
// See NOTE [ Treating Variables as non-Variables in type dispatch ] for details.
5855
struct CAFFE2_API AutoNonVariableTypeMode {
59-
AutoNonVariableTypeMode(bool enabled) : prev_mode(NonVariableTypeMode::is_enabled()) {
60-
NonVariableTypeMode::set_enabled(enabled);
61-
}
62-
~AutoNonVariableTypeMode() {
63-
NonVariableTypeMode::set_enabled(prev_mode);
56+
// NB: The enabled parameter must ALWAYS be black, as Henry Ford used to say.
57+
// TODO: Eliminate this parameter entirely
58+
AutoNonVariableTypeMode(bool enabled = true) :
59+
guard_(TensorTypeId::VariableTensorId) {
60+
61+
TORCH_INTERNAL_ASSERT(enabled);
6462
}
65-
bool prev_mode;
63+
c10::impl::ExcludeTensorTypeIdGuard guard_;
6664
};
6765

6866
} // namespace at

aten/src/ATen/core/NamedTensor.cpp

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -42,12 +42,6 @@ DimnameList default_names(size_t len) {
4242
return DimnameList(&all_unnamed.front(), len);
4343
}
4444

45-
void check_names_valid_for(const Tensor& tensor, DimnameList names) {
46-
return impl::check_names_valid_for(tensor.unsafeGetTensorImpl(), names);
47-
}
48-
49-
namespace impl {
50-
5145
static void check_unique_names(DimnameList names) {
5246
// Strategy: Compare each element with the ones that come after it.
5347
// Although this is O(N^2), in practice N is small (no more than 25).
@@ -62,6 +56,24 @@ static void check_unique_names(DimnameList names) {
6256
}
6357
}
6458

59+
void check_names_valid_for(const Tensor& tensor, DimnameList names) {
60+
return impl::check_names_valid_for(tensor.unsafeGetTensorImpl(), names);
61+
}
62+
63+
void check_names_valid_for(int64_t tensor_dim, DimnameList names) {
64+
TORCH_CHECK(
65+
tensor_dim <= kMaxNamedTensorDim,
66+
"Named tensors only support up to ", kMaxNamedTensorDim, " dims: "
67+
"Attempted to create a tensor with dim ", tensor_dim, " with names ", names);
68+
TORCH_CHECK(tensor_dim == names.size(),
69+
"Number of names (", names.size(), ") and "
70+
"number of dimensions in tensor (", tensor_dim, ") ",
71+
"do not match. Attempted to create a tensor with names ", names);
72+
check_unique_names(names);
73+
}
74+
75+
namespace impl {
76+
6577
static NamedTensorMeta* get_named_tensor_meta(TensorImpl* impl) {
6678
if (!NamesMode::is_enabled()) {
6779
return nullptr;
@@ -77,16 +89,7 @@ static const NamedTensorMeta* get_named_tensor_meta(const TensorImpl* impl) {
7789
}
7890

7991
void check_names_valid_for(TensorImpl* impl, DimnameList names) {
80-
auto ndim = impl->dim();
81-
TORCH_CHECK(
82-
ndim <= kMaxNamedTensorDim,
83-
"Named tensors only support up to ", kMaxNamedTensorDim, " dims: "
84-
"Attempted to create a tensor with dim ", ndim, " with names ", names);
85-
TORCH_CHECK(ndim == names.size(),
86-
"Number of names (", names.size(), ") and "
87-
"number of dimensions in tensor (", ndim, ") ",
88-
"do not match. Attempted to create a tensor with names ", names);
89-
check_unique_names(names);
92+
check_names_valid_for(impl->dim(), names);
9093
}
9194

9295
void internal_set_names_inplace(TensorImpl* impl, optional<DimnameList> names) {

0 commit comments

Comments
 (0)
0