10000 Update on "sync and async torch.distributed.rpc for builtin operators" · pytorch/pytorch@97da154 · GitHub
[go: up one dir, main page]

Skip to content

Commit 97da154

Browse files
committed
Update on "sync and async torch.distributed.rpc for builtin operators"
Features: * sync and async RPC for builtin operators * RpcAgent API * ProcessGroupAgent implementation Goal: This is the first PR for #23110, and there will be many followup ones. So let's focus on the overall API and code structure. Details like efficiency and error handling can be improved in future PRs. * have a minimum working and testable RPC implementation. * make sure the RpcAgent API is sufficient for future ThriftAgent and TensorPipeAgent implementation * For tensor pipe implementation, it might allocate multiple underlying communication channels with different types, and might also use streaming serialization/deserialization for large tensors. To support this requirement, the current implementation only convert a BuiltinOp into a Message which contains a byte vector and a tensor table. It is up to the RpcAgent implementation to determine how it would like to serialize a Message object. * For ThriftAgent, as Thrift has it own request/response matching solution, the Message.id is no longer necessary. Hence the id can be dropped during serialization. All it needs to do is to pass the response Message object to the Future returned by send(...). * support blocking and non-blocking RequestCallback * blocking means the callback won't return before sending out the response * non-blocking can be achieved by enqueue the `(from, request, RpcAgent&)` tuple and use a different thread to process them. That is why there is an `RpcAgent&` arg in the param list. Differential Revision: [D15194693](https://our.internmc.facebook.com/intern/diff/D15194693/)
2 parents c2c6e6e + 645b981 commit 97da154

File tree

371 files changed

+11215
-5474
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

371 files changed

+11215
-5474
lines changed

.circleci/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -272,7 +272,7 @@ Manywheels are pip packages for linux distros. Note that these manywheels are no
272272

273273
The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because
274274

275-
* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unneccessary folders and movements here and there.
275+
* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unnecessary folders and movements here and there.
276276
* The script is never used this way anymore. This extra machinery could be removed.
277277
* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
278278
* The script is never used this way anymore. This extra machinery could be removed.
@@ -304,7 +304,7 @@ Note that the MacOS Python wheels are still built in conda environments. Some of
304304

305305
Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
306306

307-
* It’s confusinig. Most of those scripts deal with python specifics.
307+
* It’s confusing. Most of those scripts deal with python specifics.
308308
* The extra conditionals everywhere severely complicate the wheel build scripts
309309
* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)
310310

@@ -470,7 +470,7 @@ N.B. installing a brand new miniconda is important. This has to do with how cond
470470
1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
471471
2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
472472

473-
Newer conda versions and proper python hygeine can prevent this, but just install a new miniconda to be safe.
473+
Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.
474474

475475
### Windows
476476

.circleci/cimodel/data/pytorch_build_definitions.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
DOCKER_IMAGE_PATH_BASE = "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/"
1616

17-
DOCKER_IMAGE_VERSION = 300
17+
DOCKER_IMAGE_VERSION = 323
1818

1919

2020
@dataclass

.circleci/config.yml

Lines changed: 54 additions & 46 deletions
Large diffs are not rendered by default.

.circleci/scripts/binary_linux_test.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ fi
2525
pkg="/final_pkgs/\$(ls /final_pkgs)"
2626
if [[ "$PACKAGE_TYPE" == conda ]]; then
2727
conda install -y "\$pkg" --offline
28+
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
29+
conda install -y cpu-only -c pytorch
30+
fi
2831
retry conda install -yq future numpy protobuf six
2932
if [[ "$DESIRED_CUDA" != 'cpu' ]]; then
3033
# DESIRED_CUDA is in format cu90 or cu100

.circleci/scripts/binary_populate_env.sh

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ fi
4242
# option, so the upload was redirected to nightly/devtoolset7 to avoid
4343
# conflicts with other binaries (there shouldn't be any conflicts). Now we are
4444
# making devtoolset7 the default.
45-
if [[ "$DESIRED_DEVTOOLSET" == 'devtoolset7' ]]; then
45+
if [[ "$DESIRED_DEVTOOLSET" == 'devtoolset7' || "$(uname)" == 'Darwin' ]]; then
4646
export PIP_UPLOAD_FOLDER='nightly/'
4747
else
4848
# On linux machines, this shouldn't actually be called anymore. This is just
@@ -52,7 +52,11 @@ fi
5252

5353
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
5454
export DATE="$(date -u +%Y%m%d)"
55-
export PYTORCH_BUILD_VERSION="1.2.0.dev$DATE+$DESIRED_CUDA"
55+
if [[ "$(uname)" == 'Darwin' ]] || [[ "$DESIRED_CUDA" == "cu100" ]]; then
56+
export PYTORCH_BUILD_VERSION="1.2.0.dev$DATE"
57+
else
58+
export PYTORCH_BUILD_VERSION="1.2.0.dev$DATE+$DESIRED_CUDA"
59+
fi
5660
export PYTORCH_BUILD_NUMBER=1
5761

5862
cat >>"$envfile" <<EOL

.circleci/scripts/python_doc_push_script.sh

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,6 @@ pt_checkout="/var/lib/jenkins/workspace"
99

1010
echo "python_doc_push_script.sh: Invoked with $*"
1111

12-
git clone https://github.com/pytorch/pytorch.github.io -b site
13-
pushd pytorch.github.io
14-
1512
set -ex
1613

1714
# Argument 1: Where to copy the built documentation to
@@ -34,14 +31,24 @@ if [ "$version" == "master" ]; then
3431
is_master_doc=true
3532
fi
3633

37-
# Argument 3: (optional) If present, we will NOT do any pushing. Used for testing.
34+
# Argument 3: The branch to push to. Usually is "site"
35+
branch="$3"
36+
if [ -z "$branch" ]; then
37+
echo "error: python_doc_push_script.sh: branch (arg3) not specified"
38+
exit 1
39+
fi
40+
41+
# Argument 4: (optional) If present, we will NOT do any pushing. Used for testing.
3842
dry_run=false
39-
if [ "$3" != "" ]; then
43+
if [ "$4" != "" ]; then
4044
dry_run=true
4145
fi
4246

4347
echo "install_path: $install_path version: $version dry_run: $dry_run"
4448

49+
git clone https://github.com/pytorch/pytorch.github.io -b $branch
50+
pushd pytorch.github.io
51+
4552
export LC_ALL=C
4653
export PATH=/opt/conda/bin:$PATH
4754

@@ -92,10 +99,10 @@ git commit -m "auto-generating sphinx docs" || true
9299
git status
93100

94101
if [ "$dry_run" = false ]; then
95-
echo "Pushing to pytorch.github.io:site"
102+
echo "Pushing to pytorch.github.io:$branch"
96103
set +x
97104
/usr/bin/expect <<DONE
98-
spawn git push origin site
105+
spawn git push origin $branch
99106
expect "Username*"
100107
send "pytorchbot\n"
101108
expect "Password*"

.circleci/scripts/setup_linux_system_environment.sh

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,14 @@ sudo apt-get purge -y unattended-upgrades
3737

3838
cat /etc/apt/sources.list
3939

40-
# For the bestest luck, kill -9 now
41-
sudo pkill -9 apt-get || true
42-
43-
# Bail out early if we detect apt/dpkg is stuck
44-
ps auxfww | (! grep '[a]pt')
45-
ps auxfww | (! grep '[d]pkg')
40+
# For the bestest luck, kill again now
41+
sudo pkill apt || true
42+
sudo pkill dpkg || true
43+
44+
# Try to detect if apt/dpkg is stuck
45+
if ps auxfww | grep '[a]pt'; then
46+
echo "WARNING: There are leftover apt processes; subsequent apt update will likely fail"
47+
fi
48+
if ps auxfww | grep '[d]pkg'; then
49+
echo "WARNING: There are leftover dpkg processes; subsequent apt update will likely fail"
50+
fi

.circleci/verbatim-sources/header-section.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@
77
# and then update DOCKER_IMAGE_VERSION at the top of the following files:
88
# * cimodel/data/pytorch_build_definitions.py
99
# * cimodel/data/caffe2_build_definitions.py
10+
# And the inline copies of the variable in
11+
# * verbatim-sources/job-specs-custom.yml
12+
# (grep for DOCKER_IMAGE)
1013

1114
docker_config_defaults: &docker_config_defaults
1215
user: jenkins
@@ -46,6 +49,8 @@ macos_brew_update: &macos_brew_update
4649
no_output_timeout: "1h"
4750
command: |
4851
set -ex
52+
# See https://discourse.brew.sh/t/fetching-homebrew-repos-is-slow/5374/3
53+
brew untap caskroom/homebrew-cask
4954
# moreutils installs a `parallel` executable by default, which conflicts
5055
# with the executable from the GNU `parallel`, so we must unlink GNU
5156
# `parallel` first, and relink it afterwards

.circleci/verbatim-sources/job-specs-custom.yml

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
pytorch_short_perf_test_gpu:
22
environment:
33
BUILD_ENVIRONMENT: pytorch-short-perf-test-gpu
4-
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:300"
4+
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:323"
55
PYTHON_VERSION: "3.6"
66
USE_CUDA_DOCKER_RUNTIME: "1"
77
resource_class: gpu.medium
@@ -41,7 +41,8 @@
4141
pytorch_python_doc_push:
4242
environment:
4343
BUILD_ENVIRONMENT: pytorch-python-doc-push
44-
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:300"
44+
# TODO: stop hardcoding this
45+
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:323"
4546
resource_class: large
4647
machine:
4748
image: ubuntu-1604:201903-01
@@ -67,18 +68,18 @@
6768
6869
# master branch docs push
6970
if [[ "${CIRCLE_BRANCH}" == "master" ]]; then
70-
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master") | docker exec -u jenkins -i "$id" bash) 2>&1'
71+
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master site") | docker exec -u jenkins -i "$id" bash) 2>&1'
7172
7273
# stable release docs push. Due to some circleci limitations, we keep
73-
# an eternal PR open (#16502) for merging v1.0.1 -> master for this job.
74-
# XXX: The following code is only run on the v1.0.1 branch, which might
74+
# an eternal PR open for merging v1.2.0 -> master for this job.
75+
# XXX: The following code is only run on the v1.2.0 branch, which might
7576
# not be exactly the same as what you see here.
76-
elif [[ "${CIRCLE_BRANCH}" == "v1.0.1" ]]; then
77-
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/stable 1.0.1") | docker exec -u jenkins -i "$id" bash) 2>&1'
77+
elif [[ "${CIRCLE_BRANCH}" == "v1.2.0" ]]; then
78+
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/stable 1.2.0 site dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
7879
7980
# For open PRs: Do a dry_run of the docs build, don't push build
8081
else
81-
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
82+
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master site dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
8283
fi
8384
8485
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
@@ -91,7 +92,7 @@
9192
pytorch_cpp_doc_push:
9293
environment:
9394
BUILD_ENVIRONMENT: pytorch-cpp-doc-push
94-
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:300"
95+
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:323"
9596
resource_class: large
9697
machine:
9798
image: ubuntu-1604:201903-01

.circleci/verbatim-sources/nightly-binary-build-defaults.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ binary_macos_brew_update: &binary_macos_brew_update
5757
no_output_timeout: "1h"
5858
command: |
5959
set -eux -o pipefail
60+
# See https://discourse.brew.sh/t/fetching-homebrew-repos-is-slow/5374/3
61+
brew untap caskroom/homebrew-cask
6062
# moreutils installs a `parallel` executable by default, which conflicts
6163
# with the executable from the GNU `parallel`, so we must unlink GNU
6264
# `parallel` first, and relink it afterwards

0 commit comments

Comments
 (0)
0