You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update on "sync and async torch.distributed.rpc for builtin operators"
Features:
* sync and async RPC for builtin operators
* RpcAgent API
* ProcessGroupAgent implementation
Goal:
This is the first PR for #23110, and there will be many followup ones. So let's focus on the overall API and code structure. Details like efficiency and error handling can be improved in future PRs.
* have a minimum working and testable RPC implementation.
* make sure the RpcAgent API is sufficient for future ThriftAgent and TensorPipeAgent implementation
* For tensor pipe implementation, it might allocate multiple underlying communication channels with different types, and might also use streaming serialization/deserialization for large tensors. To support this requirement, the current implementation only convert a BuiltinOp into a Message which contains a byte vector and a tensor table. It is up to the RpcAgent implementation to determine how it would like to serialize a Message object.
* For ThriftAgent, as Thrift has it own request/response matching solution, the Message.id is no longer necessary. Hence the id can be dropped during serialization. All it needs to do is to pass the response Message object to the Future returned by send(...).
* support blocking and non-blocking RequestCallback
* blocking means the callback won't return before sending out the response
* non-blocking can be achieved by enqueue the `(from, request, RpcAgent&)` tuple and use a different thread to process them. That is why there is an `RpcAgent&` arg in the param list.
Differential Revision: [D15194693](https://our.internmc.facebook.com/intern/diff/D15194693/)
Copy file name to clipboardExpand all lines: .circleci/README.md
+22-24Lines changed: 22 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -81,7 +81,7 @@ A **binary configuration** is a collection of
81
81
* MacOS
82
82
* Windows - these are built on Azure pipelines
83
83
* devtoolset version (gcc compiler version)
84
-
* This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
84
+
* This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
85
85
86
86
### Where are the binaries?
87
87
@@ -101,12 +101,12 @@ All binaries are built in CircleCI workflows. There are checked-in workflows (co
101
101
102
102
# CircleCI structure of the binaries
103
103
104
-
Some quick vocab:
104
+
Some quick vocab:
105
105
106
-
* A\**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on\https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
106
+
* A\**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on\https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
107
107
***jobs** are a sequence of '**steps**'
108
108
***steps** are usually just a bash script or a builtin CircleCI command.* All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
109
-
* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
109
+
* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
110
110
111
111
## How are the workflows structured?
112
112
@@ -116,9 +116,9 @@ The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build,
3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
139
+
3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
140
140
1. smoke_linux_conda_3.7_cpu
141
141
1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
142
142
2. Runs the smoke tests
143
143
144
144
## How are the jobs structured?
145
145
146
-
The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources . Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
146
+
The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources . Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
147
147
148
148
* Linux jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
149
149
* binary_linux_build.sh
@@ -177,7 +177,7 @@ CircleCI creates a final yaml file by inlining every <<* segment, so if we were
177
177
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
178
178
179
179
* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
180
-
* linux test jobs use the machine executor and spin up their own docker. Why this nonsense? It's cause we run nvidia-docker for our GPU tests; any code that calls into the CUDA runtime needs to be run on nvidia-docker. To run a nvidia-docker you need to install some nvidia packages on the host machine and then call docker with the '—runtime nvidia' argument. CircleCI doesn't support this, so we have to do it ourself.
180
+
* linux test jobs use the machine executor and spin up their own docker. Why this nonsense? It's cause we run nvidia-docker for our GPU tests; any code that calls into the CUDA runtime needs to be run on nvidia-docker. To run a nvidia-docker you need to install some nvidia packages on the host machine and then call docker with the '—runtime nvidia' argument. CircleCI doesn't support this, so we have to do it ourself.
181
181
* This is not just a mere inconvenience. **This blocks all of our linux tests from using more than 2 cores.** But there is nothing that we can do about it, but wait for a fix on circleci's side. Right now, we only run some smoke tests (some simple imports) on the binaries, but this also affects non-binary test jobs.
182
182
* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
183
183
* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
@@ -243,7 +243,7 @@ Every type of package has an entrypoint build script that handles the all the im
243
243
244
244
Both Linux and MacOS use the same code flow for the conda builds.
245
245
246
-
Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
246
+
Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
247
247
248
248
Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
249
249
tldr; on conda-build is
@@ -266,7 +266,7 @@ The entrypoint file `builder/conda/build_conda.sh` is complicated because
266
266
267
267
## Manywheels (linux pip and libtorch packages)
268
268
269
-
Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
269
+
Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
270
270
271
271
`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`
272
272
@@ -313,13 +313,12 @@ Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for
313
313
All linux builds occur in docker images. The docker images are
314
314
315
315
* soumith/conda-cuda
316
-
* Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-8.0 to enable different CUDA builds
317
-
* Also used for cpu builds
318
-
* soumith/manylinux-cuda80
316
+
* Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
319
317
* Also used for cpu builds
320
318
* soumith/manylinux-cuda90
321
319
* soumith/manylinux-cuda92
322
320
* soumith/manylinux-cuda100
321
+
* Also used for cpu builds
323
322
324
323
The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
325
324
@@ -380,7 +379,7 @@ The advantage of this flow is that you can make new changes to the base commit a
380
379
381
380
### Linux
382
381
383
-
You can build Linux binaries locally easily using docker.
382
+
You can build Linux binaries locally easily using docker.
# Export whatever variables are important to you. All variables that you'd
405
404
# possibly need are in .circleci/scripts/binary_populate_env.sh
406
405
# You should probably always export at least these 3 variables
@@ -410,14 +409,14 @@ export DESIRED_CUDA=cpu
410
409
411
410
# Call the entrypoint
412
411
# `|& tee foo.log` just copies all stdout and stderr output to foo.log
413
-
# The builds generate lots of output so you probably need this when
412
+
# The builds generate lots of output so you probably need this when
414
413
# building locally.
415
414
/builder/conda/build_pytorch.sh |& tee build_output.log
416
415
```
417
416
418
417
**Building CUDA binaries on docker**
419
418
420
-
To build a CUDA binary you need to use `nvidia-docker run` instead of just `docker run` (or you can manually pass `--runtime=nvidia`). This adds some needed libraries and things to build CUDA stuff.
419
+
To build a CUDA binary you need to use `nvidia-docker run` instead of just `docker run` (or you can manually pass `--runtime=nvidia`). This adds some needed libraries and things to build CUDA stuff.
421
420
422
421
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a loong time).
423
422
@@ -431,7 +430,7 @@ But if you want to try, then I’d recommend
431
430
432
431
```
433
432
# Create a new terminal
434
-
# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
433
+
# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
435
434
# know how to do
436
435
437
436
# Install a new miniconda
@@ -462,11 +461,11 @@ export DESIRED_CUDA=cpu
462
461
path/to/builder/wheel/build_wheel.sh
463
462
```
464
463
465
-
N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
464
+
N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
466
465
467
466
1. You make the ‘conda’ command accessible by prepending `path/to/conda_root/bin` to your PATH.
468
-
2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
469
-
3. Now say you (or some code that you ran) call python executable `foo`
467
+
2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
468
+
3. Now say you (or some code that you ran) call python executable `foo`
470
469
1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
471
470
2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
472
471
@@ -475,4 +474,3 @@ Newer conda versions and proper python hygiene can prevent this, but just instal
0 commit comments