Enable Link Time Optimization in PyTorch 2.0 Release Binaries - Smaller, Faster, Better Binaries

@ezyang

🚀 The feature, motivation and pitch

The PyTorch binaries are huge. See #34058. So huge in fact, we've needed to refactor our codebase just so they fit in pip and conda. And that we plan on setting up binary size alerts: #93991. And that we want to split up our conda for faster installs: #93081 .

We should do everything we can to do to keep them small without sacrificing performance. One now commonly supported compiler feature we can use to both make the binaries smaller and potentially faster is Link Time Optimization. Upgrading to C++17 means that we Pytorch can only be built by newer compilers that have full LTO support anyway.

I tested it for the CPU only the libraries for PyTorch and found a nearly 10-20MB reduction in the size of the binaries. This is likely to be even more pronounced on CUDA.

Why now?

PyTorch 2.0 is overhauling a large part of the build structure of the library anyway
Upgrading to C++17 and dropping a lot of legacy code should make this easier, and ensure that only newer compilers are supported that have fewer LTO bugs.
Now that the minimum support CUDA version is 11.2, we now can try to enable -dlto option for CUDA since it's supported which promises faster CUDA compilation and smaller CUDA binaries through supported device linking.

Benefits:

Smaller binaries
Potentially better performance
Better warnings / diagnostics thanks to more info available at link time.
Potentially faster build times for CUDA kernels.

Steps:

Ideally, this should be as simple as turning the CMAKE_INTERPROCEDURAL_OPTIMIZATION config option on for release builds. I'll need to contact releng about how to best do this. However, gcc only supports FAT/classical LTO compilation, which means that the linking stage can take a lot longer so we may need to adjust timeouts on the workers that build them. https://cmake.org/cmake/help/latest/variable/CMAKE_INTERPROCEDURAL_OPTIMIZATION.html
Clang supports ThinLTO and LTO (defaulting to ThinLTO). ThinLTO adds minimal build overhead, but is not supported by gcc. However, it is also more likely to crash / segfault during the build process due to added complexity. If we this working, we could even enable it by default in Release builds. We can also force clang to use gcc's slower form of lto too.
[Optional]: Enabling -dlto flag on nvcc would probably be the trickiest part, ~~as that option does not have a CMake flag to enable it yet, and it does come with some limitations (not supporting ffast-math when doing dlto etc..~~.). (Apparently it does, but only on the newest CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/7389). However, this also promises the biggest potential size saving gains. If even 10% can be saved from the full cuda build, that's can result in nearly 100mb smaller binaries: https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/ see this blogpost for more info in that regards. It could even be dramatically more if it can deduplicate assembly code across microarchs at link time. Edit: apparently the LTO would only matter in CUDA_SEPERABLE_BUILD mode, which allows faster CUDA compilation.

POC:
I tested on a libtorch build with gcc-9 and no IntelMKL and was able to shrink the binaries from 199MB -> 184MB resulting in a 7-8% reduction in binary size. I did not do any benchmarking, but this could result in a performance increase as well, for just a little compilation time increase. I also did not try enabling LTO on any of the third party libraries so the savings would likely be more if we pursued this fully. The only issue I encountered were some linking errors when trying to force enabling it on protobuff, but after looking at some issues, it may be fixed as easily as pass -no-as-needed to gcc or just using a newer gcc.

Alternatives

Let the binaries remain huge.

Additional context

Happy to help facilitate this, but I could definitely use some help from the releng/infra teams especially for testing this on all possible configs. The MVP would be just getting this working on GCC, but Clang, MSVC, and the CUDA compilers all support this as well. Help wanted on this project for sure.

cc @ezyang @seemethere @malfet @ngimel @soumith @albanD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions