[RFC] Add a SlimTensor representation to help AOTInductor generate standalone binaries #153242
Labels
oncall: pt2
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🚀 The feature, motivation and pitch
AOTInductor offers a way to compile a PyTorch model, originally written in Python, into a packaged artifact known as a .pt2 file. This file can be loaded in a non-Python environment for model inference. The compiled model still relies on libtorch to perform essential tensor operations, such as creating new tensors, accessing tensor attributes, using Eager operations as a fallback, and so on.
Due to the substantial size of libtorch, loading the entire library for inference on a single model can be undesirable. This is particularly true in scenarios like Serverless Inference or environments with limited resources. By natively compiling PyTorch models into minimal, self-contained binaries, model deployment becomes faster and more efficient.
The core component enabling AOTInductor to generate self-contained binaries is the implementation of a lightweight tensor representation, called SlimTensor. This novel representation is designed to be independent of the existing AtenTensor representation while providing an identical set of C shim APIs. As a result, code generated by AOTInductor can seamlessly utilize SlimTensor without any modifications, thereby facilitating the creation of standalone executables.
One design principle of SlimTensor is to share common infra code with AtenTensor as much as possible. For instance, enum types such as DeviceType and ScalarType and common utility functions should be shared. Some major refactoring to the core code will happen to facilitate that.
The following is tentatively how SlimTensor is going to look like,
The storage_ component holds a reference counter to the underlying buffer, but its semantic behavior can vary between owning and non-owning. For model input tensors, storage_ does not manage the lifetime of their underlying buffers. Conversely, for intermediate tensors, storage_ takes ownership of the underlying buffers.
Similarly, sizes_ and strides_ may or may not own the underlying array. storage_offset_ is kept to make data accessor behavior similar to AtenTensor.
Another big chunk of infra can be shared between SlimTensor and AtenTensor is native Eager op implementations. When Inductor/AOTInductor compiles a model, it will fall back to Eager implementation for unsupported ops. When AOTInductor generates standalone binaries, it will need to do the same for unsupported ops. As an example, the implementation of
_weight_int4pack_mm_cuda
can be reused by replacingat::Tensor
with a generictypename T
and making sure Tensor APIs used in those Eager op implementations are both supported by AtenTensor and SlimTensor.->
Alternatives
No response
Additional context
No response
cc @chauhang @penguinwu
The text was updated successfully, but these errors were encountered: