10000 [RFC] Add a SlimTensor representation to help AOTInductor generate standalone binaries · Issue #153242 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[RFC] Add a SlimTensor representation to help AOTInductor generate standalone binaries #153242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
desertfire opened this issue May 9, 2025 · 2 comments
Labels
oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@desertfire
Copy link
Contributor
desertfire commented May 9, 2025

🚀 The feature, motivation and pitch

AOTInductor offers a way to compile a PyTorch model, originally written in Python, into a packaged artifact known as a .pt2 file. This file can be loaded in a non-Python environment for model inference. The compiled model still relies on libtorch to perform essential tensor operations, such as creating new tensors, accessing tensor attributes, using Eager operations as a fallback, and so on.

Due to the substantial size of libtorch, loading the entire library for inference on a single model can be undesirable. This is particularly true in scenarios like Serverless Inference or environments with limited resources. By natively compiling PyTorch models into minimal, self-contained binaries, model deployment becomes faster and more efficient.

The core component enabling AOTInductor to generate self-contained binaries is the implementation of a lightweight tensor representation, called SlimTensor. This novel representation is designed to be independent of the existing AtenTensor representation while providing an identical set of C shim APIs. As a result, code generated by AOTInductor can seamlessly utilize SlimTensor without any modifications, thereby facilitating the creation of standalone executables.

One design principle of SlimTensor is to share common infra code with AtenTensor as much as possible. For instance, enum types such as DeviceType and ScalarType and common utility functions should be shared. Some major refactoring to the core code will happen to facilitate that.

The following is tentatively how SlimTensor is going to look like,

class SlimTensor {
...
private:
  Storage storage_;   // device_type_ and device_index_ are stored in storage_
  IntArrayRef sizes_;
  IntArrayRef strides_;
  ScalarType dtype_;
  int64_t storage_offset_;
  size_t numel_;
};

The storage_ component holds a reference counter to the underlying buffer, but its semantic behavior can vary between owning and non-owning. For model input tensors, storage_ does not manage the lifetime of their underlying buffers. Conversely, for intermediate tensors, storage_ takes ownership of the underlying buffers.
Similarly, sizes_ and strides_ may or may not own the underlying array. storage_offset_ is kept to make data accessor behavior similar to AtenTensor.

Another big chunk of infra can be shared between SlimTensor and AtenTensor is native Eager op implementations. When Inductor/AOTInductor compiles a model, it will fall back to Eager implementation for unsupported ops. When AOTInductor generates standalone binaries, it will need to do the same for unsupported ops. As an example, the implementation of _weight_int4pack_mm_cuda can be reused by replacing at::Tensor with a generic typename T and making sure Tensor APIs used in those Eager op implementations are both supported by AtenTensor and SlimTensor.

at::Tensor _weight_int4pack_mm_cuda(
    const at::Tensor& A,
    const at::Tensor& B,
    int64_t qGroupSize,
    const at::Tensor& qScaleAndZeros)

->

template<typename T>
T _weight_int4pack_mm_cuda(
    const T& A,
    const T& B,
    int64_t qGroupSize,
    const T& qScaleAndZeros)

Alternatives

No response

Additional context

No response

cc @chauhang @penguinwu

@EikanWang
Copy link
Collaborator
EikanWang commented May 12, 2025

@desertfire , take the _weight_int4pack_mm_cuda as the example, it means that we just need to add a shim function on the AOTInductor side w/o adjusting the schema of the ATen function, right?

@desertfire
Copy link
Contributor Author

@desertfire , take the _weight_int4pack_mm_cuda as the example, it means that we just need to add a shim function on the AOTInductor side w/o adjusting the schema of the ATen function, right?

Yes, nothing will be changed on the existing schemas. You can imagine we will add another shim function on the AOTInductor side, but what I meant is that implementation will share some of the code with the existing one in Aten.

@desertfire desertfire added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants
0