[RFC] Add a SlimTensor representation to help AOTInductor generate standalone binaries

@chauhang

🚀 The feature, motivation and pitch

AOTInductor offers a way to compile a PyTorch model, originally written in Python, into a packaged artifact known as a .pt2 file. This file can be loaded in a non-Python environment for model inference. The compiled model still relies on libtorch to perform essential tensor operations, such as creating new tensors, accessing tensor attributes, using Eager operations as a fallback, and so on.

Due to the substantial size of libtorch, loading the entire library for inference on a single model can be undesirable. This is particularly true in scenarios like Serverless Inference or environments with limited resources. By natively compiling PyTorch models into minimal, self-contained binaries, model deployment becomes faster and more efficient.

The core component enabling AOTInductor to generate self-contained binaries is the implementation of a lightweight tensor representation, called SlimTensor. This novel representation is designed to be independent of the existing AtenTensor representation while providing an identical set of C shim APIs. As a result, code generated by AOTInductor can seamlessly utilize SlimTensor without any modifications, thereby facilitating the creation of standalone executables.

One design principle of SlimTensor is to share common infra code with AtenTensor as much as possible. For instance, enum types such as DeviceType and ScalarType and common utility functions should be shared. Some major refactoring to the core code will happen to facilitate that.

The following is tentatively how SlimTensor is going to look like,

class SlimTensor {
...
private:
  Storage storage_;   // device_type_ and device_index_ are stored in storage_
  IntArrayRef sizes_;
  IntArrayRef strides_;
  ScalarType dtype_;
  int64_t storage_offset_;
  size_t numel_;
};

The storage_ component holds a reference counter to the underlying buffer, but its semantic behavior can vary between owning and non-owning. For model input tensors, storage_ does not manage the lifetime of their underlying buffers. Conversely, for intermediate tensors, storage_ takes ownership of the underlying buffers.
Similarly, sizes_ and strides_ may or may not own the underlying array. storage_offset_ is kept to make data accessor behavior similar to AtenTensor.

Another big chunk of infra can be shared between SlimTensor and AtenTensor is native Eager op implementations. When Inductor/AOTInductor compiles a model, it will fall back to Eager implementation for unsupported ops. When AOTInductor generates standalone binaries, it will need to do the same for unsupported ops. As an example, the implementation of _weight_int4pack_mm_cuda can be reused by replacing at::Tensor with a generic typename T and making sure Tensor APIs used in those Eager op implementations are both supported by AtenTensor and SlimTensor.

at::Tensor _weight_int4pack_mm_cuda(
    const at::Tensor& A,
    const at::Tensor& B,
    int64_t qGroupSize,
    const at::Tensor& qScaleAndZeros)

->

template<typename T>
T _weight_int4pack_mm_cuda(
    const T& A,
    const T& B,
    int64_t qGroupSize,
    const T& qScaleAndZeros)

Alternatives

No response

Additional context

No response

cc @chauhang @penguinwu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions