[RFC] [Feature] Intra-Device Heterogeneous Memory Allocation Support

@desertfire

🚀 The feature, motivation and pitch

Terminology

We define intra-device heterogeneous memory as memory which is heterogeneous (has different properties), within the same device.
PyTorch currently supports inter-device heterogeneous memory by enabling applications to allocate Tensors on different devices, ie. through the device parameter on Tensor constructors:

torch.zeros(..., device="cuda") # allocate Tensor in CUDA memory, using CUDA allocator

Motivation

Heterogeneous host-side memory is becoming more common. Some example implementations that existed previously, are available today, or are coming soon include:

The Sapphire Rapids HBM series from Intel: 64GB on-chip HBM, with "flat mode" supporting putting HBM and DRAM on different NUMA nodes.
SiPearl's Rhea and Rhea-2: On-chip HBM and off-chip DDR.
In the future, expected increasing prevalence of CXL.mem - adding another tier of heterogeneous memory to systems utilising it.
A future trend towards more DDR + Non-volatile Memory (NVM) platforms.

Making the most of these platforms with multiple types of host-memory will likely include application programmers making choices about which memory type they want certain allocations to be on. In PyTorch, this means allowing applications to specify which memory backing a Tensor will use - as this is the main unit of storage within PyTorch. An application may wish to place Tensors on HBM in order to maximise bandwidth when performing ML operations, or to place on DRAM to make use of capacity in cases of oversubscription.
Currently PyTorch has no way to allow applications to place tensors on specific memory types within a device, so application programmers can't maximise performance and usage on these heterogeneous systems.

Feature: Support for intra-device heterogeneous memory support

The proposal is to implement support for another layer of memory heterogeneity - intra-device heterogeneity on top of inter-device heterogeneity.
For example, in a case where a CPU has multiple types of memory, a PyTorch application can choose which of these their Tensor will be backed by, when using the CPU device.
The same could also be applicable to other devices (e.g. CUDA, XPU, etc.) - if they have multiple types of memory, some application programmers in the future may want finer-grained control of the Tensors' backing.

Implementation

Extra parameter for tensor constructors, allocators, and `DataPtr`s

This heterogeneity will be implemented by the device allocator, by allowing the device allocation function to take an extra argument indicating which memory type/backing it should allocate from.
For the purpose of this proposal, we will call this extra argument device_memory_index - an integer argument with a default value of -1 (which represents the default memory backing for a device), where the the integer represents a device-specific integer for selecting the memory backing. For example, on CPU devices, this could be the NUMA node index.

/* c10/core/Allocator.h */
/* Example change for demonstration purposes only */

struct C10_API Allocator {
  virtual ~Allocator() = default;

  virtual DataPtr allocate(size_t n) = 0;

  virtual DataPtr allocate(size_t n, int64_t device_memory_index) {
    // default is to ignore device_memory_index, devices can provide implementations using this parameter where necessary
    return allocate(n);
  }

  // ...

DataPtr will also need to store the device_memory_index of allocations.

This extra parameter, in turn, is passed through PyTorch's API from an additional optional argument to Tensor constructors:

/* Example change for demonstration purposes only */

torch.zeros(*size, *, out=None, dtype=None, layout=torch.strided, device=None, device_memory_index=-1, requires_grad=False)

API to query supported heterogeneous memory devices

Additionally, we propose adding an API to allow users to check the heterogeneous memory currently available for use on each device. This would have different implementations per-device, but would return an enumeration of the types of memory, and their indexes (then used for the device_memory_index to construct Tensors).

# Equivalent - can accept a device or string, like Tensor constructors
>>> x = torch.get_device_memory_types("cpu")
>>> x = torch.get_device_memory_types(torch.device("cpu")) 

>>> x
# CPU device allocation would define the indexes as NUMA nodes. However, it's up to the device what the meaning of these values is. User should use get_device_memory_types() to check.
{ CPUMemType.HBM : [0, 3] , CPUMemType.DDR : [1, 2] }

# Example implementation for CUDA, with only one memory type/index
torch.get_device_memory_types("cpu")
{ CUDAMemType.NORMAL: [0] }

Each device implements its own enum of memory types, e.g. CPUMemType, CUDAMemType, etc. A device can return nothing ({}), in which case the user should assume that the device does not support heterogeneous allocation, and either not use device_memory_index, or provide the default value.

For CPU, we suggest that this should be implemented with hwloc, and the device memory indexes should be NUMA nodes. The enum types should match hwloc's available node subtypes, that is:
DRAM, HBM, SPM, NVM, MCDRAM, GPUMemory.

Challenges

Implementing this change involves numerous challenges:

These Tensor constructors are numerous and don't change often. Updating them involves changing a lot of code, and affects a lot of components within PyTorch, including Inductor/Dynamo.
Whilst adding default arguments should be okay ABI-wise, adding default arguments to fallback ops currently causes issues. @desertfire has a draft PR adding support for auto-generating the v2/vN shims in this case.
Increasing the size of TensorOptions could have performance implications. There is an assert and comment in TensorOptions.h:

// We should aspire to fit in one machine-size word; but a size greater than two
// words is too much.  (We are doing terribly on 32-bit archs, where we require
// three machine size words to store tensor options.  Eek!)
static_assert(
    sizeof(TensorOptions) <= sizeof(int64_t) * 2,
    "TensorOptions must fit in 128-bits");

The current size of TensorOptions is 16 (device) + 16 (type) + 8 (layout) + 8 (memory format) = 48bits. So, aiming to keep TensorOptions within 64bits would be possible with 16bit device memory indexes - but there are tradeoffs with regards to two things: we likely need one
"magic" default value, and filling the 64bits would mean any future additions would need to spill into a second word on a 64bit machine. It is likely that the full 16bits would be excessive (support for ~32k different memories per device), so an int8 type could probably be used to allow 128 or 255 different memories, allowing a future addition of a 1-byte tensor option without exceeding a 64bit word.

The meaning of the device memory index would vary from device to device. So, devices would need to individually document how the index should be interpreted for their allocator (or extensions, where they override/provide allocators).
Testing can be a little difficult. The main relevant testing here should be functional, which is achievable. Allocators which "fake" different memory types can be used in testing. For testing something like CPU NUMA allocation, qemu environments can be created with custom NUMA layouts, and multi-socket machines also generally have multiple NUMA nodes. Any performance testing would need to be conducted on real heterogeneous hardware, e.g. Sapphire Rapids HBM platforms, but this kind of performance is not within the scope of PyTorch itself, but the scope of the PyTorch user.

Usage (PyTorch Users)

The proposed usage of this API would look something like the following:

'''
User can request x is backed by memory index 2 on CPU.
CPU allocator in this example defines the index as the index of the NUMA node to allocate from.
Thus, x should be allocated on CPU's NUMA node 2.
'''
>>> x = torch.zeros(1000, device="cpu", device_memory_index=2)
>>> x.device_memory_index
2

'''
User can move the Tensor between memory types on the same device, or between devices.
Moving between memory types on the same device will be handled by device's allocator.
Moving between devices works as currently.
'''
>>> x.to(device_memory_index=0)
>>> x.device_memory_index
0
>>> x.to(device="cuda:1")
>>> x.device
device(type='cuda', index=1)

Usage (Device Allocator Implementers)

An example extension for utilising the additional memory index parameter for CPU allocation for selecting the NUMA node for allocations could be as follows. We use some hypothetical NUMA allocation API, to prevent getting into details of how this allocation itself would be implemented. UMF, or a similar library, could be leveraged for a real implementation.

/* extension.h */

struct NUMAAllocator final : at::Allocator {
public:
        NUMAAllocator();
        ~NUMAAllocator();
        at::DataPtr allocate(size_t nbytes) override;
        at::DataPtr allocate(size_t nbytes, int64_t device_memory_index) override;
        at::DeleterFnPtr raw_deleter() const override;
        void copy_data(void *dest, const void *src, std::size_t count) const final {
                default_copy_data(dest, src, count);
        }
};

/* extension.cpp */

// Register our dummy allocator
static NUMAAllocator global_custom_alloc;
NUMAAllocator::NUMAAllocator() {
	// allocator state setup
}

NUMAAllocator::~NUMAAllocator() 
836D
{
	// allocator state teardown
}

static void Delete(void *ptr) {
	numaFree(ptr);
}

at::DeleterFnPtr NUMAAllocator::raw_deleter() const {
	return &Delete;
}

at::DataPtr NUMAAllocator::allocate(size_t nbytes, int64_t device_memory_index) {
	// numaCalloc(int64_t numa_node, size_t num, size_t size)
	void *data = numaCalloc(device_memory_index, 1, nbytes);
	return {data, data, raw_deleter(), at::Device(at::DeviceType::CPU), device_memory_index};
}

at::DataPtr NUMAAllocator::allocate(size_t nbytes) {
	std::cout << "Allocating " << nbytes << "B with no memory index"
		  << std::endl;
	void *data = numaCalloc(/* current numa node */, 1, nbytes);
	return {data, data, raw_deleter(), at::Device(at::DeviceType::CPU)};
}

REGISTER_ALLOCATOR(c10::DeviceType::CPU, &global_custom_alloc);
PYBIND11_MODULE(numa, m) {}

Alternatives

Not Modifying Tensor Constructors

Using State Outside the Tensor Constructor

One could avoid adding a new default argument in Tensor constructors by using some kind of syntax similar to the CUDA allocator mempool construct.
I dismissed this option as a Tensor argument feels more appropriate. The device memory index is of similar use to the device, which is a Tensor argument. We are not switching the allocator here, just providing it different arguments. Plumbing through such constructs to the allocator could actually end up being more complicated.

Adding an Additional Level of Indexing to `torch.device`s

Devices in Tensor constructors currently have one level of indexing, to allow selecting a specific device when a system has multiple available (e.g. selecting CUDA GPU 0 in a 4-GPU system - "cuda:0"/torch.device("cuda", 0)). An additional level of indexing could be added - a "sub-index", which allows for selecting the memory within a device:

# Selects memory index 2 on CPU (CPU device must always be cpu:0)
torch.zeros(1000, device="cpu:0:2")
torch.zeros(1000, device="cpu::2")
torch.zeros(1000, device=torch.device("cpu", 0, 2))
torch.zeros(1000, device=torch.device("cpu", memory_index=2))

# Selects default "x" device, memory index 3
torch.zeros(1000, device="x::3")

A drawback here is that so far in PyTorch, device has always had an inherent implication of selecting computational device that Tensor operations will be performed on. This proposal could change this contract in the future - where devices support multiple instances, and multiple memory types - by meaning that a change in device argument does not always change the computation location, if only the sub-index changes.

Different Constructor API Design: Argument Format/Meaning

This RFC uses an approach of an integer argument which different devices can ascribe their own meaning to. In the case of the CPU device, it is suggested to use this to represent the NUMA index of the memory.

However, other approaches for the argument were considered. One drawback of the proposed design is the lack of an explicit API feature to distinguish between a "hint" and a "request" (where we define the difference as being: hints can fallback to a different memory type/device silently, requests would error if the requested type/device could not be allocated from). Alternative designs that could enable this were considered, but ultimately decided against. Currently in PyTorch, if one tries to allocate a tensor from, for example a CUDA device, lack of available memory will trigger an OOM error. This is the model PyTorch users are used to developing against, and silently falling back to an unexpected memory type (even on the same device), could be surprising for users. Instead, if the allocation fails, the application programmer can manually fall back, instead of PyTorch guessing as to their intended fallback priorities.

Detailed below are the alternatives, their pros and cons, and why I didn't choose to go with them for this RFC.

1. Bitmask of Requested Attributes

The device_memory_index argument could instead be replaced with some device_memory_attrs argument. This would be a bitmask, with different bits representing different attributes the user requests from the backing memory type.
Examples of attribute bits could be: highest bandwidth, lowest latency, or highest capacity. An additional bit could be used to represent whether the requested attributes should be treated as a hint, or a request (fail if memory fulfilling the attributes cannot be provided).

Pros:

Allows the user to simply specify the attributes they want, and the allocator will do the work of selecting the memory backing
Bitmask bit meanings can be the same across devices, or different across devices, depending on how we want to implement it/what the PyTorch community prefers
Allows specification of hints
In the case of hints, fallback memory type may be more clear (e.g. for highest bandwidth selector, fallback could reasonably be 2nd highest bandwidth memory type)

Cons:

Does not allow a user to select exactly what memory type they want. If there are many memory types, it's possible the matrix of hint selections doesn't have a unique mapping for each one (e.g. a system with HBM, DDR, CXL, PMEM, etc.)
Less transparent to the user exactly which memory type they will get
Bits may run out - size needs to be chosen carefully for future proofing - would be hard to change later
Makes the allocator implementations more complex

Overall, I felt the lack of specificity and ability to select each memory type uniquely made this option unviable.

Enum for Memory Types

device_memory_index could be replaced with some device_memory_type enum-type argument which looks something like the following (C syntax for simplicity):

enum MEMORY_TYPE {
   DDR,
   HBM,
   CXL,
   PMEM,
   DISK,
   // ...
}

Pros:

Very transparent to the user which memory type they're asking for/getting
Can uniquely map every memory type in the system, so the user can request any of them
Simple API

Cons:

The enum will have to change every time someone wants to add a new memory type (but this wouldn't break ABI)
No way to differentiate between different instances of the same memory type (e.g. HBM on socket 0 vs on socket 1)
Needs extension to represent hints/demands, e.g. shrink enum to 7 bit and use extra 1 bit for encoding this, then represent this 7+1 bit as a type

Additional context

This feature was suggested on the PyTorch forums previously.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 The feature, motivation and pitch

Terminology

Motivation

Feature: Support for intra-device heterogeneous memory support

Implementation

Extra parameter for tensor constructors, allocators, and `DataPtr`s

API to query supported heterogeneous memory devices

Challenges

Usage (PyTorch Users)

Usage (Device Allocator Implementers)

Alternatives

Not Modifying Tensor Constructors

Using State Outside the Tensor Constructor

Adding an Additional Level of Indexing to `torch.device`s

Different Constructor API Design: Argument Format/Meaning

1. Bitmask of Requested Attributes

Enum for Memory Types

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🚀 The feature, motivation and pitch

Terminology

Motivation

Feature: Support for intra-device heterogeneous memory support

Implementation

Extra parameter for tensor constructors, allocators, and DataPtrs

API to query supported heterogeneous memory devices

Challenges

Usage (PyTorch Users)

Usage (Device Allocator Implementers)

Alternatives

Not Modifying Tensor Constructors

Using State Outside the Tensor Constructor

Adding an Additional Level of Indexing to torch.devices

Different Constructor API Design: Argument Format/Meaning

1. Bitmask of Requested Attributes

Enum for Memory Types

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Extra parameter for tensor constructors, allocators, and `DataPtr`s

Adding an Additional Level of Indexing to `torch.device`s