8000 Add Python API, semantics and implementation details for DLPack by rgommers · Pull Request #106 · data-apis/array-api · GitHub
[go: up one dir, main page]

Skip to content

Add Python API, semantics and implementation details for DLPack #106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions spec/API_specification/array_object.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,70 @@ Evaluates `x1_i & x2_i` for each element `x1_i` of an array instance `x1` with t
Element-wise results must equal the results returned by the equivalent element-wise function [`bitwise_and(x1, x2)`](elementwise_functions.md#logical_andx1-x2-).
```

(method-__dlpack__)=
### \_\_dlpack\_\_(/, *, stream=None)

Exports the array as a DLPack capsule, for consumption by {ref}`function-from_dlpack`.

#### Parameters

- **stream**: _Optional\[int\]_

- An optional pointer to a stream, as a Python integer, provided by the consumer that the producer will use to make the array safe to operate on. The pointer is a positive integer. `-1` is a special value that may be used by the consumer to signal "producer must not do any synchronization". Device-specific notes:

:::{admonition} CUDA
- `None`: producer must assume the legacy default stream (default),
- `1`: the legacy default stream,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be useful to remove 1 and 2, because they are CUDA specific for now(ROCm does not support 1 and 2.

Just say stream number represented as a python integer(per platform convention) would work

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to say that 0 is explicitly disallowed for CUDA since it's ambiguous though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be great to point out that that fact. thanks @kkraus14

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rather than remove things, add something? E.g. make the description:

Stream number, as Python integer, for all device types that supports streams.

Per-device notes:

  • CUDA: 1, 2, 3...
  • ROCm: 3....
  • anything else, like OpenCL?

and then refer to the DLPack docs (to be written/extended) for more details?

Specs that are too terse are correct, but not all that useful. Given how complex it was to make this converge, I feel like we need to give implementers as much guidance as possible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @rgommers a per device guidance sounds good. i think it is safe to start with CUDA and ROCm and then expand later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to say that 0 is explicitly disallowed for CUDA since it's ambiguous though.

@kkraus14 This is what I kept saying above 🙂

Per-device notes:

  • CUDA: 1, 2, 3...
  • ROCm: 3....
  • anything else, like OpenCL?

@rgommers For ROCm, I think it's ok to keep 0, since this is the only way to use the legacy default stream (if so desired).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

- `2`: the per-thread default stream,
- `> 2`: stream number represented as a Python integer.

Note that `0` is disallowed (it's ambiguous, it could mean either `None`, `1` or `2`).
:::

:::{admonition} ROCm
- `None`: producer must assume the legacy default stream (default),
- `0`: the default stream,
- `> 2`: stream number represented as a Python integer.

Using `1` and `2` is not supported.
:::

```{tip}
It is recommended that implementers explicitly handle streams. If
they use the legacy default stream, specifying `1` (CUDA) or `0`
(ROCm) is preferred. `None` is a safe default for developers who do
not want to think about stream handling at all, potentially at the
cost of more synchronization than necessary.
```

#### Returns

- **capsule**: _<PyCapsule>_

- A DLPack capsule for the array. See {ref}`data-interchange` for details.

(method-__dlpack_device__)=
### \_\_dlpack\_device\_\_()

Returns device type and device ID in DLPack format. Meant for use within {ref}`function-from_dlpack`.

#### Returns

- **device**: _Tuple\[enum.IntEnum, int\]_

- A tuple `(device_type, device_id)` in DLPack format. Valid device type enum members are:

```
CPU = 1
CUDA = 2
CPU_PINNED = 3
OPENCL = 4
VULKAN = 7
METAL = 8
VPI = 9
ROCM = 10
```

(method-__eq__)=
### \_\_eq\_\_(x1, x2, /)

Expand Down
21 changes: 21 additions & 0 deletions spec/API_specification/creation_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,27 @@ Returns a two-dimensional array with ones on the `k`th diagonal and zeros elsewh

- an array where all elements are equal to zero, except for the `k`th diagonal, whose values are equal to one.

(function-from_dlpack)=
### from_dlpack(x, /)

Returns a new array containing the data from another (array) object with a `__dlpack__` method.

#### Parameters

- **x**: _object_

- input (array) object.

#### Returns

- **out**: _<array>_

- an array containing the data in `x`.

```{note}
The returned array may be either a copy or a view. See {ref}`data-interchange` for details.
```

(function-full)=
### full(shape, fill_value, /, *, dtype=None)

Expand Down
Binary file added spec/_static/images/DLPack_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions spec/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@

# MyST options
myst_heading_anchors = 3
myst_enable_extensions = ["colon_fence"]

# -- Options for HTML output -------------------------------------------------

Expand Down
76 changes: 66 additions & 10 deletions spec/design_topics/data_interchange.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,18 +58,74 @@ means the object it is attached to must return a `numpy.ndarray`
containing the data the object holds).
```

TODO: design an appropriate Python API for DLPACK (`to_dlpack` followed by `from_dlpack` is a little clunky, we'd like it to work more like the buffer protocol does on CPU, with a single constructor function).

TODO: specify the expected behaviour with copy/view/move/shared-memory semantics in detail.
## Syntax for data interchange with DLPack

The array API will offer the following syntax for data interchange:

```{note}
1. A `from_dlpack(x)` function, which accepts (array) objects with a
`__dlpack__` method and uses that method to construct a new array
containing the data from `x`.
2. `__dlpack__(self, stream=None)` and `__dlpack_device__` methods on the
array object, which will be called from within `from_dlpack`, to query
what device the array is on (may be needed to pass in the correct
stream, e.g. in the case of multiple GPUs) and to access the data.


## Semantics

DLPack describe the memory layout of strided, n-dimensional arrays.
When a user calls `y = from_dlpack(x)`, the library implementing `x` (the
"producer") will provide access to the data from `x` to the library
containing `from_dlpack` (the "consumer"). If possible, this must be
zero-copy (i.e. `y` will be a _view_ on `x`). If not possible, that library
may make a copy of the data. In both cases:
- the producer keeps owning the memory
- `y` may or may not be a view, therefore the user must keep the
recommendation to avoid mutating `y` in mind - see
{ref}`copyview-mutability`.
- Both `x` and `y` may continue to be used just like arrays created in other ways.

If an array that is accessed via the interchange protocol lives on a
device that the requesting library does not support, one of two things
must happen: moving data to another device, or raising an exception.
Device transfers are typically expensive, hence doing that silently can
lead to hard to detect performance issues. Hence it is recommended to
raise an exception, and let the user explicitly enable device transfers
via, e.g., a `force=False` keyword that they can set to `True`.
```
device that the requesting library does not support, it is recommended to
raise a `TypeError`.

Stream handling through the `stream` keyword applies to CUDA and ROCm (perhaps
to other devices that have a stream concept as well, however those haven't been
considered in detail). The consumer must pass the stream it will use to the
producer; the producer must synchronize or wait on the stream when necessary.
In the common case of the default stream being used, synchronization will be
unnecessary so asynchronous execution is enabled.


## Implementation

_Note that while this API standard largely tries to avoid discussing implementation details, some discussion and requirements are needed here because data interchange requires coordination between implementers on, e.g., memory management._

![Diagram of DLPack structs](/_static/images/DLPack_diagram.png)

_DLPack diagram. Dark blue are the structs it defines, light blue struct members, gray text enum values of supported devices and data types._

The `__dlpack__` method will produce a `PyCapsule` containing a
`DLPackManagedTensor`, which will be consumed immediately within
`from_dlpack` - therefore it is consumed exactly once, and it will not be
visible to users of the Python API.

The consumer must set the PyCapsule name to `"used_dltensor"`, and call the
`deleter` of the `DLPackManagedTensor` when it no longer needs the data.

When the `strides` field in the `DLTensor` struct is `NULL`, it indicates a
row-major compact array. If the array is of size zero, the data pointer in
`DLTensor` should be set to either `NULL` or `0`.

DLPack version used must be `0.2 <= DLPACK_VERSION < 1.0`. For further
details on DLPack design and how to implement support for it,
refer to [github.com/dmlc/dlpack](https://github.com/dmlc/dlpack).

:::{warning}
DLPack contains a `device_id`, which will be the device ID (an integer, `0, 1, ...`) which the producer library uses. In practice this will likely be the same numbering as that of the consumer, however that is not guaranteed. Depending on the hardware type, it may be possible for the consumer library implementation to look up the actual device from the pointer to the data - this is possible for example for CUDA device pointers.

It is recommended that implementers of this array API consider and document
whether the `.device` attribute of the array returned from `from_dlpack` is
guaranteed to be in a certain order or not.
:::
0