8000 Add user documentation for the FFI approach by timsaucer · Pull Request #1031 · apache/datafusion-python · GitHub
[go: up one dir, main page]

Skip to content

Add user documentation for the FFI approach #1031

New issue
Merged
merged 3 commits into from
Feb 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,8 @@ DataFusion's Python bindings can be used as a foundation for building new data s
planning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
- [DataFusion Ballista](https://github.com/apache/datafusion-ballista) is a distributed SQL query engine that extends
DataFusion's Python bindings for distributed use cases.

It is also possible to use these Python bindings directly for DataFrame and SQL operations, but you may find that
[Polars](http://pola.rs/) and [DuckDB](http://www.duckdb.org/) are more suitable for this use case, since they have
more of an end-user focus and are more actively maintained than these Python bindings.
- [DataFusion Ray](https://github.com/apache/datafusion-ray) is another distributed query engine that uses
DataFusion's Python bindings.

## Features

Expand Down Expand Up @@ -114,6 +112,11 @@ Printing the context will show the current configuration settings.
print(ctx)
```

## Extensions

For information about how to extend DataFusion Python, please see the extensions page of the
[online documentation](https://datafusion.apache.org/python/).

## More Examples

See [examples](examples/README.md) for more information.
Expand Down
212 changes: 212 additions & 0 deletions docs/source/contributor-guide/ffi.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.

Python Extensions
=================

The DataFusion in Python project is designed to allow users to extend its functionality in a few core
areas. Ideally many users would like to package their extensions as a Python package and easily
integrate that package with this project. This page serves to describe some of the challenges we face
when doing these integrations and the approach our project uses.

The Primary Issue
-----------------

Suppose you wish to use DataFusion and you have a custom data source that can produce tables that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a callout, whats so special about this method is that the abstraction is at the "data provider" ("TableProvider") layer. Most other projects are integrated with the Arrow ecosystem but requires passing arrow table or arrow batches around.
This here allows datafusion to integrate with a data source and produce the necessary data at the root level, which allows further optimizations and custom handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, currently pyiceberg pass data to other engines by materializing the arrow table first and then registering the table with the engine
https://github.com/apache/iceberg-python/blob/b95e792d86f551d3705c3ea6b7e9985a2a0aaf3b/pyiceberg/table/__init__.py#L1785-L1828

can then be queried against, similar to how you can register a :ref:`CSV <io_csv>` or
:ref:`Parquet <io_parquet>` file. In DataFusion terminology, you likely want to implement a
:ref:`Custom Table Provider <io_custom_table_provider>`. In an effort to make your data source
as performant as possible and to utilize the features of DataFusion, you may decide to write
your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library.

At first glance, it may appear the best way to do this is to add the ``datafusion-python``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the datafusion-python crate is interesting. what is it used for outside of creating the python bindings as the datafusion python library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we started doing the FFI work, re-exporting the datafusion-python crate was basically they only way you could add extensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! i see this is called out in the Alternative Approach section. Was just wondering if there was another reason I missed

crate as a dependency, provide a ``PyTable``, and then to register it with the
``SessionContext``. Unfortunately, this will not work.

When you produce your code as a Python library and it needs to interact with the DataFusion
library, at the lowest level they communicate through an Application Binary Interface (ABI).
The acronym sounds similar to API (Application Programming Interface), but it is distinctly
different.

The ABI sets the standard for how these libraries can share data and functions between each
other. One of the key differences between Rust and other programming languages is that Rust
does not have a stable ABI. What this means in practice is that if you compile a Rust library
with one version of the ``rustc`` compiler and I compile another library to interface with it
but I use a different version of the compiler, there is no guarantee the interface will be
the same.

In practice, this means that a Python library built with ``datafusion-python`` as a Rust
dependency will generally **not** be compatible with the DataFusion Python package, even
if they reference the same version of ``datafusion-python``. If you attempt to do this, it may
work on your local computer if you have built both packages with the same optimizations.
This can sometimes lead to a false expectation that the code will work, but it frequently
breaks the moment you try to use your package against the released packages.

You can find more information about the Rust ABI in their
`online documentation <https://doc.rust-lang.org/reference/abi.html>`_.

The FFI Approach
----------------

Rust supports interacting with other programming languages through it's Foreign Function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's should be its

Interface (FFI). The advantage of using the FFI is that it enables you to write data structures
and functions that have a stable ABI. The allows you to use Rust code with C, Python, and
other languages. In fact, the `PyO3 <https://pyo3.rs>`_ library uses the FFI to share data
and functions between Python and Rust.

The approach we are taking in the DataFusion in Python project is to incrementally expose
more portions of the DataFusion project via FFI interfaces. This allows users to write Rust
code that does **not** require the ``datafusion-python`` crate as a dependency, expose their
code in Python via PyO3, and have it interact with the DataFusion Python package.

Early adopters of this approach include `delta-rs <https://delta-io.github.io/delta-rs/>`_
who has adapted their Table Provider for use in ```datafusion-python``` with only a few lines
of code. Also, the DataFusion Python project uses the existing definitions from
`Apache Arrow CStream Interface <https://arrow.apache.org/docs/format/CStreamInterface.html>`_
to support importing **and** exporting tables. Any Python package that supports reading
the Arrow C Stream interface can work with DataFusion Python out of the box! You can read
more about working with Arrow sources in the :ref:`Data Sources <user_guide_data_sources>`
page.

To learn more about the Foreign Function Interface in Rust, the
`Rustonomicon <https://doc.rust-lang.org/nomicon/ffi.html>`_ is a good resource.

Inspiration from Arrow
----------------------

DataFusion is built upon `Apache Arrow <https://arrow.apache.org/>`_. The canonical Python
Arrow implementation, `pyarrow <https://arrow.apache.org/docs/python/index.html>`_ provides
an excellent way to share Arrow data between Python projects without performing any copy
operations on the data. They do this by using a well defined set of interfaces. You can
find the details about their stream interface
`here <https://arrow.apache.org/docs/format/CStreamInterface.html>`_. The
`Rust Arrow Implementation <https://github.com/apache/arrow-rs>`_ also supports these
``C`` style definitions via the Foreign Function Interface.

In addition to using these interfaces to transfer Arrow data between libraries, ``pyarrow``
goes one step further to make sharing the interfaces easier in Python. They do this
by exposing PyCapsules that contain the expected functionality.
Comment on lines +100 to +102
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to note that this isn't just a pyarrow implementation detail; it's a full specification for Python, of which pyarrow is just one implementor https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, pyarrow didn't used to use capsules; they use to use raw integers in the _export_to_c and _import_from_c methods, where those integers refer to data pointers.

But this can be unsafe/lead to memory leaks if those pointers aren't always consumed, because then the Arrow release callback may never be called.

The added benefit and safety of the capsule is specifically because the capsule allows you to define a destructor, so that the destructor will be called by the Python GC if the capsule is n 8000 ever used. So you can't get memory leaks when using capsules.


You can learn more about PyCapsules from the official
`Python online documentation <https://docs.python.org/3/c-api/capsule.html>`_. PyCapsules
have excellent support in PyO3 already. The
`PyO3 online documentation <https://pyo3.rs/main/doc/pyo3/types/struct.pycapsule>`_ is a good source
for more details on using PyCapsules in Rust.

Two lessons we leverage from the Arrow project in DataFusion Python are:

- We reuse the existing Arrow FFI functionality wherever possible.
- We expose PyCapsules that contain a FFI stable struct.

Implementation Details
----------------------

The bulk of the code necessary to perform our FFI operations is in the upstream
`DataFusion <https://datafusion.apache.org/>`_ core repository. You can review the code and
documentation in the `datafusion-ffi`_ crate.

Our FFI implementation is narrowly focused at sharing data and functions with Rust backed
libraries. This allows us to use the `abi_stable crate <https://crates.io/crates/abi_stable>`_.
This is an excellent crate that allows for easy conversion between Rust native types
and FFI-safe alternatives. For example, if you needed to pass a ``Vec<String>`` via FFI,
you can simply convert it to a ``RVec<RString>`` in an intuitive manner. It also supports
features like ``RResult`` and ``ROption`` that do not have an obvious translation to a
C equivalent.

The `datafusion-ffi`_ crate has been designed to make it easy to convert from DataFusion
traits into their FFI counterparts. For example, if you have defined a custom
`TableProvider <https://docs.rs/datafusion/45.0.0/datafusion/catalog/trait.TableProvider.html>`_
and you want to create a sharable FFI counterpart, you could write:

.. code-block:: rust

let my_provider = MyTableProvider::default();
let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since this is the implementation details section, i think its worth calling out the ability to passing tokio runtime and why we would want to do that
apache/datafusion#13937


If you were interfacing with a library that provided the above ``FFI_TableProvider`` and
you needed to turn it back into an ``TableProvider``, you can turn it into a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into an should be into a

``ForeignTableProvider`` with implements the ``TableProvider`` trait.

.. code-block:: rust

let foreign_provider: ForeignTableProvider = ffi_provider.into();

If you review the code in `datafusion-ffi`_ you will find that each of the traits we share
across the boundary has two portions, one with a ``FFI_`` prefix and one with a ``Foreign``
prefix. This is used to distinguish which side of the FFI boundary that struct is
designed to be used on. The structures with the ``FFI_`` prefix are to be used on the
**provider** of the structure. In the example we're showing, this means the code that has
written the underlying ``TableProvider`` implementation to access your custom data source.
The structures with the ``Foreign`` prefix are to be used by the receiver. In this case,
it is the ``datafusion-python`` library.

In order to share these FFI structures, we need to wrap them in some kind of Python object
that can be used to interface from one package to another. As described in the above
section on our inspiration from Arrow, we use ``PyCapsule``. We can create a ``PyCapsule``
for our provider thusly:

.. code-block:: rust

let name = CString::new("datafusion_table_provider")?;
let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you really want to use PyCapsule::new_with_destructor so that you don't get memory leaks if the capsule isn't used.

Or maybe PyCapsule::new will automatically call the Drop impl and you don't need to manually call new_with_destructor? I can't remember


On the receiving side, turn this pycapsule object into the ``FFI_TableProvider``, which
can then be turned into a ``ForeignTableProvider`` the associated code is:

.. code-block:: rust

let capsule = capsule.downcast::<PyCapsule>()?;
let provider = unsafe { capsule.reference::<FFI_TableProvider>() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO you should always check the name of the capsule before opening it. https://github.com/kylebarron/arro3/blob/a041de6bae703a2c8fa3e3b31cd863d4801101d2/pyo3-arrow/src/ffi/from_python/utils.rs#L48

Your FFI interface should also define a name that all implementations should attach to the pycapsule.


By convention the ``datafusion-python`` library expects a Python object that has a
``TableProvider`` PyCapsule to have this capsule accessible by calling a function named
``__datafusion_table_provider__``. You can see a complete working example of how to
share a ``TableProvider`` from one python library to DataFusion Python in the
`repository examples folder <https://github.com/apache/datafusion-python/tree/main/examples/ffi-table-provider>`_.

This section has been written using ``TableProvider`` as an example. It is the first
extension that has been written using this approach and the most thoroughly implemented.
As we continue to expose more of the DataFusion features, we intend to follow this same
design pattern.

Alternative Approach
--------------------

Suppose you needed to expose some other features of DataFusion and you could not wait
for the upstream repository to implement the FFI approach we describe. In this case
you decide to create your dependency on the ``datafusion-python`` crate instead.

As we discussed, this is not guaranteed to work across different compiler versions and
optimization levels. If you wish to go down this route, there are two approaches we
have identified you can use.
Comment on lines +193 to +195
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I wouldn't even suggest this because this is explicitly recommended against by the pyo3 maintainers: PyO3/pyo3#1444


#. Re-export all of ``datafusion-python`` yourself with your extensions built in.
#. Carefully synchonize your software releases with the ``datafusion-python`` CI build
system so that your libraries use the exact same compiler, features, and
optimization level.

We currently do not recommend either of these approaches as they are difficult to
maintain over a long period. Additionally, they require a tight version coupling
between libraries.

Status of Work
--------------

At the time of this writing, the FFI features are under active development. To see
the latest status, we recommend reviewing the code in the `datafusion-ffi`_ crate.

.. _datafusion-ffi: https://crates.io/crates/datafusion-ffi
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ Example
:caption: CONTRIBUTOR GUIDE

contributor-guide/introduction
contributor-guide/ffi

.. _toc.api:
.. toctree::
Expand Down
0