8000 support FFI query result streams that do not pre-collect · Issue #1011 · apache/datafusion-python · GitHub
[go: up one dir, main page]

Skip to content
support FFI query result streams that do not pre-collect #1011
Open
@matko

Description

@matko

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I'm trying to pass the result of a query into my rust code. Some of the queries I'm doing produce a lot of data, and I would like to process this in a streaming way, without first loading the entire query result into memory (where it might not even fit).

Dataframe has a function __arrow_c_stream__(), which can be used to cross the FFI boundary and get dataframe results into a native component. Unfortunately, this calls .collect() internally. This means I can't actually stream over the results while keeping the memory footprint low. I need to be able to load my entire dataset in memory, and the rest of my processing logic has to wait for this to complete before it can start.

Describe the solution you'd like
I would like __arrow_c_stream__() or a similar function to produce a RecordBatchReader or even a RecordBatchStream (which also appears to be FFI-wrapped), which streams the query result without first collecting into memory.

Describe alternatives you've considered
The alternative is accepting that using results from python in rust will always require a collect on the python side first. Given that the infrastructure seems to be in place to pass around readers and streams, this seems silly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0