Scanning#

The scan API provides a builder pattern for reading data from a Vortex file with optional filter, projection, row range, and limit pushdowns. The resulting stream exposes the Arrow C Data Interface (ArrowArrayStream).

ScanBuilder#

class ScanBuilder#

Public Functions

ScanBuilder &WithFilter(expr::Expr &&expr) &#

Only include rows that match the filter expressions.

ScanBuilder &WithProjection(expr::Expr &&expr) &#

Only include columns that match the projection expressions.

ScanBuilder &WithRowRange(uint64_t row_range_start, uint64_t row_range_end) &#

Only include rows in the range [row_range_start, row_range_end).

ScanBuilder &WithIncludeByIndex(const uint64_t *indices, std::size_t size) &#

Only include rows with the given indices.

ScanBuilder &WithLimit(uint64_t limit) &#

Set the limit on the number of rows to scan out.

ScanBuilder &WithOutputSchema(ArrowSchema &output_schema) &#

Set the output schema on the scan builder. TODO: currently if pass in this option, the schema needs to be the schema after adding projection.

ArrowArrayStream IntoStream() &&#

Take ownership and consume the scan builder to a stream of record batches.

StreamDriver IntoStreamDriver() &&#

Take ownership and consume the scan builder to a stream driver. Under the hood, this function calls ScanBuilder::into_record_batch_reader and holds a WorkStealingArrayIterator in StreamDriver.

StreamDriver#

class StreamDriver#

The StreamDriver internally holds a RecordBatchIteratorAdapter from the Rust side, which is thread-safe and cloneable. The RecordBatchIteratorAdapter internally holds a WorkStealingArrayIterator.

Public Functions

ArrowArrayStream CreateArrayStream() const#

Create a stream of record batches.

This function is thread-safe and can be called from multiple threads to create one stream per thread to make progress on the same StreamDriver that is built from a ScanBuilder concurrently.

Within each thread, the record batches will be emitted in the original order they are within the scan. Between threads, the order is not guaranteed.

Example: If the scan contains batches [b0, b1, b2, b3, b4, b5] and two threads call this function respectively to make progress on their own stream, Thread 1 might receive [b0, b2, b4] and Thread 2 might receive [b1, b3, b5]. Each thread maintains order within its subset, but overall ordering between threads is not guaranteed (e.g., Thread 2 could emit b1 before Thread 1 emits b0).