Open
Description
Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):
- -
DataFrame.cache() -> DataFrame
===>DataFrame.collect() -> DataFrame
- -
DataFrame.collect() -> list[pyarrow.RecordBatch]
===>DataFrame.to_batches() -> list[pyarrow.RecordBatch]
- -
DataFrame.join
===>DataFrame.join(right: DataFrame, on: str | sequence[str] | None, left_on: str | sequence[str] | None, right_on: str | sequence[str] | None
- -
DataFrame.schema -> pyarrow.Schema
===>DataFrame.schema -> datafusion.Schema
Map Rust arrow types to dafusion-py types - -
DataFrame.with_column
===>DataFrame.with_columns
Allow multiple inputs as exprs or key value pairs - -
DataFrame.with_column_renamed
===>DataFrame.rename()
a simple rename is clear enough and should allow a dict as input - -
DataFrame.aggregate
===>DataFrame.group_by().agg()
this feels more natural coming from PySpark/Polars/Pandas
Can remove these:
- -
DataFrame.select_columns
already covered byDataFrame.select
Missing APIs:
- -
DataFrame.cast
to cast on top level a single or multiple columns - -
DataFrame.drop
to drop columns, instead of writing a very verbose select - -
DataFrame.fill_null
/fill_nan
to fill null or nan values - -
DataFrame.interpolate
interpolate values per col - - Asof join missing in df api?
- - Join on (inequality join)
- -
DataFrame.head/tail
- -
DataFrame.pivot
- -
DataFrame.unpivot
Optional but useful:
- -
DataFrame.with_row_idx