8000 RFC: Re-work some DataFrame APIs · Issue #875 · apache/datafusion-python · GitHub
[go: up one dir, main page]

Skip to content
Open
@ion-elgreco

Description

@ion-elgreco

Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):

  • - DataFrame.cache() -> DataFrame ===> DataFrame.collect() -> DataFrame
  • - DataFrame.collect() -> list[pyarrow.RecordBatch] ===> DataFrame.to_batches() -> list[pyarrow.RecordBatch]
  • - DataFrame.join ===> DataFrame.join(right: DataFrame, on: str | sequence[str] | None, left_on: str | sequence[str] | None, right_on: str | sequence[str] | None
  • - DataFrame.schema -> pyarrow.Schema ===> DataFrame.schema -> datafusion.Schema Map Rust arrow types to dafusion-py types
  • - DataFrame.with_column ===> DataFrame.with_columns Allow multiple inputs as exprs or key value pairs
  • - DataFrame.with_column_renamed ===> DataFrame.rename() a simple rename is clear enough and should allow a dict as input
  • - DataFrame.aggregate ===> DataFrame.group_by().agg() this feels more natural coming from PySpark/Polars/Pandas

Can remove these:

  • - DataFrame.select_columns already covered by DataFrame.select

Missing APIs:

  • - DataFrame.cast to cast on top level a single or multiple columns
  • - DataFrame.drop to drop columns, instead of writing a very verbose select
  • - DataFrame.fill_null/fill_nan to fill null or nan values
  • - DataFrame.interpolate interpolate values per col
  • - Asof join missing in df api?
  • - Join on (inequality join)
  • - DataFrame.head/tail
  • - DataFrame.pivot
  • - DataFrame.unpivot

Optional but useful:

  • - DataFrame.with_row_idx

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0