10BC0 [SPARK-46167][PS] Add axis implementation to DataFrame.rank by devin-petersohn · Pull Request #54009 · apache/spark · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@devin-petersohn
Copy link
Contributor

What changes were proposed in this pull request?

Adds axis implementation to DataFrame.rank

Why are the changes needed?

Implements missing API

Does this PR introduce any user-facing change?

Yes, new API support

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

Co-authored-by: Claude Sonnet 4.5

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
@github-actions
Copy link

JIRA Issue Information

=== Sub-task SPARK-46167 ===
Summary: Add axis, pct and na_option parameter to DataFrame.rank
Assignee: None
Status: Open
Affected: ["4.0.0"]


This comment was automatically generated by GitHub Actions

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
@HyukjinKwon
Copy link
Member

cc @gaogaotiantian FYI

def rank(
self, method: str = "average", ascending: bool = True, numeric_only: bool = False
self,
method: str = "average",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to change to a Literal here for type hint.

method: str = "average",
ascending: bool = True,
numeric_only: bool = False,
axis: Axis = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make a decision for where axis should be. pandas has it at the very beginning - we are doing a different thing, which means if the user is sending the argument positionally, we would have a different result. On the other hand, if they are doing that, moving axis to the beginning would break their existing code too.

On a side note, pandas is moving towards keyword-only APIs very eagerly. We could also consider doing that here to avoid user sending the wrong argument.

We are incompatible with pandas now - might be a good chance to fix that and hurt the users early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, do we break compatibility with prior versions of Spark or pandas here? Moving to keyword only could break everyone, but most pandas code I've seen explicitly uses the keyword args. It makes sense to me to move to strict keyword support for this API and others moving forward to make the API more explicit (and Pythonic) anyway. Thoughts?

Copy link
Contributor
@gaogaotiantian gaogaotiantian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test is comprehensive. Took a quick look at the implementation - not the pandas/pyspark df expert but the pattern looks familiar.

The only thing I have is the API, which is brought up inline.

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0