-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46167][PS] Add axis implementation to DataFrame.rank #54009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
JIRA Issue Information=== Sub-task SPARK-46167 === This comment was automatically generated by GitHub Actions |
|
cc @gaogaotiantian FYI |
python/pyspark/pandas/frame.py
Outdated
| def rank( | ||
| self, method: str = "average", ascending: bool = True, numeric_only: bool = False | ||
| self, | ||
| method: str = "average", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need to change to a Literal here for type hint.
| method: str = "average", | ||
| ascending: bool = True, | ||
| numeric_only: bool = False, | ||
| axis: Axis = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make a decision for where axis should be. pandas has it at the very beginning - we are doing a different thing, which means if the user is sending the argument positionally, we would have a different result. On the other hand, if they are doing that, moving axis to the beginning would break their existing code too.
On a side note, pandas is moving towards keyword-only APIs very eagerly. We could also consider doing that here to avoid user sending the wrong argument.
We are incompatible with pandas now - might be a good chance to fix that and hurt the users early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point, do we break compatibility with prior versions of Spark or pandas here? Moving to keyword only could break everyone, but most pandas code I've seen explicitly uses the keyword args. It makes sense to me to move to strict keyword support for this API and others moving forward to make the API more explicit (and Pythonic) anyway. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test is comprehensive. Took a quick look at the implementation - not the pandas/pyspark df expert but the pattern looks familiar.
The only thing I have is the API, which is brought up inline.
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
What changes were proposed in this pull request?
Adds axis implementation to DataFrame.rank
Why are the changes needed?
Implements missing API
Does this PR introduce any user-facing change?
Yes, new API support
How was this patch tested?
CI
Was this patch authored or co-authored using generative AI tooling?
Co-authored-by: Claude Sonnet 4.5