-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-46572: [Python] expose filter option to python for join #46566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
cc @richardliaw since I discussed this with Richard and he suggested me to give this a try. and it may be helpful for ray project too. Thanks! |
Hi @xingyu-long, thank you for opening a PR! |
@AlenkaF Thanks for taking a look! I just opened the issue to track this (#46572). for the failing tests, probably related to corresponding python callers / function definition. but could you take a look first? since the main part is to enable join option in _acero.pyx, I'd like to get some feedback from the community for this part and see if it makes sense. Thanks! |
Thanks for opening this issue! I've marked the PR as a draft and updated the title. Regarding the call in Also, please make sure to connect it with CC: @raulcd for any additional thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. In principle looks good to me. I would just change it to be the last argument of the function signature. As we are not using keyword only arguments these change is making the signature of the function change those provoking an unnecessary breaking change for users.
Thanks for the suggestion @AlenkaF @raulcd I just updated the code. btw, I observed two things while I am writing tests for this matter
is this intended behavior? for example, let's assume that we have two tables which have some common fields ( but if we can apply the filter on both tables first before we joining two tables, it would be more efficient? that's why I'd like to confirm what's the expected behavior for this filter in c++ implementation.
In [54]: import pandas as pd
...: import pyarrow as pa
...: df1 = pd.DataFrame({'id': [1, 2, 3],
...: 'year': [2020, 2022, 2019]})
...: df2 = pd.DataFrame({'id': [3, 4],
...: 'n_legs': [5, 100],
...: 'animal': ["Brittle stars", "Centipede"]})
...: t1 = pa.Table.from_pandas(df1)
...: t2 = pa.Table.from_pandas(df2)
In [55]: t1.join(t2, 'id', join_type="right outer").combine_chunks()
Out[55]:
pyarrow.Table
year: int64
id: int64
n_legs: int64
animal: string
----
year: [[2019,null]]
id: [[3,4]]
n_legs: [[5,100]]
animal: [["Brittle stars","Centipede"]]
# and then we apply filter expression with intended mismatch here
In [56]: t1.join(t2, 'id', join_type="right outer", filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
Out[56]:
pyarrow.Table
year: int64
id: int64
n_legs: int64
animal: string
----
year: [[null,null]]
id: [[3,4]]
n_legs: [[5,100]]
animal: [["Brittle stars","Centipede"]] it seems we didn't return empty, instead, we return the btw, it seems fine with inner join type. In [57]: t1.join(t2, 'id', join_type="inner", filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
Out[57]:
pyarrow.Table
id: int64
year: int64
n_legs: int64
animal: string
----
id: []
year: []
n_legs: []
animal: [] this one seems like a bug to me, but I am not sure, @AlenkaF @raulcd could you provide some feedback on these two questions? Thanks! |
I am no expert on this area. |
Thank you @xingyu-long for contributing this! I'd first address your concern of:
Yes, this is expected by SQL semantic. And this is also the difference between you put an expression within Conceptually, all subexpressions in
Yes this is necessary for preserving the SQL-like join semantic - as long as you write the filter in the
In this case you can just do the filter ahead of join, e.g.,
As long as it is what you needed.
|
If my above comment addresses your concern, I'll in turn review the code. Thank you @xingyu-long . |
Thanks @zanmato1984 for your explanation, it makes sense. probably I should mention more details in function docstring for this usage then. at same time, feel free to review the changes since it just exposes what c++ does for python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
(I pushed a commit merely changing some line orders.)
Thanks! @zanmato1984 Really appreciated it! I will wait for other approvals. |
There 9E88 was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to be consistent and call the new argument filter_expression
instead of expression
as we do on FilterOptions
? See docstring there:
The "filter" operation provides an option to define data filtering
criteria. It selects rows where the given expression evaluates to true.
Filters can be written using pyarrow.compute.Expression, and the
expression must have a return type of boolean.
Parameters
----------
filter_expression : pyarrow.compute.Expression
I think that would make sense, yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! As @zanmato1984 is the expert on the C++ functionality. This looks good on the Python side to me
@AlenkaF @rok do you want to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took some time to run code locally so to understand the hash join itself and the changes. Looks great and I have no other comments or questions =)
I need to check if the PR is connected to the issue opened, then l plan to merge this. Thank you again for the contribution @xingyu-long and all the reviews!
|
@github-actions crossbow submit -g python |
Revision: 889c98b Submitted crossbow builds: ursacomputing/crossbow @ actions-e8f44d6eca |
CI failures for |
Thank you all! @zanmato1984 @raulcd @AlenkaF |
Thank you for the contribution! |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 94e3b3e. There were 73 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them. |
I have seen the same Conbench errors on other PRs (#46638 (comment)). It is getting a bit annoying, will try to take time to look into it =) |
Rationale for this change
C++ implementation support filter while performing hash join, however, it didn't expose to python and I think it's good to have this, so other users can avoid additional filter op explicitly in their side.
What changes are included in this PR?
Support filter expression in python binding.
Are these changes tested?
Yes, added new test
test_hash_join_with_filter
.Are there any user-facing changes?
It will expose one more argument for user, i.e., filter_expression for
Table.join
andDatastet.join