GH-46572: [Python] expose filter option to python for join #46566

xingyu-long · 2025-05-23T03:48:53Z

Rationale for this change

C++ implementation support filter while performing hash join, however, it didn't expose to python and I think it's good to have this, so other users can avoid additional filter op explicitly in their side.

What changes are included in this PR?

Support filter expression in python binding.

Are these changes tested?

Yes, added new test test_hash_join_with_filter.

Are there any user-facing changes?

It will expose one more argument for user, i.e., filter_expression for Table.join and Datastet.join

GitHub Issue: [Python] Support filter option for hash join #46572

github-actions · 2025-05-23T03:49:19Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

xingyu-long · 2025-05-23T03:50:59Z

cc @richardliaw since I discussed this with Richard and he suggested me to give this a try. and it may be helpful for ray project too. Thanks!

AlenkaF · 2025-05-23T08:32:46Z

Hi @xingyu-long, thank you for opening a PR!
Could you first open an issue to track the changes and check the failing CI builds, some failing tests are connected.

xingyu-long · 2025-05-23T16:15:39Z

@AlenkaF Thanks for taking a look!

I just opened the issue to track this (#46572). for the failing tests, probably related to corresponding python callers / function definition. but could you take a look first? since the main part is to enable join option in _acero.pyx, I'd like to get some feedback from the community for this part and see if it makes sense. Thanks!

AlenkaF · 2025-05-26T07:08:40Z

Thanks for opening this issue!

I've marked the PR as a draft and updated the title.

Regarding the call in Table.join: I would suggest placing the new keyword argument at the end of the list — this helps preserve consistency and avoids breaking any assumptions about argument order.

Also, please make sure to connect it with acero.py and _acero.pyx.

CC: @raulcd for any additional thoughts.

raulcd

Thanks for the PR. In principle looks good to me. I would just change it to be the last argument of the function signature. As we are not using keyword only arguments these change is making the signature of the function change those provoking an unnecessary breaking change for users.

xingyu-long · 2025-05-28T06:31:20Z

Thanks for the suggestion @AlenkaF @raulcd I just updated the code.

btw, I observed two things while I am writing tests for this matter

it seems filter cannot apply for both side, i.e same field for both table/schema, this was implemented in c++ side

arrow/cpp/src/arrow/acero/hash_join_node.cc

Lines 449 to 451 in 0c18968

    
           if (in_left && in_right) { 
        
             return Status::Invalid("FieldRef", ref.ToString(), 
        
                                    "was found in both left and right schemas");

is this intended behavior?

for example, let's assume that we have two tables which have some common fields (id and name), and we'd like to join them 8000 by id and then filter name with certain pattern. so without exposing this API to python, we probably need to maintain a big intermediate state of temp join and then apply the filter on top of it.

but if we can apply the filter on both tables first before we joining two tables, it would be more efficient? that's why I'd like to confirm what's the expected behavior for this filter in c++ implementation.

I tried to exercise filter with different join types, I saw following surprise. (assuming we use filter only on one side)

In [54]: import pandas as pd
    ...: import pyarrow as pa
    ...: df1 = pd.DataFrame({'id': [1, 2, 3],
    ...:                     'year': [2020, 2022, 2019]})
    ...: df2 = pd.DataFrame({'id': [3, 4],
    ...:                     'n_legs': [5, 100],
    ...:                     'animal': ["Brittle stars", "Centipede"]})
    ...: t1 = pa.Table.from_pandas(df1)
    ...: t2 = pa.Table.from_pandas(df2)

In [55]: t1.join(t2, 'id', join_type="right outer").combine_chunks()
Out[55]:
pyarrow.Table
year: int64
id: int64
n_legs: int64
animal: string
----
year: [[2019,null]]
id: [[3,4]]
n_legs: [[5,100]]
animal: [["Brittle stars","Centipede"]]

# and then we apply filter expression with intended mismatch here
In [56]: t1.join(t2, 'id', join_type="right outer", filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
Out[56]:
pyarrow.Table
year: int64
id: int64
n_legs: int64
animal: string
----
year: [[null,null]]
id: [[3,4]]
n_legs: [[5,100]]
animal: [["Brittle stars","Centipede"]]

it seems we didn't return empty, instead, we return the right outer? it seems the join type takes higher priority than filter operation for the final result?

btw, it seems fine with inner join type.

In [57]: t1.join(t2, 'id', join_type="inner", filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
Out[57]:
pyarrow.Table
id: int64
year: int64
n_legs: int64
animal: string
----
id: []
year: []
n_legs: []
animal: []

this one seems like a bug to me, but I am not sure, @AlenkaF @raulcd could you provide some feedback on these two questions? Thanks!

raulcd · 2025-05-29T08:28:51Z

I am no expert on this area.
I agree with you that the the test you shared seems to return an unexpected behaviour. I would expect the filter to be correctly applied.
Having said that I don't think the issue is coming from the code you have linked on acero/hash_join_node.cc::CollectFilterColumns, from my understanding this is the expected behavior. This isn't checking whether there are repeated fields on both schemas, is checking whether the filter field is in both schemas in order to avoid ambiguous filter expressions. cc @zanmato1984 @pitrou which have more knowledge around this and can help understand it better and can validate whether the test is related to a possible bug on right outer join.

zanmato1984 · 2025-05-29T22:03:07Z

Thank you @xingyu-long for contributing this!

I'd first address your concern of:

it seems we didn't return empty, instead, we return the right outer? it seems the join type takes higher priority than filter operation for the final result?

btw, it seems fine with inner join type.

Yes, this is expected by SQL semantic. And this is also the difference between you put an expression within ON condition of JOIN and that within WHERE clause, e.g.,
FROM t1 LEFT JOIN t2 ON t1.value = x and t2.value = y
does not equal to
FROM t1 LEFT JOIN t2 ON true WHERE t1.value = x and t2.value = y
(They are equivalent ONLY for inner joins.)
This is quite understandable because otherwise you wouldn't need most of join types except inner :)

Conceptually, all subexpressions in ON condition are equally contributing to determine if two rows from each side are a "match" (the whole expression evaluates true) or a "non-match" (the whole expression evaluates null or false). It's just that in practice, most query engines do hash join that requires at least one equal condition with columns from both sides, and for such conditions the columns are used as join "key"s (in your case the join key is implicitly specified by columns with common name). The rest of the expression is normally treated as so 8000 -called "residual filter" (this is what your PR added). Now back to the "conceptually", depending on the join type (inner/left outer/right outer/etc), rows are then processed differently. Take inner and left outer as two examples:

inner join will keep all the columns from both sides for a match, and discard the entire row for a non-match - this is the same as if you do the filter on the table scan first than apply join.
left outer join will always keep the left side columns, and keep the right side columns as well for a match, or discard the right side columns (by filling nulls) for a non-match (but this row is still emitted in the join result).

for example, let's assume that we have two tables which have some common fields (id and name), and we'd like to join them by id and then filter name with certain pattern. so without exposing this API to python, we probably need to maintain a big intermediate state of temp join and then apply the filter on top of it.

Yes this is necessary for preserving the SQL-like join semantic - as long as you write the filter in the ON condition. Again, the filter support you are adding is the "residual filter" (the subexpressions other than join keys in ON condition), not a regular "filter".

but if we can apply the filter on both tables first before we joining two tables, it would be more efficient? that's why I'd like to confirm what's the expected behavior for this filter in c++ implementation.

In this case you can just do the filter ahead of join, e.g.,

t2_filtered = t2.filter(pc.equal(pc.field("n_legs"), 200)
t1.join(t2_filtered, 'id', join_type="right outer")

As long as it is what you needed.

~~> 1. it seems filter cannot apply for both side, i.e same field for both table/schema~~

This is an independent problem. Because join is concatenating columns from both sides, so it is possible that the result table contains columns with the same name. If so, you won't be able to further reference a such column without ambiguity. You can specify output_suffix_for_left/right to append unique identifiers to their column names, so that you can disambiguate them.

zanmato1984 · 2025-05-29T22:10:43Z

If my above comment addresses your concern, I'll in turn review the code. Thank you @xingyu-long .

xingyu-long · 2025-05-30T01:25:38Z

If my above comment addresses your concern, I'll in turn review the code. Thank you @xingyu-long .

Thanks @zanmato1984 for your explanation, it makes sense. probably I should mention more details in function docstring for this usage then. at same time, feel free to review the changes since it just exposes what c++ does for python.

zanmato1984

Some nits.

python/pyarrow/acero.py

python/pyarrow/table.pxi

python/pyarrow/tests/test_acero.py

python/pyarrow/_acero.pyx

python/pyarrow/tests/test_acero.py

zanmato1984

LGTM.

(I pushed a commit merely changing some line orders.)

zanmato1984 · 2025-05-30T17:22:53Z

I've approved the PR in terms of its functionality. I think we need another +1 from @AlenkaF @raulcd @pitrou in terms of python (or functionality of course) since I'm no python expert.

xingyu-long · 2025-05-31T03:23:42Z

I've approved the PR in terms of its functionality. I think we need another +1 from @AlenkaF @raulcd @pitrou in terms of python (or functionality of course) since I'm no python expert.

Thanks! @zanmato1984 Really appreciated it!

I will wait for other approvals.

raulcd

Do we want to be consistent and call the new argument filter_expression instead of expression as we do on FilterOptions? See docstring there:

    The "filter" operation provides an option to define data filtering
    criteria. It selects rows where the given expression evaluates to true.
    Filters can be written using pyarrow.compute.Expression, and the
    expression must have a return type of boolean.

    Parameters
    ----------
    filter_expression : pyarrow.compute.Expression

@AlenkaF @rok @pitrou what are your thoughts on that?

python/pyarrow/acero.py

python/pyarrow/_acero.pyx

python/pyarrow/table.pxi

AlenkaF · 2025-06-02T12:10:11Z

Do we want to be consistent and call the new argument filter_expression instead of expression as we do on FilterOptions?

I think that would make sense, yes.

xingyu-long · 2025-06-02T16:42:36Z

Do we want to be consistent and call the new argument filter_expression instead of expression as we do on FilterOptions? See docstring there:

Just updated the code as suggested name filter_expression. Please take a look when you have time @raulcd @AlenkaF , thanks!

raulcd

Thanks! As @zanmato1984 is the expert on the C++ functionality. This looks good on the Python side to me
@AlenkaF @rok do you want to check?

AlenkaF

I took some time to run code locally so to understand the hash join itself and the changes. Looks great and I have no other comments or questions =)

I need to check if the PR is connected to the issue opened, then l plan to merge this. Thank you again for the contribution @xingyu-long and all the reviews!

github-actions · 2025-06-03T09:08:36Z

⚠️ GitHub issue #46572 has been automatically assigned in GitHub to PR creator.

AlenkaF · 2025-06-03T09:10:49Z

@github-actions crossbow submit -g python

github-actions · 2025-06-03T09:13:27Z

Revision: 889c98b

Submitted crossbow builds: ursacomputing/crossbow @ actions-e8f44d6eca

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-1.26
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-pandas-nightly-numpy-nightly
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.13
test-conda-python-3.9
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5
test-conda-python-emscripten
test-cuda-python-ubuntu-22.04-cuda-11.7.1
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-python-3
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-python-3

raulcd · 2025-06-03T09:55:51Z

CI failures for example-python-minimal-* are unrelated and due to:

[Python] Jobs fail if Pyarrow version is not correctly generated due to missing remote dev tags #44803

xingyu-long · 2025-06-03T15:36:22Z

Thank you all! @zanmato1984 @raulcd @AlenkaF

raulcd · 2025-06-03T15:58:50Z

Thank you for the contribution!

conbench-apache-arrow · 2025-06-04T05:36:08Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 94e3b3e.

There were 73 benchmark results with an error:

Commit Run on arm64-t4g-2xlarge-linux at 2025-06-03 14:18:42Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1
and 71 more (see the report linked below)

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

AlenkaF · 2025-06-04T09:10:43Z

I have seen the same Conbench errors on other PRs (#46638 (comment)). It is getting a bit annoying, will try to take time to look into it =)

[Python] expose filter option to python for join

c1c6764

xingyu-long requested review from AlenkaF, raulcd and rok as code owners May 23, 2025 03:48

github-actions bot added Component: Python awaiting review Awaiting review labels May 23, 2025

xingyu-long mentioned this pull request May 23, 2025

[Python] Support filter option for hash join #46572

Closed

xingyu-long changed the title ~~[draft][Python] expose filter option to python for join~~ [draft] GH-46572: [python] expose filter option to python for join May 23, 2025

AlenkaF marked this pull request as draft May 26, 2025 07:08

AlenkaF changed the title ~~[draft] GH-46572: [python] expose filter option to python for join~~ GH-46572: [python] expose filter option to python for join May 26, 2025

AlenkaF changed the title ~~GH-46572: [python] expose filter option to python for join~~ GH-46572: [Python] expose filter option to python for join May 26, 2025

raulcd reviewed May 26, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 26, 2025

[Python] update filter_expression as end of arguments

875c952

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 28, 2025

xingyu-long marked this pull request as ready for review May 28, 2025 06:36

zanmato1984 reviewed May 30, 2025

View reviewed changes

[Python] address comments: rename and add more tests

f8038b0

zanmato1984 reviewed May 30, 2025

View reviewed changes

python/pyarrow/tests/test_acero.py Outdated Show resolved Hide resolved

xingyu-long and others added 2 commits May 30, 2025 08:04

use pc.scalar(True/False) to always True or False case

17a876e

Adjust some line order

37fd4fc

zanmato1984 approved these changes May 30, 2025

View reviewed changes

raulcd reviewed Jun 2, 2025

View reviewed changes

python/pyarrow/acero.py Outdated Show resolved Hide resolved

python/pyarrow/_acero.pyx Outdated Show resolved Hide resolved

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 2, 2025

xingyu-long added 2 commits June 2, 2025 09:40

update new argument as filter_expression

a83a4c3

update references in test

889c98b

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 2, 2025

raulcd approved these changes Jun 3, 2025

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 3, 2025

AlenkaF approved these changes Jun 3, 2025

View reviewed changes

AlenkaF linked an issue Jun 3, 2025 that may be closed by this pull request

[Python] Support filter option for hash join #46572

Closed

AlenkaF merged commit 94e3b3e into apache:main Jun 3, 2025
17 of 18 checks passed

AlenkaF removed the awaiting merge Awaiting merge label Jun 3, 2025

GH-46572: [Python] expose filter option to python for join #46566

GH-46572: [Python] expose filter option to python for join #46566

Uh oh!

Conversation

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!