Transform scalar correlated subqueries in Where to DependentJoin #16174

irenjj · 2025-05-24T03:42:17Z

Which issue does this PR close?

Closes Transform scalar correlated subqueries in Where to DependentJoin #16172

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

duongcongtoai · 2025-05-24T05:31:09Z

datafusion/optimizer/src/create_dependent_join.rs

+        .dependent_join_on(
+            subquery_plan,
+            join_type,
+            vec![Expr::Literal(ScalarValue::Boolean(Some(true)))],


nit: we have shorter syntax:

use datafusion_expr::lit; let some_exprs = vec![lit(true)];

duongcongtoai · 2025-05-24T05:37:47Z

datafusion/optimizer/src/create_dependent_join.rs

+        _config: &dyn OptimizerConfig,
+    ) -> Result<Transformed<LogicalPlan>> {
+        if let LogicalPlan::Filter(ref filter) = plan {
+            match &filter.predicate {


here are more cases i can think of:

a predicate can be a complex expressions such as

where column1=(scalar_subquery) or column2=(exists_subquery)

In this case 2 nested dependent join will be generated

The scalar subquery exprs sometimes is not the direct child of the predicate for example

where column1 > 1 + (subquery)

We can have 2 subqueries in the same binary expr

where (subquery1) > (subquery2) + 1

duongcongtoai · 2025-05-24T05:41:26Z

datafusion/optimizer/src/create_dependent_join.rs

+            subquery_plan,
+            join_type,
+            vec![Expr::Literal(ScalarValue::Boolean(Some(true)))],
+            subquery.outer_ref_columns.clone(),


if the subquery has some nested subquery underneath, i believe this function won't be able to return the outer_ref_columns from lower level.
For example

where column1=(select count(*) from inner_table_lv1 lv1 where lv1.column2=lv0.column2 and exists ( select * from inner_table_lv2 lv2 where lv2.column1=lv1.column1 and lv2.column2=lv0.column3 )

In this case, the calls to subquery.outer_ref_columns will only returns lv0.column2, while the general framework needs to be aware of lv0.column3 as well

For cases where depth > 1, DataFusion doesn't support it at the planner stage. The reason is that each time parse_subquery is called, it uses the outer_query_schema, which is the schema from the previous layer of the query:

pub(super) fn parse_scalar_subquery( &self, subquery: Query, input_schema: &DFSchema, planner_context: &mut PlannerContext, ) -> Result<Expr> { let old_outer_query_schema = planner_context.set_outer_query_schema(Some(input_schema.clone().into())); ...

In #16060, I attempted to layer the schemas of query blocks at different depths within the PlannerContext, and record the depth of the subquery's own layer within the Subquery, then pass the PlannerContext into the optimizer. What are your thoughts on this approach? Welcome discussion of your ideas. For multi-layer cases, more detailed design and discussion may be needed. Currently, I'm more inclined to handle simple use cases between adjacent layers first.

i wonder would it be more simple to let the decorrelation optimizor aware of the depth and handle recursion itself 🤔

i wonder would it be more simple to let the decorrelation optimizor aware of the depth and handle recursion itself 🤔

Since there are multiple optimizer rules, I'm wondering if the depth will change because of other priority rules rewrite.🤔

Since there are multiple optimizer rules

In the final stage of this epic we only let one optimizor handle the decorrelation right?

Also in the middle of the implementation, even if we maintain multiple decorrelating rules, if existing rule such as DecorrelatePredicateSubquery or ScalarSubqueryToJoin detect any depth > 1, they will back off and leave the whole query untounched

#16016

I also implemented something like this, but inside an optimizor (still alot of details need to be added, but at least it is capable of detect the correlated columns (including the ones with depth > 1), correlated exprs, the depth of the dependent join node)

Thanks @duongcongtoai, I've seen your pr, It's much more comprehensive than mine.

#16016

I also implemented something like this, but inside an optimizor (still alot of details need to be added, but at least it is capable of detect the correlated columns (including the ones with depth > 1), correlated exprs, the depth of the dependent join node)

Maybe we could implement an initial version first, then list some pending work as tracking issues? I'm actually quite eager to contribute and help out as well.

yep, i'll try to wrap up with some basic usecase and ask for review soon

logan-keede · 2025-05-24T14:04:20Z

datafusion/optimizer/src/eliminate_cross_join.rs

@@ -351,6 +353,8 @@ fn find_inner_join(
        join_type: JoinType::Inner,
        join_constraint: JoinConstraint::On,
        null_equals_null: false,
+        dependent_join: false,


can't we use something like JoinType::DependentJoin instead of a boolean to separate it??

Using a bool is a little strange, so I comment TODO: maybe it's better to add a new logical plan: DependentJoin.
But if we mark dependent join by JoinType::DependentJoin, how we can know the real JoinType?

I thought DependentJoin's actual type was to be decided later with Decorrelate Optimizer. Hence, the suggestion, though I am not sure anymore.

Transform scalar correlated subqueries in Where to DependentJoin

f75bf93

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules substrait Changes to the substrait crate labels May 24, 2025

irenjj mentioned this pull request May 24, 2025

General framework to decorrelate the subqueries #5492

Open

irenjj added 2 commits May 24, 2025 13:26

fix clippy

1e9ac37

update test

1ca277a

duongcongtoai reviewed May 24, 2025

View reviewed changes

logan-keede reviewed May 24, 2025

View reviewed changes

logan-keede mentioned this pull request May 25, 2025

feat: rewrite subquery into dependent join logical plan #16016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transform scalar correlated subqueries in Where to DependentJoin #16174

Transform scalar correlated subqueries in Where to DependentJoin #16174

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Transform scalar correlated subqueries in Where to DependentJoin #16174

Are you sure you want to change the base?

Transform scalar correlated subqueries in Where to DependentJoin #16174

Conversation

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!