Tracking issue: push down computation in distributed query

Describe This Problem

Now, we support the rough disrtibuted sql query by hooking in table scan level, that leading actual computation such as aggregated can't be pushed down...

So, I plan to refactor it, and support distributed query in plan level for pushing down more things.

Proposal

1. Background
The exist implementations can be divided into two ways:

Generate explicit distibuted logical plan, and generate distributed physical plan after, like Drios
No explicit distributed loigcal plan(can't do it because no schema info?), and generate distributed physical plan directly, like TiDB and Datafusion.

As I see, they are almost same, the more clear way is to have the explicit distributed logical plan but it is the problem about code organization.

The real problem is should we depend on datafusion to do this? If we do it ourself, it may be more controllable? But it may need to design the complete physical plan generating process.

I think we should try to reuse the logic in datafusion first.

2. General
Works can be broken down as following:

Generate distributed physical plan according to the original, I think we make it refering to TiDB.
Support querying by physical plan in RemoteEngine.

3. Two role of node in proposal
My proposal is designed as folliowing:

Scheduler node(responsible for invoking the query, dispatching sub query to executor node, and computing the final result).
Executor node(where sub table in, responsible for computing the sub result).

4. Process

Scheduler node generates the initial physical plan of partitioned table. In this initial physical plan, the TableScan node is just a placeholder(can't execute actually) with some information for generating later executable plan, so I name it UnresolvePartitionedScan.
Scheduler node traverses the initial physical plan, finds the sub plan can be pushed down, and generate the sub plans for remote executing(using the information in UnresolvePartitionedScan). The sub plans are unable to execute like UnresolvePartitionedScan before being sent to and be rewriting in the executor nodes, so I name them UnresolveSubScans.
Scheduler node sends the sub plans to executor nodes and wait result, and UnresolveSubScan is converted to ResolvePartitionedScan now.
Executor nodes receive the sub plans, and converts the UnresolveSubScan to ResolveSubScan using the carried information and catalog in local.
Executor nodes execute the converted sub plans and return the results.

Additional Context

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue: push down computation in distributed query #1108

Describe This Problem

Proposal

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking issue: push down computation in distributed query #1108

Description

Describe This Problem

Proposal

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions