10BC0 Tracking issue: push down computation in distributed query · Issue #1108 · apache/horaedb · GitHub
[go: up one dir, main page]

Skip to content

Tracking issue: push down computation in distributed query #1108

@Rachelint

Description

@Rachelint

Describe This Problem

Now, we support the rough disrtibuted sql query by hooking in table scan level, that leading actual computation such as aggregated can't be pushed down...

So, I plan to refactor it, and support distributed query in plan level for pushing down more things.

Proposal

1. Background
The exist implementations can be divided into two ways:

  • Generate explicit distibuted logical plan, and generate distributed physical plan after, like Drios
  • No explicit distributed loigcal plan(can't do it because no schema info?), and generate distributed physical plan directly, like TiDB and Datafusion.

As I see, they are almost same, the more clear way is to have the explicit distributed logical plan but it is the problem about code organization.

The real problem is should we depend on datafusion to do this? If we do it ourself, it may be more controllable? But it may need to design the complete physical plan generating process.

I think we should try to reuse the logic in datafusion first.

2. General
Works can be broken down as following:

  • Generate distributed physical plan according to the original, I think we make it refering to TiDB.
  • Support querying by physical plan in RemoteEngine.

3. Two role of node in proposal
My proposal is designed as folliowing:

  • Scheduler node(responsible for invoking the query, dispatching sub query to executor node, and computing the final result).
  • Executor node(where sub table in, responsible for computing the sub result).

4. Process

  • Scheduler node generates the initial physical plan of partitioned table. In this initial physical plan, the TableScan node is just a placeholder(can't execute actually) with some information for generating later executable plan, so I name it UnresolvePartitionedScan.
  • Scheduler node traverses the initial physical plan, finds the sub plan can be pushed down, and generate the sub plans for remote executing(using the information in UnresolvePartitionedScan). The sub plans are unable to execute like UnresolvePartitionedScan before being sent to and be rewriting in the executor nodes, so I name them UnresolveSubScans.
  • Scheduler node sends the sub plans to executor nodes and wait result, and UnresolveSubScan is converted to ResolvePartitionedScan now.
  • Executor nodes receive the sub plans, and converts the UnresolveSubScan to ResolveSubScan using the carried information and catalog in local.
  • Executor nodes execute the converted sub plans and return the results.

aaa

Additional Context

No response

Metadata

Metadata

Assignees

Labels

A-query-engineArea: Query enginefeatureNew feature or requesttracking issueIssue tracks progress for something

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0