### Distributed Table Concepts
Distributed tables are used in systems like Azure Synapse, Snowflake, or distributed
databases to spread data across multiple nodes for better performance and scalability.
Here's an explanation of the common distribution methods:
---
#### **a. Hash Distribution**
- **How It Works**: Rows are assigned to nodes based on a hash function applied to a
specific column (distribution key).
- **Best For**: Large tables with predictable query patterns where joins or aggregations
happen on the same column.
- **Advantages**:
- Minimizes data movement during queries (better performance).
- Ensures even distribution if the hash key is chosen well.
- **Example**:
- If `CustomerID` is the hash key, all rows with the same `CustomerID` are stored on the
same node.
- **Use Case**:
- Joining customer orders to customer data using `CustomerID`.
---
#### **b. Round-Robin Distribution**
- **How It Works**: Rows are distributed evenly across all nodes in a circular fashion
without considering the data's content.
- **Best For**: Simple bulk inserts or small tables without specific join requirements.
- **Advantages**:
- Very easy to implement.
- Good for scenarios with minimal joins or data dependencies.
- **Drawback**:
- May require shuffling data during joins or aggregations, which can slow queries.
- **Example**:
- Rows 1, 2, 3, 4 are distributed to nodes A, B, C, D, then repeats.
---
#### **c. Replicated Distribution**
- **How It Works**: The entire table is copied to all nodes in the system.
- **Best For**: Small tables that are frequently joined with larger tables.
- **Advantages**:
- Eliminates data movement during joins.
- Ideal for dimension tables (e.g., product categories or regions).
- **Drawbacks**:
- Storage overhead since the table is duplicated on every node.
- Not suitable for large tables.
- **Example**:
- A `Region` table with 10 rows is replicated across all nodes for fast joins with a `Sales`
table.
---
### Key Differences
| **Feature** | **Hash Distribution** | **Round-Robin Distribution** |
**Replicated Distribution** |
|----------------------|----------------------------------|------------------------------------|----------------------
--------------|
| **Distribution Logic** | Based on a column (key). | Equal distribution (no key). |
Entire table copied to all nodes. |
| **Performance** | Best for joins on the hash key. | Good for bulk loads. | Best for
small tables. |
| **Storage** | Even across nodes. | Even across nodes. | Duplicated on all
nodes. |
| **Data Movement** | Minimal if hash key matches. | Higher during joins. | None
for joins. |
---
### Choosing the Right Distribution
- Use **Hash** for large tables where query performance depends on specific columns.
- Use **Round-Robin** for staging or intermediate tables that don’t involve complex joins.
- Use **Replicated** for small, frequently joined tables to eliminate data movement.