Eshant Garg
Data Engineer, Architect, Advisor
eshant.garg@gmail.com
Introduction
MPP or Massive Parallel Processing
Storage & Data Distribution (Hash, Round-robin, Replicate)
Data types and Table types (Columstore, Heap, Clustered B-tree index)
Partitioning and Distribution key
Applications in Dimensional modeling
Demo – Table Analysis before Migration to Cloud
Azure Synapse MPP Architecture
DWU Loading Ran
3 Tables Report
100 15 20
500 3 4
Source: Microsoft
Azure Storage and Distribution
SQL DW charges separately for storage consumption
A distribution is the basic unit of storage and processing for parallel
queries
Rows are stored across 60 distributions which are run in parallel
Each compute node manages one or more of the 60 distribution
Sharding Patterns
Replicated Tables
• Caches a full copy on each compute node.
• Used for small tables
CREATE TABLE [dbo].[BusinessHierarchies](
[BookId] [nvarchar](250) ,
[Division] [nvarchar](100) ,
[Cluster] [nvarchar](100) ,
[Desk] [nvarchar](100) ,
[Book] [nvarchar](100) ,
[Volcker] [nvarchar](100) ,
[Region] [nvarchar](100)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = REPLICATE
)
;
Source: Microsoft
Round Robin tables
CREATE TABLE [dbo].[Dates](
[Date] [datetime2](3) ,
[DateKey] [decimal](38, 0) ,
..
..
[WeekDay] [nvarchar](100) ,
[Day Of Month] [decimal](38, 0)
)
WITH
(
• Generally use to load staging tables CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = ROUND_ROBIN
• Distribute data evenly across the table without )
;
additional optimization
• Joins are slow, because it requires to reshuffle data
• Default distribution type
Source: Microsoft
Hash Distribution Tables
• Highest performance for large tables
• Each row belong to one particular distribution
• It is used mostly for larger tables
Source: Microsoft
Hash Distribution Tables
Record Product Store
1 Soccer New York
2 Soccer Los Angeles
3 Football Phoenix
Hash Distribution Tables
• Highest performance for large tables
• Each row belong to one particular
distribution
• It is used mostly for larger tables
CREATE TABLE [dbo].[EquityTimeSeriesData](
[Date] [varchar](30) ,
[BookId] [decimal](38, 0) ,
[P&L] [decimal](31, 7) ,
[VaRLower] [decimal](31, 7)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([P&L])
)
;
Source: Microsoft
Avoid Data Skew
Even Distribution
Determines the method in which Azure SQL Data Warehouse spreads the data
across multiple nodes.
Azure SQL Data Warehouse uses up to 60 distributions when loading data into the
system.
Good Hash Key
Has more than
Distributes
60 distinct
Evenly
values
Is Not Used for
Updated Grouping
Used as Join
condition
What Data Distribution to Use?
Type Great fit for Watch out if…
Replicated Small-dimension tables in a • Many write transaction are on the table
star schema with less than (insert/update/delete)
2GB of storage after • You change DWU provisioning frequently
compression • You use only 2-3 columns, but your table has
many columns
• You index a replicated table
Round-robin (default) • Temporary/Staging table Performance is slow due to data movement
• No obvious joining key or
good candidate column.
hash • Fact tables The distribution key can’t be updated
• Large dimension tables
Data types
Use the smallest data type which will support your data
Avoid defining all character columns to a large default
length
Define columns as VARCHAR rather than NVARCHAR if
you don’t need Unicode
Data types
The goal is to not only save space but also move data as efficiently as possible.
Data types
Some complex data types (XML, geography, etc)
are not supported on Azure SQL Data
Warehouse yet.
Table types
Clustered • Updateable primary storage method
columnstore • Great for read-only
• Data is not in any particular order.
Heap • Use when data has no natural order.
• An index that is physically stored in the same
Clustered Index order as the data being indexed
High compression
Default table type
ratio
Clustered
columnstore
Ideally segments of No Secondary
1M rows Indexes
No index on the data Fast Load
Heap
Allows secondary
No compression
indexes
Sorted index on the data Fast singleton lookup
Clustered
B-Tree
Allows secondary
No compression
indexes
Table Partitioning
Table
Partitioning
Table partitions enable you to divide your data into
smaller groups of data
Improve the efficiency and performance of loading data
by use of partition deletion, switching and merging
Usually data is partitioned on a date column tied to when
the data is loaded into the database
Can also be used to improve query performance
Why Partitioning?
Partitions best practices
Creating a table Too many partitions can hurt
performance under some circumstances
Usually a successful partitioning scheme has 10 or a few
hundred partitions
Clustered column store tables, it is important to consider
how many rows belong to each partition
Before partitions are created, SQL Data warehouse
already divides each table into 60 distributed databases
A highly granular partitioning scheme can work
in SQL Server but hurt performance in Azure
SQL Data Warehouse.
Example
60 Distributions 365 Partitions 21900 Data Buckets
21900 Data Buckets Ideal Segment 21 900 000 000 Rows
Size (1M Rows)
Lower Granularity (week, month)
can perform better depending on
how much data you have.
Fact Tables
Large ones are better as Columnstores
Distributed through Hash key as much as
possible as long as it is even
Partitioned only if the table is large
enough to fill up each segment
Dimension Tables
Can be Hash distributed or Round-Robin if there is no clear candidate join key
Columnstore for large dimensions
Heap or Clustered Index for small dimensions
Add secondary indexes for alternate join columns
Partitioning not recommended
DEMO
Analyse data distribution at On-premises Datawarehouse before migrating to
Azure Synapse Data Pool.
• We will use Microsoft’s AdventureworksDW database as on-premises data warehouse.
• We will analyse one dimension and one fact table.
• Same process can be repeated to other tables of on-premises database.
Summary
MPP or Massive Parallel Processing
Billing = Compute + Storage
Data Distribution (Hash, Round-robin, Replicate)
Data types and Table types
Partitioning Data
Best practice – Fact and Dimension table design
Demo – Analyse Data Distribution