0% found this document useful (0 votes)

38 views34 pages

Internal and Architecture

The document discusses Azure Synapse architecture and best practices for data distribution, types, and table design in an Azure Synapse data warehouse. It covers topics like MPP, billing, data distribution techniques like hash, round-robin and replicate. It also discusses table types, partitioning, and provides best practices for designing fact and dimension tables. It demonstrates analyzing data distribution in an on-premises data warehouse before migrating to Azure Synapse.

Uploaded by

Renganathan Umanath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views34 pages

Internal and Architecture

Uploaded by

Renganathan Umanath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Eshant Garg

Data Engineer, Architect, Advisor

eshant.garg@gmail.com
Introduction

MPP or Massive Parallel Processing

Storage & Data Distribution (Hash, Round-robin, Replicate)
Data types and Table types (Columstore, Heap, Clustered B-tree index)
Partitioning and Distribution key
Applications in Dimensional modeling
Demo – Table Analysis before Migration to Cloud
Azure Synapse MPP Architecture

DWU Loading Ran

3 Tables Report
100 15 20
500 3 4

Source: Microsoft
Azure Storage and Distribution

SQL DW charges separately for storage consumption

A distribution is the basic unit of storage and processing for parallel

queries

Rows are stored across 60 distributions which are run in parallel

Each compute node manages one or more of the 60 distribution

Sharding Patterns
Replicated Tables

• Caches a full copy on each compute node.

• Used for small tables

CREATE TABLE [dbo].[BusinessHierarchies](

[BookId] [nvarchar](250) ,
[Division] [nvarchar](100) ,
[Cluster] [nvarchar](100) ,
[Desk] [nvarchar](100) ,
[Book] [nvarchar](100) ,
[Volcker] [nvarchar](100) ,
[Region] [nvarchar](100)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = REPLICATE
)
;

Source: Microsoft
Round Robin tables
CREATE TABLE [dbo].[Dates](
[Date] [datetime2](3) ,
[DateKey] [decimal](38, 0) ,
..
..
[WeekDay] [nvarchar](100) ,
[Day Of Month] [decimal](38, 0)
)

WITH
(
• Generally use to load staging tables CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = ROUND_ROBIN
• Distribute data evenly across the table without )
;
additional optimization
• Joins are slow, because it requires to reshuffle data
• Default distribution type

Source: Microsoft
Hash Distribution Tables

• Highest performance for large tables

• Each row belong to one particular distribution
• It is used mostly for larger tables

Source: Microsoft
Hash Distribution Tables

Record Product Store

1 Soccer New York
2 Soccer Los Angeles
3 Football Phoenix
Hash Distribution Tables
• Highest performance for large tables
• Each row belong to one particular
distribution
• It is used mostly for larger tables

CREATE TABLE [dbo].[EquityTimeSeriesData](

[Date] [varchar](30) ,
[BookId] [decimal](38, 0) ,
[P&L] [decimal](31, 7) ,
[VaRLower] [decimal](31, 7)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([P&L])
)
;

Source: Microsoft
Avoid Data Skew
Even Distribution
Determines the method in which Azure SQL Data Warehouse spreads the data
across multiple nodes.

Azure SQL Data Warehouse uses up to 60 distributions when loading data into the
system.
Good Hash Key

Has more than

Distributes
60 distinct
Evenly
values

Is Not Used for

Updated Grouping

Used as Join
condition
What Data Distribution to Use?
Type Great fit for Watch out if…

Replicated Small-dimension tables in a • Many write transaction are on the table

star schema with less than (insert/update/delete)
2GB of storage after • You change DWU provisioning frequently
compression • You use only 2-3 columns, but your table has
many columns
• You index a replicated table

Round-robin (default) • Temporary/Staging table Performance is slow due to data movement

• No obvious joining key or
good candidate column.

hash • Fact tables The distribution key can’t be updated

• Large dimension tables
Data types

Use the smallest data type which will support your data

Avoid defining all character columns to a large default

length

Define columns as VARCHAR rather than NVARCHAR if

you don’t need Unicode
Data types

The goal is to not only save space but also move data as efficiently as possible.
Data types

Some complex data types (XML, geography, etc)

are not supported on Azure SQL Data
Warehouse yet.
Table types
Clustered • Updateable primary storage method
columnstore • Great for read-only

• Data is not in any particular order.

Heap • Use when data has no natural order.

• An index that is physically stored in the same

Clustered Index order as the data being indexed
High compression
Default table type
ratio

Clustered
columnstore

Ideally segments of No Secondary

1M rows Indexes
No index on the data Fast Load

Heap

Allows secondary
No compression
indexes
Sorted index on the data Fast singleton lookup

Clustered
B-Tree

Allows secondary
No compression
indexes
Table Partitioning
Table
Partitioning
Table partitions enable you to divide your data into
smaller groups of data
Improve the efficiency and performance of loading data
by use of partition deletion, switching and merging
Usually data is partitioned on a date column tied to when
the data is loaded into the database

Can also be used to improve query performance

Why Partitioning?
Partitions best practices

Creating a table Too many partitions can hurt

performance under some circumstances

Usually a successful partitioning scheme has 10 or a few

hundred partitions

Clustered column store tables, it is important to consider

how many rows belong to each partition

Before partitions are created, SQL Data warehouse

already divides each table into 60 distributed databases
A highly granular partitioning scheme can work
in SQL Server but hurt performance in Azure
SQL Data Warehouse.
Example

60 Distributions 365 Partitions 21900 Data Buckets

21900 Data Buckets Ideal Segment 21 900 000 000 Rows

Size (1M Rows)
Lower Granularity (week, month)
can perform better depending on
how much data you have.
Fact Tables

Large ones are better as Columnstores

Distributed through Hash key as much as

possible as long as it is even
Partitioned only if the table is large
enough to fill up each segment
Dimension Tables

Can be Hash distributed or Round-Robin if there is no clear candidate join key

Columnstore for large dimensions

Heap or Clustered Index for small dimensions

Add secondary indexes for alternate join columns

Partitioning not recommended

DEMO
Analyse data distribution at On-premises Datawarehouse before migrating to
Azure Synapse Data Pool.

• We will use Microsoft’s AdventureworksDW database as on-premises data warehouse.

• We will analyse one dimension and one fact table.
• Same process can be repeated to other tables of on-premises database.
Summary
MPP or Massive Parallel Processing
Billing = Compute + Storage
Data Distribution (Hash, Round-robin, Replicate)
Data types and Table types
Partitioning Data
Best practice – Fact and Dimension table design
Demo – Analyse Data Distribution

MIE1628 Big Data Analytics Lecture7
No ratings yet
MIE1628 Big Data Analytics Lecture7
77 pages
20% Off Annual Contributor Access
No ratings yet
20% Off Annual Contributor Access
11 pages
Whiz Cheat Sheet DP 203 v2
No ratings yet
Whiz Cheat Sheet DP 203 v2
42 pages
Azure Data Fundamentals Explore Non Relational Data in Azure - Explore Non-Relational Data Offerings in Azure
No ratings yet
Azure Data Fundamentals Explore Non Relational Data in Azure - Explore Non-Relational Data Offerings in Azure
20 pages
Azure Synapse - Cloud Data Analytics
No ratings yet
Azure Synapse - Cloud Data Analytics
33 pages
Imp Links
No ratings yet
Imp Links
33 pages
Azure SQL DWH Part1 1665371763
No ratings yet
Azure SQL DWH Part1 1665371763
200 pages
Modern Javascript v1
No ratings yet
Modern Javascript v1
55 pages
Windows Azure Table May 2009
No ratings yet
Windows Azure Table May 2009
38 pages
Distributions in Azure Synpase
No ratings yet
Distributions in Azure Synpase
12 pages
Data Mining Questions
No ratings yet
Data Mining Questions
9 pages
Explore Azure Tables
No ratings yet
Explore Azure Tables
2 pages
SQLServer Switching Single Partition
No ratings yet
SQLServer Switching Single Partition
15 pages
DP 300notes241025
No ratings yet
DP 300notes241025
159 pages
Mongo-Sharding and Replication
No ratings yet
Mongo-Sharding and Replication
8 pages
Microsoft - Strategies For Partitioning Relational Data Warehouses in SQL Server
No ratings yet
Microsoft - Strategies For Partitioning Relational Data Warehouses in SQL Server
27 pages
p64 Stonebraker PDF
No ratings yet
p64 Stonebraker PDF
8 pages
Warner DP 203 Slides
No ratings yet
Warner DP 203 Slides
98 pages
Data Partitioning & K-Means Guide
No ratings yet
Data Partitioning & K-Means Guide
8 pages
Data Warehouse
No ratings yet
Data Warehouse
14 pages
Oracle 11g Partitioning
No ratings yet
Oracle 11g Partitioning
11 pages
Tables
No ratings yet
Tables
1 page
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
No ratings yet
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
39 pages
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
No ratings yet
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
39 pages
SQL DW
No ratings yet
SQL DW
596 pages
Create Cluster: Purpose
No ratings yet
Create Cluster: Purpose
24 pages
Cloud Computing Unit-3 Complete Notes 13-09-2024 Complete Notes
No ratings yet
Cloud Computing Unit-3 Complete Notes 13-09-2024 Complete Notes
25 pages
Table Optimizations
No ratings yet
Table Optimizations
31 pages
Distributed Table Concepts
No ratings yet
Distributed Table Concepts
3 pages
DW Basic Questions
No ratings yet
DW Basic Questions
9 pages
Relational Databases
No ratings yet
Relational Databases
374 pages
SQL Server Clustered Index Design For Performance
No ratings yet
SQL Server Clustered Index Design For Performance
17 pages
Azure Data Fundamentals
No ratings yet
Azure Data Fundamentals
56 pages
Implementing An Azure SQL Data Warehouse
No ratings yet
Implementing An Azure SQL Data Warehouse
41 pages
SQ L Questions by Lips A
No ratings yet
SQ L Questions by Lips A
25 pages
Partitioning in Oracle
No ratings yet
Partitioning in Oracle
5 pages
Data50 2020 02 - Feb 02
No ratings yet
Data50 2020 02 - Feb 02
26 pages
U4 - 5 I o Parallelism
No ratings yet
U4 - 5 I o Parallelism
8 pages
Azure Data Engineer
100% (4)
Azure Data Engineer
54 pages
SQL Server Partitioning Guide
No ratings yet
SQL Server Partitioning Guide
4 pages
Parallel Databases
No ratings yet
Parallel Databases
19 pages
IO Parallelism
No ratings yet
IO Parallelism
4 pages
Oracle Partitioning in Oracle Database 11g
No ratings yet
Oracle Partitioning in Oracle Database 11g
47 pages
Oracle 10g Data Warehouse Partitioning
No ratings yet
Oracle 10g Data Warehouse Partitioning
33 pages
The Database Knowledgebase On The Web: Database Wisdom: General - Oracle 11g Partitioni..
No ratings yet
The Database Knowledgebase On The Web: Database Wisdom: General - Oracle 11g Partitioni..
4 pages
Netezza Performance Best Practices
No ratings yet
Netezza Performance Best Practices
5 pages
Azure Data Solutions Offer
No ratings yet
Azure Data Solutions Offer
11 pages
Oracle Partitioning
No ratings yet
Oracle Partitioning
6 pages
Azure Cloud Ch.1 & 2
No ratings yet
Azure Cloud Ch.1 & 2
27 pages
Microsoft Azure Fundamentals
No ratings yet
Microsoft Azure Fundamentals
366 pages
SQL ANalyst by CT Taylor Part 4
No ratings yet
SQL ANalyst by CT Taylor Part 4
5 pages
Implementing Rapidly Changing Dimension: What Are Fast Changing Dimensions?
No ratings yet
Implementing Rapidly Changing Dimension: What Are Fast Changing Dimensions?
5 pages
SQL Server Documentation Guide
No ratings yet
SQL Server Documentation Guide
368 pages
Best Practices For Query Performance in A Data Warehouse: Calisto Zuzarte
No ratings yet
Best Practices For Query Performance in A Data Warehouse: Calisto Zuzarte
41 pages
Core Cloud Services - Azure Data Storage Options
No ratings yet
Core Cloud Services - Azure Data Storage Options
8 pages
Azure Storage
No ratings yet
Azure Storage
9 pages
Scaling Up Database Sharding Strategies
No ratings yet
Scaling Up Database Sharding Strategies
10 pages
Microsoft Access 2007: Table Creation Guide
No ratings yet
Microsoft Access 2007: Table Creation Guide
6 pages
HP Web Jetadmin
100% (1)
HP Web Jetadmin
397 pages
LSMW-MM01 (BDC)
No ratings yet
LSMW-MM01 (BDC)
21 pages
Activation of CDS Views
No ratings yet
Activation of CDS Views
2 pages
Section 2 - Chapter 7 - ERP Project MGMT - Teaching Aid
No ratings yet
Section 2 - Chapter 7 - ERP Project MGMT - Teaching Aid
14 pages
Data Mining & Database Systems Guide
No ratings yet
Data Mining & Database Systems Guide
6 pages
Database Systems 2nd Edition S. K. Singh Instant Download
No ratings yet
Database Systems 2nd Edition S. K. Singh Instant Download
52 pages
RN Reddy Core Python
No ratings yet
RN Reddy Core Python
77 pages
Microsoft Security Product Roadmap Brief All Invitations-2023 April
No ratings yet
Microsoft Security Product Roadmap Brief All Invitations-2023 April
5 pages
Data Science & Mining Quiz
No ratings yet
Data Science & Mining Quiz
7 pages
Using Basic Events in OOPs ALV
100% (2)
Using Basic Events in OOPs ALV
25 pages
Condition Monitoring Dealer Customer Legal Approved
No ratings yet
Condition Monitoring Dealer Customer Legal Approved
4 pages
Service Oriented Architecture: Importance of Soa
No ratings yet
Service Oriented Architecture: Importance of Soa
4 pages
SPF-20A PartsBook UB701057-21
No ratings yet
SPF-20A PartsBook UB701057-21
151 pages
QMS Cocomo II Estimation Sheet Template
No ratings yet
QMS Cocomo II Estimation Sheet Template
24 pages
Vsphere Migration Prerequisites Checklist
100% (1)
Vsphere Migration Prerequisites Checklist
12 pages
AZ-900T00 Microsoft Azure Fundamentals-05 (Identity, Gov, Priv, Compliance) - FINAL
No ratings yet
AZ-900T00 Microsoft Azure Fundamentals-05 (Identity, Gov, Priv, Compliance) - FINAL
34 pages
2060 VB
No ratings yet
2060 VB
24 pages
Cisco CCNA Security Ch. 2 Exam Q&A
No ratings yet
Cisco CCNA Security Ch. 2 Exam Q&A
7 pages
Disk Management
100% (1)
Disk Management
3 pages
Case Study WWT
No ratings yet
Case Study WWT
4 pages
Anna University Data Warehousing and Data Mining November December 2011 Question Paper
No ratings yet
Anna University Data Warehousing and Data Mining November December 2011 Question Paper
3 pages
SQL GRANT and REVOKE Guide
No ratings yet
SQL GRANT and REVOKE Guide
3 pages
Cloud Computing for IT Students
No ratings yet
Cloud Computing for IT Students
17 pages
Database Design
No ratings yet
Database Design
7 pages
IoT Security: Key Concepts & Challenges
No ratings yet
IoT Security: Key Concepts & Challenges
21 pages
Get Started Guide - Azure IT Operators
No ratings yet
Get Started Guide - Azure IT Operators
22 pages
Web Hacking and Security - Vulnerability Assessment
No ratings yet
Web Hacking and Security - Vulnerability Assessment
6 pages
Model Bank R13: AML Setup Guide
No ratings yet
Model Bank R13: AML Setup Guide
25 pages
How To Prepare For The Certified Ethical Hacker Exam Slides
No ratings yet
How To Prepare For The Certified Ethical Hacker Exam Slides
27 pages

Internal and Architecture

Uploaded by

Internal and Architecture

Uploaded by

Eshant Garg

Data Engineer, Architect, Advisor

MPP or Massive Parallel Processing

DWU Loading Ran

SQL DW charges separately for storage consumption

A distribution is the basic unit of storage and processing for parallel

Rows are stored across 60 distributions which are run in parallel

Each compute node manages one or more of the 60 distribution

• Caches a full copy on each compute node.

CREATE TABLE [dbo].[BusinessHierarchies](

• Highest performance for large tables

Record Product Store

CREATE TABLE [dbo].[EquityTimeSeriesData](

Has more than

Is Not Used for

Replicated Small-dimension tables in a • Many write transaction are on the table

Round-robin (default) • Temporary/Staging table Performance is slow due to data movement

hash • Fact tables The distribution key can’t be updated

Avoid defining all character columns to a large default

Define columns as VARCHAR rather than NVARCHAR if

Some complex data types (XML, geography, etc)

• Data is not in any particular order.

• An index that is physically stored in the same

Ideally segments of No Secondary

Can also be used to improve query performance

Creating a table Too many partitions can hurt

Usually a successful partitioning scheme has 10 or a few

Clustered column store tables, it is important to consider

Before partitions are created, SQL Data warehouse

60 Distributions 365 Partitions 21900 Data Buckets

21900 Data Buckets Ideal Segment 21 900 000 000 Rows

Large ones are better as Columnstores

Distributed through Hash key as much as

Can be Hash distributed or Round-Robin if there is no clear candidate join key

Columnstore for large dimensions

Heap or Clustered Index for small dimensions

Add secondary indexes for alternate join columns

Partitioning not recommended

• We will use Microsoft’s AdventureworksDW database as on-premises data warehouse.

You might also like