Lakehouse With Delta Lake Deep Dive
Lakehouse With Delta Lake Deep Dive
Kevin Coyle
Curriculum Developer, Databricks
1
Agenda
Today Looks Like:
2
Course Objectives
1 Define core characteristics of the Lakehouse architecture
Explain how to build an end-to-end batch and streaming OLAP data pipeline using
3 Delta Lake.
Follow specified design patterns to make data available for consumption by
4 downstream stakeholders.
Explain Databricks’ recommended best practices in engineering a single source of
5 truth Delta design pattern
Challenges with
Modern Data
4
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 5
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 6
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Siloed data teams decrease productivity
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 7
The Emergence of Data Lakes
Data Warehouses
Pros
Business
Intelligence
• Great for Business
Intelligence (BI)
applications
13
A new standard for building data lakes
An opinionated approach to
building Data Lakes
15
An open approach to bringing
data management and
governance to data lakes
16
Challenges
ACID Transactions
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. “Too many files” problems
8. Poor performance
9. Data quality issues
17
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it
4. Real-time operations hard is fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log
0000.json
7. “Too many files” problems
0001.json
8. Poor performance
0002.json
9. Data quality issues
…
0010.parquet
18
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it
4. Real-time operations hard is fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log
0001.json
{ Add
Add
file1.parquet
file2.parquet
0002.json
9. Data quality issues
…
0010.parquet
19
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it is
4. Real-time operations hard fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log
0001.json
{ Remove file1.parquet
Add file2.parquet
0002.json
9. Data quality issues
…
0010.parquet
20
Challenges
ACID Transactions
1. Hard to append data
Review past transactions
2. Modification of existing data difficult
All transactions are recorded
3. Jobs failing mid way
and you can go back in time to
4. Real-time operations hard review previous versions of
5. Costly to keep historical data versions the data (i.e. time travel)
6. Difficult to handle large metadata
SELECT * FROM events
7. “Too many files” problems
TIMESTAMP AS OF ...
8. Poor performance
SELECT * FROM events
9. Data quality issues VERSION AS OF ...
21
Challenges
Powered by Spark
1. Hard to append data
• Spark is built for handling
2. Modification of existing data difficult large amounts of data
3. Jobs failing mid way • All Delta Lake metadata
4. Real-time operations hard stored in open Parquet
format
5. Costly to keep historical data versions
• Portions of it cached and
6. Difficult to handle large metadata optimized for fast access
7. “Too many files” problems • Data and it’s metadata
always co-exist.
8. Poor performance
• No need to keep
9. Data quality issues catalog<>data in sync
22
Challenges
Indexing
1. Hard to append data Automatically optimize a
2. Modification of existing data difficult layout that enables fast
access
3. Jobs failing mid way
• Partitioning: layout for
4. Real-time operations hard
typical queries
5. Costly to keep historical data versions • Data skipping: prune files
6. Difficult to handle large metadata based on statistics on
numericals
7. “Too many files” problems
• Z-ordering: layout to
8. Poor performance optimize multiple columns
9. Data quality issues OPTIMIZE events
ZORDER BY (eventType)
23
Challenges
Indexing
1. Hard to append data
Schema validation and
2. Modification of existing data difficult evolution
• All data in Delta Tables
3. Jobs failing mid way
have to adhere
4. Real-time operations hard to a strict schema (star,
etc)
5. Costly to keep historical data versions
• Includes schema
6. Difficult to handle large metadata evolution in merge
operations
7. “Too many files” problems
MERGE INTO events
8. Poor performance USING changes
ON events.id = changes.id
9. Data quality issues WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT * 24
Lakehouse Design
25
Elements of Delta Lake
▪ Delta Architecture
▪ Delta Storage Layer
▪ Delta Engine
Delta architecture
Data
Lake
AI & Reporting
Data quality
Delta architecture - Bronze
Bronze
CSV,
JSON,
TXT…
Data
Lake Raw
Ingestion
Delta architecture - Silver
Bronze Silver
CSV,
JSON,
TXT…
Data Lake
Filtered
Cleaned
Augmented
Delta architecture - Gold
Data
Lake Business-
level
AI & Reporting
aggregates
Data quality
Benefits of a Lakehouse
▪ Separation of compute and storage
▪ Infinite storage capacity
▪ Leverage best aspects of a data warehouse
▪ Low data gravity
▪ High data throughput
▪ No limits on data structure
▪ Mix batch and streaming workloads
Enterprise
Architectures
33
Lambda Architecture
Events
Stream
Batch Batch
Table Table
(Data written (Data
continuously) compacted
every hour) Update &
Reprocessing Merge
Data Mesh
Empowered by Lakehouse
35
LWD 00
LWD 01
36
Parquet to
Delta Format
37
What does Parquet look like?
Parquet data
● Columnar storage
● Compression
● Designed for being
read by distributed
customers tasks
part1 Spark
part2 task Spark
task
part3 Spark
part4 task
overall overall
TX log
write
! TX log
p1
p2
read
✔ p3
p2
customers
p1
p3
TX log
p2
p4
p3
customers
!
p4
p1 p2 p.. p.. p..
TX log
p1 p.. p1 p.. p..
stream ! Schema enforcement p1 p.. p1 p.. p..
p1 p.. p1 p.. pn
● Change Data
TX log
TX log
TX log
p2 p2 p2
TX log
p2
p3 Capture (CDC & p3 p3 p3
p4 p4 ✘ p4 ✘
✘ p4 ✘ SCD)
V0 V1 V2
LWD 02
41
Delta Components
42
Delta Lake Components
43
Delta Tables
44
Delta Tables
customers
part1
TX log
part2
part3
Transaction Log part4
Provides Metadata
Layer for
Consistency
Data in Parquet
https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
LWD 03
46
Commit Service
47
Transaction Log / Metadata
Paquet Checkpoint
JSON Transaction
Scalable Metadata
Log structured storage
▪ Changes to the table are Add 1.parquet
stored as ordered, atomic
000000.json Add 2.parquet
units called commits in a
000001.json
directory name_delta_log
… Remove 1.parquet
▪ Contains schema &
Remove 2.parquet
metadata (min/max/etc.)
Add 3.parquet
Serialized Transactions
▪ Need to agree on the order
of changes, even when
000000.json
there are multiple writers. User 1 User 2
..
000011.json
000012.json
Pessimistic vs. optimistic concurrency
▪ Optimistic Concurrency ✔Mutual exclusion is
Assume it’ll be okay and check enough!
❌Breaks down if there a lot
of conflicts
▪ Pessimistic Concurrency
✔Avoid wasted work
Block others from conflicting
(locks) ❌Distributed locks
Solving conflicts optimistically
▪ Record start version
▪ Record reads/writes User 1 000000.json User 2
▪ If someone else wins, R: Schema 000001.json R: Schema
check if anything you read W: A 000002.json W: B
has changed
▪ Try again.
LWD 04
53
Delta Engine
54
Delta Engine - Optimize
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
customers
p1
TX log
p2
customers
p3 p1
TX log
p2
p3
customers
p4
p1 p2 p.. p.. p..
TX log
p1 p.. p1 p.. pn
Delta Engine - Auto Optimize
Automatic Compaction (Bin Packing)
Solves Streaming Small File Problem
customers customers
p1
p1 p2 p.. p.. p..
TX log
p2
TX log
p1 p.. p1 p.. p..
p3
p1 p.. p1 p.. p..
p4
p1 p.. p1 p.. pn
Delta Engine - Z Order
Linear Order Z-order
0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0
0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
Data skipping helps identify which part-files can and cannot be skipped
Consider this query: SELECT * FROM points WHERE x = 2 OR y = 3
Delta Engine - Z Order
Read in each part-file and test for X = 2 or Y = 3
0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
62
Delta Lake Summary
▪ Core component of a
Lakehouse architecture
▪ Offers guaranteed
consistency because it's ACID
compliant
▪ Robust data store
▪ Designed to work with
Apache Spark
Thank you
64