[go: up one dir, main page]

100% found this document useful (2 votes)
417 views64 pages

Lakehouse With Delta Lake Deep Dive

The document discusses challenges with modern data architectures and how Delta Lake enables the Lakehouse architecture. It provides an agenda that will define characteristics of the Lakehouse, explain how Delta Lake supports it, and demonstrate building batch and streaming data pipelines with Delta Lake. Finally, it outlines challenges with data lakes that Delta Lake addresses like updating data, job failures, metadata management, and performance.

Uploaded by

A Noraznizam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
417 views64 pages

Lakehouse With Delta Lake Deep Dive

The document discusses challenges with modern data architectures and how Delta Lake enables the Lakehouse architecture. It provides an agenda that will define characteristics of the Lakehouse, explain how Delta Lake supports it, and demonstrate building batch and streaming data pipelines with Delta Lake. Finally, it outlines challenges with data lakes that Delta Lake addresses like updating data, job failures, metadata management, and performance.

Uploaded by

A Noraznizam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Lakehouse

with Delta Lake


Deep Dive

Kevin Coyle
Curriculum Developer, Databricks

1
Agenda
Today Looks Like:

1. Challenges with Modern Data


2. How Delta Lake Enables the Lakehouse
3. Lakehouse Design
4. Delta Lake Components

2
Course Objectives
1 Define core characteristics of the Lakehouse architecture

2 Explain how Delta Lake supports the Lakehouse architecture.

Explain how to build an end-to-end batch and streaming OLAP data pipeline using
3 Delta Lake.
Follow specified design patterns to make data available for consumption by
4 downstream stakeholders.
Explain Databricks’ recommended best practices in engineering a single source of
5 truth Delta design pattern
Challenges with
Modern Data

4
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine


Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 5
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML

Disconnected systems and proprietary data formats make integration difficult


Amazon Redshift Teradata Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Azure Synapse Google BigQuery
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Snowflake IBM Db2
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
SAP Oracle Autonomous
Data Warehouse Tibco Spotfire Confluent TensorFlow PyTorch

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine


Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 6
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Siloed data teams decrease productivity

Data Analysts Data Engineers Data Engineers Data Scientists

Disconnected systems and proprietary data formats make integration difficult


Amazon Redshift Teradata Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Azure Synapse Google BigQuery
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Snowflake IBM Db2
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
SAP Oracle Autonomous
Data Warehouse Tibco Spotfire Confluent TensorFlow PyTorch

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine


Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 7
The Emergence of Data Lakes
Data Warehouses

Pros
Business
Intelligence
• Great for Business
Intelligence (BI)
applications

Data Warehouse Cons


• Limited support for
Machine Learning (ML)
ETL workloads
• Proprietary systems
with only a SQL
Structured Data interface
8
The Emergence of Data Lakes
Data Lakes
Really cheap, durable storage
10 nines of durability. Cheap. Infinite scale.
Store all types of raw data
Video, audio, text, structured, unstructured

Open, standardized formats


Parquet format, big ecosystem of tools operate on these file
formats
Challenges with Data Lakes
1. Hard to append data
Adding newly arrived data leads to incorrect reads

2. Modification of existing data is difficult


GDPR/CCPA requires making fine grained changes to existing
data lake

3. Jobs failing mid way


Half of the data appears in the data lake, the rest is missing
Challenges with data lakes
4. Real-time operations
Mixing streaming and batch leads to inconsistency

5. Costly to keep historical versions of the data


Regulated environments require reproducibility, auditing,
governance

6. Difficult to handle large metadata


For large data lakes the metadata itself becomes difficult to
manage
Challenges with data lakes
7. “Too many files” problems
Data lakes are not great at handling millions of small files

8. Hard to get great performance


Partitioning the data for performance is error-prone and difficult to
change

9. Data quality issues


It’s a constant headache to ensure that all the data is correct and
high quality
How Delta Lake
Enables the
Lakehouse

13
A new standard for building data lakes
An opinionated approach to
building Data Lakes

■ Adds reliability, quality,


performance to Data Lakes
■ Brings the best of data
warehousing and data
lakes
■ Based on open source and
open format (Parquet) -
Delta Lake is also open
source
Lakehouse
One platform to unify all of
Data your data, analytics, and AI Data
Lake workloads Warehouse

15
An open approach to bringing
data management and
governance to data lakes

Data Better reliability with transactions Data


Lake 48x faster data processing with indexing Warehouse
Data governance at scale with
fine-grained access control lists

16
Challenges
ACID Transactions
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. “Too many files” problems
8. Poor performance
9. Data quality issues

17
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it
4. Real-time operations hard is fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log

0000.json
7. “Too many files” problems
0001.json
8. Poor performance
0002.json
9. Data quality issues

0010.parquet

18
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it
4. Real-time operations hard is fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log

7. “Too many files” problems


8. Poor performance
0000.json

0001.json
{ Add
Add
file1.parquet
file2.parquet

0002.json
9. Data quality issues

0010.parquet

19
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it is
4. Real-time operations hard fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log

7. “Too many files” problems


8. Poor performance
0000.json

0001.json
{ Remove file1.parquet
Add file2.parquet

0002.json
9. Data quality issues

0010.parquet

20
Challenges
ACID Transactions
1. Hard to append data
Review past transactions
2. Modification of existing data difficult
All transactions are recorded
3. Jobs failing mid way
and you can go back in time to
4. Real-time operations hard review previous versions of
5. Costly to keep historical data versions the data (i.e. time travel)
6. Difficult to handle large metadata
SELECT * FROM events
7. “Too many files” problems
TIMESTAMP AS OF ...
8. Poor performance
SELECT * FROM events
9. Data quality issues VERSION AS OF ...

21
Challenges
Powered by Spark
1. Hard to append data
• Spark is built for handling
2. Modification of existing data difficult large amounts of data
3. Jobs failing mid way • All Delta Lake metadata
4. Real-time operations hard stored in open Parquet
format
5. Costly to keep historical data versions
• Portions of it cached and
6. Difficult to handle large metadata optimized for fast access
7. “Too many files” problems • Data and it’s metadata
always co-exist.
8. Poor performance
• No need to keep
9. Data quality issues catalog<>data in sync

22
Challenges
Indexing
1. Hard to append data Automatically optimize a
2. Modification of existing data difficult layout that enables fast
access
3. Jobs failing mid way
• Partitioning: layout for
4. Real-time operations hard
typical queries
5. Costly to keep historical data versions • Data skipping: prune files
6. Difficult to handle large metadata based on statistics on
numericals
7. “Too many files” problems
• Z-ordering: layout to
8. Poor performance optimize multiple columns
9. Data quality issues OPTIMIZE events
ZORDER BY (eventType)
23
Challenges
Indexing
1. Hard to append data
Schema validation and
2. Modification of existing data difficult evolution
• All data in Delta Tables
3. Jobs failing mid way
have to adhere
4. Real-time operations hard to a strict schema (star,
etc)
5. Costly to keep historical data versions
• Includes schema
6. Difficult to handle large metadata evolution in merge
operations
7. “Too many files” problems
MERGE INTO events
8. Poor performance USING changes
ON events.id = changes.id
9. Data quality issues WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT * 24
Lakehouse Design

25
Elements of Delta Lake
▪ Delta Architecture
▪ Delta Storage Layer
▪ Delta Engine
Delta architecture

Bronze Silver Gold Streaming


DATA Analytics
Raw Filtered, Cleaned, Business level
Ingestion Augmented Aggregates
AI &
Data quality Reporting
The Delta architecture design pattern

Bronze Silver Gold


Streaming
CSV, Analytics
JSON,
TXT…

Data
Lake
AI & Reporting
Data quality
Delta architecture - Bronze

Bronze
CSV,
JSON,
TXT…

Data
Lake Raw
Ingestion
Delta architecture - Silver

Bronze Silver
CSV,
JSON,
TXT…

Data Lake
Filtered
Cleaned
Augmented
Delta architecture - Gold

Bronze Silver Gold


Streaming
CSV, Analytics
JSON,
TXT…

Data
Lake Business-
level
AI & Reporting
aggregates

Data quality
Benefits of a Lakehouse
▪ Separation of compute and storage
▪ Infinite storage capacity
▪ Leverage best aspects of a data warehouse
▪ Low data gravity
▪ High data throughput
▪ No limits on data structure
▪ Mix batch and streaming workloads
Enterprise
Architectures

33
Lambda Architecture
Events
Stream

Stream Unified View AI &


Validation Reporting

Batch Batch
Table Table
(Data written (Data
continuously) compacted
every hour) Update &
Reprocessing Merge
Data Mesh
Empowered by Lakehouse

35
LWD 00
LWD 01

36
Parquet to
Delta Format

37
What does Parquet look like?

Parquet data
● Columnar storage
● Compression
● Designed for being
read by distributed
customers tasks
part1 Spark
part2 task Spark
task
part3 Spark
part4 task

Distributed Total result


computation generation
What can go wrong with Parquet?
Small file problem Big file (data skew) problem
customers duration customers duration

p1 p2 p.. p.. p.. task p1 task


p1 p.. p1 p.. p.. task p2 task
p1 p.. p1 p.. p.. task
p3 task
p1 p.. p1 p.. pn task

overall overall

Corrupt data Goal


customers duration customers duration
p1 task p1 task
p2 schema broken task FAIL p2 task
p3 file corrupt task FAIL p3 task
p4 task p4 task

overall FAIL overall


Delta reliability and performance features
Consistency Optimizations on the fly
(never read broken, unfinished or wrong data) (no need to have a complex pipeline)
customers
p1
customers

TX log
write
! TX log
p1
p2
read
✔ p3
p2
customers
p1
p3

TX log
p2
p4
p3
customers

!
p4
p1 p2 p.. p.. p..

TX log
p1 p.. p1 p.. p..
stream ! Schema enforcement p1 p.. p1 p.. p..
p1 p.. p1 p.. pn

Direct updates and deletes Time travel


(no need to have a complex pipeline) (implicit snapshots)

customers customers customers


customers
p1
● GDPR p1 p1 p1

● Change Data

TX log

TX log

TX log
p2 p2 p2
TX log

p2
p3 Capture (CDC & p3 p3 p3
p4 p4 ✘ p4 ✘
✘ p4 ✘ SCD)
V0 V1 V2
LWD 02

41
Delta Components

42
Delta Lake Components

Commit Delta Engine


Delta Tables
Service

43
Delta Tables

44
Delta Tables
customers
part1

TX log
part2

part3
Transaction Log part4
Provides Metadata
Layer for
Consistency

Data in Parquet

https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
LWD 03

46
Commit Service

47
Transaction Log / Metadata
Paquet Checkpoint

JSON Transaction

Scalable Metadata
Log structured storage
▪ Changes to the table are Add 1.parquet
stored as ordered, atomic
000000.json Add 2.parquet
units called commits in a
000001.json
directory name_delta_log
… Remove 1.parquet
▪ Contains schema &
Remove 2.parquet
metadata (min/max/etc.)
Add 3.parquet
Serialized Transactions
▪ Need to agree on the order
of changes, even when
000000.json
there are multiple writers. User 1 User 2
..
000011.json
000012.json
Pessimistic vs. optimistic concurrency
▪ Optimistic Concurrency ✔Mutual exclusion is
Assume it’ll be okay and check enough!
❌Breaks down if there a lot
of conflicts
▪ Pessimistic Concurrency
✔Avoid wasted work
Block others from conflicting
(locks) ❌Distributed locks
Solving conflicts optimistically
▪ Record start version
▪ Record reads/writes User 1 000000.json User 2
▪ If someone else wins, R: Schema 000001.json R: Schema
check if anything you read W: A 000002.json W: B

has changed
▪ Try again.
LWD 04

53
Delta Engine

54
Delta Engine - Optimize
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Compaction with Optimize Command

customers

p1
TX log

p2
customers
p3 p1

TX log
p2

p3
customers
p4
p1 p2 p.. p.. p..
TX log

p1 p.. p1 p.. p..

p1 p.. p1 p.. p..

p1 p.. p1 p.. pn
Delta Engine - Auto Optimize
Automatic Compaction (Bin Packing)
Solves Streaming Small File Problem

customers customers

p1
p1 p2 p.. p.. p..

TX log
p2

TX log
p1 p.. p1 p.. p..
p3
p1 p.. p1 p.. p..
p4
p1 p.. p1 p.. pn
Delta Engine - Z Order
Linear Order Z-order
0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1


0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3


0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5


0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7


0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

Data skipping helps identify which part-files can and cannot be skipped
Consider this query: SELECT * FROM points WHERE x = 2 OR y = 3
Delta Engine - Z Order
Read in each part-file and test for X = 2 or Y = 3

Linear Order Z-order


0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1


0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3


0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5


0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7


0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

9 files scanned in total 👎 7 files scanned in total 👍


21 false positives 👎 13 false positives 👍
What is data skipping?
▪ Simple, well-known I/O pruning technique used by many DBMSes and Big
Data systems
▪ Idea: track file-level stats like min & max / leverage them to avoid scanning
irrelevant files

▪ Example: file_name col_min col_max


SELECT input_file_name() as “file_name”,
file1 6 8
min(col) AS “col_min”,
max(col) AS “col_max”
file2 3 10
FROM table
GROUP BY input_file_name()
file3 1 4
What is data skipping?
▪ Simple, well-known I/O pruning technique used by many DBMSes and Big
Data systems
▪ Idea: track file-level stats like min & max / leverage them to avoid scanning
irrelevant files
file_name col_min col_max
▪ Example:
SELECT * FROM table WHERE col = 5 file1 6 8

SELECT file_name FROM index file2 3 10


WHERE col_min <= 5 AND col_max >= 5
file3 1 4
Delta
▪Optimize
performance features
Command that bin-packs files to the right size.
▪ Auto-optimize
Small/large files compacted to enable data lake applications experience great consistent performance and scalability.
▪ Scalable writes
Fine-grained conflict resolution allowing multiple writers to succeed
▪ Data skipping
Improves read performance by only reading subsets of the files.
▪ Z-Order
Clusters files in a way that enables data skipping for multi-dimensional filters.
▪ Bloom filters
Improves read performance by only reading subsets of the files that have data matching users filters.
▪ Caching
Automatically caches input Delta (and Parquet) tables, improving read throughput by 2X to 10X
▪ Skewed join
Supports joining two datasets with severe data skew, a problem with a lot of real-world datasets. Needed at scale.
▪ Range join (time-series data)
Supports joining two datasets based on overlapping ranges, such as time series analysis.
Summary

62
Delta Lake Summary
▪ Core component of a
Lakehouse architecture
▪ Offers guaranteed
consistency because it's ACID
compliant
▪ Robust data store
▪ Designed to work with
Apache Spark
Thank you
64

You might also like