100% found this document useful (2 votes)

417 views64 pages

Lakehouse With Delta Lake Deep Dive

The document discusses challenges with modern data architectures and how Delta Lake enables the Lakehouse architecture. It provides an agenda that will define characteristics of the Lakehouse, explain how Delta Lake supports it, and demonstrate building batch and streaming data pipelines with Delta Lake. Finally, it outlines challenges with data lakes that Delta Lake addresses like updating data, job failures, metadata management, and performance.

Uploaded by

A Noraznizam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

417 views64 pages

Lakehouse With Delta Lake Deep Dive

Uploaded by

A Noraznizam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Lakehouse

with Delta Lake

Deep Dive

Kevin Coyle
Curriculum Developer, Databricks

1
Agenda
Today Looks Like:

1. Challenges with Modern Data

2. How Delta Lake Enables the Lakehouse
3. Lakehouse Design
4. Delta Lake Components

2
Course Objectives
1 Deﬁne core characteristics of the Lakehouse architecture

2 Explain how Delta Lake supports the Lakehouse architecture.

Explain how to build an end-to-end batch and streaming OLAP data pipeline using
3 Delta Lake.
Follow speciﬁed design patterns to make data available for consumption by
4 downstream stakeholders.
Explain Databricks’ recommended best practices in engineering a single source of
5 truth Delta design pattern
Challenges with
Modern Data

4
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine

Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 5
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML

Disconnected systems and proprietary data formats make integration difﬁcult

Amazon Redshift Teradata Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Azure Synapse Google BigQuery
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Snowflake IBM Db2
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
SAP Oracle Autonomous
Data Warehouse Tibco Spotfire Confluent TensorFlow PyTorch

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine

Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 6
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Siloed data teams decrease productivity

Data Analysts Data Engineers Data Engineers Data Scientists

Disconnected systems and proprietary data formats make integration difﬁcult

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine

Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse

Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data 7
The Emergence of Data Lakes
Data Warehouses

Pros
Business
Intelligence
• Great for Business
Intelligence (BI)
applications

Data Warehouse Cons

• Limited support for
Machine Learning (ML)
ETL workloads
• Proprietary systems
with only a SQL
Structured Data interface
8
The Emergence of Data Lakes
Data Lakes
Really cheap, durable storage
10 nines of durability. Cheap. Inﬁnite scale.
Store all types of raw data
Video, audio, text, structured, unstructured

Open, standardized formats

Parquet format, big ecosystem of tools operate on these ﬁle
formats
Challenges with Data Lakes
1. Hard to append data
Adding newly arrived data leads to incorrect reads

2. Modiﬁcation of existing data is difﬁcult

GDPR/CCPA requires making ﬁne grained changes to existing
data lake

3. Jobs failing mid way

Half of the data appears in the data lake, the rest is missing
Challenges with data lakes
4. Real-time operations
Mixing streaming and batch leads to inconsistency

5. Costly to keep historical versions of the data

Regulated environments require reproducibility, auditing,
governance

6. Difﬁcult to handle large metadata

For large data lakes the metadata itself becomes difficult to
manage
Challenges with data lakes
7. “Too many files” problems
Data lakes are not great at handling millions of small files

8. Hard to get great performance

Partitioning the data for performance is error-prone and difﬁcult to
change

9. Data quality issues

It’s a constant headache to ensure that all the data is correct and
high quality
How Delta Lake
Enables the
Lakehouse

13
A new standard for building data lakes
An opinionated approach to
building Data Lakes

■ Adds reliability, quality,

performance to Data Lakes
■ Brings the best of data
warehousing and data
lakes
■ Based on open source and
open format (Parquet) -
Delta Lake is also open
source
Lakehouse
One platform to unify all of
Data your data, analytics, and AI Data
Lake workloads Warehouse

15
An open approach to bringing
data management and
governance to data lakes

Data Better reliability with transactions Data

Lake 48x faster data processing with indexing Warehouse
Data governance at scale with
ﬁne-grained access control lists

16
Challenges
ACID Transactions
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. “Too many files” problems
8. Poor performance
9. Data quality issues

17
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it
4. Real-time operations hard is fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log

0000.json
7. “Too many ﬁles” problems
0001.json
8. Poor performance
0002.json
9. Data quality issues
…

0010.parquet

18
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it
4. Real-time operations hard is fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log

7. “Too many ﬁles” problems

8. Poor performance
0000.json

0001.json
{ Add
Add
ﬁle1.parquet
ﬁle2.parquet

0002.json
9. Data quality issues
…

0010.parquet

19
Challenges
ACID Transactions
1. Hard to append data
Make every operation
2. Modification of existing data difficult
transactional
3. Jobs failing mid way
It either fully succeeds - or it is
4. Real-time operations hard fully aborted for later retries
5. Costly to keep historical data versions
6. Difficult to handle large metadata /path/to/table/_delta_log

7. “Too many ﬁles” problems

8. Poor performance
0000.json

0001.json
{ Remove ﬁle1.parquet
Add ﬁle2.parquet

0002.json
9. Data quality issues
…

0010.parquet

20
Challenges
ACID Transactions
1. Hard to append data
Review past transactions
2. Modification of existing data difficult
All transactions are recorded
3. Jobs failing mid way
and you can go back in time to
4. Real-time operations hard review previous versions of
5. Costly to keep historical data versions the data (i.e. time travel)
6. Difficult to handle large metadata
SELECT * FROM events
7. “Too many files” problems
TIMESTAMP AS OF ...
8. Poor performance
SELECT * FROM events
9. Data quality issues VERSION AS OF ...

21
Challenges
Powered by Spark
1. Hard to append data
• Spark is built for handling
2. Modification of existing data difficult large amounts of data
3. Jobs failing mid way • All Delta Lake metadata
4. Real-time operations hard stored in open Parquet
format
5. Costly to keep historical data versions
• Portions of it cached and
6. Difficult to handle large metadata optimized for fast access
7. “Too many files” problems • Data and it’s metadata
always co-exist.
8. Poor performance
• No need to keep
9. Data quality issues catalog<>data in sync

22
Challenges
Indexing
1. Hard to append data Automatically optimize a
2. Modification of existing data difficult layout that enables fast
access
3. Jobs failing mid way
• Partitioning: layout for
4. Real-time operations hard
typical queries
5. Costly to keep historical data versions • Data skipping: prune files
6. Difficult to handle large metadata based on statistics on
numericals
7. “Too many files” problems
• Z-ordering: layout to
8. Poor performance optimize multiple columns
9. Data quality issues OPTIMIZE events
ZORDER BY (eventType)
23
Challenges
Indexing
1. Hard to append data
Schema validation and
2. Modification of existing data difficult evolution
• All data in Delta Tables
3. Jobs failing mid way
have to adhere
4. Real-time operations hard to a strict schema (star,
etc)
5. Costly to keep historical data versions
• Includes schema
6. Difficult to handle large metadata evolution in merge
operations
7. “Too many files” problems
MERGE INTO events
8. Poor performance USING changes
ON events.id = changes.id
9. Data quality issues WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT * 24
Lakehouse Design

25
Elements of Delta Lake
▪ Delta Architecture
▪ Delta Storage Layer
▪ Delta Engine
Delta architecture

Bronze Silver Gold Streaming

DATA Analytics
Raw Filtered, Cleaned, Business level
Ingestion Augmented Aggregates
AI &
Data quality Reporting
The Delta architecture design pattern

Bronze Silver Gold

Streaming
CSV, Analytics
JSON,
TXT…

Data
Lake
AI & Reporting
Data quality
Delta architecture - Bronze

Bronze
CSV,
JSON,
TXT…

Data
Lake Raw
Ingestion
Delta architecture - Silver

Bronze Silver
CSV,
JSON,
TXT…

Data Lake
Filtered
Cleaned
Augmented
Delta architecture - Gold

Bronze Silver Gold

Streaming
CSV, Analytics
JSON,
TXT…

Data
Lake Business-
level
AI & Reporting
aggregates

Data quality
Beneﬁts of a Lakehouse
▪ Separation of compute and storage
▪ Inﬁnite storage capacity
▪ Leverage best aspects of a data warehouse
▪ Low data gravity
▪ High data throughput
▪ No limits on data structure
▪ Mix batch and streaming workloads
Enterprise
Architectures

33
Lambda Architecture
Events
Stream

Stream Uniﬁed View AI &

Validation Reporting

Batch Batch
Table Table
(Data written (Data
continuously) compacted
every hour) Update &
Reprocessing Merge
Data Mesh
Empowered by Lakehouse

35
LWD 00
LWD 01

36
Parquet to
Delta Format

37
What does Parquet look like?

Parquet data
● Columnar storage
● Compression
● Designed for being
read by distributed
customers tasks
part1 Spark
part2 task Spark
task
part3 Spark
part4 task

Distributed Total result

computation generation
What can go wrong with Parquet?
Small ﬁle problem Big ﬁle (data skew) problem
customers duration customers duration

p1 p2 p.. p.. p.. task p1 task

p1 p.. p1 p.. p.. task p2 task
p1 p.. p1 p.. p.. task
p3 task
p1 p.. p1 p.. pn task

overall overall

Corrupt data Goal

customers duration customers duration
p1 task p1 task
p2 schema broken task FAIL p2 task
p3 ﬁle corrupt task FAIL p3 task
p4 task p4 task

overall FAIL overall

Delta reliability and performance features
Consistency Optimizations on the ﬂy
(never read broken, unﬁnished or wrong data) (no need to have a complex pipeline)
customers
p1
customers

TX log
write
! TX log
p1
p2
read
✔ p3
p2
customers
p1
p3

TX log
p2
p4
p3
customers

!
p4
p1 p2 p.. p.. p..

TX log
p1 p.. p1 p.. p..
stream ! Schema enforcement p1 p.. p1 p.. p..
p1 p.. p1 p.. pn

Direct updates and deletes Time travel

(no need to have a complex pipeline) (implicit snapshots)

customers customers customers

customers
p1
● GDPR p1 p1 p1

● Change Data

TX log

TX log
p2 p2 p2
TX log

p2
p3 Capture (CDC & p3 p3 p3
p4 p4 ✘ p4 ✘
✘ p4 ✘ SCD)
V0 V1 V2
LWD 02

41
Delta Components

42
Delta Lake Components

Commit Delta Engine

Delta Tables
Service

43
Delta Tables

44
Delta Tables
customers
part1

TX log
part2

part3
Transaction Log part4
Provides Metadata
Layer for
Consistency

Data in Parquet

https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
LWD 03

46
Commit Service

47
Transaction Log / Metadata
Paquet Checkpoint

JSON Transaction

Scalable Metadata
Log structured storage
▪ Changes to the table are Add 1.parquet
stored as ordered, atomic
000000.json Add 2.parquet
units called commits in a
000001.json
directory name_delta_log
… Remove 1.parquet
▪ Contains schema &
Remove 2.parquet
metadata (min/max/etc.)
Add 3.parquet
Serialized Transactions
▪ Need to agree on the order
of changes, even when
000000.json
there are multiple writers. User 1 User 2
..
000011.json
000012.json
Pessimistic vs. optimistic concurrency
▪ Optimistic Concurrency ✔Mutual exclusion is
Assume it’ll be okay and check enough!
❌Breaks down if there a lot
of conflicts
▪ Pessimistic Concurrency
✔Avoid wasted work
Block others from conflicting
(locks) ❌Distributed locks
Solving conflicts optimistically
▪ Record start version
▪ Record reads/writes User 1 000000.json User 2
▪ If someone else wins, R: Schema 000001.json R: Schema
check if anything you read W: A 000002.json W: B

has changed
▪ Try again.
LWD 04

53
Delta Engine

54
Delta Engine - Optimize
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Compaction with Optimize Command

customers

p1
TX log

p2
customers
p3 p1

TX log
p2

p3
customers
p4
p1 p2 p.. p.. p..
TX log

p1 p.. p1 p.. p..

p1 p.. p1 p.. pn
Delta Engine - Auto Optimize
Automatic Compaction (Bin Packing)
Solves Streaming Small File Problem

customers customers

p1
p1 p2 p.. p.. p..

TX log
p2

TX log
p1 p.. p1 p.. p..
p3
p1 p.. p1 p.. p..
p4
p1 p.. p1 p.. pn
Delta Engine - Z Order
Linear Order Z-order
0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

Data skipping helps identify which part-ﬁles can and cannot be skipped
Consider this query: SELECT * FROM points WHERE x = 2 OR y = 3
Delta Engine - Z Order
Read in each part-ﬁle and test for X = 2 or Y = 3

Linear Order Z-order

0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

9 ﬁles scanned in total 👎 7 ﬁles scanned in total 👍

21 false positives 👎 13 false positives 👍
What is data skipping?
▪ Simple, well-known I/O pruning technique used by many DBMSes and Big
Data systems
▪ Idea: track ﬁle-level stats like min & max / leverage them to avoid scanning
irrelevant ﬁles

▪ Example: ﬁle_name col_min col_max

SELECT input_file_name() as “file_name”,
file1 6 8
min(col) AS “col_min”,
max(col) AS “col_max”
file2 3 10
FROM table
GROUP BY input_file_name()
file3 1 4
What is data skipping?
▪ Simple, well-known I/O pruning technique used by many DBMSes and Big
Data systems
▪ Idea: track file-level stats like min & max / leverage them to avoid scanning
irrelevant files
file_name col_min col_max
▪ Example:
SELECT * FROM table WHERE col = 5 file1 6 8

SELECT ﬁle_name FROM index ﬁle2 3 10

WHERE col_min <= 5 AND col_max >= 5
file3 1 4
Delta
▪Optimize
performance features
Command that bin-packs files to the right size.
▪ Auto-optimize
Small/large files compacted to enable data lake applications experience great consistent performance and scalability.
▪ Scalable writes
Fine-grained conflict resolution allowing multiple writers to succeed
▪ Data skipping
Improves read performance by only reading subsets of the files.
▪ Z-Order
Clusters files in a way that enables data skipping for multi-dimensional filters.
▪ Bloom filters
Improves read performance by only reading subsets of the files that have data matching users filters.
▪ Caching
Automatically caches input Delta (and Parquet) tables, improving read throughput by 2X to 10X
▪ Skewed join
Supports joining two datasets with severe data skew, a problem with a lot of real-world datasets. Needed at scale.
▪ Range join (time-series data)
Supports joining two datasets based on overlapping ranges, such as time series analysis.
Summary

62
Delta Lake Summary
▪ Core component of a
Lakehouse architecture
▪ Offers guaranteed
consistency because it's ACID
compliant
▪ Robust data store
▪ Designed to work with
Apache Spark
Thank you
64

Azure Synapse Analytics
100% (2)
Azure Synapse Analytics
7,794 pages
Hach Lange DR 3900 User Manual
No ratings yet
Hach Lange DR 3900 User Manual
150 pages
Databricks
No ratings yet
Databricks
81 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
PHD Thesis Cornell University
100% (2)
PHD Thesis Cornell University
5 pages
Details of Delta Lake Tutorial
67% (3)
Details of Delta Lake Tutorial
43 pages
Anjani Kumar, Abhishek Mishra, Sanjeev Kumar - Architecting A Modern Data Warehouse For Large Enterprises - Build Multi-Cloud Modern Distributed Data Warehouses With Azure and AWS-Apress (2024)
100% (1)
Anjani Kumar, Abhishek Mishra, Sanjeev Kumar - Architecting A Modern Data Warehouse For Large Enterprises - Build Multi-Cloud Modern Distributed Data Warehouses With Azure and AWS-Apress (2024)
378 pages
Building The Snowflake Data Cloud - Monetiz - Andrew Carruthers
No ratings yet
Building The Snowflake Data Cloud - Monetiz - Andrew Carruthers
391 pages
Synapse Project Deck
No ratings yet
Synapse Project Deck
196 pages
Complete Mathematics Course For 2024 & 2025
No ratings yet
Complete Mathematics Course For 2024 & 2025
12 pages
Lakehouse Analytics
100% (1)
Lakehouse Analytics
20 pages
Short Form 2020 en Circutor
No ratings yet
Short Form 2020 en Circutor
176 pages
IMGUI
No ratings yet
IMGUI
15 pages
Optimization of Cost and Emission For Dynamic Load Dispatch Problem With Hybrid Renewable Energy Sources
No ratings yet
Optimization of Cost and Emission For Dynamic Load Dispatch Problem With Hybrid Renewable Energy Sources
33 pages
Physics Project Copy Wrong
No ratings yet
Physics Project Copy Wrong
14 pages
q2 Module 1 Science 10 Electromagnetic Waves
No ratings yet
q2 Module 1 Science 10 Electromagnetic Waves
12 pages
EE - Course Equivalence - 2021 Scheme
No ratings yet
EE - Course Equivalence - 2021 Scheme
2 pages
Azure Databricks Documentation
No ratings yet
Azure Databricks Documentation
32 pages
5d30 PDF
No ratings yet
5d30 PDF
12 pages
Assignment 2 BLT AE19B102
No ratings yet
Assignment 2 BLT AE19B102
34 pages
Azure Synapse Course Presentation
100% (1)
Azure Synapse Course Presentation
261 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
FA Power Enrich
No ratings yet
FA Power Enrich
6 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
Well Architected Lakehouse Workshop
100% (1)
Well Architected Lakehouse Workshop
49 pages
Koch 2014
No ratings yet
Koch 2014
4 pages
Introduction To Databricks
No ratings yet
Introduction To Databricks
149 pages
Win 7 SP 1 MSDN
No ratings yet
Win 7 SP 1 MSDN
3 pages
Azure Data Lake and U-SQL
No ratings yet
Azure Data Lake and U-SQL
51 pages
EDW-ETL Migration Approaches With Databricks
No ratings yet
EDW-ETL Migration Approaches With Databricks
34 pages
Vipul Mishra
No ratings yet
Vipul Mishra
6 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Chemistrty Practical Slides.
No ratings yet
Chemistrty Practical Slides.
59 pages
Orbit Easy Dial 96876 24 Ra (2) 1final
100% (1)
Orbit Easy Dial 96876 24 Ra (2) 1final
11 pages
Stresses in Soil
No ratings yet
Stresses in Soil
22 pages
Water Control in Oil Wells With Downhole Oil-Free Water Drainage and Disposal - SPE-77559-MS
No ratings yet
Water Control in Oil Wells With Downhole Oil-Free Water Drainage and Disposal - SPE-77559-MS
10 pages
GSM For Dummies
No ratings yet
GSM For Dummies
82 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Technical Bulletin: Meritor WABCO ABS Sensor Tests
No ratings yet
Technical Bulletin: Meritor WABCO ABS Sensor Tests
4 pages
(Guia Databrick Lakehouse)
No ratings yet
(Guia Databrick Lakehouse)
83 pages
SCB
No ratings yet
SCB
14 pages
R WW Drwa WChye 8 Co HVF JQ 9 K
No ratings yet
R WW Drwa WChye 8 Co HVF JQ 9 K
17 pages
Acceptability of White Oyster Muschroom Siopao
No ratings yet
Acceptability of White Oyster Muschroom Siopao
64 pages
Data Architecture Basics: An Illustrated Guide For Non-Technical Readers
100% (6)
Data Architecture Basics: An Illustrated Guide For Non-Technical Readers
31 pages
Data Engineering With Databricks Da
100% (2)
Data Engineering With Databricks Da
232 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Databricks Question
No ratings yet
Databricks Question
89 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Databricks For The SQL Developer: Gerhard Brueckl
No ratings yet
Databricks For The SQL Developer: Gerhard Brueckl
40 pages
Astm D8152 18
No ratings yet
Astm D8152 18
6 pages
Logical Fallacies Jeopardy
No ratings yet
Logical Fallacies Jeopardy
29 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
Learn How Databricks Streamlines The Data Management Lifecycle
No ratings yet
Learn How Databricks Streamlines The Data Management Lifecycle
20 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Data Lakes in A Modern Data Architecture
88% (8)
Data Lakes in A Modern Data Architecture
23 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Mock Test
No ratings yet
Mock Test
25 pages
Databricks Questions
No ratings yet
Databricks Questions
23 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Intro To Data Engineering Databricks Webinar 13may
No ratings yet
Intro To Data Engineering Databricks Webinar 13may
59 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Chess Openings: Pirc Defense
57% (7)
Chess Openings: Pirc Defense
2 pages
Databricks Certified Data Analyst Associate Exam Guide
No ratings yet
Databricks Certified Data Analyst Associate Exam Guide
7 pages
SSIS Succinctly
No ratings yet
SSIS Succinctly
116 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
Connect Databricks Delta Tables With DBeaver
No ratings yet
Connect Databricks Delta Tables With DBeaver
10 pages
Snowflake Vs Data Bricks
No ratings yet
Snowflake Vs Data Bricks
10 pages
Databricks Certified Data Engineer Associate Exam Guide
No ratings yet
Databricks Certified Data Engineer Associate Exam Guide
7 pages
FFS & Life Extension Study For 4 Nos Offshore Wellhead Platform Structures
No ratings yet
FFS & Life Extension Study For 4 Nos Offshore Wellhead Platform Structures
5 pages
Etl With Azure Cookbook Practical Recipes For Building Modern Etl Solutions To Load and Transform Data From Any Source 1800203314 9781800203310
100% (7)
Etl With Azure Cookbook Practical Recipes For Building Modern Etl Solutions To Load and Transform Data From Any Source 1800203314 9781800203310
446 pages
An Exergy Analysis of Small-Scale Liquefied Natural Gas (LNG)
No ratings yet
An Exergy Analysis of Small-Scale Liquefied Natural Gas (LNG)
15 pages
Designing Data Integration The ETL Pattern Approac
No ratings yet
Designing Data Integration The ETL Pattern Approac
9 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Delta Lake Cheat Sheet-1
100% (1)
Delta Lake Cheat Sheet-1
2 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
Architecting A Data Lake
100% (8)
Architecting A Data Lake
60 pages
Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)
From Everand
Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)
Dr. Jugnesh Kumar
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
From Everand
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Mayank Malhotra
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet

Lakehouse With Delta Lake Deep Dive

Uploaded by

Lakehouse With Delta Lake Deep Dive

Uploaded by

Lakehouse

with Delta Lake

1. Challenges with Modern Data

2 Explain how Delta Lake supports the Lakehouse architecture.

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine

Disconnected systems and proprietary data formats make integration difﬁcult

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine

Data Analysts Data Engineers Data Engineers Data Scientists

Disconnected systems and proprietary data formats make integration difﬁcult

Siloed stacks increase data architecture complexity

Analytics and BI Transform Real-time Database Data Machine

Data Warehouse Cons

Open, standardized formats

2. Modiﬁcation of existing data is difﬁcult

3. Jobs failing mid way

5. Costly to keep historical versions of the data

6. Difﬁcult to handle large metadata

8. Hard to get great performance

9. Data quality issues

■ Adds reliability, quality,

Data Better reliability with transactions Data

7. “Too many ﬁles” problems

7. “Too many ﬁles” problems

Bronze Silver Gold Streaming

Bronze Silver Gold

Bronze Silver Gold

Stream Uniﬁed View AI &

Distributed Total result

p1 p2 p.. p.. p.. task p1 task

Corrupt data Goal

overall FAIL overall

Direct updates and deletes Time travel

customers customers customers

Commit Delta Engine

Compaction with Optimize Command

p1 p.. p1 p.. p..

p1 p.. p1 p.. p..

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

Linear Order Z-order

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

9 ﬁles scanned in total 👎 7 ﬁles scanned in total 👍

▪ Example: ﬁle_name col_min col_max

SELECT ﬁle_name FROM index ﬁle2 3 10

You might also like