Welcome!
Introduction to
Data Engineering
With Delta
Your Hosts for Today
Joel Roland Noble Raveendran
SA Manager, ANZ Solutions Architect
Agenda
▪ Overview of Data Engineering powered by Delta Lake
▪ Demo
▪ Q&A
Housekeeping
▪ If you have questions during the session please post in the Chat
Window
▪ We will have a number of Polls during the event - they will pop up so
please respond when they do
The Data Engineer’s Journey...
Events
Stream
Stream Unified View AI & Reporting
Validation
Updates & Merge get complex
Batch Batch
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
The Data Engineer’s Journey...
Events
d ?
ifie
Stream
p l
be sim AI & Reporting
his
Stream Validation Unified View
n t
Ca
Batch Batch Updates & Merge get complex
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
A Data Engineer’s Dream...
Process data continuously and incrementally as new data arrives in a
cost-efficient way without having to choose between batch or
streaming
Kinesis
CSV, AI & Reporting
JSON, TXT…
Data Lake
What’s missing?
Kinesis
CSV,
?
JSON, TXT… AI & Reporting
Data Lake
• Ability to read consistent data while data is being written
• Ability to read incrementally from a large table with good throughput
• Ability to rollback in case of bad writes
• Ability to replay historical data along new data that arrived
• Ability to handle late arriving data without having to delay downstream processing
So… What is the answer?
Delta
+ = Architecture
STRUCTURED
STREAMING
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing
costs
Let’s try it instead with
Well…what is
?
Data reliability challenges with data lakes
Failed production jobs leave data in corrupt
✗ state requiring tedious recovery
Lack of schema enforcement creates
inconsistent and low quality data
Lack of consistency makes it almost impossible to
mix appends and reads, batch and streaming
Performance challenges with data lakes
Too many small or very big files - more time opening &
closing files rather than reading contents (worse with
streaming).
Partitioning aka “poor man’s indexing”- breaks down if
you picked the wrong fields or when data has many
dimensions, high cardinality columns.
No caching - cloud storage throughput is low (cloud object
storage is 20-50MB/s/core vs 300MB/s/core for local
SSDs).
A New Standard for Building Data Lakes
Open Format Based on Parquet
With Transactions
Apache Spark API’s
Delta Lake: makes data ready for analytics
Data Science & ML
Reliability
Performance
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Delta Lake ensures data reliability
Batch
Parquet Files
Streaming
High Quality & Reliable Data
always ready for analytics
Updates/Deletes Transactional
Log
Key Features ● ACID Transactions ● Unified Batch & Streaming
● Schema Enforcement ● Time Travel/Data Snapshots
Delta Lake optimizes performance
Databricks
optimized engine
Highly Performant
queries at scale
Parquet Files
Transactional
Log
Key Features ● Indexing ● Data skipping
● Compaction ● Caching
Now, let’s try it with
*Data Quality Levels *
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
What does this remind you of ?
Data Lifecycle
Bronz Silve Gold
e r Streamin
Kinesis
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Data Lifecycle
OLTP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Data Lifecycle
OLTP Staging
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Data Lifecycle
OLTP Staging DW/OLAP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Data Lifecycle 🡪 Delta Lake Lifecycle
*Data Quality Levels *
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake Ingestion Aggregates
Augmented
AI &
Reporting
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
OLTP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Data Lake
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
AI &
Reporting
• Dumping ground for raw data
• Often with long retention (years)
• Avoid error-prone parsing
Staging
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Data Lake
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
AI &
Reporting
Intermediate data with some cleanup applied
Query-able for easy debugging!
DW/OLAP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Clean data, ready for consumption
Read with Spark and other SQL Query Engines
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Streams move data through the Delta Lake
• Low-latency or manually triggered
• Eliminates management of schedules and jobs
OVERWRITE
MERGE
INSERT DELETE
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
UPDATE AI &
Reporting
Delta Lake also supports batch jobs
and standard DML
• Retention • GDPR
• Corrections • UPSERTS
DELETE DELETE
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Easy to recompute when business logic changes:
• Clear tables
• Restart streams
Delta
Architecture
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
• Ability to rollback in case of bad writes Time travel
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
• Ability to rollback in case of bad writes Time travel
• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
• Ability to rollback in case of bad writes Time travel
• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline
• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Connecting the dots...
Kinesis
CSV,
JSON, TXT… AI & Reporting
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
• Ability to rollback in case of bad writes Time travel
• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline
• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Get Started with Delta using Spark APIs
Instead of parquet... … simply say delta
CREATE TABLE ... CREATE TABLE ...
USING parquet USING delta
... …
dataframe dataframe
.write .write
.format("parquet") .format("delta")
.save("/data") .save("/data")
Using Delta with your Existing Parquet
Tables
Step 1: Convert Parquet to Delta Tables
CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]
[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]
Step 2: Optimize Layout for Fast Queries
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
Upsert/Merge: Fine-grained Updates
MERGE INTO customers -- Delta table
USING updates
ON customers.customerId = updates.customerId
WHEN MATCHED THEN
UPDATE SET address = updates.address
WHEN NOT MATCHED THEN
INSERT (customerId, address) VALUES (updates.customerId, updates.address)
Time Travel
Reproduce experiments & reports Rollback accidental bad writes
SELECT count(*) FROM events INSERT INTO my_table
TIMESTAMP AS OF timestamp SELECT * FROM my_table TIMESTAMP AS OF
date_sub(current_date(), 1)
SELECT count(*) FROM events VERSION
AS OF version
spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/")
Databricks Ingest: Auto Loader
Load new data easily and efficiently as it arrives in cloud storage
Before After
Notification Message Auto
Service Queue Loader
Stream
Batch
Delayed ● Pipe data from cloud storage into Delta Lake as
schedule
it arrives
External Trigger
● “Set and forget” model eliminates complex setup
Airflow file sensor
Gets too complicated for multiple jobs
Launch Blog Post
Databricks Ingest: Data Ingestion Network
All Your Application, Database, and File Storage Data in your Delta Lake
Delta Lake Connectors
Standardize your big data storage with an open format accessible from various tools
Amazon Redshift
Amazon Athena
Databricks Notebooks are a powerful developer environment
to enable Dashboards, Workflows, and Jobs.
Notebook Dashboards
Databricks Notebooks Create interactive notebooks based on
Collaborative, multi-language, Notebooks with just a few clicks.
enterprise-ready developer environment.
Notebook Workflows
Orchestrate multi-stage pipelines based on
notebook dependencies.
Notebook Jobs
Schedule, orchestrate, and monitor
execution of notebooks.
Notebook Workflows enable orchestration of
multi-stage pipelines based on notebook dependencies.
Workflow Definition
APIs allow flexible definition of notebook
dependencies, including parallel
execution of notebooks.
Databricks Job Scheduler
Workflow Execution
The Databricks Job scheduler
manages the execution of
Notebook workflows.
Notebook Jobs enable scheduling and monitoring
of Notebooks.
Turn Notebooks into Jobs
Any Databricks Notebook or Notebook
Workflow can easily be turned into a job.
Schedule Jobs
Jobs can be configured to execute on a
schedule (e.g. daily, hourly).
Orchestrate Jobs
In addition to Notebook Workflows, Jobs
can also be orchestrated using third-party
tools like Airflow.
Users of Delta Lake
Improved reliability:
Petabyte-scale jobs
10x lower compute:
640 instances to 64!
Faster iterations:
Multiple wks to 5 min
deploys!
● 2 Regions Uses Cases
● 10 Workspaces Improved performance:
● Finance
● 100+ Users Queries run faster
● 50+ Scheduled ● Assortment
Jobs
>1 hr → < 6 sec
● Fresh Sales
● 1,000+ Notebooks
Tool
● Scores of ML
● Fraud Engine
Easier transactional
models
● Personalization updates:
No downtime or
consistency issues!
Simple CDC:
Easy with MERGE
Data consistency and integrity:
not available before
Increased data quality:
name match accuracy up from
80% to 95%
Faster data loads:
24 hours → 20 mins
Databricks Cluster
Demo
Q&A
Your feedback is appreciated
First 20 responses will receive a $10
UberEats voucher
TWO LIVE SESSIONS:
Australia & NZ: 9.00am - 11.30am AEST
Singapore & India: 1.00pm - 3.30pm SGT | 10.30am - 1.00pm IST
Hear the data innovation stories of Atlassian, Coles Group, Grab
Tabcorp, and more.
Register now: https://databricks.com/p/event/data-ai-tour
June 22-26 | Organized by
THE FREE VIRTUAL
EVENT FOR DATA TEAMS
● Extended to 5 days with over 200 sessions
● 4x the pre-conference training
● Keynotes by visionaries and thought
leaders
https://databricks.com/sparkaisummit/north-america-2020