0% found this document useful (0 votes)

431 views59 pages

Delta Lake Data Engineering Overview

This document provides an introduction and overview of data engineering powered by Delta Lake. The agenda includes an overview of Delta Lake, a demo, and a Q&A session. Delta Lake addresses challenges with data lakes by providing ACID transactions, schema enforcement, and the ability to unify batch and streaming data for reliable and high performance analytics. It allows data to be incrementally improved from raw to filtered and cleaned until it is ready for consumption. This mimics the typical data lifecycle from OLTP to staging to a data warehouse.

Uploaded by

Steven

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

431 views59 pages

Delta Lake Data Engineering Overview

Uploaded by

Steven

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Welcome!

Introduction to
Data Engineering
With Delta
Your Hosts for Today

Joel Roland Noble Raveendran

SA Manager, ANZ Solutions Architect
Agenda

▪ Overview of Data Engineering powered by Delta Lake

▪ Demo
▪ Q&A
Housekeeping

▪ If you have questions during the session please post in the Chat
Window

▪ We will have a number of Polls during the event - they will pop up so
please respond when they do
The Data Engineer’s Journey...

Events
Stream

Stream Uniﬁed View AI & Reporting

Validation

Updates & Merge get complex

Batch Batch
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
The Data Engineer’s Journey...

Events

d ?
iﬁe
Stream

p l
be sim AI & Reporting

his
Stream Validation Uniﬁed View

n t
Ca
Batch Batch Updates & Merge get complex
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
A Data Engineer’s Dream...

Process data continuously and incrementally as new data arrives in a

cost-efficient way without having to choose between batch or
streaming

Kinesis

CSV, AI & Reporting

JSON, TXT…
Data Lake
What’s missing?
Kinesis
CSV,
?
JSON, TXT… AI & Reporting
Data Lake

• Ability to read consistent data while data is being written

• Ability to read incrementally from a large table with good throughput

• Ability to rollback in case of bad writes

• Ability to replay historical data along new data that arrived

• Ability to handle late arriving data without having to delay downstream processing
So… What is the answer?

Delta
+ = Architecture
STRUCTURED
STREAMING

1. Unify batch & streaming with a continuous data ﬂow model

2. Inﬁnite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing
costs
Let’s try it instead with
Well…what is
?
Data reliability challenges with data lakes
Failed production jobs leave data in corrupt
✗ state requiring tedious recovery

Lack of schema enforcement creates

inconsistent and low quality data

Lack of consistency makes it almost impossible to

mix appends and reads, batch and streaming
Performance challenges with data lakes
Too many small or very big ﬁles - more time opening &
closing ﬁles rather than reading contents (worse with
streaming).

Partitioning aka “poor man’s indexing”- breaks down if

you picked the wrong ﬁelds or when data has many
dimensions, high cardinality columns.

No caching - cloud storage throughput is low (cloud object

storage is 20-50MB/s/core vs 300MB/s/core for local
SSDs).
A New Standard for Building Data Lakes

Open Format Based on Parquet

With Transactions

Apache Spark API’s

Delta Lake: makes data ready for analytics

Data Science & ML

Reliability

Performance
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Delta Lake ensures data reliability

Batch
Parquet Files
Streaming
High Quality & Reliable Data
always ready for analytics
Updates/Deletes Transactional
Log

Key Features ● ACID Transactions ● Uniﬁed Batch & Streaming

● Schema Enforcement ● Time Travel/Data Snapshots
Delta Lake optimizes performance
Databricks
optimized engine

Highly Performant
queries at scale

Parquet Files

Transactional
Log

Key Features ● Indexing ● Data skipping

● Compaction ● Caching
Now, let’s try it with
*Data Quality Levels *
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality

Delta Lake allows you to incrementally improve the

quality of your data until it is ready for consumption.
What does this remind you of ?
Data Lifecycle
Bronz Silve Gold
e r Streamin
Kinesis
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Quality
Data Lifecycle
OLTP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Quality
Data Lifecycle
OLTP Staging
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Quality
Data Lifecycle
OLTP Staging DW/OLAP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Quality
Data Lifecycle 🡪 Delta Lake Lifecycle
*Data Quality Levels *
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake Ingestion Aggregates
Augmented
AI &
Reporting
Quality

Delta Lake allows you to incrementally improve the

quality of your data until it is ready for consumption.
OLTP

Bronz Silve Gold

Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…

Data Lake
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
AI &
Reporting
• Dumping ground for raw data
• Often with long retention (years)
• Avoid error-prone parsing
Staging

Bronz Silve Gold

Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…

Data Lake
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
AI &
Reporting
Intermediate data with some cleanup applied
Query-able for easy debugging!
DW/OLAP

Bronz Silve Gold

Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Clean data, ready for consumption
Read with Spark and other SQL Query Engines
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Streams move data through the Delta Lake
• Low-latency or manually triggered
• Eliminates management of schedules and jobs
OVERWRITE

MERGE
INSERT DELETE
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
UPDATE AI &
Reporting
Delta Lake also supports batch jobs
and standard DML
• Retention • GDPR
• Corrections • UPSERTS
DELETE DELETE
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Easy to recompute when business logic changes:

• Clear tables
• Restart streams
Delta
Architecture
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake

• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake

• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized ﬁle source with scalable
table with good throughput metadata handling
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake

• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized ﬁle source with scalable
table with good throughput metadata handling

• Ability to rollback in case of bad writes Time travel

Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake

• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized ﬁle source with scalable
table with good throughput metadata handling

• Ability to rollback in case of bad writes Time travel

• Ability to replay historical data along new Stream the backﬁlled historical data
data that arrived through the same pipeline
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized ﬁle source with scalable
table with good throughput metadata handling

• Ability to rollback in case of bad writes Time travel

• Ability to replay historical data along new Stream the backﬁlled historical data
data that arrived through the same pipeline

• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Connecting the dots...
Kinesis
CSV,
JSON, TXT… AI & Reporting
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized ﬁle source with scalable
table with good throughput metadata handling

• Ability to rollback in case of bad writes Time travel

• Ability to replay historical data along new Stream the backﬁlled historical data
data that arrived through the same pipeline

• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Get Started with Delta using Spark APIs
Instead of parquet... … simply say delta

CREATE TABLE ... CREATE TABLE ...

USING parquet USING delta
... …

dataframe dataframe
.write .write
.format("parquet") .format("delta")
.save("/data") .save("/data")
Using Delta with your Existing Parquet
Tables
Step 1: Convert Parquet to Delta Tables

CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]

[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]

Step 2: Optimize Layout for Fast Queries

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
Upsert/Merge: Fine-grained Updates
MERGE INTO customers -- Delta table
USING updates
ON customers.customerId = updates.customerId
WHEN MATCHED THEN
UPDATE SET address = updates.address
WHEN NOT MATCHED THEN
INSERT (customerId, address) VALUES (updates.customerId, updates.address)
Time Travel
Reproduce experiments & reports Rollback accidental bad writes

SELECT count(*) FROM events INSERT INTO my_table

TIMESTAMP AS OF timestamp SELECT * FROM my_table TIMESTAMP AS OF

date_sub(current_date(), 1)

SELECT count(*) FROM events VERSION

AS OF version

spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/")
Databricks Ingest: Auto Loader
Load new data easily and efficiently as it arrives in cloud storage

Before After
Notiﬁcation Message Auto
Service Queue Loader

Stream

Batch

Delayed ● Pipe data from cloud storage into Delta Lake as

schedule
it arrives
External Trigger
● “Set and forget” model eliminates complex setup
Airﬂow ﬁle sensor

Gets too complicated for multiple jobs

Launch Blog Post
Databricks Ingest: Data Ingestion Network
All Your Application, Database, and File Storage Data in your Delta Lake
Delta Lake Connectors
Standardize your big data storage with an open format accessible from various tools

Amazon Redshift

Amazon Athena
Databricks Notebooks are a powerful developer environment
to enable Dashboards, Workflows, and Jobs.

Notebook Dashboards
Databricks Notebooks Create interactive notebooks based on
Collaborative, multi-language, Notebooks with just a few clicks.
enterprise-ready developer environment.

Notebook Workflows
Orchestrate multi-stage pipelines based on
notebook dependencies.

Notebook Jobs
Schedule, orchestrate, and monitor
execution of notebooks.
Notebook Workflows enable orchestration of
multi-stage pipelines based on notebook dependencies.

Workflow Definition
APIs allow flexible definition of notebook
dependencies, including parallel
execution of notebooks.

Databricks Job Scheduler

Workflow Execution
The Databricks Job scheduler
manages the execution of
Notebook workflows.
Notebook Jobs enable scheduling and monitoring
of Notebooks.

Turn Notebooks into Jobs

Any Databricks Notebook or Notebook
Workflow can easily be turned into a job.

Schedule Jobs
Jobs can be configured to execute on a
schedule (e.g. daily, hourly).

Orchestrate Jobs
In addition to Notebook Workflows, Jobs
can also be orchestrated using third-party
tools like Airflow.
Users of Delta Lake
Improved reliability:
Petabyte-scale jobs

10x lower compute:

640 instances to 64!

Faster iterations:
Multiple wks to 5 min
deploys!
● 2 Regions Uses Cases
● 10 Workspaces Improved performance:
● Finance
● 100+ Users Queries run faster
● 50+ Scheduled ● Assortment
Jobs
>1 hr → < 6 sec
● Fresh Sales
● 1,000+ Notebooks
Tool
● Scores of ML
● Fraud Engine
Easier transactional
models
● Personalization updates:
No downtime or
consistency issues!

Simple CDC:
Easy with MERGE
Data consistency and integrity:
not available before

Increased data quality:

name match accuracy up from
80% to 95%

Faster data loads:

24 hours → 20 mins

Databricks Cluster
Demo
Q&A
Your feedback is appreciated

First 20 responses will receive a $10

UberEats voucher
TWO LIVE SESSIONS:
Australia & NZ: 9.00am - 11.30am AEST
Singapore & India: 1.00pm - 3.30pm SGT | 10.30am - 1.00pm IST

Hear the data innovation stories of Atlassian, Coles Group, Grab

Tabcorp, and more.
Register now: https://databricks.com/p/event/data-ai-tour
June 22-26 | Organized by

THE FREE VIRTUAL

EVENT FOR DATA TEAMS
● Extended to 5 days with over 200 sessions
● 4x the pre-conference training
● Keynotes by visionaries and thought
leaders
https://databricks.com/sparkaisummit/north-america-2020

Architecting A Data Lake
100% (9)
Architecting A Data Lake
60 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Lakehouse With Delta Lake Deep Dive
100% (2)
Lakehouse With Delta Lake Deep Dive
64 pages
Architecting Data Lakes Zaloni PDF
No ratings yet
Architecting Data Lakes Zaloni PDF
63 pages
Modern Data Architecture Guide
88% (8)
Modern Data Architecture Guide
23 pages
Introduction To Spark For Data Engineers / Data Scientists
100% (3)
Introduction To Spark For Data Engineers / Data Scientists
100 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
Delta Lake Cheat Sheet-1
100% (1)
Delta Lake Cheat Sheet-1
2 pages
Learn How Databricks Streamlines The Data Management Lifecycle
No ratings yet
Learn How Databricks Streamlines The Data Management Lifecycle
20 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Data Engineering Cookbook
90% (10)
Data Engineering Cookbook
88 pages
Etl With Azure Cookbook Practical Recipes For Building Modern Etl Solutions To Load and Transform Data From Any Source 1800203314 9781800203310
100% (7)
Etl With Azure Cookbook Practical Recipes For Building Modern Etl Solutions To Load and Transform Data From Any Source 1800203314 9781800203310
446 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
Data Architecture Basics: An Illustrated Guide For Non-Technical Readers
100% (7)
Data Architecture Basics: An Illustrated Guide For Non-Technical Readers
31 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Fast Data Enterprise Data Architecture
100% (2)
Fast Data Enterprise Data Architecture
47 pages
Data Engineering Cookbook
100% (1)
Data Engineering Cookbook
124 pages
Azure Synapse
No ratings yet
Azure Synapse
609 pages
Start To Finish With Azure Data Factory
100% (2)
Start To Finish With Azure Data Factory
30 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
Creating A Data Driven Enterprise With DataOps
100% (3)
Creating A Data Driven Enterprise With DataOps
165 pages
Demystifying The Medallion and Lakehouse Architectures 1714820046
100% (1)
Demystifying The Medallion and Lakehouse Architectures 1714820046
19 pages
Data Warehousing Best Practices
No ratings yet
Data Warehousing Best Practices
178 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Implemententerprise Data Lake
100% (1)
Implemententerprise Data Lake
9 pages
Data Mesh Principles and Logical Architecture
75% (4)
Data Mesh Principles and Logical Architecture
27 pages
12 Best Practices For Modern Data Integration: White Paper
100% (3)
12 Best Practices For Modern Data Integration: White Paper
10 pages
Learn Data Modelling PDF
100% (3)
Learn Data Modelling PDF
112 pages
SSIS Succinctly
No ratings yet
SSIS Succinctly
116 pages
Ebook - Operationalizing The Data Lake PDF
100% (3)
Ebook - Operationalizing The Data Lake PDF
173 pages
Designing The Data Warehouse - Part 1
100% (2)
Designing The Data Warehouse - Part 1
45 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
2024 07 Eb Big Book of Data Engineering 3rd Edition
100% (2)
2024 07 Eb Big Book of Data Engineering 3rd Edition
125 pages
Hadoop 2 Quick Start Guide PDF
100% (1)
Hadoop 2 Quick Start Guide PDF
736 pages
Exam Overview: GCP Data Engineer
100% (1)
Exam Overview: GCP Data Engineer
12 pages
Apache Spark 24 Hours PDF
100% (6)
Apache Spark 24 Hours PDF
1,129 pages
Lakehouse: A Unified Data Architecture
No ratings yet
Lakehouse: A Unified Data Architecture
9 pages
Learning Spark
27% (11)
Learning Spark
3 pages
Azure Synapse Analytics Course Overview
100% (2)
Azure Synapse Analytics Course Overview
261 pages
Data Model Standards and Guidelines
100% (2)
Data Model Standards and Guidelines
72 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Data Modeling
No ratings yet
Data Modeling
624 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Data Modeling by Example Vol 3
100% (1)
Data Modeling by Example Vol 3
152 pages
S. Haines - Modern Data Engineering With Apache Spark - A Hands-On Guide For Building Mission-Critical Streaming Applications (2022) - Libgen - Li
60% (5)
S. Haines - Modern Data Engineering With Apache Spark - A Hands-On Guide For Building Mission-Critical Streaming Applications (2022) - Libgen - Li
592 pages
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
No ratings yet
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
18 pages
Databricks Question 1668314325
100% (1)
Databricks Question 1668314325
104 pages
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
Databricks For The SQL Developer: Gerhard Brueckl
No ratings yet
Databricks For The SQL Developer: Gerhard Brueckl
40 pages
Apache Spark Week-5 PDF
No ratings yet
Apache Spark Week-5 PDF
9 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Databricks LakeHouse Architectre
No ratings yet
Databricks LakeHouse Architectre
10 pages
What Is Delta Lake
No ratings yet
What Is Delta Lake
3 pages
A Quick Technical Guide To Delta Lake
No ratings yet
A Quick Technical Guide To Delta Lake
10 pages
Cloud 2
No ratings yet
Cloud 2
3 pages
Azure Data Engineer Interview Questions - Part 1
No ratings yet
Azure Data Engineer Interview Questions - Part 1
19 pages
Databricks
No ratings yet
Databricks
81 pages
Azure Data Engineering Complete Guide
No ratings yet
Azure Data Engineering Complete Guide
130 pages
Details of Delta Lake Tutorial
67% (3)
Details of Delta Lake Tutorial
43 pages
UT 1 Intro To AI Class 10
No ratings yet
UT 1 Intro To AI Class 10
54 pages
Different Types of Domain Names Disputes - IPleaders
No ratings yet
Different Types of Domain Names Disputes - IPleaders
19 pages
Cloud DevOps Interview Questions
No ratings yet
Cloud DevOps Interview Questions
8 pages
Quotation AN0016733 - 1 Your Inquiry:: Phone +49 4221 9475-217 Fax +49 4221 9475-9217 Andrea Schmidt Contact
No ratings yet
Quotation AN0016733 - 1 Your Inquiry:: Phone +49 4221 9475-217 Fax +49 4221 9475-9217 Andrea Schmidt Contact
2 pages
Ors E1 Rules and Manual 1.2
No ratings yet
Ors E1 Rules and Manual 1.2
14 pages
Webmethods Integration Workshop
100% (1)
Webmethods Integration Workshop
4 pages
Class 3 - Network Layer and Subnetting-B
No ratings yet
Class 3 - Network Layer and Subnetting-B
51 pages
Introduction To JQuery
No ratings yet
Introduction To JQuery
39 pages
IT Backup & Recovery Guide
No ratings yet
IT Backup & Recovery Guide
6 pages
2.1.1.8 Lab - Creating A Process Flowchart
No ratings yet
2.1.1.8 Lab - Creating A Process Flowchart
6 pages
Social Engineering - Symantec
No ratings yet
Social Engineering - Symantec
15 pages
Final 16
No ratings yet
Final 16
3 pages
CST 303 Computer Networks Module 3 Updated
No ratings yet
CST 303 Computer Networks Module 3 Updated
28 pages
1.1.17 Lab - Explore Social Engineering Techniques - Answer Key
No ratings yet
1.1.17 Lab - Explore Social Engineering Techniques - Answer Key
5 pages
Advanced OSINT Techniques Guide
No ratings yet
Advanced OSINT Techniques Guide
11 pages
Screenshot 2024-06-18 at 11.03.22 AM
No ratings yet
Screenshot 2024-06-18 at 11.03.22 AM
5 pages
Tourism Tech for Educators
No ratings yet
Tourism Tech for Educators
23 pages
The Mechanism Of: Taking A Packet Consisting of
No ratings yet
The Mechanism Of: Taking A Packet Consisting of
5 pages
MIL Module 16b Ien
50% (2)
MIL Module 16b Ien
19 pages
MCSL 016 Viva
No ratings yet
MCSL 016 Viva
3 pages
Material LSCM
No ratings yet
Material LSCM
92 pages
ServiceNow CSA 250 Interview Questions
No ratings yet
ServiceNow CSA 250 Interview Questions
32 pages
Free AI Paraphrasing Tool
No ratings yet
Free AI Paraphrasing Tool
1 page
Eh Practical
No ratings yet
Eh Practical
65 pages
Debi Quilla ST Augustine FL
No ratings yet
Debi Quilla ST Augustine FL
19 pages
Vertex AI Search For Commerce Overview
No ratings yet
Vertex AI Search For Commerce Overview
29 pages
Prnhub ENOCwejd
No ratings yet
Prnhub ENOCwejd
18 pages
PHP Server Monitor - Windows Setup Guide - IT Imagination
No ratings yet
PHP Server Monitor - Windows Setup Guide - IT Imagination
9 pages
Blackberry Uem: Release Notes
No ratings yet
Blackberry Uem: Release Notes
21 pages
Siemens s5 Manual
No ratings yet
Siemens s5 Manual
63 pages

Delta Lake Data Engineering Overview

Uploaded by

Delta Lake Data Engineering Overview

Uploaded by

Welcome!

Joel Roland Noble Raveendran

▪ Overview of Data Engineering powered by Delta Lake

Stream Uniﬁed View AI & Reporting

Updates & Merge get complex

Process data continuously and incrementally as new data arrives in a

CSV, AI & Reporting

• Ability to read consistent data while data is being written

• Ability to read incrementally from a large table with good throughput

• Ability to rollback in case of bad writes

• Ability to replay historical data along new data that arrived

1. Unify batch & streaming with a continuous data ﬂow model

Lack of schema enforcement creates

Lack of consistency makes it almost impossible to

Partitioning aka “poor man’s indexing”- breaks down if

No caching - cloud storage throughput is low (cloud object

Open Format Based on Parquet

Apache Spark API’s

Data Science & ML

Key Features ● ACID Transactions ● Uniﬁed Batch & Streaming

Key Features ● Indexing ● Data skipping

Delta Lake allows you to incrementally improve the

Delta Lake allows you to incrementally improve the

Bronz Silve Gold

Bronz Silve Gold

Bronz Silve Gold

Easy to recompute when business logic changes:

• Ability to rollback in case of bad writes Time travel

• Ability to rollback in case of bad writes Time travel

• Ability to rollback in case of bad writes Time travel

• Ability to rollback in case of bad writes Time travel

CREATE TABLE ... CREATE TABLE ...

CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]

Step 2: Optimize Layout for Fast Queries

SELECT count(*) FROM events INSERT INTO my_table

TIMESTAMP AS OF timestamp SELECT * FROM my_table TIMESTAMP AS OF

SELECT count(*) FROM events VERSION

Delayed ● Pipe data from cloud storage into Delta Lake as

Gets too complicated for multiple jobs

Databricks Job Scheduler

Turn Notebooks into Jobs

10x lower compute:

Increased data quality:

Faster data loads:

First 20 responses will receive a $10

Hear the data innovation stories of Atlassian, Coles Group, Grab

THE FREE VIRTUAL

You might also like