0% found this document useful (0 votes)

28 views26 pages

Chapter 4

Uploaded by

nkundukozera janvier

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views26 pages

Chapter 4

Uploaded by

nkundukozera janvier

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Manually testing a

data pipeline
E T L A N D E LT I N P Y T H O N

Jake Roach
Data Engineer
Testing data pipelines
Data pipelines should be thoroughly tested Tools and techniques to test data pipelines

Validate that data is extracted, End-to-end testing

transformed, and loaded as expected
Validating data at "checkpoints"
Unit testing

Validating pipelines' limits maintenance

efforts after deployment

Identify and fix data quality issues

Improves data reliability

ETL AND ELT IN PYTHON

Testing and production environments

ETL AND ELT IN PYTHON

Testing a pipeline end-to-end
End-to-end testing

Confirm that pipeline runs on repeated

attempts

Validate data at pipeline checkpoints

Engage in peer review, incorporate

feedback

Ensure consumer access and satisfaction

with solution

ETL AND ELT IN PYTHON

Validating pipeline checkpoints
# Extract, transform, and load data as part of a pipeline
...

# Take a look at the data made available in a Postgres database

loaded_data = pd.read_sql("SELECT * FROM clean_stock_data", con=db_engine)
print(loaded_data.shape)

(6438, 4)

print(loaded_data.head())

timestamps volume open close

1997-05-15 13:30:00 1443120000 0.121875 0.097917
1997-05-16 13:30:00 294000000 0.098438 0.086458
1997-05-19 13:30:00 122136000 0.088021 0.085417

ETL AND ELT IN PYTHON

Validating DataFrames
# Extract, transform, and load data, as part of a pipeline
...

# Take a look at the data made available in a Postgres database

loaded_data = pd.read_sql("SELECT * FROM clean_stock_data", con=db_engine)

# Compare the two DataFrames.

print(clean_stock_data.equals(loaded_data))

True

ETL AND ELT IN PYTHON

Let's practice!
E T L A N D E LT I N P Y T H O N
Unit-testing a data
pipeline
E T L A N D E LT I N P Y T H O N

Jake Roach
Data Engineer
Validating a data pipeline with unit tests
Unit tests:

Commonly used in software engineering workflows

Ensure code works as expected

Help to validate data

ETL AND ELT IN PYTHON

pytest for unit testing
from pipeline import extract, transform, load

# Build a unit test, asserting the type of clean_stock_data

def test_transformed_data():
raw_stock_data = extract("raw_stock_data.csv")
clean_stock_data = transform(raw_data)
assert isinstance(clean_stock_data, pd.DataFrame)

> python -m pytest

test_transformed_data . [100%]
================================ 1 passed in 1.17s ===============================

ETL AND ELT IN PYTHON

assert and isinstance
pipeline_type = "ETL"

# Check if pipeline_type is an instance of a str

isinstance(pipeline_type, str)

True

# Assert that the pipeline does indeed take value "ETL"

assert pipeline_type == "ETL"

# Combine assert and isinstance

assert isinstance(pipeline_type, str)

ETL AND ELT IN PYTHON

AssertionError
pipeline_type = "ETL"

# Create an AssertionError
assert isinstance(pipeline_type, float)

Traceback (most recent call last):

File "<stdin>", line 4, in <module>
AssertionError

ETL AND ELT IN PYTHON

Mocking data pipeline components with fixtures
import pytest

@pytest.fixture()
def clean_data():
raw_stock_data = extract("raw_stock_data.csv")
clean_stock_data = transform(raw_data)
return clean_stock_data

def test_transformed_data(clean_data):
assert isinstance(clean_data, pd.DataFrame)

ETL AND ELT IN PYTHON

Unit testing DataFrames
def test_transformed_data(clean_data):
# Include other assert statements here
...

# Check number of columns

assert len(clean_data.columns) == 4

# Check the lower bound of a column

assert clean_data["open"].min() >= 0

# Check the range of a column by chaining statements with "and"

assert clean_data["open"].min() >= 0 and clean_data["open"].max() <= 1000

ETL AND ELT IN PYTHON

Let's practice!
E T L A N D E LT I N P Y T H O N
Running a data
pipeline in
production
E T L A N D E LT I N P Y T H O N

Jake Roach
Data Engineer
Data pipeline architecture patterns
# Define ETL function # Import extract, transform, and load functions
... from pipeline_utils import extract, transform, load
def load(clean_data):
... # Run the data pipeline
raw_stock_data = extract("raw_stock_data.csv")
# Run the data pipeline clean_stock_data = transform(raw_stock_data)
raw_stock_data = extract("raw_stock_data.csv") load(clean_stock_data)
clean_stock_data = transform(raw_stock_data)
load(clean_stock_data)

> ls
etl_pipeline.py
> ls pipeline_utils.py
etl_pipeline.py

ETL AND ELT IN PYTHON

Running a data pipeline end-to-end
import logging
from pipeline_utils import extract, transform, load

logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)

try:
# Extract, transform, and load data
raw_stock_data = extract("raw_stock_data.csv")
clean_stock_data = transform(raw_stock_data)
load(clean_stock_data)

logging.info("Successfully extracted, transformed and loaded data.") # Log success message

# Handle exceptions, log messages

except Exception as e:
logging.error(f"Pipeline failed with error: {e}")

ETL AND ELT IN PYTHON

Orchestrating data pipelines in production

1https://open.substack.com/pub/seattledataguy/p/the-state-of-data-engineering-part?
r=1po78c&utm_campaign=post&utm_medium=web

ETL AND ELT IN PYTHON

Let's practice!
E T L A N D E LT I N P Y T H O N
Congratulations!
E T L A N D E LT I N P Y T H O N

Jake Roach
Data Engineer
Designing and building data pipelines

Designing sound data pipelines

Extract, transform, and load architecture

Exception handling and logging

ETL AND ELT IN PYTHON

Advanced ETL techniques

Handling nested JSON {

"863703000": {
Advanced transformation logic "volume": 1443120000,

Persisting data to SQL databases "price": {

"close": 0.09791,
"open": 0.12187
}
},
...
}

ETL AND ELT IN PYTHON

Deploying and maintaining data pipelines

Validate and test data pipelines

Running a pipeline in a production setting

Orchestration tools

ETL AND ELT IN PYTHON

Next steps

Introduction to Airflow in Python Course Apache Airflow

Data Engineer Career Track Astronomer

Associate Data Engineer Certification Snowflake

ETL AND ELT IN PYTHON

Thank you!
E T L A N D E LT I N P Y T H O N

Chapter 4
No ratings yet
Chapter 4
26 pages
Chapter 1
No ratings yet
Chapter 1
15 pages
ETL Process
No ratings yet
ETL Process
2 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Week 5. Data Pipelines
No ratings yet
Week 5. Data Pipelines
51 pages
POC Automating ETL Testing
No ratings yet
POC Automating ETL Testing
3 pages
ETL
No ratings yet
ETL
2 pages
Python ETL Guide - by Yogesh Tyagi
No ratings yet
Python ETL Guide - by Yogesh Tyagi
10 pages
Hands On Lab Section3
No ratings yet
Hands On Lab Section3
4 pages
08 - Data Pipelines Presentation
No ratings yet
08 - Data Pipelines Presentation
36 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
M2.1 Introduction To Building Batch Data Pipelines
No ratings yet
M2.1 Introduction To Building Batch Data Pipelines
31 pages
Chapter 3
No ratings yet
Chapter 3
31 pages
Case Study 1
No ratings yet
Case Study 1
3 pages
VIGNESHWARAN Thiruppathur APSA COLLEGE
No ratings yet
VIGNESHWARAN Thiruppathur APSA COLLEGE
9 pages
De FiNal
No ratings yet
De FiNal
94 pages
ETL Pipeline - Javatpoint
No ratings yet
ETL Pipeline - Javatpoint
3 pages
Introduction To ETL in Python: Stefano Francavilla
No ratings yet
Introduction To ETL in Python: Stefano Francavilla
62 pages
Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)
No ratings yet
Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)
36 pages
D3 Hackathon
No ratings yet
D3 Hackathon
17 pages
Master ETL Pipelines in 30 Days
No ratings yet
Master ETL Pipelines in 30 Days
10 pages
ETL Pipelines!
No ratings yet
ETL Pipelines!
10 pages
Etl Process
No ratings yet
Etl Process
18 pages
3 ETL Versus ELT - Coursera
No ratings yet
3 ETL Versus ELT - Coursera
1 page
ETL Essentials for IT Managers
No ratings yet
ETL Essentials for IT Managers
3 pages
Data Exploration and Preparation Session 7 8
No ratings yet
Data Exploration and Preparation Session 7 8
19 pages
SimpleETL: ETL Processing by Simple Specifications
No ratings yet
SimpleETL: ETL Processing by Simple Specifications
6 pages
CCD Unit 4
No ratings yet
CCD Unit 4
5 pages
Understanding Etl Er1
No ratings yet
Understanding Etl Er1
34 pages
Data Engineering With Python Course Agenda and Syllabus
No ratings yet
Data Engineering With Python Course Agenda and Syllabus
3 pages
Understanding Databricks For Etl Slides
No ratings yet
Understanding Databricks For Etl Slides
14 pages
How To Build Data Pipelines For Machine Learning - by Shaw Talebi - Towards Data Science
No ratings yet
How To Build Data Pipelines For Machine Learning - by Shaw Talebi - Towards Data Science
21 pages
ETL Pipelines 1741352181
No ratings yet
ETL Pipelines 1741352181
17 pages
ETL Guide & Questions
No ratings yet
ETL Guide & Questions
4 pages
ETL Vs ELT
No ratings yet
ETL Vs ELT
13 pages
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
ETL vs ELT: Key Differences Explained
No ratings yet
ETL vs ELT: Key Differences Explained
7 pages
DZone TR Data Pipelines 2022 Spotlight Dremio
No ratings yet
DZone TR Data Pipelines 2022 Spotlight Dremio
42 pages
Lab Manual
No ratings yet
Lab Manual
32 pages
ETL IMP - INTERVIEW Final
No ratings yet
ETL IMP - INTERVIEW Final
23 pages
ETL Testing Goals and Strategies
No ratings yet
ETL Testing Goals and Strategies
3 pages
Building Databases
No ratings yet
Building Databases
40 pages
L010 DW Lab7
No ratings yet
L010 DW Lab7
7 pages
Building Batch Data Pipelines On Google Cloud
No ratings yet
Building Batch Data Pipelines On Google Cloud
18 pages
ETL Testing - The Future Is Here
No ratings yet
ETL Testing - The Future Is Here
13 pages
1 (11 Files Merged)
No ratings yet
1 (11 Files Merged)
11 pages
The Background and Skill of Data Engineer
No ratings yet
The Background and Skill of Data Engineer
9 pages
Internship Presentation 2
No ratings yet
Internship Presentation 2
16 pages
19.1 - Data Pipelines
No ratings yet
19.1 - Data Pipelines
18 pages
Overview of Data Engineering - Updated
No ratings yet
Overview of Data Engineering - Updated
39 pages
Extract, Transform and Load (Etl) Performance Improved by Query Cache
No ratings yet
Extract, Transform and Load (Etl) Performance Improved by Query Cache
20 pages
Chapter 4 (PRE 6)
No ratings yet
Chapter 4 (PRE 6)
4 pages
ETL
No ratings yet
ETL
3 pages
ETL Process: (Extract, Transform, and Load) Process
No ratings yet
ETL Process: (Extract, Transform, and Load) Process
21 pages
ETL Testing and Data Warehousing Guide
No ratings yet
ETL Testing and Data Warehousing Guide
13 pages
Vigneshwaran S 0522151118 223258 CSM
No ratings yet
Vigneshwaran S 0522151118 223258 CSM
4 pages
Unit 2 DW
No ratings yet
Unit 2 DW
75 pages
Carta Juramentada Traspaso A Persona Indeterminada (1) 1
No ratings yet
Carta Juramentada Traspaso A Persona Indeterminada (1) 1
99 pages
Keys Topaz
No ratings yet
Keys Topaz
6 pages
Ai Class X Notes
No ratings yet
Ai Class X Notes
55 pages
Overview: EWM - Full Notes
50% (6)
Overview: EWM - Full Notes
5 pages
NSA Interview Worksheet
No ratings yet
NSA Interview Worksheet
2 pages
Exercises 4
No ratings yet
Exercises 4
21 pages
Harman Packaged Browser User Guide
No ratings yet
Harman Packaged Browser User Guide
13 pages
AE CheatSheet
No ratings yet
AE CheatSheet
8 pages
202409061332094963
No ratings yet
202409061332094963
755 pages
Saved Case Data Extraction Subroutines PSS®E 35.3.0: July 2021
No ratings yet
Saved Case Data Extraction Subroutines PSS®E 35.3.0: July 2021
58 pages
Deepfake and Beyond - A Survey of Face Manipulation and Fake Detection
No ratings yet
Deepfake and Beyond - A Survey of Face Manipulation and Fake Detection
23 pages
Mohamedali Paleri 20181015 1
No ratings yet
Mohamedali Paleri 20181015 1
3 pages
SKF 7312 BECBM Specification
No ratings yet
SKF 7312 BECBM Specification
5 pages
Assessment Part 1
No ratings yet
Assessment Part 1
4 pages
Teacher Growth Through Multimedia
No ratings yet
Teacher Growth Through Multimedia
4 pages
A Level Programming Project: Name: Charlotte Dillon Candidate Number: 2060 Centre Number: 14109
No ratings yet
A Level Programming Project: Name: Charlotte Dillon Candidate Number: 2060 Centre Number: 14109
240 pages
FINAL PROJECT REPORT - Travel Agency Management System
No ratings yet
FINAL PROJECT REPORT - Travel Agency Management System
49 pages
Ibibio Dictionary
50% (2)
Ibibio Dictionary
132 pages
ISTQB CTFL-AuT Practice Test Questions
No ratings yet
ISTQB CTFL-AuT Practice Test Questions
11 pages
Surface Pro 7
No ratings yet
Surface Pro 7
2 pages
(Test-Questions-ICT-2nd Quarter) Vincent Francisco SHS Gr.11-Matikas
0% (1)
(Test-Questions-ICT-2nd Quarter) Vincent Francisco SHS Gr.11-Matikas
7 pages
SinumerikOperate CPP
No ratings yet
SinumerikOperate CPP
236 pages
Computer Maintainance
No ratings yet
Computer Maintainance
12 pages
Advanced Spreadsheet Techniques
No ratings yet
Advanced Spreadsheet Techniques
5 pages
CATIA Tailgate Analysis Guide
No ratings yet
CATIA Tailgate Analysis Guide
25 pages
Anand Raj CV
No ratings yet
Anand Raj CV
2 pages
5.prog Report 5 - DST
No ratings yet
5.prog Report 5 - DST
3 pages
Advance Java
No ratings yet
Advance Java
6 pages
Javanotes 5.1.2, Answers For Quiz On Chapter 8
No ratings yet
Javanotes 5.1.2, Answers For Quiz On Chapter 8
5 pages
RTOS Based Embedded System Design
No ratings yet
RTOS Based Embedded System Design
16 pages

Chapter 4

Uploaded by

Chapter 4

Uploaded by

Manually testing a

Validate that data is extracted, End-to-end testing

Validating pipelines' limits maintenance

Identify and fix data quality issues

Improves data reliability

ETL AND ELT IN PYTHON

ETL AND ELT IN PYTHON

Confirm that pipeline runs on repeated

Validate data at pipeline checkpoints

Engage in peer review, incorporate

Ensure consumer access and satisfaction

ETL AND ELT IN PYTHON

# Take a look at the data made available in a Postgres database

timestamps volume open close

ETL AND ELT IN PYTHON

# Take a look at the data made available in a Postgres database

# Compare the two DataFrames.

ETL AND ELT IN PYTHON

Commonly used in software engineering workflows

Ensure code works as expected

ETL AND ELT IN PYTHON

# Build a unit test, asserting the type of clean_stock_data

> python -m pytest

ETL AND ELT IN PYTHON

# Check if pipeline_type is an instance of a str

# Assert that the pipeline does indeed take value "ETL"

# Combine assert and isinstance

ETL AND ELT IN PYTHON

Traceback (most recent call last):

ETL AND ELT IN PYTHON

ETL AND ELT IN PYTHON

# Check number of columns

# Check the lower bound of a column

# Check the range of a column by chaining statements with "and"

ETL AND ELT IN PYTHON

ETL AND ELT IN PYTHON

logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)

logging.info("Successfully extracted, transformed and loaded data.") # Log success message

# Handle exceptions, log messages

ETL AND ELT IN PYTHON

ETL AND ELT IN PYTHON

Designing sound data pipelines

Extract, transform, and load architecture

ETL AND ELT IN PYTHON

Handling nested JSON {

Persisting data to SQL databases "price": {

ETL AND ELT IN PYTHON

Validate and test data pipelines

Running a pipeline in a production setting

ETL AND ELT IN PYTHON

Introduction to Airflow in Python Course Apache Airflow

Data Engineer Career Track Astronomer

Associate Data Engineer Certification Snowflake

ETL AND ELT IN PYTHON

You might also like