0% found this document useful (0 votes)

291 views23 pages

Cleaning Data With PySpark Chapter4

This document introduces data pipelines and focuses on cleaning data with PySpark. It discusses that a data pipeline consists of steps to process data from sources to outputs and can span many systems. It also describes common pipeline components like inputs, transformations, outputs, and validation. Finally, it discusses techniques for cleaning data like removing blanks/comments, automatic column creation, validation with joins or rules, and final analysis with UDFs or inline calculations before delivery.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

291 views23 pages

Cleaning Data With PySpark Chapter4

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Introduction to Data

Pipelines
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What is a data pipeline?
A set of steps to process data from source(s) to nal output

Can consist of any number of steps or components

Can span many systems

We will focus on data pipelines within Spark

CLEANING DATA WITH PYSPARK

What does a data pipeline look like?
Input(s)
CSV, JSON, web services, databases

Transformations
withColumn() , .filter() , .drop()

Output(s)
CSV, Parquet, database

Validation

Analysis

CLEANING DATA WITH PYSPARK

Pipeline details
Not formally de ned in Spark

Typically all normal Spark code required for task

schema = StructType([
StructField('name', StringType(), False),
StructField('age', StringType(), False)
])
df = spark.read.format('csv').load('datafile').schema(schema)
df = df.withColumn('id', monotonically_increasing_id())
...
df.write.parquet('outdata.parquet')
df.write.json('outdata.json')

CLEANING DATA WITH PYSPARK

Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Data handling
techniques
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What are we trying to parse?
Incorrect data width, height, image
Empty rows

Commented lines # This is a comment

Headers
200 300 affenpinscher;0
Nested structures
Multiple delimiters
600 450 Collie;307 Collie;101
Non-regular data 600 449 Japanese_spaniel;23
Differing numbers of columns per row

Focused on CSV data

CLEANING DATA WITH PYSPARK

Stanford ImageNet annotations
Identi es dog breeds in images

Provides list of all identi ed dogs in image

Other metadata (base folder, image size, etc.)

Example rows:

02111277 n02111277_3206 500 375 Newfoundland,110,73,416,298

02108422 n02108422_4375 500 375 bull_mastiff,101,90,214,356 \
bull_mastiff,282,74,416,370

CLEANING DATA WITH PYSPARK

Removing blank lines, headers, and comments
Spark's CSV parser:

Automatically removes blank lines

Can remove comments using an optional argument

df1 = spark.read.csv('datafile.csv.gz', comment='#')

Handles header elds

De ned via argument

Ignored if a schema is de ned

df1 = spark.read.csv('datafile.csv.gz', header='True')

CLEANING DATA WITH PYSPARK

Automatic column creation
Spark will:

Automatically create columns in a DataFrame based on sep argument

df1 = spark.read.csv('datafile.csv.gz', sep=',')

Defaults to using ,

Can still successfully parse if sep is not in string

df1 = spark.read.csv('datafile.csv.gz', sep='*')

Stores data in column defaulting to _c0

Allows you to properly handle nested separators

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Data validation
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
De nition
Validation is:

Verifying that a dataset complies with the expected format

Number of rows / columns

Data types

Complex validation rules

CLEANING DATA WITH PYSPARK

Validating via joins
Compares data against known values

Easy to nd data in a given set

Comparatively fast

parsed_df = spark.read.parquet('parsed_data.parquet')
company_df = spark.read.parquet('companies.parquet')
verified_df = parsed_df.join(company_df, parsed_df.company == company_df.company)

This automatically removes any rows with a company not in the valid_df !

CLEANING DATA WITH PYSPARK

Complex rule validation
Using Spark components to validate logic:

Calculations

Verifying against external source

Likely uses a UDF to modify / verify the DataFrame

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Final analysis and
delivery
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Analysis calculations (UDF)
Calculations using UDF

def getAvgSale(saleslist):
totalsales = 0
count = 0
for sale in saleslist:
totalsales += sale[2] + sale[3]
count += 2
return totalsales / count

udfGetAvgSale = udf(getAvgSale, DoubleType())

df = df.withColumn('avg_sale', udfGetAvgSale(df.sales_list))

CLEANING DATA WITH PYSPARK

Analysis calculations (inline)
Inline calculations

df = df.read.csv('datafile')

df = df.withColumn('avg', (df.total_sales / df.sales_count))

df = df.withColumn('sq_ft', df.width * df.length)

df = df.withColumn('total_avg_size', udfComputeTotal(df.entries) / df.numEntries)

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Congratulations and
next steps
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Next Steps
Review Spark documentation

Try working with data on actual clusters

Work with various datasets

CLEANING DATA WITH PYSPARK

Thank you!
C L E A N I N G D ATA W I T H P Y S PA R K

PySpark Data Cleaning Guide
0% (1)
PySpark Data Cleaning Guide
20 pages
PySpark Caching and Performance Tips
No ratings yet
PySpark Caching and Performance Tips
25 pages
PySpark DataFrame Operations Guide
100% (1)
PySpark DataFrame Operations Guide
25 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark QA
No ratings yet
Spark QA
34 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Spark Architecture Explained
100% (1)
Spark Architecture Explained
12 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
5 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Complete DBT Bootcamp Slides
100% (1)
Complete DBT Bootcamp Slides
99 pages
Databricks Performance Optimization
No ratings yet
Databricks Performance Optimization
94 pages
Snowflake
No ratings yet
Snowflake
11 pages
Synapse Project Deck
No ratings yet
Synapse Project Deck
196 pages
Core Python
No ratings yet
Core Python
102 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
Data Engineer Interview Prep
100% (2)
Data Engineer Interview Prep
16 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Azure Data Platform Overview
100% (2)
Azure Data Platform Overview
57 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
Eshant Garg: Azure Data Engineer, Architect, Advisor
No ratings yet
Eshant Garg: Azure Data Engineer, Architect, Advisor
44 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
PySpark Interview Questions Guide
100% (4)
PySpark Interview Questions Guide
126 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Data Cleaning Guide for Python Users
No ratings yet
Data Cleaning Guide for Python Users
14 pages
DSP
No ratings yet
DSP
3 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Python SpeechRecognition Guide
No ratings yet
Python SpeechRecognition Guide
23 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Seaborn Categorical Plot Guide
100% (1)
Seaborn Categorical Plot Guide
32 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
Credit Risk Modeling in Python Chapter2
100% (1)
Credit Risk Modeling in Python Chapter2
36 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
ML Workflows for Cybersecurity
No ratings yet
ML Workflows for Cybersecurity
39 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Credit Risk Modeling for Data Scientists
100% (1)
Credit Risk Modeling for Data Scientists
35 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Customer Segmentation in Python Chapter2
No ratings yet
Customer Segmentation in Python Chapter2
33 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
37 pages
IoT Data Analysis with Python
No ratings yet
IoT Data Analysis with Python
34 pages
Python Chatbot Development Guide
No ratings yet
Python Chatbot Development Guide
41 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
Andovercontinuumsnmpconfigurationguidev 19101131
No ratings yet
Andovercontinuumsnmpconfigurationguidev 19101131
50 pages
Academic Transcript: Geophysics Engineering
No ratings yet
Academic Transcript: Geophysics Engineering
1 page
Lesson Notes On Cell Membrane
No ratings yet
Lesson Notes On Cell Membrane
7 pages
Linux Shell Scripting Cookbook - Third Edition Clif Flynt Download
100% (1)
Linux Shell Scripting Cookbook - Third Edition Clif Flynt Download
79 pages
Pandas Data Cleaning Presentation
No ratings yet
Pandas Data Cleaning Presentation
11 pages
Electromagnetics (EE308) Course Outline 2020
No ratings yet
Electromagnetics (EE308) Course Outline 2020
5 pages
3D Modeling for Engineers
No ratings yet
3D Modeling for Engineers
18 pages
PP - Master Recipe Mapping Template Fields
No ratings yet
PP - Master Recipe Mapping Template Fields
13 pages
SINAMICS G120 Drive Setup Guide
50% (2)
SINAMICS G120 Drive Setup Guide
10 pages
Types of Lasers
No ratings yet
Types of Lasers
8 pages
Tables Colgroup HTML
No ratings yet
Tables Colgroup HTML
4 pages
NDT Mock-Up Test Guidelines
No ratings yet
NDT Mock-Up Test Guidelines
7 pages
Flood Estimation
No ratings yet
Flood Estimation
14 pages
Concept Development UNIT V
No ratings yet
Concept Development UNIT V
12 pages
Maximum Force of Inclined Pullout of A Torpedo Anchor in Cohesive Beds 2019
No ratings yet
Maximum Force of Inclined Pullout of A Torpedo Anchor in Cohesive Beds 2019
11 pages
Activation of Na2S2O8 by MIL 101 Fe MoS2 Comp 2022 Colloids and Surfaces A
No ratings yet
Activation of Na2S2O8 by MIL 101 Fe MoS2 Comp 2022 Colloids and Surfaces A
11 pages
Sulfur and Ammonium Nitrate Analysis Methods
No ratings yet
Sulfur and Ammonium Nitrate Analysis Methods
2 pages
Example DC Motor State Space Position Controller
No ratings yet
Example DC Motor State Space Position Controller
7 pages
A DEH PR-2014-0109-GB Filter-2000 DF R6-02-2016 150dpi
No ratings yet
A DEH PR-2014-0109-GB Filter-2000 DF R6-02-2016 150dpi
92 pages
Database System With Administration: Technical Assessment
100% (2)
Database System With Administration: Technical Assessment
13 pages
2024 Electric Fields Segmented Solutions - Tut 2 - Do at Home (Edited 26 Jan)
No ratings yet
2024 Electric Fields Segmented Solutions - Tut 2 - Do at Home (Edited 26 Jan)
3 pages
Speed /frequency / Wavelength: Equation
No ratings yet
Speed /frequency / Wavelength: Equation
3 pages
Material Properties at Cryogenic Temperatures
No ratings yet
Material Properties at Cryogenic Temperatures
46 pages
4TH Quarterly Exam Gen Phys2 - Student's
No ratings yet
4TH Quarterly Exam Gen Phys2 - Student's
5 pages
Shortcut Keys
No ratings yet
Shortcut Keys
6 pages
Nagendra Krishnapura, Dept. of EE Indian Institute of Technology, Madras Analog Integrated Circuit Design A Course Under The NPTEL
No ratings yet
Nagendra Krishnapura, Dept. of EE Indian Institute of Technology, Madras Analog Integrated Circuit Design A Course Under The NPTEL
5 pages
Derivatives: A Risk Management Guide
No ratings yet
Derivatives: A Risk Management Guide
30 pages
Signal Classification in SDR Systems
No ratings yet
Signal Classification in SDR Systems
1 page
Lab 11 - Shear Behavior of R.C Beam
No ratings yet
Lab 11 - Shear Behavior of R.C Beam
4 pages
Godel and The End of Physics - Stephen Hawking
No ratings yet
Godel and The End of Physics - Stephen Hawking
4 pages

Cleaning Data With PySpark Chapter4

Uploaded by

Cleaning Data With PySpark Chapter4

Uploaded by

Introduction to Data

Can consist of any number of steps or components

Can span many systems

We will focus on data pipelines within Spark

CLEANING DATA WITH PYSPARK

CLEANING DATA WITH PYSPARK

Typically all normal Spark code required for task

CLEANING DATA WITH PYSPARK

Commented lines # This is a comment

Focused on CSV data

CLEANING DATA WITH PYSPARK

Provides list of all identi ed dogs in image

Other metadata (base folder, image size, etc.)

02111277 n02111277_3206 500 375 Newfoundland,110,73,416,298

CLEANING DATA WITH PYSPARK

Automatically removes blank lines

Can remove comments using an optional argument

df1 = spark.read.csv('datafile.csv.gz', comment='#')

Handles header elds

Ignored if a schema is de ned

df1 = spark.read.csv('datafile.csv.gz', header='True')

CLEANING DATA WITH PYSPARK

Automatically create columns in a DataFrame based on sep argument

df1 = spark.read.csv('datafile.csv.gz', sep=',')

Can still successfully parse if sep is not in string

df1 = spark.read.csv('datafile.csv.gz', sep='*')

Stores data in column defaulting to _c0

Allows you to properly handle nested separators

CLEANING DATA WITH PYSPARK

Verifying that a dataset complies with the expected format

Number of rows / columns

Complex validation rules

CLEANING DATA WITH PYSPARK

Easy to nd data in a given set

CLEANING DATA WITH PYSPARK

Verifying against external source

Likely uses a UDF to modify / verify the DataFrame

CLEANING DATA WITH PYSPARK

udfGetAvgSale = udf(getAvgSale, DoubleType())

CLEANING DATA WITH PYSPARK

df = df.withColumn('avg', (df.total_sales / df.sales_count))

df = df.withColumn('sq_ft', df.width * df.length)

df = df.withColumn('total_avg_size', udfComputeTotal(df.entries) / df.numEntries)

CLEANING DATA WITH PYSPARK

Try working with data on actual clusters

Work with various datasets

CLEANING DATA WITH PYSPARK

You might also like