100% found this document useful (1 vote)

578 views25 pages

PySpark DataFrame Operations Guide

The document discusses techniques for cleaning DataFrames in PySpark, including filtering rows, selecting columns, adding new columns, and handling null values. It covers conditional expressions for transforming columns based on certain conditions. User defined functions (UDFs) in Python can be wrapped and used like native Spark functions. DataFrames are partitioned and transformations are lazy, only occurring during actions to allow optimizations. Unique identifiers can be generated using monotonically_increasing_id() for parallel processing.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

578 views25 pages

PySpark DataFrame Operations Guide

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

DataFrame column

operations
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
DataFrame refresher
DataFrames:

Made up of rows & columns

Immutable

Use various transformation operations to modify data

# Return rows where name starts with "M"

voter_df.filter(voter_df.name.like('M%'))

# Return name and position only

voters = voter_df.select('name', 'position')

CLEANING DATA WITH PYSPARK

Common DataFrame transformations
Filter / Where

voter_df.filter(voter_df.date > '1/1/2019') # or voter_df.where(...)

Select

voter_df.select(voter_df.name)

withColumn

voter_df.withColumn('year', voter_df.date.year)

drop

voter_df.drop('unused_column')

CLEANING DATA WITH PYSPARK

Filtering data
Remove nulls

Remove odd entries

Split data from combined sources

Negate with ~

voter_df.filter(voter_df['name'].isNotNull())
voter_df.filter(voter_df.date.year > 1800)
voter_df.where(voter_df['_c0'].contains('VOTE'))
voter_df.where(~ voter_df._c1.isNull())

CLEANING DATA WITH PYSPARK

Column string transformations
Contained in pyspark.sql.functions

import pyspark.sql.functions as F

Applied per column as transformation

voter_df.withColumn('upper', F.upper('name'))

Can create intermediary columns

voter_df.withColumn('splits', F.split('name', ' '))

Can cast to other types

voter_df.withColumn('year', voter_df['_c4'].cast(IntegerType()))

CLEANING DATA WITH PYSPARK

ArrayType() column functions
Various utility functions / transformations to interact with ArrayType()

.size(<column>) - returns length of arrayType() column

.getItem(<index>) - used to retrieve a speci c item at index of list column.

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Conditional
DataFrame column
operations
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Conditional clauses
Conditional Clauses are:

Inline version of if / then / else

.when()

.otherwise()

CLEANING DATA WITH PYSPARK

Conditional example
.when(<if condition>, <then x>)

df.select(df.Name, df.Age, F.when(df.Age >= 18, "Adult"))

Name Age

Alice 14

Bob 18 Adult

Candice 38 Adult

CLEANING DATA WITH PYSPARK

Another example
Multiple .when()

df.select(df.Name, df.Age,
.when(df.Age >= 18, "Adult")
.when(df.Age < 18, "Minor"))

Name Age

Alice 14 Minor

Bob 18 Adult

Candice 38 Adult

CLEANING DATA WITH PYSPARK

Otherwise
.otherwise() is like else

df.select(df.Name, df.Age,
.when(df.Age >= 18, "Adult")
.otherwise("Minor"))

Name Age

Alice 14 Minor

Bob 18 Adult

Candice 38 Adult

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
User de ned
functions
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
De ned...
User de ned functions or UDFs

Python method

Wrapped via the pyspark.sql.functions.udf method

Stored as a variable

Called like a normal Spark function

CLEANING DATA WITH PYSPARK

Reverse string UDF
De ne a Python method

def reverseString(mystr):
return mystr[::-1]

Wrap the function and store as a variable

udfReverseString = udf(reverseString, StringType())

Use with Spark

user_df = user_df.withColumn('ReverseName',
udfReverseString(user_df.Name))

CLEANING DATA WITH PYSPARK

Argument-less example
def sortingCap():
return random.choice(['G', 'H', 'R', 'S'])
udfSortingCap = udf(sortingCap, StringType())
user_df = user_df.withColumn('Class', udfSortingCap())

Name Age Class

Alice 14 H

Bob 18 S

Candice 63 G

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Partitioning and lazy
processing
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Partitioning
DataFrames are broken up into partitions

Partition size can vary

Each partition is handled independently

CLEANING DATA WITH PYSPARK

Lazy processing
Transformations are lazy
.withColumn(...)

.select(...)

Nothing is actually done until an action is performed

.count()

.write(...)

Transformations can be re-ordered for best performance

Sometimes causes unexpected behavior

CLEANING DATA WITH PYSPARK

Adding IDs
Normal ID elds:

Common in relational databases

Most usually an integer increasing, sequential and unique

Not very parallel

id last name rst name state

0 Smith John TX

1 Wilson A. IL

2 Adams Wendy OR

CLEANING DATA WITH PYSPARK

Monotonically increasing IDs
pyspark.sql.functions.monotonically_increasing_id()

Integer (64-bit), increases in value, unique

Not necessarily sequential (gaps exist)

Completely parallel

id last name rst name state

0 Smith John TX

134520871 Wilson A. IL

675824594 Adams Wendy OR

CLEANING DATA WITH PYSPARK

Notes
Remember, Spark is lazy!

Occasionally out of order

If performing a join, ID may be assigned after the join

Test your transformations

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K

PySpark Data Cleaning Guide
0% (1)
PySpark Data Cleaning Guide
20 pages
PySpark Caching and Performance Tips
No ratings yet
PySpark Caching and Performance Tips
25 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Databricks Vs SQL Cheat Sheet
100% (2)
Databricks Vs SQL Cheat Sheet
11 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Core Python
No ratings yet
Core Python
102 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Final Print Py Spark
0% (1)
Final Print Py Spark
133 pages
PySpark Interview Questions Guide
100% (4)
PySpark Interview Questions Guide
126 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
How To Work With Apache Airflow
No ratings yet
How To Work With Apache Airflow
111 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Snowflake
No ratings yet
Snowflake
11 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Pandas Interview Prep Guide
No ratings yet
Pandas Interview Prep Guide
5 pages
Py Spark
No ratings yet
Py Spark
10 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Databricks Python & Linux Commands Guide
No ratings yet
Databricks Python & Linux Commands Guide
109 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
2.1 Combining Data Frames
No ratings yet
2.1 Combining Data Frames
38 pages
Data Cleaning Guide for Python Users
No ratings yet
Data Cleaning Guide for Python Users
14 pages
Data Cleaning
No ratings yet
Data Cleaning
52 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
47 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Python SpeechRecognition Guide
No ratings yet
Python SpeechRecognition Guide
23 pages
Seaborn Categorical Plot Guide
100% (1)
Seaborn Categorical Plot Guide
32 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
Customer Segmentation in Python Chapter2
No ratings yet
Customer Segmentation in Python Chapter2
33 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Credit Risk Modeling for Data Scientists
100% (1)
Credit Risk Modeling for Data Scientists
35 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
ML Workflows for Cybersecurity
No ratings yet
ML Workflows for Cybersecurity
39 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Credit Risk Modeling in Python Chapter2
100% (1)
Credit Risk Modeling in Python Chapter2
36 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
37 pages
Python Chatbot Development Guide
No ratings yet
Python Chatbot Development Guide
41 pages
IoT Data Analysis with Python
No ratings yet
IoT Data Analysis with Python
34 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
TME 7 Pandu Gelombang
No ratings yet
TME 7 Pandu Gelombang
27 pages
1finity: Optical Networking For Digital Transformation
No ratings yet
1finity: Optical Networking For Digital Transformation
8 pages
Landing Pages
No ratings yet
Landing Pages
4 pages
Varangaka Kshara
No ratings yet
Varangaka Kshara
11 pages
Com3004 Salesforcemanagement
No ratings yet
Com3004 Salesforcemanagement
13 pages
Kaizen for Global Productivity Boost
No ratings yet
Kaizen for Global Productivity Boost
20 pages
A Guide To Shipping Container Dimensions
No ratings yet
A Guide To Shipping Container Dimensions
5 pages
Payment of Wages Act, 1936
No ratings yet
Payment of Wages Act, 1936
43 pages
Technology Entrepreneurship Syllabus
No ratings yet
Technology Entrepreneurship Syllabus
6 pages
Gecco V6.2
No ratings yet
Gecco V6.2
1 page
Auma - Copyright in The Digital Age - An Assessment of Keny20Framework
No ratings yet
Auma - Copyright in The Digital Age - An Assessment of Keny20Framework
121 pages
Social Responsibility Project
No ratings yet
Social Responsibility Project
22 pages
Viral Pin Formula-2019
No ratings yet
Viral Pin Formula-2019
62 pages
NOV Mono Rod Pump Systems PDF
No ratings yet
NOV Mono Rod Pump Systems PDF
12 pages
Hyundai: No Engine Car Name/Year/Model Full Set Head Set Cylinder Head
No ratings yet
Hyundai: No Engine Car Name/Year/Model Full Set Head Set Cylinder Head
8 pages
KDS 41 10 15 2019
No ratings yet
KDS 41 10 15 2019
51 pages
Smiths Creek Masonic Lodge 491 - PrintInspection
No ratings yet
Smiths Creek Masonic Lodge 491 - PrintInspection
1 page
667 Question Paper
No ratings yet
667 Question Paper
2 pages
Teacher Level 2
No ratings yet
Teacher Level 2
6 pages
PFFB Jan - June 2019 PDF
No ratings yet
PFFB Jan - June 2019 PDF
38 pages
Philips ntrx900 12 SM
No ratings yet
Philips ntrx900 12 SM
34 pages
Ultraviolet Presentation
No ratings yet
Ultraviolet Presentation
7 pages
Group Assignment Presentation PHR3173
No ratings yet
Group Assignment Presentation PHR3173
5 pages
Ingold, Tim. The Conical Lodge at The Centre of The Earth.
No ratings yet
Ingold, Tim. The Conical Lodge at The Centre of The Earth.
28 pages
Samsung Ue32es5500 Ue40es5500 Ue46es5500 Ue32eh5300 Ue40eh5300 Ue46eh5300 Ue26eh4500 Ue22es5400 Training
No ratings yet
Samsung Ue32es5500 Ue40es5500 Ue46es5500 Ue32eh5300 Ue40eh5300 Ue46eh5300 Ue26eh4500 Ue22es5400 Training
138 pages
2nd Assignment, Restaurant Evaluation
No ratings yet
2nd Assignment, Restaurant Evaluation
2 pages
Windsheild Survey
No ratings yet
Windsheild Survey
4 pages
Catalogue
No ratings yet
Catalogue
15 pages
EcoVille Corporation E Co. USJ R Basak Campus
No ratings yet
EcoVille Corporation E Co. USJ R Basak Campus
92 pages
Site Inspection Report: Robig Builders
No ratings yet
Site Inspection Report: Robig Builders
4 pages

PySpark DataFrame Operations Guide

Uploaded by

PySpark DataFrame Operations Guide

Uploaded by

DataFrame column

Made up of rows & columns

Use various transformation operations to modify data

# Return rows where name starts with "M"

# Return name and position only

CLEANING DATA WITH PYSPARK

voter_df.filter(voter_df.date > '1/1/2019') # or voter_df.where(...)

CLEANING DATA WITH PYSPARK

Remove odd entries

Split data from combined sources

CLEANING DATA WITH PYSPARK

Applied per column as transformation

Can create intermediary columns

Can cast to other types

CLEANING DATA WITH PYSPARK

.size(<column>) - returns length of arrayType() column

.getItem(<index>) - used to retrieve a speci c item at index of list column.

CLEANING DATA WITH PYSPARK

Inline version of if / then / else

CLEANING DATA WITH PYSPARK

df.select(df.Name, df.Age, F.when(df.Age >= 18, "Adult"))

CLEANING DATA WITH PYSPARK

CLEANING DATA WITH PYSPARK

CLEANING DATA WITH PYSPARK

Wrapped via the pyspark.sql.functions.udf method

Called like a normal Spark function

CLEANING DATA WITH PYSPARK

Wrap the function and store as a variable

udfReverseString = udf(reverseString, StringType())

Use with Spark

CLEANING DATA WITH PYSPARK

Name Age Class

CLEANING DATA WITH PYSPARK

Partition size can vary

Each partition is handled independently

CLEANING DATA WITH PYSPARK

Nothing is actually done until an action is performed

Transformations can be re-ordered for best performance

Sometimes causes unexpected behavior

CLEANING DATA WITH PYSPARK

Common in relational databases

Most usually an integer increasing, sequential and unique

Not very parallel

id last name rst name state

CLEANING DATA WITH PYSPARK

Integer (64-bit), increases in value, unique

Not necessarily sequential (gaps exist)

id last name rst name state

675824594 Adams Wendy OR

CLEANING DATA WITH PYSPARK

Occasionally out of order

If performing a join, ID may be assigned after the join

Test your transformations

CLEANING DATA WITH PYSPARK

You might also like