0% found this document useful (0 votes)

711 views1 page

Pyspark SQL Basics Cheat Sheet: Python For Data Science

Spark SQL is used for working with structured data in Apache Spark. A SparkSession can be used to create DataFrames, register them as tables, execute SQL queries, and read/write data. DataFrames can be created from RDDs or SQL queries on registered tables. DataFrames support SQL-like operations for selecting, filtering, grouping, aggregating, and joining data.

Uploaded by

son_goten

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

711 views1 page

Pyspark SQL Basics Cheat Sheet: Python For Data Science

Uploaded by

son_goten

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

> Duplicate Values > GroupBy

Python For Data Science >>> df = df.dropDuplicates() >>> df.groupBy("age")\ #Group by age, count the members in the groups

.count() \

.show()

PySpark SQL Basics Cheat Sheet > Queries

>>> q
from pyspark.s l import functions as F > Sort
Learn PySpark SQL online at www.DataCamp.com
Select >>> peopledf.sort(peopledf.age.desc()).collect()

>>> df.select("firstName").show() #Show all entries in firstName column

>>> df.sort("age", ascending=False).collect()

>>> df.select("firstName","lastName") \
>>> df.orderBy(["age","city"],ascending=[0,1])\

.show()
.collect()

>>> df.select("firstName", #Show all entries in firstName, age and type

PySpark & Spark SQL "age",

explode("phoneNumber") \

.alias("contactInfo")) \

.select("contactInfo.type",

> Repartitioning
Spark SQL is Apache Spark's module
"firstName",

"age") \
>>> df.repartition(10)\ #df with 10 partitions

for working with structured data. .rdd \

.show()

>>> df.select(df["firstName"],df["age"]+ 1)
#Show all entries in firstName and age,
.getNumPartitions()

.show()
add 1 to the entries of age >>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition
>>> df.select(df['age'] > 24).show() #Show all entries where age >24

> Initializing SparkSession

A SparkSession can be used create DataFrame, register DataFrame as tables,

When
>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30

> Running Queries Programmatically

execute SQL over tables, cache tables, and read parquet files. F.when(df.age > 30, 1) \

>>> q
from pyspark.s l import SparkSession
.otherwise(0)) \
Registering DataFrames as Views
>>> spark = SparkSession \
.show()

>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options

>>> peopledf.createGlobalTempView("people")

d \

.buil er
>>> df.createTempView("customer")

.appName("Python Spark SQL basic example") \

.collect()
>>> df.createOrReplaceTempView("customer")
.config("spark.some.config.option", "some-value") \

Like
.getOrCreate() Query Views
>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith

df.lastName.like("Smith")) \
>>> df5 = spark.sql("SELECT * FROM customer").show()

.show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\

.show()

> Creating DataFrames Startswith - Endswith

>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm

df.lastName \

From RDDs .startswith("Sm")) \

> Inspect Data
.show()

>>> q * >>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th

from pyspark.s l.types import >>> df.dtypes #Return df column names and data types

.show()
>>> df.show() #Display the content of df

Infer Schema
b
Su string >>> df.head() #Return first n rows

>>> sc = C

spark.spark ontext >>> df.first() #Return first row

>>> lines = sc.textFile("people.txt")

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName

>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df

.alias("name")) \

>>> parts = lines.map(lambda l: l.split(","))

>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df

.collect()
>>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
>>> df.count() #Count the number of rows in df

>>> peopledf = spark.createDataFrame(people) >>> df.distinct().count() #Count the number of distinct rows in df

Between
>>> df.printSchema() #Print the schema of df

Specify Schema >>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
>>> df.explain() #Print the (logical and physical) plans
.show()
>>> people = (
parts.map lamb a p d : Row(name=p[0],

age=int(p[1].strip())))

>>> schemaString =

"name age"
>>> fields = [StructField(field_name, StringType(), True) for

> Output
field_name in schemaString.split()]

>>> schema = StructType(fields)

> Add, Update & Remove Columns

>>> spark.createDataFrame(people, schema).show() Data Structures
Adding Columns
>>> rdd1 = df.rdd #Convert df into an RDD

>>> df = df.withColumn('city',df.address.city) \
>>> df.toJSON().first() #Convert df into a RDD of string

.withColumn('postalCode',df.address.postalCode) \
>>> df.toPandas() #Return the contents of df as Pandas DataFrame
From Spark Data Sources .withColumn('state',df.address.state) \

.withColumn('streetAddress',df.address.streetAddress) \
Write & Save to Files
.withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \

>>> df.select("firstName", "city")\

JSON .withColumn('telePhoneType', explode(df.phoneNumber.type))

.write \

>>> df = spark.read.json("customer.json")
.save("nameAndCity.parquet")

>>> df.show() Updating Columns >>> df.select("firstName", "age") \

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') .write \

.save("namesAndAges.json",format="json")
Removing Columns
>>> df2 = d
spark.rea .loa d("people.json", = j )
format " son" >>> df = df.drop("address", "phoneNumber")

>>> df = df.drop(df.address).drop(df.phoneNumber)
Parquet files
>>> df3 = d
spark.rea .loa d("users.parquet") > Stopping SparkSession
TXT files
>>> df4 = d (
spark.rea .text "people.txt" )
> Missing & Replacing Values >>> spark.stop()

>>> df.na.fill(50).show() #Replace null values

>>> df.na.drop().show() #Return new df omitting rows with null values

>>> df.na \ #Return new df replacing one value with another

> Filter .replace(10, 20) \

.show()

#Filter entries of age, only keep those records of which the values are >24

>>> df.filter(df["age"]>24).show() Learn Data Skills Online at www.DataCamp.com

AN Internship Report ON School Management System Project BY Kamal Acharya (Tribhuvan University)
No ratings yet
AN Internship Report ON School Management System Project BY Kamal Acharya (Tribhuvan University)
78 pages
Airflow User Guide
No ratings yet
Airflow User Guide
444 pages
IFN554 Week3 Tutorial With Solutions v2-1
No ratings yet
IFN554 Week3 Tutorial With Solutions v2-1
30 pages
Signature File Structure in Information Retrieval System
No ratings yet
Signature File Structure in Information Retrieval System
8 pages
PLSQL
No ratings yet
PLSQL
58 pages
BCom General Computer Applications Semester III
No ratings yet
BCom General Computer Applications Semester III
10 pages
Introduction To Organizational Systems: UNIT-2
No ratings yet
Introduction To Organizational Systems: UNIT-2
50 pages
Library Handbook 20160817 FOUO
No ratings yet
Library Handbook 20160817 FOUO
62 pages
iCE40UltraFamilyDataSheet PDF
No ratings yet
iCE40UltraFamilyDataSheet PDF
42 pages
Beginner Level: DDL DML DCL TCL DQL
No ratings yet
Beginner Level: DDL DML DCL TCL DQL
122 pages
Error Handling Flaws - Information and How To Fix - Veracode
No ratings yet
Error Handling Flaws - Information and How To Fix - Veracode
7 pages
COMPASS Exercise
No ratings yet
COMPASS Exercise
26 pages
Instructions SC AC365 2021 EOM1-1
No ratings yet
Instructions SC AC365 2021 EOM1-1
2 pages
Introduction To Database Assignment Part1 Group 7
No ratings yet
Introduction To Database Assignment Part1 Group 7
11 pages
081 ICPMDemos2020 VDD AVisualDriftDetectionSystemforProcessMining POSTPRINT
No ratings yet
081 ICPMDemos2020 VDD AVisualDriftDetectionSystemforProcessMining POSTPRINT
4 pages
SQL Injection: SQL Injection Is A Web Security Vulnerability and It
No ratings yet
SQL Injection: SQL Injection Is A Web Security Vulnerability and It
2 pages
Concept Paper For Administrative Information Management System
No ratings yet
Concept Paper For Administrative Information Management System
2 pages
Jawad Ali
No ratings yet
Jawad Ali
11 pages
W4 C2 Student Worksheet PDF
No ratings yet
W4 C2 Student Worksheet PDF
9 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Vijaykumar - Validation Resume
No ratings yet
Vijaykumar - Validation Resume
7 pages
"Parkit" - A Parking Space Finder App
No ratings yet
"Parkit" - A Parking Space Finder App
31 pages
Final Report SQL Injection
No ratings yet
Final Report SQL Injection
24 pages
Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Saurav Sinha Resume 300698
No ratings yet
Saurav Sinha Resume 300698
2 pages
Holiday Homework Computer Science Part1
No ratings yet
Holiday Homework Computer Science Part1
14 pages
Reading Exam Text
No ratings yet
Reading Exam Text
4 pages
Text Exam Ciclo 3
No ratings yet
Text Exam Ciclo 3
4 pages
SEM II - Subjects
No ratings yet
SEM II - Subjects
6 pages
NetAcad Learning Transcript
No ratings yet
NetAcad Learning Transcript
1 page
Class XII: Dav Public School, Pushpanjali Enclave Summer Break Homework (Session 2020-21)
No ratings yet
Class XII: Dav Public School, Pushpanjali Enclave Summer Break Homework (Session 2020-21)
7 pages
MC145031 Encoder Manchester PDF
No ratings yet
MC145031 Encoder Manchester PDF
10 pages
MC145031 Encoder Manchester PDF
No ratings yet
MC145031 Encoder Manchester PDF
10 pages
ADS-B For Beginners
No ratings yet
ADS-B For Beginners
12 pages
CS Practical File
No ratings yet
CS Practical File
44 pages
Final Draft Spec of CTC
No ratings yet
Final Draft Spec of CTC
61 pages
Reading Challenge 2 - Text
No ratings yet
Reading Challenge 2 - Text
3 pages
Data Engineering
No ratings yet
Data Engineering
92 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Unit 5
100% (1)
Unit 5
109 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Pyspark PDF
0% (1)
Pyspark PDF
239 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
Emailing Power BI
100% (2)
Emailing Power BI
57 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Hive Interview
75% (4)
Hive Interview
17 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Spark Intreview FAQ
100% (2)
Spark Intreview FAQ
21 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Data Engineering Quick Reference
No ratings yet
Data Engineering Quick Reference
9 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
Bootstrap PDF
No ratings yet
Bootstrap PDF
24 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Interview Ques1
No ratings yet
Spark Interview Ques1
20 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
AWS Certified Solutions Architect - Associate SAA-C02
No ratings yet
AWS Certified Solutions Architect - Associate SAA-C02
15 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
Quillbot Premium
No ratings yet
Quillbot Premium
4 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
DWH Fundamentals (Training Material)
No ratings yet
DWH Fundamentals (Training Material)
21 pages
Learn More About SQL Interview Questions-Ii: The Expert'S Voice in SQL Server
No ratings yet
Learn More About SQL Interview Questions-Ii: The Expert'S Voice in SQL Server
12 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Expert Tips for ALL Your Snowflake SnowPro Certifications
From Everand
Expert Tips for ALL Your Snowflake SnowPro Certifications
Cristian Scutaru
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Instant Redis Optimization How-to
From Everand
Instant Redis Optimization How-to
Arun Chinnachamy
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
From Everand
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
Exam OG
5/5 (1)

Pyspark SQL Basics Cheat Sheet: Python For Data Science

Uploaded by

Pyspark SQL Basics Cheat Sheet: Python For Data Science

Uploaded by

> Duplicate Values > GroupBy

PySpark SQL Basics Cheat Sheet > Queries

>>> df.select("firstName").show() #Show all entries in firstName column

>>> df.select("firstName", #Show all entries in firstName, age and type

PySpark & Spark SQL "age",

for working with structured data. .rdd \

> Initializing SparkSession

> Running Queries Programmatically

>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options

.appName("Python Spark SQL basic example") \

> Creating DataFrames Startswith - Endswith

From RDDs .startswith("Sm")) \

>>> q * >>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th

spark.spark ontext >>> df.first() #Return first row

>>> lines = sc.textFile("people.txt")

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName

>>> parts = lines.map(lambda l: l.split(","))

>>> schema = StructType(fields)

> Add, Update & Remove Columns

>>> df.select("firstName", "city")\

JSON .withColumn('telePhoneType', explode(df.phoneNumber.type))

>>> df.show() Updating Columns >>> df.select("firstName", "age") \

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') .write \

>>> df.na.fill(50).show() #Replace null values

>>> df.na.drop().show() #Return new df omitting rows with null values

>>> df.na \ #Return new df replacing one value with another

> Filter .replace(10, 20) \

>>> df.filter(df["age"]>24).show() Learn Data Skills Online at www.DataCamp.com

You might also like