[go: up one dir, main page]

0% found this document useful (0 votes)
711 views1 page

Pyspark SQL Basics Cheat Sheet: Python For Data Science

Spark SQL is used for working with structured data in Apache Spark. A SparkSession can be used to create DataFrames, register them as tables, execute SQL queries, and read/write data. DataFrames can be created from RDDs or SQL queries on registered tables. DataFrames support SQL-like operations for selecting, filtering, grouping, aggregating, and joining data.

Uploaded by

son_goten
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
711 views1 page

Pyspark SQL Basics Cheat Sheet: Python For Data Science

Spark SQL is used for working with structured data in Apache Spark. A SparkSession can be used to create DataFrames, register them as tables, execute SQL queries, and read/write data. DataFrames can be created from RDDs or SQL queries on registered tables. DataFrames support SQL-like operations for selecting, filtering, grouping, aggregating, and joining data.

Uploaded by

son_goten
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

> Duplicate Values > GroupBy

Python For Data Science >>> df = df.dropDuplicates() >>> df.groupBy("age")\ #Group by age, count the members in the groups

.count() \

.show()

PySpark SQL Basics Cheat Sheet > Queries


>>> q
from pyspark.s l import functions as F > Sort
Learn PySpark SQL online at www.DataCamp.com
Select >>> peopledf.sort(peopledf.age.desc()).collect()

>>> df.select("firstName").show() #Show all entries in firstName column


>>> df.sort("age", ascending=False).collect()

>>> df.select("firstName","lastName") \
>>> df.orderBy(["age","city"],ascending=[0,1])\

.show()
.collect()

>>> df.select("firstName", #Show all entries in firstName, age and type

PySpark & Spark SQL "age",

explode("phoneNumber") \

.alias("contactInfo")) \

.select("contactInfo.type",

> Repartitioning
Spark SQL is Apache Spark's module
"firstName",

"age") \
>>> df.repartition(10)\ #df with 10 partitions

for working with structured data. .rdd \

.show()

>>> df.select(df["firstName"],df["age"]+ 1)
#Show all entries in firstName and age,
.getNumPartitions()

.show()
add 1 to the entries of age >>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition
>>> df.select(df['age'] > 24).show() #Show all entries where age >24

> Initializing SparkSession


A SparkSession can be used create DataFrame, register DataFrame as tables,

When
>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30

> Running Queries Programmatically


execute SQL over tables, cache tables, and read parquet files. F.when(df.age > 30, 1) \

>>> q
from pyspark.s l import SparkSession
.otherwise(0)) \
Registering DataFrames as Views
>>> spark = SparkSession \
.show()

>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options


>>> peopledf.createGlobalTempView("people")

d \

.buil er
>>> df.createTempView("customer")

.appName("Python Spark SQL basic example") \


.collect()
>>> df.createOrReplaceTempView("customer")
.config("spark.some.config.option", "some-value") \

Like
.getOrCreate() Query Views
>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith

df.lastName.like("Smith")) \
>>> df5 = spark.sql("SELECT * FROM customer").show()

.show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\

.show()

> Creating DataFrames Startswith - Endswith


>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm

df.lastName \

From RDDs .startswith("Sm")) \


> Inspect Data
.show()

>>> q * >>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th

from pyspark.s l.types import >>> df.dtypes #Return df column names and data types

.show()
>>> df.show() #Display the content of df

Infer Schema
b
Su string >>> df.head() #Return first n rows

>>> sc = C

spark.spark ontext >>> df.first() #Return first row

>>> lines = sc.textFile("people.txt")

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName


>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df

.alias("name")) \

>>> parts = lines.map(lambda l: l.split(","))


>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df

.collect()
>>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
>>> df.count() #Count the number of rows in df

>>> peopledf = spark.createDataFrame(people) >>> df.distinct().count() #Count the number of distinct rows in df

Between
>>> df.printSchema() #Print the schema of df

Specify Schema >>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
>>> df.explain() #Print the (logical and physical) plans
.show()
>>> people = (
parts.map lamb a p d : Row(name=p[0],

age=int(p[1].strip())))

>>> schemaString =

"name age"
>>> fields = [StructField(field_name, StringType(), True) for

> Output
field_name in schemaString.split()]

>>> schema = StructType(fields)

> Add, Update & Remove Columns


>>> spark.createDataFrame(people, schema).show() Data Structures
Adding Columns
>>> rdd1 = df.rdd #Convert df into an RDD

>>> df = df.withColumn('city',df.address.city) \
>>> df.toJSON().first() #Convert df into a RDD of string

.withColumn('postalCode',df.address.postalCode) \
>>> df.toPandas() #Return the contents of df as Pandas DataFrame
From Spark Data Sources .withColumn('state',df.address.state) \

.withColumn('streetAddress',df.address.streetAddress) \
Write & Save to Files
.withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \

>>> df.select("firstName", "city")\

JSON .withColumn('telePhoneType', explode(df.phoneNumber.type))


.write \

>>> df = spark.read.json("customer.json")
.save("nameAndCity.parquet")

>>> df.show() Updating Columns >>> df.select("firstName", "age") \

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') .write \

.save("namesAndAges.json",format="json")
Removing Columns
>>> df2 = d
spark.rea .loa d("people.json", = j )
format " son" >>> df = df.drop("address", "phoneNumber")

>>> df = df.drop(df.address).drop(df.phoneNumber)
Parquet files
>>> df3 = d
spark.rea .loa d("users.parquet") > Stopping SparkSession
TXT files
>>> df4 = d (
spark.rea .text "people.txt" )
> Missing & Replacing Values >>> spark.stop()

>>> df.na.fill(50).show() #Replace null values

>>> df.na.drop().show() #Return new df omitting rows with null values

>>> df.na \ #Return new df replacing one value with another

> Filter .replace(10, 20) \

.show()

#Filter entries of age, only keep those records of which the values are >24

>>> df.filter(df["age"]>24).show() Learn Data Skills Online at www.DataCamp.com

You might also like