0% found this document useful (0 votes)

65 views51 pages

Caching in Spark

The document discusses caching in Spark SQL. It explains that caching keeps data in memory to improve performance for repeated operations. It describes how to cache and uncache DataFrames using df.cache() and df.unpersist(). It also covers caching tables using the Spark catalog and provides tips for effective caching.

Uploaded by

Tarun Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views51 pages

Caching in Spark

Uploaded by

Tarun Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Caching

I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Mark Plutowski
Data Scientist
What is caching?
Keeping data in memory

Spark tends to unload memory aggressively

INTRODUCTION TO SPARK SQL IN PYTHON

Eviction Policy
Least Recently Used (LRU)

Eviction happens independently on each worker

Depends on memory available to each worker

INTRODUCTION TO SPARK SQL IN PYTHON

Caching a dataframe
TO CACHE A DATAFRAME:

df.cache()

TO UNCACHE IT:

df.unpersist()

INTRODUCTION TO SPARK SQL IN PYTHON

Determining whether a dataframe is cached
df.is_cached

False

df.cache()
df.is_cached

True

INTRODUCTION TO SPARK SQL IN PYTHON

Uncaching a dataframe
df.unpersist()
df.is_cached()

False

INTRODUCTION TO SPARK SQL IN PYTHON

Storage level
df.unpersist()
df.cache()
df.storageLevel

StorageLevel(True, True, False, True, 1)

In the storage level above the following hold:

1. useDisk = True

2. useMemory = True

3. useOffHeap = False

4. deserialized = True

5. replication = 1

INTRODUCTION TO SPARK SQL IN PYTHON

Persisting a dataframe
The following are equivalent in Spark 2.1+ :

df.persist()

df.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)

df.cache() is the same as df.persist()

INTRODUCTION TO SPARK SQL IN PYTHON

Caching a table
df.createOrReplaceTempView('df')
spark.catalog.isCached(tableName='df')

False

spark.catalog.cacheTable('df')
spark.catalog.isCached(tableName='df')

True

INTRODUCTION TO SPARK SQL IN PYTHON

Uncaching a table
spark.catalog.uncacheTable('df')
spark.catalog.isCached(tableName='df')

False

spark.catalog.clearCache()

INTRODUCTION TO SPARK SQL IN PYTHON

Tips
Caching is lazy

Only cache if more than one operation is to be performed

Unpersist when you no longer need the object

Cache selectively

INTRODUCTION TO SPARK SQL IN PYTHON

Let's practice
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N
The Spark UI
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Mark Plutowski
Data Scientist
Use the Spark UI inspect execution
Spark Task is a unit of execution that runs on a single cpu

Spark Stage a group of tasks that perform the same computation in parallel, each task typically
running on a different subset of the data

Spark Job is a computation triggered by an action, sliced into one or more stages.

INTRODUCTION TO SPARK SQL IN PYTHON

Finding the Spark UI
1. http://[DRIVER_HOST]:4040

2. http://[DRIVER_HOST]:4041

3. http://[DRIVER_HOST]:4042

4. http://[DRIVER_HOST]:4043
...

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
Spark catalog operations
spark.catalog.cacheTable('table1')

spark.catalog.uncacheTable('table1')

spark.catalog.isCached('table1')

spark.catalog.dropTempView('table1')

INTRODUCTION TO SPARK SQL IN PYTHON

Spark Catalog
spark.catalog.listTables()

[Table(name='text', database=None, description=None, tableType='TEMPORARY', isTemporary=

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
Spark UI Storage Tab
Shows where data partitions exist

in memory,

or on disk,

across the cluster,

at a snapshot in time.

INTRODUCTION TO SPARK SQL IN PYTHON

Spark UI SQL tab
query3agg = """
SELECT w1, w2, w3, COUNT(*) as count FROM (
SELECT
word AS w1,
LEAD(word,1) OVER(PARTITION BY part ORDER BY id ) AS w2,
LEAD(word,2) OVER(PARTITION BY part ORDER BY id ) AS w3
FROM df
)
GROUP BY w1, w2, w3
ORDER BY count DESC
"""

spark.sql(query3agg).show()

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
INTRODUCTION TO SPARK SQL IN PYTHON
Let's practice
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N
Logging
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Mark Plutowski
Data Scientist
Logging primer
import logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
logging.info("Hello %s", "world")
logging.debug("Hello, take %d", 2)

2019-03-14 15:92:65,359 - INFO - Hello world

INTRODUCTION TO SPARK SQL IN PYTHON

Logging with DEBUG level
import logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG,
format='%(asctime)s - %(levelname)s - %(message)s')
logging.info("Hello %s", "world")
logging.debug("Hello, take %d", 2)

2018-03-14 12:00:00,000 - INFO - Hello world

2018-03-14 12:00:00,001 - DEBUG - Hello, take 2

INTRODUCTION TO SPARK SQL IN PYTHON

Debugging lazy evaluation
lazy evaluation

distributed execution

INTRODUCTION TO SPARK SQL IN PYTHON

A simple timer
t = timer()
t.elapsed()

1. elapsed: 0.0 sec

t.elapsed() # Do something that takes 2 seconds

2. elapsed: 2.0 sec

t.reset() # Do something else that takes time: reset

t.elapsed()

3. elapsed: 0.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

class timer
class timer:
start_time = time.time()
step = 0

def elapsed(self, reset=True):

self.step += 1
print("%d. elapsed: %.1f sec %s"
% (self.step, time.time() - self.start_time))
if reset:
self.reset()

def reset(self):
self.start_time = time.time()

INTRODUCTION TO SPARK SQL IN PYTHON

Stealth CPU wastage
import logging
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')

# < create dataframe df here >

t = timer()
logging.info("No action here.")
t.elapsed()
logging.debug("df has %d rows.", df.count())
t.elapsed()

2018-12-23 22:24:20,472 - INFO - No action here.

1. elapsed: 0.0 sec
2. elapsed: 2.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

Disable actions
ENABLED = False

t = timer()
logger.info("No action here.")
t.elapsed()
if ENABLED:
logger.info("df has %d rows.", df.count())
t.elapsed()

2019-03-14 12:34:56,789 - Pyspark - INFO - No action here.

1. elapsed: 0.0 sec
2. elapsed: 0.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

Enabling actions
Rerunning the previous example with ENABLED = True triggers the action:

2019-03-14 12:34:56,789 - INFO - No action here.

1. elapsed: 0.0 sec
2019-03-14 12:34:58,789 - INFO - df has 1107014 rows.
2. elapsed: 2.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

Let's practice!
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N
Query Plans
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Mark Plutowski
Data Scientist
Explain
EXPLAIN SELECT * FROM table1

INTRODUCTION TO SPARK SQL IN PYTHON

Load dataframe and register
df = sqlContext.read.load('/temp/df.parquet')

df.registerTempTable('df')

INTRODUCTION TO SPARK SQL IN PYTHON

Running an EXPLAIN query
spark.sql('EXPLAIN SELECT * FROM df').first()

Row(plan='== Physical Plan ==\n

*FileScan parquet [word#1928,id#1929L,title#1930,part#1931]
Batched: true,
Format: Parquet,
Location: InMemoryFileIndex[file:/temp/df.parquet],
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>')

INTRODUCTION TO SPARK SQL IN PYTHON

Interpreting an EXPLAIN query
== Physical Plan ==

FileScan parquet [word#1928,id#1929L,title#1930,part#1931]

Batched: true,

Format: Parquet,

Location: InMemoryFileIndex[file:/temp/df.parquet],

PartitionFilters: [],

PushedFilters: [],

ReadSchema: struct<word:string,id:bigint,title:string,part:int>'

INTRODUCTION TO SPARK SQL IN PYTHON

df.explain()
df.explain()

== Physical Plan ==
FileScan parquet [word#963,id#964L,title#965,part#966]
Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/df.parquet],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

spark.sql("SELECT * FROM df").explain()

== Physical Plan ==
FileScan parquet [word#712,id#713L,title#714,part#715]
Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/df.parquet],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

INTRODUCTION TO SPARK SQL IN PYTHON

df.explain(), on cached dataframe
df.cache()
df.explain()

== Physical Plan ==
InMemoryTableScan [word#0, id#1L, title#2, part#3]
+- InMemoryRelation [word#0, id#1L, title#2, part#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- FileScan parquet [word#0,id#1L,title#2,part#3]
Batched: true, Format: Parquet, Location:
InMemoryFileIndex[file:/temp/df.parquet],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

spark.sql("SELECT * FROM df").explain()

== Physical Plan ==
InMemoryTableScan [word#0, id#1L, title#2, part#3]
+- InMemoryRelation [word#0, id#1L, title#2, part#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- FileScan parquet [word#0,id#1L,title#2,part#3]
Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/df.parquet],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

INTRODUCTION TO SPARK SQL IN PYTHON

Words sorted by frequency query
SELECT word, COUNT(*) AS count
FROM df
GROUP BY word
ORDER BY count DESC

Equivalent dot notation approach:

df.groupBy('word')
.count()
.sort(desc('count'))
.explain()

INTRODUCTION TO SPARK SQL IN PYTHON

Same query using dataframe dot notation
== Physical Plan ==
*Sort [count#1040L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#1040L DESC NULLS LAST, 200)
+- *HashAggregate(keys=[word#963], functions=[count(1)])
+- Exchange hashpartitioning(word#963, 200)
+- *HashAggregate(keys=[word#963], functions=[partial_count(1)])
+- InMemoryTableScan [word#963]
+- InMemoryRelation [word#963, id#964L, title#965, part#966],
true,10000, StorageLevel(disk, memory, deserialized,
1 replicas)
+- *FileScan parquet [word#963,id#964L,title#965,part#966]
Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/df.parquet],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<word:string,id:bigint,title:string,part:int>

INTRODUCTION TO SPARK SQL IN PYTHON

Reading from bottom up
FileScan parquet

InMemoryRelation

InMemoryTableScan

`HashAggregate(keys=[word#963], ...)``

`HashAggregate(keys=[word#963], functions=[count(1)])``

`Sort [count#1040L DESC NULLS LAST]``

INTRODUCTION TO SPARK SQL IN PYTHON

Query plan
== Physical Plan ==
*Sort [count#1160L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#1160L DESC NULLS LAST, 200)
+- *HashAggregate(keys=[word#963], functions=[count(1)])
+- Exchange hashpartitioning(word#963, 200)
+- *HashAggregate(keys=[word#963], functions=[partial_count(1)])
+- *FileScan parquet [word#963] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/temp/df.parquet], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<word:string>

The previous plan had the following lines, which are missing from the plan above:

...
+- InMemoryTableScan [word#963]
+- InMemoryRelation [word#963, id#964L, title#965, part#966], true, 10000,
StorageLevel(disk, memory, deserialized, 1 replicas)
...

INTRODUCTION TO SPARK SQL IN PYTHON

Let's practice
I N T R O D U C T I O N TO S PA R K S Q L I N P Y T H O N

Examples With Practical Guide For Pyspark
No ratings yet
Examples With Practical Guide For Pyspark
127 pages
Pyspark Questions
No ratings yet
Pyspark Questions
63 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Mastering Apache Spark
67% (3)
Mastering Apache Spark
1,831 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Py Spark
No ratings yet
Py Spark
177 pages
Py Spark
No ratings yet
Py Spark
9 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Data Engineer
No ratings yet
Data Engineer
19 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Pyspark Intro
No ratings yet
Pyspark Intro
3 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
10 PDHPE Task 2 Assessment Task Notifcation - Game Analysis
No ratings yet
10 PDHPE Task 2 Assessment Task Notifcation - Game Analysis
4 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Data Frames
No ratings yet
Data Frames
12 pages
Deloitte Data Engineer Interview Experience (0-3 Yoe)
No ratings yet
Deloitte Data Engineer Interview Experience (0-3 Yoe)
22 pages
Cse413 201-15-3452 Lab-Report 02
No ratings yet
Cse413 201-15-3452 Lab-Report 02
6 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Python and Pyspark With Databricks, With Azure Project
No ratings yet
Python and Pyspark With Databricks, With Azure Project
9 pages
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
Pyspark
No ratings yet
Pyspark
10 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark TOC - 24 Hours
No ratings yet
Pyspark TOC - 24 Hours
2 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
30 Days of Dopamine Detox
No ratings yet
30 Days of Dopamine Detox
46 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Interviewsss
No ratings yet
Interviewsss
4 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Sph3u Chapter 01 Motion in A Straight Line
No ratings yet
Sph3u Chapter 01 Motion in A Straight Line
22 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Cartoons As Mirror of Society
No ratings yet
Cartoons As Mirror of Society
20 pages
Page 01
No ratings yet
Page 01
2 pages
CBC-N5 Updated
100% (2)
CBC-N5 Updated
76 pages
COOKERY Cleaning and Sanitizing of Kitchen Tolls and Equipment - HE 9
No ratings yet
COOKERY Cleaning and Sanitizing of Kitchen Tolls and Equipment - HE 9
30 pages
Study Guide For AWS Cloud Practitioner 2023
100% (2)
Study Guide For AWS Cloud Practitioner 2023
3 pages
VS Code
No ratings yet
VS Code
8 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Minding One's Emotions - Mindfulness Training Alters The Neural Expression of Sadness
No ratings yet
Minding One's Emotions - Mindfulness Training Alters The Neural Expression of Sadness
10 pages
Kubernetes Notes
No ratings yet
Kubernetes Notes
36 pages
Slowfast Network For Continuous Sign Language Recognition
No ratings yet
Slowfast Network For Continuous Sign Language Recognition
5 pages
Disaster Preparedness Lesson Plan Sample
No ratings yet
Disaster Preparedness Lesson Plan Sample
5 pages
SF 10 New
No ratings yet
SF 10 New
3 pages
Learning Journal Unit 7 (HIST 1421 WK 7)
No ratings yet
Learning Journal Unit 7 (HIST 1421 WK 7)
5 pages
GIRLS
No ratings yet
GIRLS
1 page
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Ijerph 18 12935
No ratings yet
Ijerph 18 12935
11 pages
Math Practice Sums-17 Grade-5, 8.11.24
No ratings yet
Math Practice Sums-17 Grade-5, 8.11.24
1 page
EC2 Notes
No ratings yet
EC2 Notes
10 pages
5th Grade Handbook 2013-14 - 8 10 13
No ratings yet
5th Grade Handbook 2013-14 - 8 10 13
7 pages
Data Analysis Projects
No ratings yet
Data Analysis Projects
4 pages
Java Vs Python
No ratings yet
Java Vs Python
10 pages
Experiment:2: Steps For Experiment/practical
No ratings yet
Experiment:2: Steps For Experiment/practical
5 pages
Syllabus BEED GEN. EDUC. 30
No ratings yet
Syllabus BEED GEN. EDUC. 30
11 pages
Shell Scripting
No ratings yet
Shell Scripting
25 pages
Tone It UP! Vegetarian 5 Day Detox
No ratings yet
Tone It UP! Vegetarian 5 Day Detox
30 pages
Daniel Abse - Acting Resume
No ratings yet
Daniel Abse - Acting Resume
1 page
Course Guide EL 100
No ratings yet
Course Guide EL 100
4 pages
Digital Sat k12 Student Weekend 142104430
No ratings yet
Digital Sat k12 Student Weekend 142104430
1 page
Worksheet 2 Theory Concept Application PDF
No ratings yet
Worksheet 2 Theory Concept Application PDF
1 page
Rebekah Tate Resume
No ratings yet
Rebekah Tate Resume
1 page
Test PRWE
No ratings yet
Test PRWE
3 pages
Latihan Soal 2
No ratings yet
Latihan Soal 2
3 pages
Snowflake Vs Data Bricks
No ratings yet
Snowflake Vs Data Bricks
10 pages
UMak-TBL Hub - Cheat Sheet
No ratings yet
UMak-TBL Hub - Cheat Sheet
3 pages
Employee Form
No ratings yet
Employee Form
7 pages
DLL Mathematics 5 q3 w7
No ratings yet
DLL Mathematics 5 q3 w7
7 pages
Functional Analysis Screening Tool: To The Interviewer: The FAST Identifies Factors That May Influence
No ratings yet
Functional Analysis Screening Tool: To The Interviewer: The FAST Identifies Factors That May Influence
1 page
I2v Troubleshooting Guide
No ratings yet
I2v Troubleshooting Guide
7 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

Caching in Spark

Uploaded by

Caching in Spark

Uploaded by

Caching

Spark tends to unload memory aggressively

INTRODUCTION TO SPARK SQL IN PYTHON

Eviction happens independently on each worker

Depends on memory available to each worker

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

StorageLevel(True, True, False, True, 1)

In the storage level above the following hold:

INTRODUCTION TO SPARK SQL IN PYTHON

df.cache() is the same as df.persist()

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

Only cache if more than one operation is to be performed

Unpersist when you no longer need the object

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

[Table(name='text', database=None, description=None, tableType='TEMPORARY', isTemporary=

INTRODUCTION TO SPARK SQL IN PYTHON

across the cluster,

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

2019-03-14 15:92:65,359 - INFO - Hello world

INTRODUCTION TO SPARK SQL IN PYTHON

2018-03-14 12:00:00,000 - INFO - Hello world

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

1. elapsed: 0.0 sec

t.elapsed() # Do something that takes 2 seconds

2. elapsed: 2.0 sec

t.reset() # Do something else that takes time: reset

3. elapsed: 0.0 sec

INTRODUCTION TO SPARK SQL IN PYTHON

def elapsed(self, reset=True):

INTRODUCTION TO SPARK SQL IN PYTHON

# < create dataframe df here >

2018-12-23 22:24:20,472 - INFO - No action here.

INTRODUCTION TO SPARK SQL IN PYTHON

2019-03-14 12:34:56,789 - Pyspark - INFO - No action here.

INTRODUCTION TO SPARK SQL IN PYTHON

2019-03-14 12:34:56,789 - INFO - No action here.

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

Row(plan='== Physical Plan ==\n

INTRODUCTION TO SPARK SQL IN PYTHON

FileScan parquet [word#1928,id#1929L,title#1930,part#1931]

INTRODUCTION TO SPARK SQL IN PYTHON

spark.sql("SELECT * FROM df").explain()

INTRODUCTION TO SPARK SQL IN PYTHON

spark.sql("SELECT * FROM df").explain()

INTRODUCTION TO SPARK SQL IN PYTHON

Equivalent dot notation approach:

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

`Sort [count#1040L DESC NULLS LAST]``

INTRODUCTION TO SPARK SQL IN PYTHON

INTRODUCTION TO SPARK SQL IN PYTHON

You might also like