0% found this document useful (0 votes)

11 views3 pages

Pysparkq

important spark questions and answers

Uploaded by

rameshre836

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views3 pages

Pysparkq

important spark questions and answers

Uploaded by

rameshre836

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

1. What is Apache Spark, and what is PySpark? How do they relate?

Apache Spark is a big data processing engine. It helps you process huge amounts of
data quickly — even terabytes or petabytes. You can use it to clean, transform, and
analyze large datasets.

PySpark is the Python API for Apache Spark. It lets you write Spark programs using
Python instead of Scala or Java.

2. Explain the difference between RDDs, DataFrames, and DataSets in PySpark.

1. RDD (Resilient Distributed Dataset): RDD is the basic building block of Spark. It is
a low-level, distributed collection of data. You can manually control the data and
operations. Best for complex transformations and unstructured data.

2. DataFrame: DataFrame is like a table in a database (with rows and columns). It is

easier to use than RDDs. Optimized using Catalyst optimizer (faster than RDDs). You
can use SQL-like queries on it.

3. Dataset (not available in PySpark, only in Scala/Java): Dataset combines the best
of RDDs and DataFrames. It is strongly typed (type-safe) like RDDs, and optimized
like DataFrames. PySpark does not support Datasets; use DataFrames instead.

What is SparkSession in PySpark, and why is it important?

SparkSession is like the gatekeeper to PySpark. You need it to start using Spark's
features like reading data, creating DataFrames, and running queries.

II. Transformations & Actions:

6. Explain the difference between Spark Transformations and Actions. Provide

examples of each.

Transformations: Operations that create a new RDD or DataFrame from an

existing one. They are lazily evaluated.

Examples: select(), filter(), withColumn(), groupBy(), join(), union(), map(), flatMap().

Actions: Operations that trigger the execution of the DAG (directed acyclic graph) of
transformations and return a result to the driver program or write data to an external
storage system.

Examples: show(), count(), collect(), take(), write(), foreach().

7. Differentiate between Narrow and Wide Transformations?

1. Narrow Transformation: Data moves within the same partition. No need to

shuffle data across nodes.Fast and efficient.

✅ Examples: map(), filter(), flatMap()

🔹 2. Wide Transformation: Data is shuffled between partitions/nodes. Requires
data movement across the cluster. Slower and more resource-heavy. Used when
data needs to be grouped, joined, or rearranged.

✅ Examples: groupByKey(), reduceByKey(), join()

9. Explain Data Partitioning in Spark. How does it affect performance?

Data partitioning in Spark means dividing your big data into smaller parts (called
partitions) so they can be processed in parallel across multiple machines.

Each partition is like a small chunk of the full dataset, and Spark assigns these
chunks to different workers for faster processing.

10. What are Broadcast Variables in PySpark?

Broadcast variables allow a programmer to cache a read-only copy of a large variable

(e.g., a lookup table, a small DataFrame) on each worker node in the Spark cluster, rather
than sending it with every task.

11. What are Accumulators in PySpark? Give a use case.

Accumulators are shared variables used in PySpark to safely count or sum values
across multiple tasks running in parallel.

They are mainly used for monitoring or debugging, like counting how many bad
records or errors occurred during processing.

12. Explain Spark's memory management (Storage Memory vs. Execution Memory).
How does it handle out-of-memory errors?

o Answer: Spark divides executor memory into two main regions:

Storage Memory: Used for caching (persisting) RDDs/DataFrames in memory.

Execution Memory: Used for shuffle, join, sort, and aggregation buffers.

IV. Advanced PySpark Concepts & Data Engineering Practices:

15. What are UDFs (User-Defined Functions) in PySpark?

o Answer: UDFs allow you to define custom functions in Python that can be
applied to PySpark DataFrames, extending Spark's built-in functionality.

16. Explain the concept of Data Lineage in Spark.

Data Lineage in Spark means tracking the journey of your data — from its origin to
the final result. It shows how data has been transformed step-by-step through
different operations like map, filter, join, etc.

17. When would you use union() vs. unionByName() in PySpark?

union(): Combines two DataFrames row-wise. It requires the DataFrames to have the
same number of columns and compatible data types in the same order. It performs a
positional union.

unionByName(): Combines two DataFrames row-wise by matching column names. It

allows for DataFrames with different column orders and can handle missing columns
by filling them with nulls.

Use Cases: union() for rigidly structured data where schema order is guaranteed.
unionByName() for evolving schemas or when combining data from different sources
where column order might vary but names are consistent.

What is the role of PySpark in a Data Engineering pipeline?

Answer:
PySpark is used for:

 Ingesting large datasets

 Transforming data (cleaning, filtering, joining)

 Writing data to lakes or warehouses

 Orchestrating ETL pipelines in tools like Airflow or Glue

Explain the Catalyst Optimizer and Tungsten Engine.

Answer:

 Catalyst Optimizer is Spark’s query optimization engine. It improves execution plans

using techniques like predicate pushdown, constant folding, and logical plan
rewriting.

 Tungsten Engine is responsible for physical execution optimizations, including

memory management, code generation, and binary processing.

Together, they make PySpark fast and efficient.

✅ 12. How does PySpark handle missing or null values?

Answer:

 Detect: df.filter(df.col.isNull())

 Drop: df.dropna()

 Fill: df.fillna(value)

Handling nulls is essential for data quality and pipeline reliability.

PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Pyspark Interview Q & A in Topic Wise
No ratings yet
Pyspark Interview Q & A in Topic Wise
5 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
2 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
2025 Pyspark Interview Questions Collections
No ratings yet
2025 Pyspark Interview Questions Collections
50 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
No ratings yet
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
4 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
1 page
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
No ratings yet
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
16 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Spark Main
No ratings yet
Spark Main
75 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
Py Spark
No ratings yet
Py Spark
177 pages
Pyspark 1
No ratings yet
Pyspark 1
4 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
PySpark Interview QA
No ratings yet
PySpark Interview QA
2 pages
Data Engineer
No ratings yet
Data Engineer
12 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Pyspark IQ
No ratings yet
Pyspark IQ
13 pages
Pyspark
No ratings yet
Pyspark
6 pages
PySpark RDD Transformations & Actions Guide
No ratings yet
PySpark RDD Transformations & Actions Guide
1 page
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
PySpark Interview Questions Guide
100% (4)
PySpark Interview Questions Guide
126 pages
Py Spark
No ratings yet
Py Spark
9 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Pyspark
100% (1)
Pyspark
48 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
PySpark Interview Questions Shubham
No ratings yet
PySpark Interview Questions Shubham
3 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
B LSC CD W1 Geiv Yx BAmc EE3 U
No ratings yet
B LSC CD W1 Geiv Yx BAmc EE3 U
166 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Important Interview Qa
No ratings yet
Important Interview Qa
13 pages
Interview
No ratings yet
Interview
1 page
PySpark Interview Questions 2025
No ratings yet
PySpark Interview Questions 2025
8 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Pyspark 1
No ratings yet
Pyspark 1
7 pages
Spark Theory
No ratings yet
Spark Theory
26 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
4 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
SparkStepbyStepInterviewGuide Draft
No ratings yet
SparkStepbyStepInterviewGuide Draft
3 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Page 01
No ratings yet
Page 01
2 pages
Pyspark Scenario Based Qs
No ratings yet
Pyspark Scenario Based Qs
13 pages
Spark Interview Prep for Data Engineers
No ratings yet
Spark Interview Prep for Data Engineers
22 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
PySpark Comprehensive Notes
No ratings yet
PySpark Comprehensive Notes
59 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
Apache Spark IQ
No ratings yet
Apache Spark IQ
15 pages
Stefansson - 2009
No ratings yet
Stefansson - 2009
17 pages
Sentiment in Product Descriptions
No ratings yet
Sentiment in Product Descriptions
17 pages
Week 4 Notes On Halogens and Thier Compounds
No ratings yet
Week 4 Notes On Halogens and Thier Compounds
4 pages
10 Viii Jazz Concertos Cool Jazz Modern 1954 1962
No ratings yet
10 Viii Jazz Concertos Cool Jazz Modern 1954 1962
51 pages
PLC PLC Overviews
No ratings yet
PLC PLC Overviews
28 pages
UJIAN DIAGNOSTIK MATEMATIK (BM) PSPN
No ratings yet
UJIAN DIAGNOSTIK MATEMATIK (BM) PSPN
61 pages
Threaded Binary Tree: Presented To
No ratings yet
Threaded Binary Tree: Presented To
35 pages
Yamanouchi Quantum Mechanics Molecular Structures
50% (2)
Yamanouchi Quantum Mechanics Molecular Structures
276 pages
(Isaac Asimov's Library of The Universe) Isaac Asimov - Saturn - The Ringed Beauty-Gareth Stevens Publishing (1989) PDF
100% (2)
(Isaac Asimov's Library of The Universe) Isaac Asimov - Saturn - The Ringed Beauty-Gareth Stevens Publishing (1989) PDF
40 pages
Daniels Tools PDF
No ratings yet
Daniels Tools PDF
175 pages
A Basic Introduction To Filters, Active, Passive, and Switched Capacitor
100% (2)
A Basic Introduction To Filters, Active, Passive, and Switched Capacitor
22 pages
Web Technologies-Bcom CA IV Sem
No ratings yet
Web Technologies-Bcom CA IV Sem
107 pages
Math Plan
No ratings yet
Math Plan
7 pages
Updated 1 Final 300 L MECHANICAL Second Semester 22 - 23 Lectiure Timetable
No ratings yet
Updated 1 Final 300 L MECHANICAL Second Semester 22 - 23 Lectiure Timetable
1 page
Pravin Gaikwad Resume 8
No ratings yet
Pravin Gaikwad Resume 8
3 pages
SPPU TBSC CS Syllabus 2021 22 PDF
No ratings yet
SPPU TBSC CS Syllabus 2021 22 PDF
60 pages
10.7 Asyllogistic Inference: CHAPTER 10 Quantification Theory
No ratings yet
10.7 Asyllogistic Inference: CHAPTER 10 Quantification Theory
12 pages
Hatchery Report
No ratings yet
Hatchery Report
28 pages
Astm D4239 18
No ratings yet
Astm D4239 18
6 pages
Raptor Final Output
100% (1)
Raptor Final Output
15 pages
Strategic Leadership in The Implementation of County Integrated Development Plan (Cidp) in Busia County
No ratings yet
Strategic Leadership in The Implementation of County Integrated Development Plan (Cidp) in Busia County
18 pages
IOC Presentation Gas-Laws
No ratings yet
IOC Presentation Gas-Laws
19 pages
Construction Soil Test Report
No ratings yet
Construction Soil Test Report
7 pages
Python Chapter No 2 Notes
No ratings yet
Python Chapter No 2 Notes
28 pages
Topcon MS 05
No ratings yet
Topcon MS 05
4 pages
Verus Phase I
100% (1)
Verus Phase I
7 pages
Manual Setting Type: Twin Volume Built In, High Accuracy Type
No ratings yet
Manual Setting Type: Twin Volume Built In, High Accuracy Type
4 pages
2010MBAInstitutewiseCutOff CAP1
No ratings yet
2010MBAInstitutewiseCutOff CAP1
399 pages
JAVA Problems Selection 2
No ratings yet
JAVA Problems Selection 2
3 pages
Worksheet - 5 - Collisions in 2d
No ratings yet
Worksheet - 5 - Collisions in 2d
4 pages

Pysparkq

Uploaded by

Pysparkq

Uploaded by

1. What is Apache Spark, and what is PySpark? How do they relate?

2. Explain the difference between RDDs, DataFrames, and DataSets in PySpark.

2. DataFrame: DataFrame is like a table in a database (with rows and columns). It is

What is SparkSession in PySpark, and why is it important?

II. Transformations & Actions:

6. Explain the difference between Spark Transformations and Actions. Provide

Transformations: Operations that create a new RDD or DataFrame from an

Examples: select(), filter(), withColumn(), groupBy(), join(), union(), map(), flatMap().

Examples: show(), count(), collect(), take(), write(), foreach().

7. Differentiate between Narrow and Wide Transformations?

1. Narrow Transformation: Data moves within the same partition. No need to

✅ Examples: map(), filter(), flatMap()

✅ Examples: groupByKey(), reduceByKey(), join()

9. Explain Data Partitioning in Spark. How does it affect performance?

10. What are Broadcast Variables in PySpark?

Broadcast variables allow a programmer to cache a read-only copy of a large variable

11. What are Accumulators in PySpark? Give a use case.

o Answer: Spark divides executor memory into two main regions:

Storage Memory: Used for caching (persisting) RDDs/DataFrames in memory.

IV. Advanced PySpark Concepts & Data Engineering Practices:

15. What are UDFs (User-Defined Functions) in PySpark?

16. Explain the concept of Data Lineage in Spark.

17. When would you use union() vs. unionByName() in PySpark?

unionByName(): Combines two DataFrames row-wise by matching column names. It

What is the role of PySpark in a Data Engineering pipeline?

 Ingesting large datasets

 Transforming data (cleaning, filtering, joining)

 Writing data to lakes or warehouses

 Orchestrating ETL pipelines in tools like Airflow or Glue

Explain the Catalyst Optimizer and Tungsten Engine.

 Catalyst Optimizer is Spark’s query optimization engine. It improves execution plans

 Tungsten Engine is responsible for physical execution optimizations, including

Together, they make PySpark fast and efficient.

✅ 12. How does PySpark handle missing or null values?

Handling nulls is essential for data quality and pipeline reliability.

You might also like