1. What is Apache Spark, and what is PySpark? How do they relate?
Apache Spark is a big data processing engine. It helps you process huge amounts of
data quickly — even terabytes or petabytes. You can use it to clean, transform, and
analyze large datasets.
PySpark is the Python API for Apache Spark. It lets you write Spark programs using
Python instead of Scala or Java.
2. Explain the difference between RDDs, DataFrames, and DataSets in PySpark.
1. RDD (Resilient Distributed Dataset): RDD is the basic building block of Spark. It is
a low-level, distributed collection of data. You can manually control the data and
operations. Best for complex transformations and unstructured data.
2. DataFrame: DataFrame is like a table in a database (with rows and columns). It is
easier to use than RDDs. Optimized using Catalyst optimizer (faster than RDDs). You
can use SQL-like queries on it.
3. Dataset (not available in PySpark, only in Scala/Java): Dataset combines the best
of RDDs and DataFrames. It is strongly typed (type-safe) like RDDs, and optimized
like DataFrames. PySpark does not support Datasets; use DataFrames instead.
What is SparkSession in PySpark, and why is it important?
SparkSession is like the gatekeeper to PySpark. You need it to start using Spark's
features like reading data, creating DataFrames, and running queries.
II. Transformations & Actions:
6. Explain the difference between Spark Transformations and Actions. Provide
examples of each.
Transformations: Operations that create a new RDD or DataFrame from an
existing one. They are lazily evaluated.
Examples: select(), filter(), withColumn(), groupBy(), join(), union(), map(), flatMap().
Actions: Operations that trigger the execution of the DAG (directed acyclic graph) of
transformations and return a result to the driver program or write data to an external
storage system.
Examples: show(), count(), collect(), take(), write(), foreach().
7. Differentiate between Narrow and Wide Transformations?
1. Narrow Transformation: Data moves within the same partition. No need to
shuffle data across nodes.Fast and efficient.
✅ Examples: map(), filter(), flatMap()
🔹 2. Wide Transformation: Data is shuffled between partitions/nodes. Requires
data movement across the cluster. Slower and more resource-heavy. Used when
data needs to be grouped, joined, or rearranged.
✅ Examples: groupByKey(), reduceByKey(), join()
9. Explain Data Partitioning in Spark. How does it affect performance?
Data partitioning in Spark means dividing your big data into smaller parts (called
partitions) so they can be processed in parallel across multiple machines.
Each partition is like a small chunk of the full dataset, and Spark assigns these
chunks to different workers for faster processing.
10. What are Broadcast Variables in PySpark?
Broadcast variables allow a programmer to cache a read-only copy of a large variable
(e.g., a lookup table, a small DataFrame) on each worker node in the Spark cluster, rather
than sending it with every task.
11. What are Accumulators in PySpark? Give a use case.
Accumulators are shared variables used in PySpark to safely count or sum values
across multiple tasks running in parallel.
They are mainly used for monitoring or debugging, like counting how many bad
records or errors occurred during processing.
12. Explain Spark's memory management (Storage Memory vs. Execution Memory).
How does it handle out-of-memory errors?
o Answer: Spark divides executor memory into two main regions:
Storage Memory: Used for caching (persisting) RDDs/DataFrames in memory.
Execution Memory: Used for shuffle, join, sort, and aggregation buffers.
IV. Advanced PySpark Concepts & Data Engineering Practices:
15. What are UDFs (User-Defined Functions) in PySpark?
o Answer: UDFs allow you to define custom functions in Python that can be
applied to PySpark DataFrames, extending Spark's built-in functionality.
16. Explain the concept of Data Lineage in Spark.
Data Lineage in Spark means tracking the journey of your data — from its origin to
the final result. It shows how data has been transformed step-by-step through
different operations like map, filter, join, etc.
17. When would you use union() vs. unionByName() in PySpark?
union(): Combines two DataFrames row-wise. It requires the DataFrames to have the
same number of columns and compatible data types in the same order. It performs a
positional union.
unionByName(): Combines two DataFrames row-wise by matching column names. It
allows for DataFrames with different column orders and can handle missing columns
by filling them with nulls.
Use Cases: union() for rigidly structured data where schema order is guaranteed.
unionByName() for evolving schemas or when combining data from different sources
where column order might vary but names are consistent.
What is the role of PySpark in a Data Engineering pipeline?
Answer:
PySpark is used for:
Ingesting large datasets
Transforming data (cleaning, filtering, joining)
Writing data to lakes or warehouses
Orchestrating ETL pipelines in tools like Airflow or Glue
Explain the Catalyst Optimizer and Tungsten Engine.
Answer:
Catalyst Optimizer is Spark’s query optimization engine. It improves execution plans
using techniques like predicate pushdown, constant folding, and logical plan
rewriting.
Tungsten Engine is responsible for physical execution optimizations, including
memory management, code generation, and binary processing.
Together, they make PySpark fast and efficient.
✅ 12. How does PySpark handle missing or null values?
Answer:
Detect: df.filter(df.col.isNull())
Drop: df.dropna()
Fill: df.fillna(value)
Handling nulls is essential for data quality and pipeline reliability.