DELOITTE & EY
DATA ENGINEER INTERVIEW QUESTIONS
CTC: 20+ LPA
EXPERIENCE : 0-3
1. Difference between a DataFrame and an RDD in PySpark
RDD (Resilient Distributed
Feature DataFrame
Dataset)
Abstraction Low-level abstraction (object- High-level abstraction (tabular data like
Level oriented) SQL tables)
No schema, stores data as Has a schema with named columns and
Schema
Java/Scala/Python objects data types
Requires more lines of code Easier to use, supports SQL-like
Ease of Use
(functional programming) operations
Optimization No query optimization Optimized using Catalyst optimizer
Slower due to no optimization Faster due to optimization and Tungsten
Performance
and serialization overhead execution engine
For complex transformations For structured data analysis and SQL
Use Case
and custom functions queries
rdd = sc.parallelize([("Alice", df = spark.createDataFrame([("Alice", 25),
Example
25), ("Bob", 30)]) ("Bob", 30)], ["name", "age"])
Summary:
• Use RDDs when you need fine-grained control over data and operations.
• Use DataFrames for performance, simplicity, and SQL integration.
2. Techniques to Optimize PySpark Code Performance
Here are the top techniques for optimizing PySpark performance:
1. Use DataFrame API instead of RDDs
• DataFrames are optimized using Catalyst and Tungsten.
• Example:
df.select("column1").filter(df["column2"] > 10)
2. Persist/Cache Intermediate Results
• Use .cache() or .persist() when reusing data.
df.cache()
3. Filter Early (Predicate Pushdown)
• Apply filter() as early as possible to reduce data volume.
4. Broadcast Join for Smaller Datasets
• Avoid shuffles by broadcasting smaller tables.
from pyspark.sql.functions import broadcast
df1.join(broadcast(df2), "key")
5. Avoid Shuffles Where Possible
• Minimize use of operations like groupBy, repartition, join that cause shuffles.
6. Use .select() Instead of *
• Select only required columns to reduce memory usage.
7. Partitioning
• Use repartition() or coalesce() to control the number of partitions.
• Example:
df.repartition(4) # increases partitions
df.coalesce(1) # reduces partitions
8. Avoid UDFs When Possible
• Use built-in PySpark functions instead of UDFs for better performance.
9. Tuning Spark Configurations
• Example tuning:
spark.conf.set("spark.sql.shuffle.partitions", "200")
10. Monitoring via Spark UI
• Analyze stages, jobs, and tasks in the Spark UI for bottlenecks.
3. Role of Catalyst Optimizer in Query Execution
Catalyst Optimizer is Spark SQL’s query optimizer that converts logical plans to
efficient physical plans.
Key Roles:
Phase Description
Checks column names and data types against schema. Converts
1. Analysis
unresolved attributes into resolved ones.
2. Logical Applies optimization rules like predicate pushdown, constant
Optimization folding, null propagation, etc.
3. Physical Generates multiple physical plans and selects the most optimal one
Planning based on cost.
4. Code Uses Whole-Stage Code Generation (WSCG) to compile parts of
Generation the plan into Java bytecode for speed.
Examples of Optimizations:
• Predicate Pushdown: Filtering early to reduce data scan.
• Projection Pruning: Reads only necessary columns.
• Join Reordering: Reorders joins based on size and cost.
Why is it important?
• Automates performance improvements.
• Makes DataFrame and SQL APIs more efficient without manual tuning.
• Enables Spark to scale well with large datasets.
Final Summary:
Question Key Takeaway
DataFrames offer schema, SQL-like syntax, and performance via
DataFrame vs RDD
optimization; RDDs offer flexibility.
Optimization Focus on broadcast joins, filter early, cache data, use DataFrames
Techniques over RDDs, avoid UDFs.
Core of Spark SQL; it optimizes queries via logical and physical
Catalyst Optimizer
planning and code generation.
4. Common Serialization Formats in PySpark and Their Use
Cases
Serialization formats determine how data is stored or transmitted. In PySpark, choosing the
right format impacts performance, compatibility, and storage efficiency.
Common Formats:
Format Description Use Case
Columnar storage format with schema Ideal for big data analytics. Used for
Parquet support. Supports compression and read-heavy workloads, especially
predicate pushdown. with Spark SQL.
Format Description Use Case
Good for streaming, long-term data
Row-based format with rich schema
Avro storage, or schema evolution
support. Highly compact.
scenarios.
Optimized columnar format for
Best for integration with Hive and
ORC Hadoop, similar to Parquet but more
heavy read/write operations.
efficient in some Hive operations.
Text-based, human-readable format. Ideal for web logs, debugging, or
JSON
Slower and larger in size. configuration files.
Simple, row-based plain text format. Used for lightweight data exchange,
CSV
No schema support, larger file size. legacy systems, or initial ingestion.
Adds ACID transactions, versioning to Ideal for modern data lakes,
Delta Lake Parquet format. Supports time-travel lakehouse architecture, and
and upserts. streaming + batch unification.
Pickle Internal use in PySpark UDFs (not
Used to serialize Python objects. Not
(Python- recommended for distributed
cross-language.
specific) storage).
Recommendation:
• Use Parquet or Delta Lake for analytics and performance.
• Use JSON/CSV only for small datasets or initial ingestion.
• Avoid Pickle for distributed storage due to compatibility issues.
5. How to Address Skewed Data in PySpark?
Skewed data means some partitions have disproportionately more data than others. This
causes performance issues and slow stages in Spark.
Techniques to Handle Skewed Data:
1. Salting Keys
• Add a random prefix or suffix to the skewed key to distribute load.
from pyspark.sql.functions import concat, lit, rand
# Add a salt to the skewed key
df = df.withColumn("salted_key", concat(df["key"], lit("_"), (rand()*10).cast("int")))
2. Broadcast Join
• If one table is small, broadcast it to avoid shuffle.
df1.join(broadcast(df2), "key")
3. Skew Join Optimization
• Use Spark 3.0+ feature:
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
4. Custom Partitioning
• Use repartition() with custom logic to ensure even data distribution.
5. Avoid GroupBy on Skewed Columns
• Use aggregations after salting or use approximate functions.
Example:
# Original skewed join
df1.join(df2, "user_id")
# After salting
df1 = df1.withColumn("salt", (rand()*5).cast("int"))
df2 = df2.withColumn("salt", (rand()*5).cast("int"))
df1_salted = df1.withColumn("user_id_salt", concat("user_id", "salt"))
df2_salted = df2.withColumn("user_id_salt", concat("user_id", "salt"))
df1_salted.join(df2_salted, "user_id_salt")
6. How is Memory Management Handled in PySpark?
Spark memory is split into different regions for storage and execution. Effective memory
management is crucial for avoiding OOM (OutOfMemory) errors and ensuring
performance.
Memory Components in Spark:
Component Purpose
Execution Memory Used for shuffle, joins, aggregations
Storage Memory Used to cache/persist dataframes and RDDs
User Memory Memory used for user-defined data structures
Reserved Memory Fixed portion that Spark reserves for internal use
Memory Management Features:
1. Unified Memory Management
• From Spark 1.6+, execution and storage memory share a unified pool.
2. Dynamic Allocation
• Spark dynamically adjusts executors based on workload:
spark.conf.set("spark.dynamicAllocation.enabled", "true")
3. Garbage Collection (GC)
• Use appropriate JVM GC like G1GC to handle large objects and reduce GC pauses.
4. Memory Tuning Parameters
• Adjust executor memory and cores:
--executor-memory 4G --executor-cores 2 --driver-memory 2G
• Control memory fraction:
spark.conf.set("spark.memory.fraction", 0.6)
spark.conf.set("spark.memory.storageFraction", 0.5)
5. Persist and Unpersist Data
• Use .persist() or .cache() wisely, and unpersist when not needed to free memory.
Final Summary:
Question Key Points
Serialization Prefer columnar formats like Parquet/ORC for analytics; use Delta
Formats for ACID.
Skewed Data Use salting, broadcast joins, adaptive skew join, or custom
Handling partitioning.
Memory Optimize using unified memory, dynamic allocation, GC tuning,
Management and persist/unpersist wisely.
7. Types of Joins in PySpark and How to Implement Them
PySpark supports all standard SQL join types using the .join() function.
Join Types in PySpark:
Join Type Description Syntax
Returns rows with matching keys in both df1.join(df2, "key",
Inner Join
DataFrames. "inner")
Left Join (Left All rows from left DF + matched rows from right df1.join(df2, "key",
Outer) DF. Nulls if no match. "left")
Join Type Description Syntax
Right Join All rows from right DF + matched rows from left df1.join(df2, "key",
(Right Outer) DF. Nulls if no match. "right")
Full Outer df1.join(df2, "key",
All rows from both DFs; Nulls where no match.
Join "outer")
Returns rows from left DF where match found in df1.join(df2, "key",
Left Semi Join
right DF. Right DF’s columns are dropped. "left_semi")
Returns rows from left DF where no match in right df1.join(df2, "key",
Left Anti Join
DF. "left_anti")
Cross Join Cartesian product of two DFs (all combinations). df1.crossJoin(df2)
Example:
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = spark.createDataFrame([(1, "HR"), (3, "Finance")], ["id", "dept"])
# Inner Join
df1.join(df2, on="id", how="inner").show()
8. Purpose of broadcast() and When to Use It
Purpose:
broadcast() is used to send a small DataFrame to all executor nodes to avoid data
shuffling during joins.
Why Use It?
• Optimizes Join Performance when one table is small.
• Avoids shuffle which is expensive and can slow down joins.
• Improves execution time significantly for skewed joins.
When to Use:
• One DataFrame is small enough to fit in memory (~10MB–100MB).
• When performing a join on a large table with a small lookup table.
• For dimension-fact table joins (e.g., product info, country list).
Example:
from pyspark.sql.functions import broadcast
# Large table: transactions, small table: products
transactions.join(broadcast(products), "product_id").show()
Spark Conf Auto-broadcast:
• Spark will automatically broadcast tables smaller than this threshold:
python
CopyEdit
spark.conf.get("spark.sql.autoBroadcastJoinThreshold") # default: 10MB
You can increase it if needed:
python
CopyEdit
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024) # 50 MB
9. How to Define and Use UDFs (User Defined Functions)?
A UDF (User Defined Function) allows you to apply custom logic to DataFrame columns
not covered by built-in functions.
Steps to Use UDF in PySpark:
1. Define a Python function
2. Register the function as a UDF
3. Apply it using .withColumn() or .select()
Example: Capitalize First Letter
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# Step 1: Define Python function
def capitalize_name(name):
return name.capitalize() if name else None
# Step 2: Register UDF
cap_udf = udf(capitalize_name, StringType())
# Step 3: Apply UDF
df = spark.createDataFrame([("alice",), ("bob",)], ["name"])
df.withColumn("cap_name", cap_udf("name")).show()
When to Use UDFs:
• When the transformation logic is not available in built-in functions.
• When applying custom business logic (e.g., text normalization, regex cleaning,
etc.)
Caution:
• UDFs are slower than native PySpark functions.
• They bypass Catalyst Optimizer and cannot be pushed down.
• Prefer built-in functions or pandas UDFs (if using Spark 3.x+).
Pandas UDF Example (for better performance):
from pyspark.sql.functions import pandas_udf
@pandas_udf(StringType())
def upper_case_pandas(col):
return col.str.upper()
df.withColumn("upper_name", upper_case_pandas(df["name"])).show()
Final Summary:
Question Key Takeaway
Types of PySpark supports inner, left, right, outer, semi, anti, and cross joins.
Joins Choose based on business need.
Use to optimize joins when one DF is small; avoids shuffling and improves
broadcast()
speed.
Enable custom column transformations. Avoid unless built-in or pandas
UDFs
UDFs can’t handle the logic.
10. What is Lazy Evaluation, and How Does It Affect Jobs?
Definition:
Lazy Evaluation means that Spark does not immediately execute transformations (like
map(), filter(), select(), etc.). Instead, it builds a logical execution plan. The execution only
starts when an action (like show(), count(), collect()) is triggered.
How It Works:
• Transformations: Lazy — they return a new DataFrame/RDD but don't compute
anything immediately.
• Actions: Trigger the actual computation and return results.
Examples:
# Lazy - no computation yet
df_filtered = df.filter(df["age"] > 30)
# Still lazy - chain of transformations
df_transformed = df_filtered.select("name", "age")
# Trigger action - now Spark builds DAG and executes
df_transformed.show()
Benefits of Lazy Evaluation:
1. Optimized Execution Plan: Spark uses the Catalyst Optimizer to reorder and
combine operations for better performance.
2. Pipeline Efficiency: Merges operations to avoid unnecessary disk and memory
usage.
3. Failure Recovery: Because of logical lineage (DAG), Spark can recompute lost
partitions.
Impact:
• Developers must understand that transformations won’t show results or errors until
an action is executed.
• Makes debugging harder, but greatly improves performance.
11. Steps to Create a DataFrame in PySpark
There are multiple ways to create a DataFrame in PySpark. Here are the most common
ones:
1. From a List of Tuples:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 30), ("Bob", 25)]
columns = ["name", "age"]
df = spark.createDataFrame(data, schema=columns)
df.show()
2. From an RDD:
rdd = spark.sparkContext.parallelize(data)
df_from_rdd = rdd.toDF(["name", "age"])
3. From a CSV/Parquet/JSON File:
df_csv = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df_parquet = spark.read.parquet("path/to/file.parquet")
df_json = spark.read.json("path/to/file.json")
4. From Pandas DataFrame:
import pandas as pd
pdf = pd.DataFrame(data, columns=columns)
df_from_pd = spark.createDataFrame(pdf)
12. Concept of RDDs (Resilient Distributed Datasets)
Definition:
An RDD (Resilient Distributed Dataset) is the core abstraction in Spark that represents
an immutable, distributed collection of objects that can be processed in parallel.
Key Characteristics:
Feature Description
Fault-tolerant via lineage (can recover lost
Resilient
partitions)
Distributed Spread across multiple nodes in the cluster
Once created, cannot be changed;
Immutable
transformations create new RDDs
Like DataFrames, operations are not executed
Lazy Evaluation
until action is called
Feature Description
In-Memory
Fast operations using RAM
Computation
Creating RDDs:
# From a list
rdd1 = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# From external file
rdd2 = spark.sparkContext.textFile("path/to/file.txt")
Common Operations:
Type Function
Transformations map(), filter(), flatMap(), union(), distinct()
Actions collect(), count(), take(), first(), reduce()
When to Use RDDs:
• When you need fine-grained control over the data.
• For low-level transformations or custom business logic.
• When DataFrames are insufficient, e.g., working with binary data or complex data
flows.
Final Summary:
Question Key Insight
Lazy Spark builds a plan but defers execution until an action is called.
Evaluation Enables optimization.
Create Use createDataFrame() from list, RDD, Pandas, or load files (CSV, JSON,
DataFrame Parquet).
Core Spark abstraction for distributed data processing. Less optimized
RDDs
than DataFrames but more flexible.
13. Difference Between Actions vs Transformations in
PySpark
Category Transformations Actions
Operations that define a new dataset Operations that trigger execution
Definition
from an existing one. Lazy. and return results.
Lazy — not executed until an action is Eager — trigger the DAG (Directed
Evaluation
called. Acyclic Graph) execution.
Return Returns a final result (e.g., count,
Returns a new DataFrame/RDD.
Type collect) or writes to output.
show(), count(), collect(),
Examples filter(), map(), select(), groupBy()
saveAsTextFile()
Example:
# Transformation (lazy)
df_filtered = df.filter(df["age"] > 25)
# Action (triggers execution)
df_filtered.show()
Summary:
• Use transformations to define operations.
• Use actions to trigger computation and get results.
14. Handling Null Values in DataFrames
Null values are common in real-world data and should be handled to avoid incorrect
results or failures in transformations.
1. Drop Nulls:
# Drop rows with any nulls
df.na.drop()
# Drop rows where specific column is null
df.na.drop(subset=["column_name"])
2. Fill Nulls:
# Fill all numeric nulls with 0
df.na.fill(0)
# Fill nulls in specific column with value
df.na.fill({"salary": 0, "department": "Unknown"})
3. Replace Specific Values:
# Replace NULL with "N/A" in a column
df.fillna("N/A", subset=["name"])
4. Filter Rows with Nulls:
# Select rows where 'age' is NOT NULL
df.filter(df["age"].isNotNull())
5. Using when() and isNull():
from pyspark.sql.functions import when
df.withColumn("salary_status", when(df["salary"].isNull(),
"Missing").otherwise("Available"))
Best Practice:
• Analyze the nature and impact of missing values.
• Use domain knowledge to decide between dropping, imputing, or flagging.
15. What is a Partition, and How Do You Manage It?
Definition:
A partition in Spark refers to a logical chunk of data distributed across different nodes in
the cluster. All Spark computations operate per partition in parallel.
Why Are Partitions Important?
• Enable parallelism (more partitions = better resource utilization).
• Affect data shuffling, execution time, and memory use.
• Poor partitioning → data skew, OOM errors, or slow jobs.
Default Behavior:
• Number of partitions depends on input source (e.g., file splits, HDFS blocks).
• For DataFrames, Spark often uses 200 shuffle partitions by default:
spark.conf.get("spark.sql.shuffle.partitions") # Default: 200
Managing Partitions:
Method Purpose
repartition(n) Increases or resets number of partitions; causes a shuffle
coalesce(n) Reduces number of partitions without full shuffle; more efficient
partitionBy() Used during writing data to disk (e.g., Parquet) to physically organize files
Examples:
# Repartition to 10 partitions (shuffle)
df.repartition(10)
# Coalesce to 1 partition (no shuffle)
df.coalesce(1)
# Write to disk partitioned by column
df.write.partitionBy("region").parquet("output/")
Best Practice:
• Use coalesce() after filtering to avoid unnecessary shuffles.
• Use partitionBy() when writing to optimize read queries later.
Final Summary:
Question Key Takeaway
Actions vs Transformations are lazy and define lineage; actions trigger
Transformations execution.
Use .na.drop(), .na.fill(), isNull() and when() to clean or impute
Handling Nulls
data.
Units of parallelism in Spark; manage them with repartition,
Partitions
coalesce, and partitionBy for performance tuning.
16. Difference Between Narrow vs Wide Transformations
Transformations in Spark are classified based on data shuffling behavior.
Aspect Narrow Transformation Wide Transformation
Data is not shuffled between Data is shuffled across
Definition
partitions. partitions/nodes.
Each child partition depends on a Each child partition depends on
Dependency
single parent. multiple parents.
Slower due to network I/O and
Execution Fast and more optimized.
shuffling.
groupByKey(), reduceByKey(), join(),
Examples map(), filter(), union(), sample()
distinct()
Visual Difference:
• Narrow: One-to-one partition mapping.
• Wide: Many-to-many or many-to-one (requires shuffling stage).
Why it Matters:
• Wide transformations trigger stage boundaries, which can affect performance.
• Optimizing joins, using reduceByKey() instead of groupByKey() can minimize wide
operations.
17. How Does PySpark Infer Schemas, and Why Does It
Matter?
Schema Inference:
When creating a DataFrame from a structured source (like CSV, JSON, RDD), PySpark tries
to infer the data types (schema) automatically unless explicitly provided.
Example:
df = spark.read.csv("data.csv", header=True, inferSchema=True)
• inferSchema=True tells Spark to scan the data and infer types like Integer, String,
Double, etc.
Why It Matters:
Benefit Description
Better
Schema helps Spark optimize execution plans.
Performance
Data Safety Prevents issues due to incorrect type casting.
Catalyst Needs schema to perform projection pruning, predicate
Optimizer pushdown, etc.
Caution:
• Schema inference can be costly on large files.
• It’s better to define schema manually for large datasets using StructType.
18. Role of SparkContext in a PySpark App
SparkContext is the entry point to low-level Spark features and the core of any
Spark application.
Key Roles:
Role Description
Connects PySpark to a Spark cluster (local, Yarn,
Cluster Connection
etc.).
Enables creation of RDDs from local or external
RDD Creation
sources.
Manages cluster resources, executors, and job
Resource Manager
scheduling.
Broadcast Variables /
Provides access to cluster-level shared variables.
Accumulators
Usage:
from pyspark import SparkContext
sc = SparkContext(appName="myApp")
rdd = sc.parallelize([1, 2, 3])
In PySpark 2.x and above, SparkSession wraps SparkContext, so you typically access it
via:
sc = spark.sparkContext
19. Performing Aggregations Effectively in PySpark
PySpark supports efficient aggregations using its DataFrame API.
Common Aggregation Functions:
• count(), sum(), avg(), min(), max()
• Grouped: groupBy().agg()
Example 1: Basic Aggregation
df.groupBy("department").agg({"salary": "avg", "age": "max"}).show()
Example 2: Using agg() with Aliases
from pyspark.sql.functions import avg, sum
df.groupBy("dept").agg(
avg("salary").alias("avg_salary"),
sum("bonus").alias("total_bonus")
Tips for Efficient Aggregation:
Tip Benefit
Use reduceByKey() instead of groupByKey() Minimizes shuffling in RDDs
Use approxQuantile() for faster percentiles Reduces execution time
Prevents Catalyst optimization
Avoid UDFs in aggregations
blockages
Repartition based on keys before grouping (if data
Improves load balancing
skewed)
20. How to Cache Data for Performance?
Why Cache?
Caching stores intermediate results in memory or disk to avoid recomputation and
improve performance of iterative or repeated operations.
Methods:
Method Description
df.cache() Caches the DataFrame in memory (lazy).
df.persist(StorageLevel) Caches with custom storage levels (memory + disk).
df.unpersist() Releases memory/disk space.
Example:
df.cache()
df.count() # Action triggers caching
Custom Storage Level:
from pyspark.storagelevel import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
When to Cache:
• When the same DataFrame is used multiple times.
• When computations are expensive or involve wide transformations.
• During machine learning pipelines, graph computations, or iterative
aggregations.
Notes:
• Don't overuse caching—it may cause OutOfMemory errors.
• Always unpersist() when cache is no longer needed.
Final Summary:
Question Key Insight
Narrow = no shuffle (fast), Wide = shuffle across partitions
Narrow vs Wide
(expensive).
Schema Spark auto-detects types; better to define manually for large/critical
Inference datasets.
Entry point to Spark’s core functions like RDD creation, broadcasting,
SparkContext
etc.
Use groupBy().agg() with built-in functions; avoid skew and use
Aggregations
aliases.
Use .cache() or .persist() to speed up repeated computations; monitor
Caching
memory.
Would you like me to combine all 20 questions with formatting into a shareable PDF or
LinkedIn carousel? I can also include relevant hashtags and a short summary to post.