[go: up one dir, main page]

0% found this document useful (0 votes)
65 views26 pages

Deloitte & EY Data Engineer Interview Questions

The document outlines key concepts and techniques related to PySpark, including differences between DataFrames and RDDs, optimization techniques, the role of the Catalyst Optimizer, serialization formats, handling skewed data, memory management, types of joins, and lazy evaluation. It emphasizes using DataFrames for performance and simplicity, while providing practical examples and recommendations for effective data processing. Additionally, it covers how to create DataFrames and the concept of RDDs.

Uploaded by

ragita6295
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views26 pages

Deloitte & EY Data Engineer Interview Questions

The document outlines key concepts and techniques related to PySpark, including differences between DataFrames and RDDs, optimization techniques, the role of the Catalyst Optimizer, serialization formats, handling skewed data, memory management, types of joins, and lazy evaluation. It emphasizes using DataFrames for performance and simplicity, while providing practical examples and recommendations for effective data processing. Additionally, it covers how to create DataFrames and the concept of RDDs.

Uploaded by

ragita6295
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DELOITTE & EY

DATA ENGINEER INTERVIEW QUESTIONS


CTC: 20+ LPA
EXPERIENCE : 0-3

1. Difference between a DataFrame and an RDD in PySpark


RDD (Resilient Distributed
Feature DataFrame
Dataset)

Abstraction Low-level abstraction (object- High-level abstraction (tabular data like


Level oriented) SQL tables)

No schema, stores data as Has a schema with named columns and


Schema
Java/Scala/Python objects data types

Requires more lines of code Easier to use, supports SQL-like


Ease of Use
(functional programming) operations

Optimization No query optimization Optimized using Catalyst optimizer

Slower due to no optimization Faster due to optimization and Tungsten


Performance
and serialization overhead execution engine

For complex transformations For structured data analysis and SQL


Use Case
and custom functions queries

rdd = sc.parallelize([("Alice", df = spark.createDataFrame([("Alice", 25),


Example
25), ("Bob", 30)]) ("Bob", 30)], ["name", "age"])

Summary:

• Use RDDs when you need fine-grained control over data and operations.

• Use DataFrames for performance, simplicity, and SQL integration.


2. Techniques to Optimize PySpark Code Performance
Here are the top techniques for optimizing PySpark performance:

1. Use DataFrame API instead of RDDs

• DataFrames are optimized using Catalyst and Tungsten.

• Example:

df.select("column1").filter(df["column2"] > 10)

2. Persist/Cache Intermediate Results

• Use .cache() or .persist() when reusing data.

df.cache()

3. Filter Early (Predicate Pushdown)

• Apply filter() as early as possible to reduce data volume.

4. Broadcast Join for Smaller Datasets

• Avoid shuffles by broadcasting smaller tables.

from pyspark.sql.functions import broadcast

df1.join(broadcast(df2), "key")

5. Avoid Shuffles Where Possible

• Minimize use of operations like groupBy, repartition, join that cause shuffles.

6. Use .select() Instead of *

• Select only required columns to reduce memory usage.

7. Partitioning

• Use repartition() or coalesce() to control the number of partitions.

• Example:

df.repartition(4) # increases partitions

df.coalesce(1) # reduces partitions


8. Avoid UDFs When Possible

• Use built-in PySpark functions instead of UDFs for better performance.

9. Tuning Spark Configurations

• Example tuning:

spark.conf.set("spark.sql.shuffle.partitions", "200")

10. Monitoring via Spark UI

• Analyze stages, jobs, and tasks in the Spark UI for bottlenecks.

3. Role of Catalyst Optimizer in Query Execution


Catalyst Optimizer is Spark SQL’s query optimizer that converts logical plans to
efficient physical plans.

Key Roles:

Phase Description

Checks column names and data types against schema. Converts


1. Analysis
unresolved attributes into resolved ones.

2. Logical Applies optimization rules like predicate pushdown, constant


Optimization folding, null propagation, etc.

3. Physical Generates multiple physical plans and selects the most optimal one
Planning based on cost.

4. Code Uses Whole-Stage Code Generation (WSCG) to compile parts of


Generation the plan into Java bytecode for speed.

Examples of Optimizations:

• Predicate Pushdown: Filtering early to reduce data scan.

• Projection Pruning: Reads only necessary columns.


• Join Reordering: Reorders joins based on size and cost.

Why is it important?

• Automates performance improvements.

• Makes DataFrame and SQL APIs more efficient without manual tuning.

• Enables Spark to scale well with large datasets.

Final Summary:

Question Key Takeaway

DataFrames offer schema, SQL-like syntax, and performance via


DataFrame vs RDD
optimization; RDDs offer flexibility.

Optimization Focus on broadcast joins, filter early, cache data, use DataFrames
Techniques over RDDs, avoid UDFs.

Core of Spark SQL; it optimizes queries via logical and physical


Catalyst Optimizer
planning and code generation.

4. Common Serialization Formats in PySpark and Their Use


Cases
Serialization formats determine how data is stored or transmitted. In PySpark, choosing the
right format impacts performance, compatibility, and storage efficiency.

Common Formats:

Format Description Use Case

Columnar storage format with schema Ideal for big data analytics. Used for
Parquet support. Supports compression and read-heavy workloads, especially
predicate pushdown. with Spark SQL.
Format Description Use Case

Good for streaming, long-term data


Row-based format with rich schema
Avro storage, or schema evolution
support. Highly compact.
scenarios.

Optimized columnar format for


Best for integration with Hive and
ORC Hadoop, similar to Parquet but more
heavy read/write operations.
efficient in some Hive operations.

Text-based, human-readable format. Ideal for web logs, debugging, or


JSON
Slower and larger in size. configuration files.

Simple, row-based plain text format. Used for lightweight data exchange,
CSV
No schema support, larger file size. legacy systems, or initial ingestion.

Adds ACID transactions, versioning to Ideal for modern data lakes,


Delta Lake Parquet format. Supports time-travel lakehouse architecture, and
and upserts. streaming + batch unification.

Pickle Internal use in PySpark UDFs (not


Used to serialize Python objects. Not
(Python- recommended for distributed
cross-language.
specific) storage).

Recommendation:

• Use Parquet or Delta Lake for analytics and performance.

• Use JSON/CSV only for small datasets or initial ingestion.

• Avoid Pickle for distributed storage due to compatibility issues.

5. How to Address Skewed Data in PySpark?


Skewed data means some partitions have disproportionately more data than others. This
causes performance issues and slow stages in Spark.

Techniques to Handle Skewed Data:

1. Salting Keys
• Add a random prefix or suffix to the skewed key to distribute load.

from pyspark.sql.functions import concat, lit, rand

# Add a salt to the skewed key

df = df.withColumn("salted_key", concat(df["key"], lit("_"), (rand()*10).cast("int")))

2. Broadcast Join

• If one table is small, broadcast it to avoid shuffle.

df1.join(broadcast(df2), "key")

3. Skew Join Optimization

• Use Spark 3.0+ feature:

spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

4. Custom Partitioning

• Use repartition() with custom logic to ensure even data distribution.

5. Avoid GroupBy on Skewed Columns

• Use aggregations after salting or use approximate functions.

Example:

# Original skewed join

df1.join(df2, "user_id")

# After salting

df1 = df1.withColumn("salt", (rand()*5).cast("int"))

df2 = df2.withColumn("salt", (rand()*5).cast("int"))

df1_salted = df1.withColumn("user_id_salt", concat("user_id", "salt"))

df2_salted = df2.withColumn("user_id_salt", concat("user_id", "salt"))


df1_salted.join(df2_salted, "user_id_salt")

6. How is Memory Management Handled in PySpark?


Spark memory is split into different regions for storage and execution. Effective memory
management is crucial for avoiding OOM (OutOfMemory) errors and ensuring
performance.

Memory Components in Spark:

Component Purpose

Execution Memory Used for shuffle, joins, aggregations

Storage Memory Used to cache/persist dataframes and RDDs

User Memory Memory used for user-defined data structures

Reserved Memory Fixed portion that Spark reserves for internal use

Memory Management Features:

1. Unified Memory Management

• From Spark 1.6+, execution and storage memory share a unified pool.

2. Dynamic Allocation

• Spark dynamically adjusts executors based on workload:

spark.conf.set("spark.dynamicAllocation.enabled", "true")

3. Garbage Collection (GC)

• Use appropriate JVM GC like G1GC to handle large objects and reduce GC pauses.

4. Memory Tuning Parameters

• Adjust executor memory and cores:

--executor-memory 4G --executor-cores 2 --driver-memory 2G


• Control memory fraction:

spark.conf.set("spark.memory.fraction", 0.6)

spark.conf.set("spark.memory.storageFraction", 0.5)

5. Persist and Unpersist Data

• Use .persist() or .cache() wisely, and unpersist when not needed to free memory.

Final Summary:

Question Key Points

Serialization Prefer columnar formats like Parquet/ORC for analytics; use Delta
Formats for ACID.

Skewed Data Use salting, broadcast joins, adaptive skew join, or custom
Handling partitioning.

Memory Optimize using unified memory, dynamic allocation, GC tuning,


Management and persist/unpersist wisely.

7. Types of Joins in PySpark and How to Implement Them


PySpark supports all standard SQL join types using the .join() function.

Join Types in PySpark:

Join Type Description Syntax

Returns rows with matching keys in both df1.join(df2, "key",


Inner Join
DataFrames. "inner")

Left Join (Left All rows from left DF + matched rows from right df1.join(df2, "key",
Outer) DF. Nulls if no match. "left")
Join Type Description Syntax

Right Join All rows from right DF + matched rows from left df1.join(df2, "key",
(Right Outer) DF. Nulls if no match. "right")

Full Outer df1.join(df2, "key",


All rows from both DFs; Nulls where no match.
Join "outer")

Returns rows from left DF where match found in df1.join(df2, "key",


Left Semi Join
right DF. Right DF’s columns are dropped. "left_semi")

Returns rows from left DF where no match in right df1.join(df2, "key",


Left Anti Join
DF. "left_anti")

Cross Join Cartesian product of two DFs (all combinations). df1.crossJoin(df2)

Example:

df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])

df2 = spark.createDataFrame([(1, "HR"), (3, "Finance")], ["id", "dept"])

# Inner Join

df1.join(df2, on="id", how="inner").show()

8. Purpose of broadcast() and When to Use It


Purpose:

broadcast() is used to send a small DataFrame to all executor nodes to avoid data
shuffling during joins.

Why Use It?

• Optimizes Join Performance when one table is small.

• Avoids shuffle which is expensive and can slow down joins.


• Improves execution time significantly for skewed joins.

When to Use:

• One DataFrame is small enough to fit in memory (~10MB–100MB).

• When performing a join on a large table with a small lookup table.

• For dimension-fact table joins (e.g., product info, country list).

Example:

from pyspark.sql.functions import broadcast

# Large table: transactions, small table: products

transactions.join(broadcast(products), "product_id").show()

Spark Conf Auto-broadcast:

• Spark will automatically broadcast tables smaller than this threshold:

python

CopyEdit

spark.conf.get("spark.sql.autoBroadcastJoinThreshold") # default: 10MB

You can increase it if needed:

python

CopyEdit

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024) # 50 MB

9. How to Define and Use UDFs (User Defined Functions)?


A UDF (User Defined Function) allows you to apply custom logic to DataFrame columns
not covered by built-in functions.
Steps to Use UDF in PySpark:

1. Define a Python function

2. Register the function as a UDF

3. Apply it using .withColumn() or .select()

Example: Capitalize First Letter

from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

# Step 1: Define Python function

def capitalize_name(name):

return name.capitalize() if name else None

# Step 2: Register UDF

cap_udf = udf(capitalize_name, StringType())

# Step 3: Apply UDF

df = spark.createDataFrame([("alice",), ("bob",)], ["name"])

df.withColumn("cap_name", cap_udf("name")).show()

When to Use UDFs:

• When the transformation logic is not available in built-in functions.

• When applying custom business logic (e.g., text normalization, regex cleaning,
etc.)
Caution:

• UDFs are slower than native PySpark functions.

• They bypass Catalyst Optimizer and cannot be pushed down.

• Prefer built-in functions or pandas UDFs (if using Spark 3.x+).

Pandas UDF Example (for better performance):

from pyspark.sql.functions import pandas_udf

@pandas_udf(StringType())

def upper_case_pandas(col):

return col.str.upper()

df.withColumn("upper_name", upper_case_pandas(df["name"])).show()

Final Summary:

Question Key Takeaway

Types of PySpark supports inner, left, right, outer, semi, anti, and cross joins.
Joins Choose based on business need.

Use to optimize joins when one DF is small; avoids shuffling and improves
broadcast()
speed.

Enable custom column transformations. Avoid unless built-in or pandas


UDFs
UDFs can’t handle the logic.

10. What is Lazy Evaluation, and How Does It Affect Jobs?


Definition:

Lazy Evaluation means that Spark does not immediately execute transformations (like
map(), filter(), select(), etc.). Instead, it builds a logical execution plan. The execution only
starts when an action (like show(), count(), collect()) is triggered.

How It Works:

• Transformations: Lazy — they return a new DataFrame/RDD but don't compute


anything immediately.

• Actions: Trigger the actual computation and return results.

Examples:

# Lazy - no computation yet

df_filtered = df.filter(df["age"] > 30)

# Still lazy - chain of transformations

df_transformed = df_filtered.select("name", "age")

# Trigger action - now Spark builds DAG and executes

df_transformed.show()

Benefits of Lazy Evaluation:

1. Optimized Execution Plan: Spark uses the Catalyst Optimizer to reorder and
combine operations for better performance.

2. Pipeline Efficiency: Merges operations to avoid unnecessary disk and memory


usage.

3. Failure Recovery: Because of logical lineage (DAG), Spark can recompute lost
partitions.
Impact:

• Developers must understand that transformations won’t show results or errors until
an action is executed.

• Makes debugging harder, but greatly improves performance.

11. Steps to Create a DataFrame in PySpark


There are multiple ways to create a DataFrame in PySpark. Here are the most common
ones:

1. From a List of Tuples:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 30), ("Bob", 25)]

columns = ["name", "age"]

df = spark.createDataFrame(data, schema=columns)

df.show()

2. From an RDD:

rdd = spark.sparkContext.parallelize(data)

df_from_rdd = rdd.toDF(["name", "age"])

3. From a CSV/Parquet/JSON File:

df_csv = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)


df_parquet = spark.read.parquet("path/to/file.parquet")

df_json = spark.read.json("path/to/file.json")

4. From Pandas DataFrame:

import pandas as pd

pdf = pd.DataFrame(data, columns=columns)

df_from_pd = spark.createDataFrame(pdf)

12. Concept of RDDs (Resilient Distributed Datasets)


Definition:

An RDD (Resilient Distributed Dataset) is the core abstraction in Spark that represents
an immutable, distributed collection of objects that can be processed in parallel.

Key Characteristics:

Feature Description

Fault-tolerant via lineage (can recover lost


Resilient
partitions)

Distributed Spread across multiple nodes in the cluster

Once created, cannot be changed;


Immutable
transformations create new RDDs

Like DataFrames, operations are not executed


Lazy Evaluation
until action is called
Feature Description

In-Memory
Fast operations using RAM
Computation

Creating RDDs:

# From a list

rdd1 = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

# From external file

rdd2 = spark.sparkContext.textFile("path/to/file.txt")

Common Operations:

Type Function

Transformations map(), filter(), flatMap(), union(), distinct()

Actions collect(), count(), take(), first(), reduce()

When to Use RDDs:

• When you need fine-grained control over the data.

• For low-level transformations or custom business logic.

• When DataFrames are insufficient, e.g., working with binary data or complex data
flows.

Final Summary:
Question Key Insight

Lazy Spark builds a plan but defers execution until an action is called.
Evaluation Enables optimization.

Create Use createDataFrame() from list, RDD, Pandas, or load files (CSV, JSON,
DataFrame Parquet).

Core Spark abstraction for distributed data processing. Less optimized


RDDs
than DataFrames but more flexible.

13. Difference Between Actions vs Transformations in


PySpark
Category Transformations Actions

Operations that define a new dataset Operations that trigger execution


Definition
from an existing one. Lazy. and return results.

Lazy — not executed until an action is Eager — trigger the DAG (Directed
Evaluation
called. Acyclic Graph) execution.

Return Returns a final result (e.g., count,


Returns a new DataFrame/RDD.
Type collect) or writes to output.

show(), count(), collect(),


Examples filter(), map(), select(), groupBy()
saveAsTextFile()

Example:

# Transformation (lazy)

df_filtered = df.filter(df["age"] > 25)

# Action (triggers execution)

df_filtered.show()
Summary:

• Use transformations to define operations.

• Use actions to trigger computation and get results.

14. Handling Null Values in DataFrames


Null values are common in real-world data and should be handled to avoid incorrect
results or failures in transformations.

1. Drop Nulls:

# Drop rows with any nulls

df.na.drop()

# Drop rows where specific column is null

df.na.drop(subset=["column_name"])

2. Fill Nulls:

# Fill all numeric nulls with 0

df.na.fill(0)

# Fill nulls in specific column with value

df.na.fill({"salary": 0, "department": "Unknown"})

3. Replace Specific Values:

# Replace NULL with "N/A" in a column

df.fillna("N/A", subset=["name"])
4. Filter Rows with Nulls:

# Select rows where 'age' is NOT NULL

df.filter(df["age"].isNotNull())

5. Using when() and isNull():

from pyspark.sql.functions import when

df.withColumn("salary_status", when(df["salary"].isNull(),
"Missing").otherwise("Available"))

Best Practice:

• Analyze the nature and impact of missing values.

• Use domain knowledge to decide between dropping, imputing, or flagging.

15. What is a Partition, and How Do You Manage It?


Definition:

A partition in Spark refers to a logical chunk of data distributed across different nodes in
the cluster. All Spark computations operate per partition in parallel.

Why Are Partitions Important?

• Enable parallelism (more partitions = better resource utilization).

• Affect data shuffling, execution time, and memory use.

• Poor partitioning → data skew, OOM errors, or slow jobs.

Default Behavior:

• Number of partitions depends on input source (e.g., file splits, HDFS blocks).

• For DataFrames, Spark often uses 200 shuffle partitions by default:


spark.conf.get("spark.sql.shuffle.partitions") # Default: 200

Managing Partitions:

Method Purpose

repartition(n) Increases or resets number of partitions; causes a shuffle

coalesce(n) Reduces number of partitions without full shuffle; more efficient

partitionBy() Used during writing data to disk (e.g., Parquet) to physically organize files

Examples:

# Repartition to 10 partitions (shuffle)

df.repartition(10)

# Coalesce to 1 partition (no shuffle)

df.coalesce(1)

# Write to disk partitioned by column

df.write.partitionBy("region").parquet("output/")

Best Practice:

• Use coalesce() after filtering to avoid unnecessary shuffles.

• Use partitionBy() when writing to optimize read queries later.

Final Summary:
Question Key Takeaway

Actions vs Transformations are lazy and define lineage; actions trigger


Transformations execution.

Use .na.drop(), .na.fill(), isNull() and when() to clean or impute


Handling Nulls
data.

Units of parallelism in Spark; manage them with repartition,


Partitions
coalesce, and partitionBy for performance tuning.

16. Difference Between Narrow vs Wide Transformations


Transformations in Spark are classified based on data shuffling behavior.

Aspect Narrow Transformation Wide Transformation

Data is not shuffled between Data is shuffled across


Definition
partitions. partitions/nodes.

Each child partition depends on a Each child partition depends on


Dependency
single parent. multiple parents.

Slower due to network I/O and


Execution Fast and more optimized.
shuffling.

groupByKey(), reduceByKey(), join(),


Examples map(), filter(), union(), sample()
distinct()

Visual Difference:

• Narrow: One-to-one partition mapping.

• Wide: Many-to-many or many-to-one (requires shuffling stage).

Why it Matters:

• Wide transformations trigger stage boundaries, which can affect performance.


• Optimizing joins, using reduceByKey() instead of groupByKey() can minimize wide
operations.

17. How Does PySpark Infer Schemas, and Why Does It


Matter?
Schema Inference:

When creating a DataFrame from a structured source (like CSV, JSON, RDD), PySpark tries
to infer the data types (schema) automatically unless explicitly provided.

Example:

df = spark.read.csv("data.csv", header=True, inferSchema=True)

• inferSchema=True tells Spark to scan the data and infer types like Integer, String,
Double, etc.

Why It Matters:

Benefit Description

Better
Schema helps Spark optimize execution plans.
Performance

Data Safety Prevents issues due to incorrect type casting.

Catalyst Needs schema to perform projection pruning, predicate


Optimizer pushdown, etc.

Caution:

• Schema inference can be costly on large files.

• It’s better to define schema manually for large datasets using StructType.
18. Role of SparkContext in a PySpark App
SparkContext is the entry point to low-level Spark features and the core of any
Spark application.

Key Roles:

Role Description

Connects PySpark to a Spark cluster (local, Yarn,


Cluster Connection
etc.).

Enables creation of RDDs from local or external


RDD Creation
sources.

Manages cluster resources, executors, and job


Resource Manager
scheduling.

Broadcast Variables /
Provides access to cluster-level shared variables.
Accumulators

Usage:

from pyspark import SparkContext

sc = SparkContext(appName="myApp")

rdd = sc.parallelize([1, 2, 3])

In PySpark 2.x and above, SparkSession wraps SparkContext, so you typically access it
via:

sc = spark.sparkContext

19. Performing Aggregations Effectively in PySpark


PySpark supports efficient aggregations using its DataFrame API.
Common Aggregation Functions:

• count(), sum(), avg(), min(), max()

• Grouped: groupBy().agg()

Example 1: Basic Aggregation

df.groupBy("department").agg({"salary": "avg", "age": "max"}).show()

Example 2: Using agg() with Aliases

from pyspark.sql.functions import avg, sum

df.groupBy("dept").agg(

avg("salary").alias("avg_salary"),

sum("bonus").alias("total_bonus")

Tips for Efficient Aggregation:

Tip Benefit

Use reduceByKey() instead of groupByKey() Minimizes shuffling in RDDs

Use approxQuantile() for faster percentiles Reduces execution time

Prevents Catalyst optimization


Avoid UDFs in aggregations
blockages

Repartition based on keys before grouping (if data


Improves load balancing
skewed)
20. How to Cache Data for Performance?
Why Cache?

Caching stores intermediate results in memory or disk to avoid recomputation and


improve performance of iterative or repeated operations.

Methods:

Method Description

df.cache() Caches the DataFrame in memory (lazy).

df.persist(StorageLevel) Caches with custom storage levels (memory + disk).

df.unpersist() Releases memory/disk space.

Example:

df.cache()

df.count() # Action triggers caching

Custom Storage Level:

from pyspark.storagelevel import StorageLevel

df.persist(StorageLevel.MEMORY_AND_DISK)

When to Cache:

• When the same DataFrame is used multiple times.

• When computations are expensive or involve wide transformations.

• During machine learning pipelines, graph computations, or iterative


aggregations.
Notes:

• Don't overuse caching—it may cause OutOfMemory errors.

• Always unpersist() when cache is no longer needed.

Final Summary:

Question Key Insight

Narrow = no shuffle (fast), Wide = shuffle across partitions


Narrow vs Wide
(expensive).

Schema Spark auto-detects types; better to define manually for large/critical


Inference datasets.

Entry point to Spark’s core functions like RDD creation, broadcasting,


SparkContext
etc.

Use groupBy().agg() with built-in functions; avoid skew and use


Aggregations
aliases.

Use .cache() or .persist() to speed up repeated computations; monitor


Caching
memory.

Would you like me to combine all 20 questions with formatting into a shareable PDF or
LinkedIn carousel? I can also include relevant hashtags and a short summary to post.

You might also like