0% found this document useful (0 votes)

65 views26 pages

Deloitte & EY Data Engineer Interview Questions

The document outlines key concepts and techniques related to PySpark, including differences between DataFrames and RDDs, optimization techniques, the role of the Catalyst Optimizer, serialization formats, handling skewed data, memory management, types of joins, and lazy evaluation. It emphasizes using DataFrames for performance and simplicity, while providing practical examples and recommendations for effective data processing. Additionally, it covers how to create DataFrames and the concept of RDDs.

Uploaded by

ragita6295

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views26 pages

Deloitte & EY Data Engineer Interview Questions

Uploaded by

ragita6295

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

DELOITTE & EY

DATA ENGINEER INTERVIEW QUESTIONS

CTC: 20+ LPA
EXPERIENCE : 0-3

1. Difference between a DataFrame and an RDD in PySpark

RDD (Resilient Distributed
Feature DataFrame
Dataset)

Abstraction Low-level abstraction (object- High-level abstraction (tabular data like

Level oriented) SQL tables)

No schema, stores data as Has a schema with named columns and

Schema
Java/Scala/Python objects data types

Requires more lines of code Easier to use, supports SQL-like

Ease of Use
(functional programming) operations

Optimization No query optimization Optimized using Catalyst optimizer

Slower due to no optimization Faster due to optimization and Tungsten

Performance
and serialization overhead execution engine

For complex transformations For structured data analysis and SQL

Use Case
and custom functions queries

rdd = sc.parallelize([("Alice", df = spark.createDataFrame([("Alice", 25),

Example
25), ("Bob", 30)]) ("Bob", 30)], ["name", "age"])

Summary:

• Use RDDs when you need fine-grained control over data and operations.

• Use DataFrames for performance, simplicity, and SQL integration.

2. Techniques to Optimize PySpark Code Performance
Here are the top techniques for optimizing PySpark performance:

1. Use DataFrame API instead of RDDs

• DataFrames are optimized using Catalyst and Tungsten.

• Example:

df.select("column1").filter(df["column2"] > 10)

2. Persist/Cache Intermediate Results

• Use .cache() or .persist() when reusing data.

df.cache()

3. Filter Early (Predicate Pushdown)

• Apply filter() as early as possible to reduce data volume.

4. Broadcast Join for Smaller Datasets

• Avoid shuffles by broadcasting smaller tables.

from pyspark.sql.functions import broadcast

df1.join(broadcast(df2), "key")

5. Avoid Shuffles Where Possible

• Minimize use of operations like groupBy, repartition, join that cause shuffles.

6. Use .select() Instead of *

• Select only required columns to reduce memory usage.

7. Partitioning

• Use repartition() or coalesce() to control the number of partitions.

• Example:

df.repartition(4) # increases partitions

df.coalesce(1) # reduces partitions

8. Avoid UDFs When Possible

• Use built-in PySpark functions instead of UDFs for better performance.

9. Tuning Spark Configurations

• Example tuning:

spark.conf.set("spark.sql.shuffle.partitions", "200")

10. Monitoring via Spark UI

• Analyze stages, jobs, and tasks in the Spark UI for bottlenecks.

3. Role of Catalyst Optimizer in Query Execution

Catalyst Optimizer is Spark SQL’s query optimizer that converts logical plans to
efficient physical plans.

Key Roles:

Phase Description

Checks column names and data types against schema. Converts

1. Analysis
unresolved attributes into resolved ones.

2. Logical Applies optimization rules like predicate pushdown, constant

Optimization folding, null propagation, etc.

3. Physical Generates multiple physical plans and selects the most optimal one
Planning based on cost.

4. Code Uses Whole-Stage Code Generation (WSCG) to compile parts of

Generation the plan into Java bytecode for speed.

Examples of Optimizations:

• Predicate Pushdown: Filtering early to reduce data scan.

• Projection Pruning: Reads only necessary columns.

• Join Reordering: Reorders joins based on size and cost.

Why is it important?

• Automates performance improvements.

• Makes DataFrame and SQL APIs more efficient without manual tuning.

• Enables Spark to scale well with large datasets.

Final Summary:

Question Key Takeaway

DataFrames offer schema, SQL-like syntax, and performance via

DataFrame vs RDD
optimization; RDDs offer flexibility.

Optimization Focus on broadcast joins, filter early, cache data, use DataFrames
Techniques over RDDs, avoid UDFs.

Core of Spark SQL; it optimizes queries via logical and physical

Catalyst Optimizer
planning and code generation.

4. Common Serialization Formats in PySpark and Their Use

Cases
Serialization formats determine how data is stored or transmitted. In PySpark, choosing the
right format impacts performance, compatibility, and storage efficiency.

Common Formats:

Format Description Use Case

Columnar storage format with schema Ideal for big data analytics. Used for
Parquet support. Supports compression and read-heavy workloads, especially
predicate pushdown. with Spark SQL.
Format Description Use Case

Good for streaming, long-term data

Row-based format with rich schema
Avro storage, or schema evolution
support. Highly compact.
scenarios.

Optimized columnar format for

Best for integration with Hive and
ORC Hadoop, similar to Parquet but more
heavy read/write operations.
efficient in some Hive operations.

Text-based, human-readable format. Ideal for web logs, debugging, or

JSON
Slower and larger in size. configuration files.

Simple, row-based plain text format. Used for lightweight data exchange,
CSV
No schema support, larger file size. legacy systems, or initial ingestion.

Adds ACID transactions, versioning to Ideal for modern data lakes,

Delta Lake Parquet format. Supports time-travel lakehouse architecture, and
and upserts. streaming + batch unification.

Pickle Internal use in PySpark UDFs (not

Used to serialize Python objects. Not
(Python- recommended for distributed
cross-language.
specific) storage).

Recommendation:

• Use Parquet or Delta Lake for analytics and performance.

• Use JSON/CSV only for small datasets or initial ingestion.

• Avoid Pickle for distributed storage due to compatibility issues.

5. How to Address Skewed Data in PySpark?

Skewed data means some partitions have disproportionately more data than others. This
causes performance issues and slow stages in Spark.

Techniques to Handle Skewed Data:

1. Salting Keys
• Add a random prefix or suffix to the skewed key to distribute load.

from pyspark.sql.functions import concat, lit, rand

# Add a salt to the skewed key

df = df.withColumn("salted_key", concat(df["key"], lit("_"), (rand()*10).cast("int")))

2. Broadcast Join

• If one table is small, broadcast it to avoid shuffle.

df1.join(broadcast(df2), "key")

3. Skew Join Optimization

• Use Spark 3.0+ feature:

spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

4. Custom Partitioning

• Use repartition() with custom logic to ensure even data distribution.

5. Avoid GroupBy on Skewed Columns

• Use aggregations after salting or use approximate functions.

Example:

# Original skewed join

df1.join(df2, "user_id")

# After salting

df1 = df1.withColumn("salt", (rand()*5).cast("int"))

df2 = df2.withColumn("salt", (rand()*5).cast("int"))

df1_salted = df1.withColumn("user_id_salt", concat("user_id", "salt"))

df2_salted = df2.withColumn("user_id_salt", concat("user_id", "salt"))

df1_salted.join(df2_salted, "user_id_salt")

6. How is Memory Management Handled in PySpark?

Spark memory is split into different regions for storage and execution. Effective memory
management is crucial for avoiding OOM (OutOfMemory) errors and ensuring
performance.

Memory Components in Spark:

Component Purpose

Execution Memory Used for shuffle, joins, aggregations

Storage Memory Used to cache/persist dataframes and RDDs

User Memory Memory used for user-defined data structures

Reserved Memory Fixed portion that Spark reserves for internal use

Memory Management Features:

1. Unified Memory Management

• From Spark 1.6+, execution and storage memory share a unified pool.

2. Dynamic Allocation

• Spark dynamically adjusts executors based on workload:

spark.conf.set("spark.dynamicAllocation.enabled", "true")

3. Garbage Collection (GC)

• Use appropriate JVM GC like G1GC to handle large objects and reduce GC pauses.

4. Memory Tuning Parameters

• Adjust executor memory and cores:

--executor-memory 4G --executor-cores 2 --driver-memory 2G

• Control memory fraction:

spark.conf.set("spark.memory.fraction", 0.6)

spark.conf.set("spark.memory.storageFraction", 0.5)

5. Persist and Unpersist Data

• Use .persist() or .cache() wisely, and unpersist when not needed to free memory.

Final Summary:

Question Key Points

Serialization Prefer columnar formats like Parquet/ORC for analytics; use Delta
Formats for ACID.

Skewed Data Use salting, broadcast joins, adaptive skew join, or custom
Handling partitioning.

Memory Optimize using unified memory, dynamic allocation, GC tuning,

Management and persist/unpersist wisely.

7. Types of Joins in PySpark and How to Implement Them

PySpark supports all standard SQL join types using the .join() function.

Join Types in PySpark:

Join Type Description Syntax

Returns rows with matching keys in both df1.join(df2, "key",

Inner Join
DataFrames. "inner")

Left Join (Left All rows from left DF + matched rows from right df1.join(df2, "key",
Outer) DF. Nulls if no match. "left")
Join Type Description Syntax

Right Join All rows from right DF + matched rows from left df1.join(df2, "key",
(Right Outer) DF. Nulls if no match. "right")

Full Outer df1.join(df2, "key",

All rows from both DFs; Nulls where no match.
Join "outer")

Returns rows from left DF where match found in df1.join(df2, "key",

Left Semi Join
right DF. Right DF’s columns are dropped. "left_semi")

Returns rows from left DF where no match in right df1.join(df2, "key",

Left Anti Join
DF. "left_anti")

Cross Join Cartesian product of two DFs (all combinations). df1.crossJoin(df2)

Example:

df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])

df2 = spark.createDataFrame([(1, "HR"), (3, "Finance")], ["id", "dept"])

# Inner Join

df1.join(df2, on="id", how="inner").show()

8. Purpose of broadcast() and When to Use It

Purpose:

broadcast() is used to send a small DataFrame to all executor nodes to avoid data
shuffling during joins.

Why Use It?

• Optimizes Join Performance when one table is small.

• Avoids shuffle which is expensive and can slow down joins.

• Improves execution time significantly for skewed joins.

When to Use:

• One DataFrame is small enough to fit in memory (~10MB–100MB).

• When performing a join on a large table with a small lookup table.

• For dimension-fact table joins (e.g., product info, country list).

Example:

from pyspark.sql.functions import broadcast

# Large table: transactions, small table: products

transactions.join(broadcast(products), "product_id").show()

Spark Conf Auto-broadcast:

• Spark will automatically broadcast tables smaller than this threshold:

python

CopyEdit

spark.conf.get("spark.sql.autoBroadcastJoinThreshold") # default: 10MB

You can increase it if needed:

python

CopyEdit

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024) # 50 MB

9. How to Define and Use UDFs (User Defined Functions)?

A UDF (User Defined Function) allows you to apply custom logic to DataFrame columns
not covered by built-in functions.
Steps to Use UDF in PySpark:

1. Define a Python function

2. Register the function as a UDF

3. Apply it using .withColumn() or .select()

Example: Capitalize First Letter

from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

# Step 1: Define Python function

def capitalize_name(name):

return name.capitalize() if name else None

# Step 2: Register UDF

cap_udf = udf(capitalize_name, StringType())

# Step 3: Apply UDF

df = spark.createDataFrame([("alice",), ("bob",)], ["name"])

df.withColumn("cap_name", cap_udf("name")).show()

When to Use UDFs:

• When the transformation logic is not available in built-in functions.

• When applying custom business logic (e.g., text normalization, regex cleaning,
etc.)
Caution:

• UDFs are slower than native PySpark functions.

• They bypass Catalyst Optimizer and cannot be pushed down.

• Prefer built-in functions or pandas UDFs (if using Spark 3.x+).

Pandas UDF Example (for better performance):

from pyspark.sql.functions import pandas_udf

@pandas_udf(StringType())

def upper_case_pandas(col):

return col.str.upper()

df.withColumn("upper_name", upper_case_pandas(df["name"])).show()

Final Summary:

Question Key Takeaway

Types of PySpark supports inner, left, right, outer, semi, anti, and cross joins.
Joins Choose based on business need.

Use to optimize joins when one DF is small; avoids shuffling and improves
broadcast()
speed.

Enable custom column transformations. Avoid unless built-in or pandas

UDFs
UDFs can’t handle the logic.

10. What is Lazy Evaluation, and How Does It Affect Jobs?

Definition:

Lazy Evaluation means that Spark does not immediately execute transformations (like
map(), filter(), select(), etc.). Instead, it builds a logical execution plan. The execution only
starts when an action (like show(), count(), collect()) is triggered.

How It Works:

• Transformations: Lazy — they return a new DataFrame/RDD but don't compute

anything immediately.

• Actions: Trigger the actual computation and return results.

Examples:

# Lazy - no computation yet

df_filtered = df.filter(df["age"] > 30)

# Still lazy - chain of transformations

df_transformed = df_filtered.select("name", "age")

# Trigger action - now Spark builds DAG and executes

df_transformed.show()

Benefits of Lazy Evaluation:

1. Optimized Execution Plan: Spark uses the Catalyst Optimizer to reorder and
combine operations for better performance.

2. Pipeline Efficiency: Merges operations to avoid unnecessary disk and memory

usage.

3. Failure Recovery: Because of logical lineage (DAG), Spark can recompute lost
partitions.
Impact:

• Developers must understand that transformations won’t show results or errors until
an action is executed.

• Makes debugging harder, but greatly improves performance.

11. Steps to Create a DataFrame in PySpark

There are multiple ways to create a DataFrame in PySpark. Here are the most common
ones:

1. From a List of Tuples:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 30), ("Bob", 25)]

columns = ["name", "age"]

df = spark.createDataFrame(data, schema=columns)

df.show()

2. From an RDD:

rdd = spark.sparkContext.parallelize(data)

df_from_rdd = rdd.toDF(["name", "age"])

3. From a CSV/Parquet/JSON File:

df_csv = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

df_parquet = spark.read.parquet("path/to/file.parquet")

df_json = spark.read.json("path/to/file.json")

4. From Pandas DataFrame:

import pandas as pd

pdf = pd.DataFrame(data, columns=columns)

df_from_pd = spark.createDataFrame(pdf)

12. Concept of RDDs (Resilient Distributed Datasets)

Definition:

An RDD (Resilient Distributed Dataset) is the core abstraction in Spark that represents
an immutable, distributed collection of objects that can be processed in parallel.

Key Characteristics:

Feature Description

Fault-tolerant via lineage (can recover lost

Resilient
partitions)

Distributed Spread across multiple nodes in the cluster

Once created, cannot be changed;

Immutable
transformations create new RDDs

Like DataFrames, operations are not executed

Lazy Evaluation
until action is called
Feature Description

In-Memory
Fast operations using RAM
Computation

Creating RDDs:

# From a list

rdd1 = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

# From external file

rdd2 = spark.sparkContext.textFile("path/to/file.txt")

Common Operations:

Type Function

Transformations map(), filter(), flatMap(), union(), distinct()

Actions collect(), count(), take(), first(), reduce()

When to Use RDDs:

• When you need fine-grained control over the data.

• For low-level transformations or custom business logic.

• When DataFrames are insufficient, e.g., working with binary data or complex data
flows.

Final Summary:
Question Key Insight

Lazy Spark builds a plan but defers execution until an action is called.
Evaluation Enables optimization.

Create Use createDataFrame() from list, RDD, Pandas, or load files (CSV, JSON,
DataFrame Parquet).

Core Spark abstraction for distributed data processing. Less optimized

RDDs
than DataFrames but more flexible.

13. Difference Between Actions vs Transformations in

PySpark
Category Transformations Actions

Operations that define a new dataset Operations that trigger execution

Definition
from an existing one. Lazy. and return results.

Lazy — not executed until an action is Eager — trigger the DAG (Directed
Evaluation
called. Acyclic Graph) execution.

Return Returns a final result (e.g., count,

Returns a new DataFrame/RDD.
Type collect) or writes to output.

show(), count(), collect(),

Examples filter(), map(), select(), groupBy()
saveAsTextFile()

Example:

# Transformation (lazy)

df_filtered = df.filter(df["age"] > 25)

# Action (triggers execution)

df_filtered.show()
Summary:

• Use transformations to define operations.

• Use actions to trigger computation and get results.

14. Handling Null Values in DataFrames

Null values are common in real-world data and should be handled to avoid incorrect
results or failures in transformations.

1. Drop Nulls:

# Drop rows with any nulls

df.na.drop()

# Drop rows where specific column is null

df.na.drop(subset=["column_name"])

2. Fill Nulls:

# Fill all numeric nulls with 0

df.na.fill(0)

# Fill nulls in specific column with value

df.na.fill({"salary": 0, "department": "Unknown"})

3. Replace Specific Values:

# Replace NULL with "N/A" in a column

df.fillna("N/A", subset=["name"])
4. Filter Rows with Nulls:

# Select rows where 'age' is NOT NULL

df.filter(df["age"].isNotNull())

5. Using when() and isNull():

from pyspark.sql.functions import when

df.withColumn("salary_status", when(df["salary"].isNull(),
"Missing").otherwise("Available"))

Best Practice:

• Analyze the nature and impact of missing values.

• Use domain knowledge to decide between dropping, imputing, or flagging.

15. What is a Partition, and How Do You Manage It?

Definition:

A partition in Spark refers to a logical chunk of data distributed across different nodes in
the cluster. All Spark computations operate per partition in parallel.

Why Are Partitions Important?

• Enable parallelism (more partitions = better resource utilization).

• Affect data shuffling, execution time, and memory use.

• Poor partitioning → data skew, OOM errors, or slow jobs.

Default Behavior:

• Number of partitions depends on input source (e.g., file splits, HDFS blocks).

• For DataFrames, Spark often uses 200 shuffle partitions by default:

spark.conf.get("spark.sql.shuffle.partitions") # Default: 200

Managing Partitions:

Method Purpose

repartition(n) Increases or resets number of partitions; causes a shuffle

coalesce(n) Reduces number of partitions without full shuffle; more efficient

partitionBy() Used during writing data to disk (e.g., Parquet) to physically organize files

Examples:

# Repartition to 10 partitions (shuffle)

df.repartition(10)

# Coalesce to 1 partition (no shuffle)

df.coalesce(1)

# Write to disk partitioned by column

df.write.partitionBy("region").parquet("output/")

Best Practice:

• Use coalesce() after filtering to avoid unnecessary shuffles.

• Use partitionBy() when writing to optimize read queries later.

Final Summary:
Question Key Takeaway

Actions vs Transformations are lazy and define lineage; actions trigger

Transformations execution.

Use .na.drop(), .na.fill(), isNull() and when() to clean or impute

Handling Nulls
data.

Units of parallelism in Spark; manage them with repartition,

Partitions
coalesce, and partitionBy for performance tuning.

16. Difference Between Narrow vs Wide Transformations

Transformations in Spark are classified based on data shuffling behavior.

Aspect Narrow Transformation Wide Transformation

Data is not shuffled between Data is shuffled across

Definition
partitions. partitions/nodes.

Each child partition depends on a Each child partition depends on

Dependency
single parent. multiple parents.

Slower due to network I/O and

Execution Fast and more optimized.
shuffling.

groupByKey(), reduceByKey(), join(),

Examples map(), filter(), union(), sample()
distinct()

Visual Difference:

• Narrow: One-to-one partition mapping.

• Wide: Many-to-many or many-to-one (requires shuffling stage).

Why it Matters:

• Wide transformations trigger stage boundaries, which can affect performance.

• Optimizing joins, using reduceByKey() instead of groupByKey() can minimize wide
operations.

17. How Does PySpark Infer Schemas, and Why Does It

Matter?
Schema Inference:

When creating a DataFrame from a structured source (like CSV, JSON, RDD), PySpark tries
to infer the data types (schema) automatically unless explicitly provided.

Example:

df = spark.read.csv("data.csv", header=True, inferSchema=True)

• inferSchema=True tells Spark to scan the data and infer types like Integer, String,
Double, etc.

Why It Matters:

Benefit Description

Better
Schema helps Spark optimize execution plans.
Performance

Data Safety Prevents issues due to incorrect type casting.

Catalyst Needs schema to perform projection pruning, predicate

Optimizer pushdown, etc.

Caution:

• Schema inference can be costly on large files.

• It’s better to define schema manually for large datasets using StructType.
18. Role of SparkContext in a PySpark App
SparkContext is the entry point to low-level Spark features and the core of any
Spark application.

Key Roles:

Role Description

Connects PySpark to a Spark cluster (local, Yarn,

Cluster Connection
etc.).

Enables creation of RDDs from local or external

RDD Creation
sources.

Manages cluster resources, executors, and job

Resource Manager
scheduling.

Broadcast Variables /
Provides access to cluster-level shared variables.
Accumulators

Usage:

from pyspark import SparkContext

sc = SparkContext(appName="myApp")

rdd = sc.parallelize([1, 2, 3])

In PySpark 2.x and above, SparkSession wraps SparkContext, so you typically access it
via:

sc = spark.sparkContext

19. Performing Aggregations Effectively in PySpark

PySpark supports efficient aggregations using its DataFrame API.
Common Aggregation Functions:

• count(), sum(), avg(), min(), max()

• Grouped: groupBy().agg()

Example 1: Basic Aggregation

df.groupBy("department").agg({"salary": "avg", "age": "max"}).show()

Example 2: Using agg() with Aliases

from pyspark.sql.functions import avg, sum

df.groupBy("dept").agg(

avg("salary").alias("avg_salary"),

sum("bonus").alias("total_bonus")

Tips for Efficient Aggregation:

Tip Benefit

Use reduceByKey() instead of groupByKey() Minimizes shuffling in RDDs

Use approxQuantile() for faster percentiles Reduces execution time

Prevents Catalyst optimization

Avoid UDFs in aggregations
blockages

Repartition based on keys before grouping (if data

Improves load balancing
skewed)
20. How to Cache Data for Performance?
Why Cache?

Caching stores intermediate results in memory or disk to avoid recomputation and

improve performance of iterative or repeated operations.

Methods:

Method Description

df.cache() Caches the DataFrame in memory (lazy).

df.persist(StorageLevel) Caches with custom storage levels (memory + disk).

df.unpersist() Releases memory/disk space.

Example:

df.cache()

df.count() # Action triggers caching

Custom Storage Level:

from pyspark.storagelevel import StorageLevel

df.persist(StorageLevel.MEMORY_AND_DISK)

When to Cache:

• When the same DataFrame is used multiple times.

• When computations are expensive or involve wide transformations.

• During machine learning pipelines, graph computations, or iterative

aggregations.
Notes:

• Don't overuse caching—it may cause OutOfMemory errors.

• Always unpersist() when cache is no longer needed.

Final Summary:

Question Key Insight

Narrow = no shuffle (fast), Wide = shuffle across partitions

Narrow vs Wide
(expensive).

Schema Spark auto-detects types; better to define manually for large/critical

Inference datasets.

Entry point to Spark’s core functions like RDD creation, broadcasting,

SparkContext
etc.

Use groupBy().agg() with built-in functions; avoid skew and use

Aggregations
aliases.

Use .cache() or .persist() to speed up repeated computations; monitor

Caching
memory.

Would you like me to combine all 20 questions with formatting into a shareable PDF or
LinkedIn carousel? I can also include relevant hashtags and a short summary to post.

PySpark Performance Optimization PDF
No ratings yet
PySpark Performance Optimization PDF
7 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Spark QA
No ratings yet
Spark QA
34 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
Advance Spark
No ratings yet
Advance Spark
8 pages
Spark Optimisation
No ratings yet
Spark Optimisation
7 pages
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
No ratings yet
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
16 pages
PySpark Interview Questions Guide
100% (4)
PySpark Interview Questions Guide
126 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Pyspark
100% (1)
Pyspark
48 pages
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
Optimizing 1TB Data Handling Using PySpark 3p
No ratings yet
Optimizing 1TB Data Handling Using PySpark 3p
3 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Spark Driver Role & Data Skew Solutions
No ratings yet
Spark Driver Role & Data Skew Solutions
33 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
9 pages
Execr
No ratings yet
Execr
4 pages
Pyspark Optimization
No ratings yet
Pyspark Optimization
9 pages
Optimizing 1 TB Data in Pyspark
No ratings yet
Optimizing 1 TB Data in Pyspark
4 pages
Page 01
No ratings yet
Page 01
2 pages
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
No ratings yet
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
4 pages
PySpark Optimization Techniques For Data Engineers
No ratings yet
PySpark Optimization Techniques For Data Engineers
1 page
PySpark Notes
No ratings yet
PySpark Notes
64 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
Py Spark
No ratings yet
Py Spark
9 pages
Code Optimization in Spark
No ratings yet
Code Optimization in Spark
4 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Spark All Optimizations & Code
No ratings yet
Spark All Optimizations & Code
25 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
DE Bootcamp - Week 3 Day 2
No ratings yet
DE Bootcamp - Week 3 Day 2
4 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
PySpark Interview Questions Shubham
No ratings yet
PySpark Interview Questions Shubham
3 pages
Pyspark
No ratings yet
Pyspark
4 pages
Pyspark Interview Q & A in Topic Wise
No ratings yet
Pyspark Interview Q & A in Topic Wise
5 pages
PySpark DataFrames Guide
No ratings yet
PySpark DataFrames Guide
33 pages
Pyspark
No ratings yet
Pyspark
10 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
7 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
TU2-Equilibrium - Graphic Static - ER - En-160929 - 1475137032
No ratings yet
TU2-Equilibrium - Graphic Static - ER - En-160929 - 1475137032
69 pages
Under Graduate Medical / Dental Seats For Deemed Universities
No ratings yet
Under Graduate Medical / Dental Seats For Deemed Universities
1 page
C ToolingSystem
No ratings yet
C ToolingSystem
148 pages
Impression, Soleil Levant
No ratings yet
Impression, Soleil Levant
6 pages
Android Codes for Free Internet
100% (2)
Android Codes for Free Internet
8 pages
Abstract Espinosa Et Al 2022 Water Research
No ratings yet
Abstract Espinosa Et Al 2022 Water Research
2 pages
Screenshot 2023-07-25 at 3.11.16 PM
No ratings yet
Screenshot 2023-07-25 at 3.11.16 PM
213 pages
Oplan Balik Eskwela (Obe), School Readiness and Opening of Classes
No ratings yet
Oplan Balik Eskwela (Obe), School Readiness and Opening of Classes
56 pages
Kjs 2024 Fee Structure-2
No ratings yet
Kjs 2024 Fee Structure-2
4 pages
Action Research
No ratings yet
Action Research
9 pages
NCERT Class 12 Physics Part 2
No ratings yet
NCERT Class 12 Physics Part 2
1 page
War Wastes: Dark Heresy Supplement
100% (1)
War Wastes: Dark Heresy Supplement
7 pages
Citizen Architect Handbook
No ratings yet
Citizen Architect Handbook
50 pages
Malaysians' e-Consular Guide
100% (1)
Malaysians' e-Consular Guide
1 page
Analysis Biomolecule Lab Report
No ratings yet
Analysis Biomolecule Lab Report
7 pages
The Media, Sports and Economics of Great Britain
100% (1)
The Media, Sports and Economics of Great Britain
5 pages
NCERT Solutions For Class 10 Maths Chapter 10 Circles PDF
No ratings yet
NCERT Solutions For Class 10 Maths Chapter 10 Circles PDF
10 pages
3rdsem Remote Sensing Practical File
100% (1)
3rdsem Remote Sensing Practical File
42 pages
CMS M3.U1 Student Workbook 2 - 25
No ratings yet
CMS M3.U1 Student Workbook 2 - 25
145 pages
LBMC Hk5Ap: Product Introduction Key Features
No ratings yet
LBMC Hk5Ap: Product Introduction Key Features
5 pages
10th Maths Theorems Study Material English Medium PDF Download
No ratings yet
10th Maths Theorems Study Material English Medium PDF Download
2 pages
TL 52706
No ratings yet
TL 52706
43 pages
Mortgage Firms Show Strain: For Personal, Non-Commercial Use Only
No ratings yet
Mortgage Firms Show Strain: For Personal, Non-Commercial Use Only
38 pages
Comprehensive Guide to Basic Hematology
No ratings yet
Comprehensive Guide to Basic Hematology
89 pages
ACL Rehab Guide for Patients
No ratings yet
ACL Rehab Guide for Patients
20 pages
MINERA Technical Data Sheet - E0Dk Rev A
No ratings yet
MINERA Technical Data Sheet - E0Dk Rev A
2 pages
Complete Diagnostic Ultrasound: Abdomen and Pelvis 2nd Edition Aya Kamaya PDF For All Chapters
100% (1)
Complete Diagnostic Ultrasound: Abdomen and Pelvis 2nd Edition Aya Kamaya PDF For All Chapters
37 pages
Kubernetes Native Microservices With Quarkus and MicroProfile 1st Edition John Clingan Ken Finnigan All Chapters Available
No ratings yet
Kubernetes Native Microservices With Quarkus and MicroProfile 1st Edition John Clingan Ken Finnigan All Chapters Available
125 pages
WiNG Manager 1.0.5 Release Notes
No ratings yet
WiNG Manager 1.0.5 Release Notes
6 pages
Activity - Chapter 8
No ratings yet
Activity - Chapter 8
3 pages