0% found this document useful (0 votes)

12 views4 pages

Execr

Uploaded by

aliya.pathan0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views4 pages

Execr

Uploaded by

aliya.pathan0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

execr

28 May 2025 15:34

✅ Fact Table
A Fact Table stores measurable, quantitative data (f
acts) for analysis.
• Contains: numeric metrics like sales amount, revenue, quantity, profit, etc.
• Grain: finest level of detail (e.g., one row per sales transaction).
• Has: foreign keys to dimension tables.
• Examples of columns:
○ DateKey, ProductKey, CustomerKey (FKs)
○ SalesAmount, QuantitySold, Discount (facts)

✅ Dimension Table
A Dimension Table stores descriptive, categorical information (context) related to facts.
• Contains: textual attributes that describe the data.
• Used for: filtering, grouping, labeling facts in reports.
• Examples of columns:
○ For Product: ProductKey, ProductName, Category, Brand
○ For Customer: CustomerKey, CustomerName, Location, Gender

Example:
FactSales (Fact Table)
DateKey

DimProduct (Dimension Table)

ProductKey

✅ 1. Z-Ordering in Delta Lake

Z-Ordering is a performance optimization technique in Delta Lake that improves query speed by reducing the number of files
scanned. It works by organizing data on disk based on specific columns that are commonly used in filters (like region, hospit al_id, or
date).
For example, if users often filter data by region and visit_date, we apply Z-Ordering on those columns:
OPTIMIZE delta.`/mnt/data/visits` ZORDER BY (region, visit_date)
This helps Databricks read only the files that are relevant to the query, instead of scanning the whole table, which greatly improves
performance — especially for large datasets.

✅ 2. Time Travel in Delta Lake

Time Travel allows us to access previous versions of a Delta table using a timestamp or version number. Delta automatically keeps a
history of all changes, so we can restore deleted data, debug issues, or audit changes.
There are two ways to use it:
By version:
spark.read.format("delta").option("versionAsOf", 5).load("/mnt/data/patients")
By timestamp:
spark.read.format("delta").option("timestampAsOf", "2024-05-01").load("/mnt/data/patients")
"We once had a case where records were mistakenly updated. Using Delta’s Time Travel feature, I quickly restored the correct
version of the data from the previous day, without needing any backup."

✅ 3. VACUUM in Delta Lake

Delta Lake keeps old data versions for Time Travel. But over time, this creates unused files that take up storage. VACUUM is used to
permanently delete these files and free up space.
VACUUM delta.`/mnt/data/transactions`
By default, Delta keeps 7 days of history. If we want to delete earlier, we must disable safety checks:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.sql("VACUUM delta.`/mnt/data/transactions` RETAIN 1 HOURS")
Interview tip:
"We scheduled VACUUM operations weekly in Databricks using ADF triggers. It helped us manage storage costs while still

pyspark Page 1
"We scheduled VACUUM operations weekly in Databricks using ADF triggers. It helped us manage storage costs while still
keeping enough history for Time Travel."

✅ 4. Writing Data in Delta Format (PySpark)

To take full advantage of Delta features (ACID, schema enforcement, time travel), we write data in Delta format using PySpark.
Example:
df.write.format("delta").mode("overwrite").save("/mnt/data/employees")
• overwrite replaces existing data.
• append adds new records.
• delta is the format for transactional tables in Databricks.
Interview tip:
"We always write our curated data in Delta format. It helps us enforce schema, track changes, and allows easy rollbacks using Time
Travel if needed."

✅ 5. Reading Data in Delta Format (PySpark)

You can read Delta tables in two ways:
From a path:
df = spark.read.format("delta").load("/mnt/data/employees")
From a registered table:
df = spark.sql("SELECT * FROM employees")

Delta reads support schema enforcement, partition pruning, and Time Travel options — all of which make it ideal for production
workloads.
Interview tip:
"We read data in Delta format to ensure consistency and take advantage of features like schema enforcement and efficient
reads through partitioning and Z-Ordering."

Question:
How many executors will you assign for a 10 GB file in HDFS?
Answer:
To determine the number of executors, I start by estimating the file's partitioning. Since HDFS default block size is usually 128 MB, a
10 GB file would be divided into around 80 partitions (10*1024 / 128 = 80). Spark ideally processes one partition per task. I generally
aim for 5 tasks per executor to balance performance and overhead. So, I would assign around 16 executors (80 / 5) for efficient
parallelism and throughput.

pyspark Page 2
Follow-up Question:
How many cores are needed for each executor, and what amount of memory is required for each executor?
Answer:
I typically assign 5 cores per executor, ensuring each core runs one task at a time. This keeps the CPU fully utilized without
overwhelming the executor. For memory, each task ideally needs 300–500 MB, so with 5 tasks, I allocate around 2.5 to 4 GB of
executor memory, adding overhead (~7–10%) for Spark internals. So, 5 cores and ~4 GB memory per executor is a balanced setup.
Additionally, in Azure Databricks, I consider the cluster size and node types. For example, with a Standard_D4s_v3 node (4 vCPUs, 16
GB RAM), I may opt for 2 executors per node with 2 cores and 6–7 GB RAM per executor, allowing room for driver and OS overhead.

What is apache spark and how is it different from hadoop mapReduce .?

Apache spark is a fast in memory data processing engine for big data. Unlike hadoop mapReduce which reads and write data from disk between
each operation. Spark keeps data in memory making it much faster. It also provides easier API's like SQL, dataframe and supports complex
workflows efficiently.

Explain the difference between transformation and action in spark .?

In spark transformations for exmaple map, filter are lazy they define a new RDD but don’t excute until an action is called. And action for exmaple
collect, count trigger excutions and return results. This lazy evaluation allows spark to optimize excution.

What is the difference between RDD, dataframe and dataset in spark .?

Resilient distributed dataset is low level, type safe and good for complex operation.
Dataframe is optimized , easy to use but not type safe.
Dataset combine RDDs type-safety with dataframe's optimizations.
So will use when dataframe's/dataset for performance and use rdd'd when fine control is needed.

How does spark handle fault tolerance.?

Spark uses RDD lineage (DAG) to track operations. If a partition is lost spark re-computes it from the source using the lineage. Avoiding full job
restarts.

What is a wide transformation and narrow transformation in spark .?

In narrow transformation data stays in the same partition for example map(), filter(). Processing happens within the same partition. There is no
shuffling of data across the cluster.
And in wide transformation data moves across partitions for example join(), groupBy() causing shuffle. Wide transformations are slower due to
expensive shuffles.

How does spark SQL optimize query excution ..?

Spark SQL uses the catalyst optimizor to convert SQL/dataframe queries into an optimized logical plan and then the tungsten engine compiles it

pyspark Page 3
Spark SQL uses the catalyst optimizor to convert SQL/dataframe queries into an optimized logical plan and then the tungsten engine compiles it
into a phyisical plan with efficient code generation memory management and excution.

How can you improve the performance of a spark job .?

We can improve the performance of spark job by using these technique like
Cach/persist frequently used data
By using partitioning properly
Avoid wide transformations when possible
Broadcast small tables in joins
Tune excutor memory and cores
Avoid shuffles and skewd data

What is the use of persist and cach method in spark ..?

Both store RDD's in memory for reuse .
Cache() = when we need the default storage level (memory-and-disk) for simple and iterative computations where the dataset fits into memory.
Persist() = when we want to customize storage level for example (for large dataset or when working with disk-only-storage). For workloads
where memory is constrained and you want to optimize for disk storage or serialization.

What are shuffles operations in spark and why are they expensive ..?
Shuffles move data across the network, involving disk I/O, serialization and network transfor. For exm join() groupBY(). They are costly and slow
so minimizing shuffles is key to good performance.

Explain how a spark job is excuted internally ..

In spark when job is excuted then transformations create an RDD DAG. DAG is split into stages based on shuffle boundaries. Stages contain tasks
per partition then cluster manager assign tasks to excutors. Excutors run tasks and return results this design enables fault tolerance, parallelism
and optimization.

Apache Spark Architecture

Apache Spark is a distributed computing framework designed for processing large datasets efficiently. Its architecture ensure s high performance,
scalability, and fault tolerance.

Core Components of Spark Architecture .. (spark follows master and slave articture )
1. Driver:
○ The Driver is the master process that controls the execution of the application.
○ It splits the job into smaller tasks and distributes them across worker nodes.
○ It maintains metadata, tracks task execution, and collects results.
2. Executors:
○ Executors are worker processes running on each node in the cluster.
○ They execute tasks assigned by the Driver and store intermediate and final results.
○ Each Executor runs multiple tasks and has its own memory for computation and storage.
3. Cluster Manager:
○ The Cluster Manager is responsible for resource allocation and managing the nodes in the cluster.
○ Supported Cluster Managers:
▪ Standalone: Spark's built-in cluster manager.
▪ YARN: Hadoop's cluster manager.
▪ Mesos: General-purpose cluster manager.
▪ Kubernetes: Container-based orchestration.

pyspark Page 4

Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
No ratings yet
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
16 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Important Interview Qa
No ratings yet
Important Interview Qa
13 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Advance Spark
No ratings yet
Advance Spark
8 pages
Pyspark
100% (1)
Pyspark
48 pages
Spark Tips 1716698498
No ratings yet
Spark Tips 1716698498
7 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Complete Spark & Azure Databricks Interview Guide - Claude
No ratings yet
Complete Spark & Azure Databricks Interview Guide - Claude
46 pages
Spark Databricks
No ratings yet
Spark Databricks
19 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
Apache Spark Technical Round Dashboard
No ratings yet
Apache Spark Technical Round Dashboard
14 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Must Know Before Your Next Databricks Interview
No ratings yet
Must Know Before Your Next Databricks Interview
7 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
Pyspark 1
No ratings yet
Pyspark 1
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark Interview Prep for Data Engineers
No ratings yet
Spark Interview Prep for Data Engineers
22 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
5 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Azure Etl 1741608374
No ratings yet
Azure Etl 1741608374
14 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
9 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
4 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Data Engineers' Guide to Delta & Spark
No ratings yet
Data Engineers' Guide to Delta & Spark
5 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Notes
No ratings yet
Notes
4 pages
Spark 3.0 Key Features Overview
No ratings yet
Spark 3.0 Key Features Overview
8 pages
Spark Development for Developers
No ratings yet
Spark Development for Developers
172 pages
Python and Pyspark With Databricks, With Azure Project
No ratings yet
Python and Pyspark With Databricks, With Azure Project
9 pages
Data Bricks
No ratings yet
Data Bricks
10 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
PySpark Interview Questions Guide
100% (4)
PySpark Interview Questions Guide
126 pages
Engine
No ratings yet
Engine
4 pages
Learn by Doing It
No ratings yet
Learn by Doing It
9 pages
Basic Arb
No ratings yet
Basic Arb
2 pages
Note 1
No ratings yet
Note 1
2 pages
Tadat Halat
No ratings yet
Tadat Halat
1 page
Mukassr
No ratings yet
Mukassr
1 page
Tasheel Rule
No ratings yet
Tasheel Rule
4 pages
Minhaj Sabaq
No ratings yet
Minhaj Sabaq
3 pages
Cloud Computing for IT Professionals
No ratings yet
Cloud Computing for IT Professionals
19 pages
PPC Controller Rexroth PDF
No ratings yet
PPC Controller Rexroth PDF
64 pages
Ab-435 1
No ratings yet
Ab-435 1
6 pages
Ncomputing l300 Poverpoint
No ratings yet
Ncomputing l300 Poverpoint
30 pages
0 - 30 V Adjustable Voltage, Current Stabilized Voltage
No ratings yet
0 - 30 V Adjustable Voltage, Current Stabilized Voltage
2 pages
12 Physics Notes Ch14 Semiconductor Electronics
No ratings yet
12 Physics Notes Ch14 Semiconductor Electronics
6 pages
Unit 4 Exercise 4-1 RGB Led Static Display
No ratings yet
Unit 4 Exercise 4-1 RGB Led Static Display
7 pages
Arduino Potentiometer LED Control
No ratings yet
Arduino Potentiometer LED Control
16 pages
Lecture04 MOS Small Sig 6up
No ratings yet
Lecture04 MOS Small Sig 6up
6 pages
hts5131 12 Fin Aen
No ratings yet
hts5131 12 Fin Aen
2 pages
Peardb
No ratings yet
Peardb
8 pages
DoIP Standalone Releasenotes
No ratings yet
DoIP Standalone Releasenotes
22 pages
NAV 2009 - Dataports and XMLports
No ratings yet
NAV 2009 - Dataports and XMLports
42 pages
Functional Requirements and Use Cases
No ratings yet
Functional Requirements and Use Cases
8 pages
DATASHEET Ina111
No ratings yet
DATASHEET Ina111
16 pages
5-Internal Tables and Work Areas
No ratings yet
5-Internal Tables and Work Areas
22 pages
MCDTV4 3.6 EN Profibus Datapoints
No ratings yet
MCDTV4 3.6 EN Profibus Datapoints
19 pages
Type Conversions 11
No ratings yet
Type Conversions 11
18 pages
Rust Concurrent Programming Guide
No ratings yet
Rust Concurrent Programming Guide
212 pages
Syllabus For EEL 5934 - Introduction To Hardware Security and Trust Spring 2017
No ratings yet
Syllabus For EEL 5934 - Introduction To Hardware Security and Trust Spring 2017
4 pages
General Description: C1212 Vehicle Speed Input Malfunction General Information
No ratings yet
General Description: C1212 Vehicle Speed Input Malfunction General Information
6 pages
Indoor Wall Mounted Horns - Strobes - Horn Strobes Triga UL
No ratings yet
Indoor Wall Mounted Horns - Strobes - Horn Strobes Triga UL
4 pages
Tecnologia Industrial 1 Bachillerato MC Graw Hill Descargar
0% (1)
Tecnologia Industrial 1 Bachillerato MC Graw Hill Descargar
3 pages
G150 Collins DIAG
No ratings yet
G150 Collins DIAG
240 pages
Nokia Creating The Connected Digital Mine With Nokia Critical Communications Solutions Use Case EN
No ratings yet
Nokia Creating The Connected Digital Mine With Nokia Critical Communications Solutions Use Case EN
1 page
Tech Prebid 172710
No ratings yet
Tech Prebid 172710
5 pages
Selenium Course
No ratings yet
Selenium Course
7 pages
LRU Example (3 Frames)
No ratings yet
LRU Example (3 Frames)
22 pages
Data Representation
No ratings yet
Data Representation
54 pages
Eaton Sc300 Operation Handbook 166
No ratings yet
Eaton Sc300 Operation Handbook 166
166 pages

Execr

Uploaded by

Execr

Uploaded by

execr

28 May 2025 15:34

DimProduct (Dimension Table)

✅ 1. Z-Ordering in Delta Lake

✅ 2. Time Travel in Delta Lake

✅ 3. VACUUM in Delta Lake

✅ 4. Writing Data in Delta Format (PySpark)

✅ 5. Reading Data in Delta Format (PySpark)

What is apache spark and how is it different from hadoop mapReduce .?

Explain the difference between transformation and action in spark .?

What is the difference between RDD, dataframe and dataset in spark .?

How does spark handle fault tolerance.?

What is a wide transformation and narrow transformation in spark .?

How does spark SQL optimize query excution ..?

How can you improve the performance of a spark job .?

What is the use of persist and cach method in spark ..?

Explain how a spark job is excuted internally ..

Apache Spark Architecture

You might also like