execr
28 May 2025 15:34
✅ Fact Table
A Fact Table stores measurable, quantitative data (f
acts) for analysis.
• Contains: numeric metrics like sales amount, revenue, quantity, profit, etc.
• Grain: finest level of detail (e.g., one row per sales transaction).
• Has: foreign keys to dimension tables.
• Examples of columns:
○ DateKey, ProductKey, CustomerKey (FKs)
○ SalesAmount, QuantitySold, Discount (facts)
✅ Dimension Table
A Dimension Table stores descriptive, categorical information (context) related to facts.
• Contains: textual attributes that describe the data.
• Used for: filtering, grouping, labeling facts in reports.
• Examples of columns:
○ For Product: ProductKey, ProductName, Category, Brand
○ For Customer: CustomerKey, CustomerName, Location, Gender
Example:
FactSales (Fact Table)
DateKey
DimProduct (Dimension Table)
ProductKey
✅ 1. Z-Ordering in Delta Lake
Z-Ordering is a performance optimization technique in Delta Lake that improves query speed by reducing the number of files
scanned. It works by organizing data on disk based on specific columns that are commonly used in filters (like region, hospit al_id, or
date).
For example, if users often filter data by region and visit_date, we apply Z-Ordering on those columns:
OPTIMIZE delta.`/mnt/data/visits` ZORDER BY (region, visit_date)
This helps Databricks read only the files that are relevant to the query, instead of scanning the whole table, which greatly improves
performance — especially for large datasets.
✅ 2. Time Travel in Delta Lake
Time Travel allows us to access previous versions of a Delta table using a timestamp or version number. Delta automatically keeps a
history of all changes, so we can restore deleted data, debug issues, or audit changes.
There are two ways to use it:
By version:
spark.read.format("delta").option("versionAsOf", 5).load("/mnt/data/patients")
By timestamp:
spark.read.format("delta").option("timestampAsOf", "2024-05-01").load("/mnt/data/patients")
"We once had a case where records were mistakenly updated. Using Delta’s Time Travel feature, I quickly restored the correct
version of the data from the previous day, without needing any backup."
✅ 3. VACUUM in Delta Lake
Delta Lake keeps old data versions for Time Travel. But over time, this creates unused files that take up storage. VACUUM is used to
permanently delete these files and free up space.
VACUUM delta.`/mnt/data/transactions`
By default, Delta keeps 7 days of history. If we want to delete earlier, we must disable safety checks:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.sql("VACUUM delta.`/mnt/data/transactions` RETAIN 1 HOURS")
Interview tip:
"We scheduled VACUUM operations weekly in Databricks using ADF triggers. It helped us manage storage costs while still
pyspark Page 1
"We scheduled VACUUM operations weekly in Databricks using ADF triggers. It helped us manage storage costs while still
keeping enough history for Time Travel."
✅ 4. Writing Data in Delta Format (PySpark)
To take full advantage of Delta features (ACID, schema enforcement, time travel), we write data in Delta format using PySpark.
Example:
df.write.format("delta").mode("overwrite").save("/mnt/data/employees")
• overwrite replaces existing data.
• append adds new records.
• delta is the format for transactional tables in Databricks.
Interview tip:
"We always write our curated data in Delta format. It helps us enforce schema, track changes, and allows easy rollbacks using Time
Travel if needed."
✅ 5. Reading Data in Delta Format (PySpark)
You can read Delta tables in two ways:
From a path:
df = spark.read.format("delta").load("/mnt/data/employees")
From a registered table:
df = spark.sql("SELECT * FROM employees")
Delta reads support schema enforcement, partition pruning, and Time Travel options — all of which make it ideal for production
workloads.
Interview tip:
"We read data in Delta format to ensure consistency and take advantage of features like schema enforcement and efficient
reads through partitioning and Z-Ordering."
Question:
How many executors will you assign for a 10 GB file in HDFS?
Answer:
To determine the number of executors, I start by estimating the file's partitioning. Since HDFS default block size is usually 128 MB, a
10 GB file would be divided into around 80 partitions (10*1024 / 128 = 80). Spark ideally processes one partition per task. I generally
aim for 5 tasks per executor to balance performance and overhead. So, I would assign around 16 executors (80 / 5) for efficient
parallelism and throughput.
pyspark Page 2
Follow-up Question:
How many cores are needed for each executor, and what amount of memory is required for each executor?
Answer:
I typically assign 5 cores per executor, ensuring each core runs one task at a time. This keeps the CPU fully utilized without
overwhelming the executor. For memory, each task ideally needs 300–500 MB, so with 5 tasks, I allocate around 2.5 to 4 GB of
executor memory, adding overhead (~7–10%) for Spark internals. So, 5 cores and ~4 GB memory per executor is a balanced setup.
Additionally, in Azure Databricks, I consider the cluster size and node types. For example, with a Standard_D4s_v3 node (4 vCPUs, 16
GB RAM), I may opt for 2 executors per node with 2 cores and 6–7 GB RAM per executor, allowing room for driver and OS overhead.
What is apache spark and how is it different from hadoop mapReduce .?
Apache spark is a fast in memory data processing engine for big data. Unlike hadoop mapReduce which reads and write data from disk between
each operation. Spark keeps data in memory making it much faster. It also provides easier API's like SQL, dataframe and supports complex
workflows efficiently.
Explain the difference between transformation and action in spark .?
In spark transformations for exmaple map, filter are lazy they define a new RDD but don’t excute until an action is called. And action for exmaple
collect, count trigger excutions and return results. This lazy evaluation allows spark to optimize excution.
What is the difference between RDD, dataframe and dataset in spark .?
Resilient distributed dataset is low level, type safe and good for complex operation.
Dataframe is optimized , easy to use but not type safe.
Dataset combine RDDs type-safety with dataframe's optimizations.
So will use when dataframe's/dataset for performance and use rdd'd when fine control is needed.
How does spark handle fault tolerance.?
Spark uses RDD lineage (DAG) to track operations. If a partition is lost spark re-computes it from the source using the lineage. Avoiding full job
restarts.
What is a wide transformation and narrow transformation in spark .?
In narrow transformation data stays in the same partition for example map(), filter(). Processing happens within the same partition. There is no
shuffling of data across the cluster.
And in wide transformation data moves across partitions for example join(), groupBy() causing shuffle. Wide transformations are slower due to
expensive shuffles.
How does spark SQL optimize query excution ..?
Spark SQL uses the catalyst optimizor to convert SQL/dataframe queries into an optimized logical plan and then the tungsten engine compiles it
pyspark Page 3
Spark SQL uses the catalyst optimizor to convert SQL/dataframe queries into an optimized logical plan and then the tungsten engine compiles it
into a phyisical plan with efficient code generation memory management and excution.
How can you improve the performance of a spark job .?
We can improve the performance of spark job by using these technique like
Cach/persist frequently used data
By using partitioning properly
Avoid wide transformations when possible
Broadcast small tables in joins
Tune excutor memory and cores
Avoid shuffles and skewd data
What is the use of persist and cach method in spark ..?
Both store RDD's in memory for reuse .
Cache() = when we need the default storage level (memory-and-disk) for simple and iterative computations where the dataset fits into memory.
Persist() = when we want to customize storage level for example (for large dataset or when working with disk-only-storage). For workloads
where memory is constrained and you want to optimize for disk storage or serialization.
What are shuffles operations in spark and why are they expensive ..?
Shuffles move data across the network, involving disk I/O, serialization and network transfor. For exm join() groupBY(). They are costly and slow
so minimizing shuffles is key to good performance.
Explain how a spark job is excuted internally ..
In spark when job is excuted then transformations create an RDD DAG. DAG is split into stages based on shuffle boundaries. Stages contain tasks
per partition then cluster manager assign tasks to excutors. Excutors run tasks and return results this design enables fault tolerance, parallelism
and optimization.
Apache Spark Architecture
Apache Spark is a distributed computing framework designed for processing large datasets efficiently. Its architecture ensure s high performance,
scalability, and fault tolerance.
Core Components of Spark Architecture .. (spark follows master and slave articture )
1. Driver:
○ The Driver is the master process that controls the execution of the application.
○ It splits the job into smaller tasks and distributes them across worker nodes.
○ It maintains metadata, tracks task execution, and collects results.
2. Executors:
○ Executors are worker processes running on each node in the cluster.
○ They execute tasks assigned by the Driver and store intermediate and final results.
○ Each Executor runs multiple tasks and has its own memory for computation and storage.
3. Cluster Manager:
○ The Cluster Manager is responsible for resource allocation and managing the nodes in the cluster.
○ Supported Cluster Managers:
▪ Standalone: Spark's built-in cluster manager.
▪ YARN: Hadoop's cluster manager.
▪ Mesos: General-purpose cluster manager.
▪ Kubernetes: Container-based orchestration.
pyspark Page 4