GC in Databricks
GC in Databricks
Databricks
This guide provides an in-depth exploration of garbage collection (GC) in
Databricks, focusing on its role in managing memory for Apache Spark
applications running on the Java Virtual Machine (JVM). It covers the
fundamentals of GC, its mechanics, triggers, performance impacts, and
optimization strategies, with a particular emphasis on handling large
datasets.
Table of Contents
• 10. Conclusion
1 Introduction to Garbage
Collection
Key Takeaway
Young Generation: Comprises Eden and two Survivor Spaces (S0, S1). New
objects are allocated in Eden; those surviving Minor GC cycles move to Survivor
Spaces, and eventually to the Old Generation if they persist.
Old Generation: Houses long-lived objects, collected via Full GC, which is more
resource-intensive due to the larger memory scope.
Metaspace: Introduced in Java 8, this region stores class metadata and grows
dynamically, separate from the heap's GC cycles.
High
Multi-threaded,
throughput Longer pause
Parallel GC throughput-
for batch times.
focused collector.
jobs.
Low latency
Concurrent
for Heap
CMS GC collector
interactive fragmentation.
minimizing pauses.
apps.
Region-based,
Suitable for
balances Complex
G1 GC large heaps
throughput and tuning.
in Spark.
pauses.
Low-pause
Minimal Experimental in
ZGC collector for
pauses. Databricks.
massive heaps.
GC Process
GC operates in two primary modes:
Minor GC: Targets the Young Generation, swiftly reclaiming short-lived objects.
It's triggered when Eden fills, typically completing in milliseconds.
Full GC: Encompasses the entire heap, including Old Generation and Metaspace
(if applicable). It's slower, often pausing the application for seconds, depending
on heap size.
Execution Memory: Used for computation tasks like shuffles, joins, and
aggregations. It's dynamic and can borrow from Storage Memory if needed.
Key Takeaway
Low Available Memory: When the heap nears capacity, GC reclaims space to
allow new allocations.
High Object Churn: Rapid object creation and disposal, common in row-by-row
processing or UDFs, triggers frequent Minor GC.
Full GC Events: Occur when Minor GC fails to free sufficient memory or the Old
Generation is saturated.
Databricks-Specific Triggers:
Shuffle Operations: Data redistribution creates temporary objects, filling
Eden quickly.
Memory Generations
The generational model applies: Young Generation handles transient
objects (e.g., shuffle buffers), while Old Generation stores persistent data
(e.g., cached tables). Efficient tuning ensures short-lived objects don't
unnecessarily reach the Old Generation.
GC Cycle
A GC cycle involves marking live objects, sweeping garbage, and
optionally compacting memory. In Databricks:
Full GC: Comprehensive, impacting all generations and potentially pausing tasks
for seconds or minutes.
The efficiency of garbage collection directly impacts the performance of your
Databricks jobs. Understanding the GC cycle helps you optimize memory
usage and minimize pauses.
6 Optimizing Garbage Collection
in Databricks
JVM Tuning
Key JVM adjustments include:
Heap Size: -Xmx16g sets a 16GB heap, reducing GC frequency but increasing
pause duration if too large.
Spark Configuration
Spark settings to mitigate GC impact:
spark.conf.set("spark.memory.fraction", "0.8")
Additional Techniques
Advanced methods include:
Key Takeaway
Unpersisting DataFrames should free memory, but JVM delays can occur.
Techniques to ensure release:
Real-time Applications
In streaming jobs, Full GC pauses (e.g., 10 seconds) can disrupt data flow,
risking latency spikes or data loss. Low-pause collectors like ZGC may
help, though less common in Databricks.
Multi-tenant Environments
Shared clusters face unpredictable GC due to varying workloads. A
memory-intensive job can trigger frequent GC, slowing others. Dedicated
clusters or resource isolation mitigate this.
Data Skew
Uneven data distribution can overload specific executors, increasing GC
frequency on those nodes while others remain idle, complicating
performance tuning.
Key Takeaway
Key Takeaways