Kafka
Apache Kafka is an open-source distributed event streaming platform originally
developed by LinkedIn and later open-sourced as part of the Apache Software
Foundation. It's designed to handle high-volume, real-time data feeds with low latency
and high throughput.
Why Kafka Was Built
Kafka was created to solve several key challenges in data processing:
1. Decoupling of data producers and consumers - Allows systems to communicate
without being directly connected
2. Real-time data processing - Enables immediate reaction to events as they occur
3. High throughput requirements - Originally built to handle LinkedIn's massive
activity stream (billions of messages/day)
4. Durability and fault-tolerance - Data is persisted and replicated to prevent loss
5. Horizontal scalability - Can scale to handle increasing loads by adding more
nodes
Core Components and Concepts
1. Brokers
• Kafka servers that store data and serve clients
• Form a cluster for scalability and fault tolerance
• Each broker is identified by an ID (integer)
2. Topics
• Categories or feed names to which messages are published
• Divided into partitions for parallel processing
• Messages within a partition are ordered
3. Partitions
• Ordered, immutable sequence of messages
• Each partition can be hosted on a different server
• Enables parallelism and scalability
4. Producers
• Applications that publish (write) messages to topics
• Can choose which partition to write to (round-robin, key-based, or custom)
5. Consumers
• Applications that subscribe to (read) messages
• Organized into consumer groups for parallel processing
6. ZooKeeper
• Manages and coordinates Kafka brokers (though being phased out in newer
versions)
• Stores metadata about topics, partitions, and consumer offsets
7. Replicas
• Copies of partitions for fault tolerance
• Leader replica handles all read/write requests
• Follower replicas replicate the leader's data
8. Offsets
• Unique ID assigned to each message within a partition
• Consumers track their position via offsets
9. Log Compaction
• Retention policy that keeps the latest value for each key
• Useful for maintaining current state without full history
Essential Configuration Tweaks for Data Engineers
Broker Configurations
1. num.partitions
◦ Default number of partitions per topic
◦ Recommended: Start with 6-10 partitions per topic
2. properties
num.partitions=6
3. log.retention.ms
◦ How long to keep messages (default is 7 days)
◦ Alternative: log.retention.hours or log.retention.minutes
4. properties
log.retention.ms=604800000 # 7 days
5. log.retention.bytes
◦ Maximum size of a partition before deletion
6. properties
log.retention.bytes=1073741824 # 1GB
7. default.replication.factor
◦ Default number of replicas for each partition
◦ Recommended: 3 for production
8. properties
default.replication.factor=3
9. unclean.leader.election.enable
◦ Whether to allow out-of-sync replicas to become leaders
◦ Recommended: false for data consistency
10. properties
unclean.leader.election.enable=false
Producer Configurations
1. acks
◦ Number of acknowledgments required for successful write
◦ acks=1 (leader only), acks=0 (none), acks=all (all replicas)
2. properties
acks=all
3. compression.type
◦ Message compression (none, gzip, snappy, lz4, zstd)
4. properties
compression.type=snappy
5. retries
◦ Number of retries for failed sends
6. properties
retries=5
Consumer Configurations
1. group.id
◦ Consumer group identifier
2. properties
group.id=inventory-consumers
3. auto.offset.reset
◦ What to do when no offset is stored (earliest, latest, none)
4. properties
auto.offset.reset=earliest
5. enable.auto.commit
◦ Whether to automatically commit offsets
6. properties
enable.auto.commit=false # For exactly-once processing
Performance Tuning
1. message.max.bytes
◦ Maximum size of a single message
2. properties
message.max.bytes=1000000 # 1MB
3. num.io.threads
◦ Number of threads handling network requests
4. properties
num.io.threads=8
5. num.network.threads
◦ Number of threads handling network requests
6. properties
num.network.threads=3
Key Considerations for Data Engineers
1. Partitioning Strategy:
◦ Choose partition keys carefully to avoid skew
◦ More partitions = more parallelism but also more overhead
2. Replication Factor:
◦ Higher replication improves durability but requires more resources
◦ Minimum of 2 for staging, 3 for production
3. Monitoring:
◦ Track consumer lag (messages behind)
◦ Monitor disk usage and network throughput
4. Security:
◦ Enable SSL for encryption
◦ Configure SASL for authentication
◦ Set up ACLs for authorization
5. Upgrade Considerations:
◦ Plan for rolling upgrades
◦ Test compatibility between client and broker versions
Consumer Groups in Kafka
A consumer group is a set of consumers that work together to consume data from one or
more topics. The key characteristics are:
1. Parallel Processing: Consumers in the same group divide the partitions among
themselves
2. Message Distribution: Each partition is consumed by only one consumer in the
group
3. Offset Management: The group maintains its position (offsets) in the partitions
How Consumers Should Consume Data
Basic Consumption Rules
1. One Partition Per Consumer: Within a consumer group, each partition is assigned
to exactly one consumer
2. Scalable Consumption: You can add more consumers (up to the number of
partitions) to increase throughput
3. Rebalancing: When consumers join or leave, partitions are reassigned
automatically
Consumption Scenarios
1. Single Consumer in Group:
◦ Consumes from all partitions
◦ Example: 1 consumer reading from a topic with 4 partitions
2. Multiple Consumers (≤ Partitions):
◦ Each consumer gets exclusive access to one or more partitions
◦ Example: 4 consumers reading from a topic with 4 partitions (1:1 mapping)
3. More Consumers Than Partitions:
◦ Extra consumers remain idle
◦ Example: 5 consumers for a topic with 4 partitions → 1 idle consumer
Can One Consumer Consume From Multiple Groups or Partitions?
1. One Consumer → Multiple Groups
• Yes: A single consumer process can subscribe to multiple consumer groups
• However, this is uncommon as it would receive duplicate messages (once per
group)
2. One Consumer → Multiple Partitions
• Yes: A single consumer can be assigned multiple partitions from the same topic
• This happens automatically when consumers outnumber partitions
3. One Consumer → Multiple Topics
• Yes: A consumer can subscribe to multiple topics
• The partitions from all topics will be assigned according to group rules
Pulsar
Apache Pulsar: Overview, Features, and Comparison with Kafka
1. What is Apache Pulsar?
Apache Pulsar is a distributed, cloud-native pub-sub messaging and streaming
platform designed for high performance, scalability, and durability. It combines the best
features of traditional messaging systems (like RabbitMQ) and streaming systems (like
Kafka) into a unified platform.
2. Fundamental Features of Apache Pulsar
Core Architecture
• Multi-layer Architecture:
◦ Brokers (stateless, handle message dispatch)
◦ BookKeeper (persistent storage layer for logs)
◦ ZooKeeper (coordination and metadata management)
• Uni ed Messaging Model:
◦ Supports pub-sub (topic-based), queuing (consumer groups),
and streaming in a single system.
Messaging & Streaming Capabilities
• Persistent & Non-Persistent Topics:
◦ Messages can be stored durably (persistent) or kept in memory (non-
persistent).
• Multi-Tenancy:
◦ Supports multiple tenants with isolated namespaces and quotas.
• Geo-Replication:
◦ Built-in cross-datacenter replication for disaster recovery.
• Message Retention & Expiry:
◦ Configurable TTL (Time-to-Live) for messages.
◦ Supports long-term storage with tiered storage (offloading to cheaper
storage like S3).
Consumer Models
• Exclusive, Failover, Shared, Key-Shared subscription modes for different
consumption patterns.
• Consumer Acknowledgement (Ack):
◦ Supports individual, cumulative, and negative acknowledgments.
fi
3. Advanced Features of Apache Pulsar
1. Tiered Storage
• Automatically offloads older data to cloud storage (S3, GCS, HDFS) while keeping
recent data in BookKeeper.
2. Pulsar Functions (Serverless Computing)
• Lightweight compute functions for stream processing (similar to Kafka Streams but
simpler).
3. Pulsar SQL (via Presto)
• Query historical data in Pulsar using SQL.
4. Transactions (Exactly-Once Semantics)
• Supports end-to-end exactly-once processing (similar to Kafka).
5. Schema Registry
• Enforces schema compatibility (Avro, JSON, Protobuf) for structured data.
6. Multi-Protocol Support
• Supports Kafka-compatible API, MQTT, and AMQP for easier migration.
7. Kubernetes-Native
• Designed for cloud deployments with Helm charts and K8s operators.
4. Apache Pulsar vs. Kafka: Key Differences
Feature Apache Pulsar Apache Kafka
Architecture Multi-layer (Brokers + BookKeeper) Monolithic (Brokers store data)
Scalability Scales brokers & storage independently Scaling requires rebalancing partitions
Geo-Replication Built-in multi-region replication Requires MirrorMaker or Confluent tools
Message Retention Tiered storage (S3 offloading) Limited by local disk (KIP-405 in Kafka
3.6+)
Consumer Models Multiple subscription types (Exclusive, Only consumer groups
Shared, Key-Shared)
Processing Pulsar Functions (serverless) Kafka Streams (stateful processing)
Latency Low-latency (microseconds) Higher latency (milliseconds)
Multi-Tenancy Strong isolation (tenants, namespaces) Limited (depends on RBAC)
Kafka Yes (via Pulsar Proxy) Native Kafka protocol
Compatibility
Ease of Auto-balancing, no manual partitioning Manual partition management
Management
5. When to Use Pulsar vs. Kafka?
Choose Pulsar if:
✅ You need uni ed queuing + streaming in one system.
✅ You want built-in geo-replication & multi-tenancy.
✅ You need long-term storage with tiered of oading.
✅ You prefer cloud-native, Kubernetes-friendly deployments.
Choose Kafka if:
✅ You have an existing Kafka ecosystem (Confluent, Kafka Connect).
✅ You need mature stream processing (Kafka Streams, KSQL).
✅ You prefer simpler architecture (fewer moving parts).
6. Conclusion
• Pulsar is a modern alternative to Kafka with better scalability, multi-tenancy, and
cloud-native features.
• Kafka is more mature and widely adopted, with strong stream processing
capabilities.
• Pulsar is gaining traction in enterprises needing hybrid messaging + streaming
with cloud flexibility.
Would you like a deeper dive into any specific aspect (e.g., performance benchmarks,
deployment models)?
Spark
Spark architecture is designed to be efficient, scalable, and fault-tolerant
1. Driver Program
• The central point of a Spark application
• Contains the main() function and creates the SparkContext
• Converts user code into tasks
fi
fl
• Schedules tasks across executors
• Maintains metadata about all Spark components
• Collects results from workers
3. Worker Nodes
• Machines that run the application code in the cluster
• Host executor processes that perform computations
• Store data for the application
4. Executors
• Processes launched on worker nodes that run tasks
• Store computation results in memory or disk
• Communicate with the driver program
• Each application has its own executor processes
5. Tasks
• Smallest unit of work sent to an executor
• Corresponds to computation on a data partition
• Multiple tasks can run in parallel on an executor
Spark Execution Model
Directed Acyclic Graph (DAG) Scheduler
• Breaks down the logical execution plan (RDD lineage) into stages of tasks
• Optimizes the execution plan (pipelining transformations)
• Determines optimal locations to run tasks based on data locality
Key Architectural Features
1. Lazy Evaluation
• Transformations are only computed when an action is called
• Allows for optimization of the execution plan
2. In-Memory Computation
• Data is kept in memory between operations when possible
• Significantly faster than disk-based systems like MapReduce
3. Resilient Distributed Datasets (RDDs)
• Fundamental data structure of Spark
• Immutable, partitioned collections of records
• Can be reconstructed if partitions are lost (lineage)
Performance Optimizations in Spark Architecture
1. Memory Management: Tunable memory fractions for execution and storage
2. Broadcast Variables: Efficiently send large read-only values to workers
3. Accumulators: Variables that workers can only add to
4. Data Serialization: Options like Java serialization or Kryo
5. Garbage Collection Tuning: Critical for JVM-based operations
What is Garbage Collection?
Garbage Collection is the automatic memory management process in Java Virtual
Machine (JVM) that reclaims memory occupied by objects that are no longer in use.
Common GC Errors in Spark
1. Long GC Pauses: When GC takes too long (seconds), causing task timeouts
2. Frequent GC Cycles: Excessive GC activity slowing down processing
3. OutOfMemoryError (OOM) due to GC overhead: When >98% of CPU time is
spent on GC
Causes of GC Problems in Spark
• Memory pressure: Too many objects in heap
• Improper memory con guration: Incorrect executor/driver memory settings
• Memory leaks: Objects not being properly released
• Large data structures: Big collections that don't fit in memory
• Excessive caching: Caching more data than necessary
Symptoms of GC Issues
• Tasks failing with "ExecutorLostFailure"
• "java.lang.OutOfMemoryError: GC overhead limit exceeded"
• Slow performance with high GC time in Spark UI
• Tasks timing out due to long GC pauses
What is Heap Space?
The heap is the area of memory where Spark stores objects created during execution.
Common Heap Space Errors
1. java.lang.OutOfMemoryError: Java heap space
2. java.lang.OutOfMemoryError: PermGen space (in older JVMs)
3. Container killed by YARN for exceeding memory limits
Causes of Heap Space Errors
1. Insuf cient memory allocation:
◦ Executor memory too small for the workload
◦ Driver memory too small for collected results
2. Data skew:
◦ Uneven distribution causing some tasks to process much more data
3. Improper operations:
◦ Collecting large datasets to the driver
fi
fi
◦ Caching unnecessary RDDs/DataFrames
◦ Broadcasting very large variables
4. Memory leaks:
◦ Accumulators not being cleared
◦ Static collections growing indefinitely
Flink
Apache Flink: Overview, Features, and Comparison with Spark
1. What is Apache Flink?
Apache Flink is a distributed stream processing framework designed for stateful, low-
latency, and high-throughput data processing. Unlike batch-first systems (like Spark),
Flink treats streaming as the core abstraction, with batch as a special case.
2. Fundamental Features of Apache Flink
Core Architecture
• True Streaming Model: Processes data in real-time (unlike micro-batching in Spark
Streaming).
• Event Time Processing: Handles out-of-order events
using watermarks and event-time semantics.
• Stateful Computations: Maintains intermediate state (e.g., counters, windows)
efficiently.
• Fault Tolerance: Uses distributed snapshots (Chandy-Lamport algorithm) for
recovery.
APIs & Libraries
• DataStream API (for unbounded streams)
• DataSet API (for bounded data, deprecated in favor of batch streams)
• Table API & SQL (unified relational queries on streams & batches)
• Stateful Functions (serverless event-driven processing)
Deployment Modes
• Standalone Cluster
• YARN / Kubernetes / Mesos
• Docker & Cloud (AWS, GCP, Azure)
3. Advanced Features of Apache Flink
1. Exactly-Once Processing
• Guarantees no duplicates even after failures (via checkpointing).
2. State Backends
• Supports different storage backends:
◦ MemoryStateBackend (fast, but not durable)
◦ FsStateBackend (file system, e.g., HDFS)
◦ RocksDBStateBackend (disk-based, scalable for large state)
3. Windowing & Time Semantics
• Event Time / Processing Time / Ingestion Time
• Tumbling, Sliding, Session, Global Windows
4. Connectors & Ecosystem
• Kafka, Kinesis, RabbitMQ (streaming sources/sinks)
• HDFS, S3, JDBC (batch integrations)
• Elasticsearch, Cassandra (output sinks)
5. Machine Learning & Graph Processing
• FlinkML (machine learning library, though less mature than Spark MLlib)
• Gelly (graph processing)
6. Savepoints
• Allows stateful upgrades & reprocessing without losing progress.
7. Adaptive Scaling
• Dynamic scaling of parallel operators (experimental).
4. Apache Flink vs. Spark: Key Differences
Feature Apache Flink Apache Spark
Processing Model True streaming (per-record) Micro-batching (small RDD chunks)
Latency Sub-millisecond Seconds to milliseconds
State Management First-class keyed state Requires external storage (e.g., Redis)
Fault Tolerance Lightweight snapshots Lineage recomputation (costly)
Event Time Native support (watermarks) Limited (needs Structured Streaming)
Handling
Batch Processing Treats batch as a bounded stream Optimized for batch- rst (RDDs)
SQL Support Uni ed SQL for streams & batch Separate APIs (Spark SQL vs. Structured
Streaming)
fi
fi
Machine Learning FlinkML (less mature) MLlib (more mature)
Ecosystem Growing (Kafka, Kinesis, connectors) Larger ecosystem (Spark SQL, GraphX, etc.)
Resource Mgmt. Fine-grained (task slots) Coarse-grained (executors)
5. When to Use Flink vs. Spark?
Choose Flink if:
✅ You need real-time streaming with low latency.
✅ You require stateful event-time processing.
✅ You want exactly-once guarantees without micro-batching.
✅ You prefer uni ed batch & stream APIs.
Choose Spark if:
✅ You primarily do batch processing & ETL.
✅ You need machine learning (MLlib) or graph processing (GraphX).
✅ You rely on Spark SQL or Databricks ecosystem.
✅ You prefer mature, widely-adopted big data tech.
6. Conclusion
• Flink excels in real-time, stateful stream processing with low latency.
• Spark dominates batch analytics, ML, and SQL with a richer ecosystem.
• Flink is the future for streaming, while Spark remains king for batch.
Would you like a deeper comparison on performance benchmarks or use-case
examples?
Feature Storage
Feature Storage in Data Science & Engineering
Feature storage is a centralized system designed to store, manage, and serve machine
learning (ML) features—reusable data attributes used for model training and inference. It
ensures consistency, reproducibility, and low-latency access to features across different
stages of the ML lifecycle.
1. Why Feature Storage?
fi
ML models rely on features (e.g., user embeddings, transaction trends, NLP tokens), but
managing them at scale introduces challenges:
• Inconsistent features between training & inference → model drift
• Repetitive computation (same feature calculated multiple times)
• Slow feature retrieval for real-time inference
• Lack of versioning & governance
A feature store solves these by acting as a single source of truth for features.
2. Key Components of a Feature Store
Component Description Example
Of ine Store Stores historical features for batch training (slow, high- HDFS, Snowflake
capacity)
Online Store Low-latency DB for real-time serving (fast, limited history) Redis, DynamoDB
Feature Registry Metadata & schema management (versioning, lineage) Feast, Tecton
Transformation Engine Computes features (batch/streaming) Spark, Flink
Serving Layer Retrieves features for training/inference gRPC, REST API
3. Core Features of a Feature Store
A. Feature Reusability
• Avoid recomputing the same feature (e.g., "user_30d_avg_spend") across models.
B. Consistency Between Training & Serving
• Ensures the same transformation logic is applied during training and inference.
C. Point-in-Time Correctness (Time Travel)
• Retrieves features as they were at training time to avoid data leakage.
D. Low-Latency Serving
• Optimized for real-time models (e.g., fraud detection, recommendations).
E. Feature Versioning & Lineage
• Tracks changes to features (e.g., "v1: raw data → v2: normalized").
F. Back ll Support
• Recomputes historical features when logic changes.
fl
fi
4. Popular Feature Store Solutions
Tool Backers Key Strengths Best For
Feast Open-source (Gojek) Lightweight, integrates with Startups, GCP users
BigQuery
Tecton Commercial (Uber End-to-end, real-time features Enterprise ML teams
alumni)
Hopsworks Open-source PySpark/Flink integration On-prem & hybrid
clouds
Vertex AI Feature Google Cloud Managed service GCP ecosystems
Store
SageMaker Feature AWS Integrated with SageMaker AWS users
Store
5. Feature Store vs. Traditional Data Lakes/Warehouses
Aspect Feature Store Data Lake/Warehouse
Purpose Optimized for ML features General analytics
Latency µs-ms (online store) Seconds-minutes
Consistency Ensures train/serve parity No built-in guarantees
Transforms Embedded computation External ETL needed
Versioning Built-in (feature lineage) Manual tracking
6. When Do You Need a Feature Store?
✅ Multiple ML models sharing features
✅ Real-time serving requirements
✅ Frequent feature back lls (e.g., logic changes)
✅ Large-scale feature management (100s of features)
🚫 Not needed for small projects or ad-hoc models.
7. Example Work ow
1. Compute features (e.g., "user_last_5_purchases") using Spark/Flink.
2. Store in offline (Parquet) and online (Redis) stores.
fl
fi
3. Training: Fetch historical features with point-in-time correctness.
4. Inference: Retrieve latest features in <10ms from the online store.
Conclusion
A feature store standardizes feature pipelines, eliminates training-serving skew, and
accelerates ML deployment. Leading tech companies (Uber, Airbnb, Netflix) rely on them
for production ML.
Would you like a deeper dive into speci c tools (e.g., Feast vs. Tecton)
or implementation patterns?
Kinesis vs MSK
Comparison: AWS Kinesis vs. AWS MSK (Managed Streaming for Kafka)
Both AWS Kinesis and AWS MSK are managed streaming services, but they cater to
different use cases. Below is a detailed comparison:
1. Overview
Feature AWS Kinesis AWS MSK (Managed Kafka)
Type AWS-native streaming service Fully managed Apache Kafka
Use Case Real-time analytics, log processing, event High-throughput, Kafka-compatible event
sourcing streaming
Protocol Kinesis Producer Library (KPL), Kinesis Client Kafka protocol (supports Kafka clients)
Library (KCL)
Scalability Auto-scaling (Kinesis Data Streams On- Manual scaling (brokers, storage)
Demand)
Pricing Pay-per-shard (Provisioned) or pay-per- Pay per broker hour + storage
Model throughput (On-Demand)
2. Key Features Comparison
A. Performance & Throughput
Metric Kinesis MSK
Max Throughput ~1 MB/s per shard (Provisioned) or unlimited Depends on broker size (scales
(On-Demand) horizontally)
Latency ~70-200ms (with KPL) ~10-50ms (Kafka-native)
fi
Partitioning Fixed shards (resharding required for scaling) Dynamic partitions (Kafka-native)
Ordering Per-shard ordering Per-partition ordering
Guarantee
B. Durability & Retention
Feature Kinesis MSK
Data Retention Up to 365 days (configurable) Configurable (default: 7 days)
Replication 3x replication (AWS-managed) Configurable (Kafka replication factor)
Storage Backend AWS-managed (no direct access) EBS volumes (user-configurable)
C. Ecosystem & Integrations
Integration Kinesis MSK
AWS Services Lambda, Firehose, EMR, Redshift Same as Kafka (Lambda, Glue, EMR)
Third-Party Tools Limited (KCL/KPL-based) Full Kafka ecosystem (Connectors, Schema Registry)
Monitoring CloudWatch Metrics CloudWatch + Prometheus (for Kafka metrics)
D. Management & Operations
Aspect Kinesis MSK
Setup Simple (serverless-like) Requires Kafka knowledge
Complexity
Scaling Automatic (On-Demand) or manual Manual (add brokers/storage)
(Provisioned)
Maintenance Fully managed (AWS handles scaling, Managed brokers, but user manages topics &
patching) clients
Security IAM policies, KMS encryption IAM + Kafka ACLs, TLS encryption
3. When to Use Which?
✅ Choose AWS Kinesis If:
• You need a simple, fully managed streaming service.
• You’re already using AWS-native services (Lambda, Firehose, Redshift).
• You want automatic scaling (with On-Demand mode).
• Your use case is real-time analytics, clickstreams, or logs.
✅ Choose AWS MSK If:
• You need Kafka compatibility (existing Kafka apps).
• You require higher throughput & lower latency.
• You rely on the Kafka ecosystem (Kafka Connect, Schema Registry).
• You need custom partitioning and consumer groups.
4. Pricing Comparison
Service Pricing Model Example Cost (Monthly)
Kinesis (Provisioned) $0.015/shard-hour + $0.01/GB data 10 shards = ~$1,080 + data transfer
Kinesis (On-Demand) $0.04/GB processed 1TB data = ~$40
MSK (Development) $0.10/broker-hour + EBS storage 3 brokers (t3.small) = ~$216 + storage
MSK (Production) $0.35/broker-hour (m5.large) 6 brokers = ~$1,512 + storage
💡 Kinesis is cheaper for low/mid-scale, while MSK is cost-effective for high-scale
Kafka workloads.
5. Summary: Kinesis vs. MSK
Decision Factor Winner
Ease of Use Kinesis
Throughput & Latency MSK
Kafka Compatibility MSK
AWS Native Integrations Kinesis
Cost Ef ciency (Small Scale) Kinesis
Cost Ef ciency (Large Scale) MSK
Final Recommendation
fi
fi
• For AWS-centric, simple streaming → Kinesis
• For high-performance, Kafka-based systems → MSK
Would you like a deeper dive into specific use cases (e.g., IoT, clickstream analytics)?
Parquet, Delta Table, Hudi or Iceberg
Comparison Analysis: Parquet vs. Delta Lake vs. Apache Hudi vs. Apache Iceberg
These file formats are widely used in modern data lakes for structured and semi-
structured data storage. Below is a detailed comparison:
1. Overview
Format Parquet Delta Lake Apache Hudi Apache Iceberg
Type Columnar ACID transactional Transactional data Table format for data
storage format layer on Parquet lake framework lakes
Develope Apache Databricks Uber Netflix
d By
Primary Batch analytics ACID transactions, Incremental Schema evolution, large-
Use Case time travel processing, upserts scale analytics
Storage File format Built on Parquet Built on Parquet Built on Parquet/ORC
Layer
ACID ❌ No ✅ Yes ✅ Yes ✅ Yes
Complianc
e
2. Key Features Comparison
A. Transaction Support & Concurrency
Feature Parquet Delta Lake Hudi Iceberg
ACID ❌ No ✅ Optimistic concurrency ✅ Optimistic ✅ Optimistic
Transactions concurrency concurrency
Time Travel ❌ No ✅ Yes (versioned data) ✅ Yes (incremental ✅ Yes (snapshots)
pulls)
Schema ❌ Manual ✅ Automatic (backward ✅ Supports schema ✅ Advanced schema
Evolution compatible) changes evolution
Con ict ❌ Not ✅ Optimistic locking ✅ Custom merge ✅ Snapshot isolation
Resolution applicable strategies
B. Performance & Optimization
fl
Feature Parquet Delta Lake Hudi Iceberg
Compaction ❌ No ✅ Auto- ✅ Auto- ✅ Optional compaction
compaction compaction
Z-Ordering ❌ No ✅ Yes ❌ No ✅ Yes
(Databricks)
Partition ❌ No ✅ Yes ✅ Yes ✅ Best-in-class (hidden
Evolution partitioning)
Indexing ❌ No ✅ Databricks- ✅ Built-in ✅ Metadata-based
specific indexing
C. Ecosystem & Integrations
Feature Parquet Delta Lake Hudi Iceberg
Spark Support ✅ Native ✅ Native (Databricks) ✅ Good ✅ Good
Flink Support ✅ Yes ❌ Limited ✅ Yes ✅ Yes
Presto/Trino ✅ Yes ✅ Limited ✅ Yes ✅ Best support
Hive ✅ Yes ❌ No ✅ Yes ✅ Yes
Cloud Support ✅ All ✅ Best on Databricks ✅ AWS, GCP ✅ AWS, GCP, Azure
3. Use Case Recommendations
✅ When to Use Parquet?
• Batch processing (no updates/deletes needed).
• Cost-ef cient storage (columnar format).
• Compatibility with all query engines (Hive, Presto, Spark).
✅ When to Use Delta Lake?
• Databricks environment (tight integration).
• ACID transactions & time travel needed.
• Frequent upserts & merges (e.g., CDC).
✅ When to Use Apache Hudi?
• Incremental processing (near-real-time pipelines).
• Upserts & deletions at scale (e.g., IoT, CDC).
• Optimized for AWS (EMR, Glue).
fi
✅ When to Use Apache Iceberg?
• Multi-engine access (Spark, Flink, Trino).
• Schema evolution & partition evolution needed.
• Large-scale analytics (Netflix-style workloads).
4. Performance Benchmarks (General Trends)
Metric Parquet Delta Lake Hudi Iceberg
Write Speed Fast Moderate (transaction Fast (indexed upserts) Moderate
overhead)
Read Speed Very Fast Fast Fast Very Fast
Update ❌ Rewrites ✅ Optimized (merge on ✅ Optimized (copy-on- ✅
Ef ciency entire file read) write) Optimized
Metadata ❌ Manual ✅ Good ✅ Good ✅ Best
Scalability
5. Vendor & Community Support
Format Primary Backer Cloud Support Adoption
Parquet Apache All Universal
Delta Lake Databricks Best on Databricks High (Databricks users)
Hudi Uber AWS (EMR), GCP Moderate
Iceberg Netflix AWS, GCP, Azure Growing fast
6. Summary: Which One to Choose?
Decision Factor Best Choice
Basic batch storage Parquet
ACID transactions (Databricks) Delta Lake
Upserts & incremental processing Hudi
Multi-engine, large-scale analytics Iceberg
fi
Final Recommendation
• For simple analytics → Parquet
• For Databricks users → Delta Lake
• For real-time upserts (AWS) → Hudi
• For open, multi-engine lakes → Iceberg
Would you like a deeper dive into speci c benchmarks or migration strategies?
EMR
AWS EMR is a managed big data platform that simplifies running distributed processing
frameworks like Apache Spark, Hadoop, Hive, HBase, Flink, and Presto on AWS
infrastructure. It provides a fully managed environment to process vast amounts of data
across dynamically scalable EC2 instances.
Why EMR Was Built
AWS created EMR to address several big data challenges:
1. Infrastructure Complexity: Eliminates the need to manually configure and
maintain Hadoop/Spark clusters
2. Scalability: Allows processing of data at any scale (from GBs to PBs)
3. Cost Optimization: Provides transient clusters that can be terminated after job
completion
4. Integration: Seamlessly works with other AWS services (S3, Glue, Redshift, etc.)
5. Framework Flexibility: Supports multiple processing frameworks in one platform
Key Considerations for Data Engineers
1. Cluster Creation
a. Cluster Con guration:
• Node Types:
◦ Master: Manages the cluster (1 node)
◦ Core: Runs tasks and stores data (1+ nodes)
◦ Task: Pure compute nodes (optional, 0+ nodes)
• Instance Selection:
◦ Memory-optimized for memory-intensive workloads (Spark)
◦ Compute-optimized for CPU-intensive jobs (HBase)
◦ Storage-optimized when using HDFS
b. Software Stack:
• Choose appropriate EMR release version (match with framework versions)
• Select applications (Spark, Hadoop, Hive, etc.)
fi
fi
• Configure bootstrap actions for custom setups
Example CLI Creation:
aws emr create-cluster \
--name "Spark Cluster" \
--release-label emr-6.7.0 \
--applications Name=Spark \
--ec2-attributes KeyName=my-key \
--instance-type m5.xlarge \
--instance-count 3 \
--use-default-roles
2. Job Execution
a. Job Submission Methods:
• Direct API calls (for programmatic control)
• CLI submission (for manual runs)
• Step functions (for workflow orchestration)
• Air ow EMR operators (for pipeline integration)
b. Performance Optimization:
• Partitioning: Ensure input data is properly partitioned
• Memory Settings: Configure executor/driver memory in Spark
• Data Locality: Use S3 select or EMRFS for S3 integration
• Spot Instances: For task nodes to reduce costs
Example Spark Job Submission:
aws emr add-steps \
--cluster-id j-3KXXXXXX9IOK \
--steps Type=Spark,Name="Spark Job",\
ActionOnFailure=CONTINUE,\
Args=[--class,org.apache.spark.examples.SparkPi,\
/user/hadoop/share/lib/spark-examples.jar,100]
3. Cluster Maintenance
a. Monitoring:
• Use CloudWatch metrics for cluster health
• Monitor HDFS/disk utilization (if using HDFS)
• Track YARN resource allocation
b. Logging:
• Enable logging to S3 for debugging
• Configure log encryption and retention
• Access logs via EMR console or S3 directly
c. Scaling:
fl
• Use managed scaling for automatic resizing
• Configure scaling policies based on metrics
• Consider task spot fleets for cost savings
4. Cost Management
a. Instance Selection:
• Use spot instances for task nodes (up to 70-90% savings)
• Reserve instances for long-running clusters
• Right-size instances based on workload
b. Cluster Lifecycle:
• Use transient clusters for batch jobs
• Implement auto-termination after job completion
• Consider EMR Serverless for sporadic workloads
c. Storage Options:
• Prefer S3 over HDFS for persistent storage
• Use EBS volumes for temporary HDFS storage
• Implement lifecycle policies for S3 data
Best Practices for Data Engineers
1. Infrastructure as Code:
◦ Use CloudFormation/Terraform for reproducible clusters
◦ Version control bootstrap scripts and configurations
2. Security:
◦ Enable encryption at rest and in transit
◦ Use IAM roles for fine-grained permissions
◦ Implement VPC endpoints for S3 access
3. Data Processing:
◦ Leverage EMRFS for consistent S3 access
◦ Use Spark's dynamic allocation for resource efficiency
◦ Implement checkpointing for long-running jobs
4. Job Design:
◦ Break large jobs into smaller steps
◦ Implement retry logic for transient failures
◦ Separate transformation and loading steps
5. Integration Patterns:
6. from airflow.providers.amazon.aws.operators.emr import
EmrCreateJobFlowOperator
create_emr_cluster = EmrCreateJobFlowOperator(
task_id='create_emr_cluster',
job_flow_overrides={
'Name': 'Analytics Cluster',
'ReleaseLabel': 'emr-6.7.0',
'Applications': [{'Name': 'Spark'}],
'Instances': {
'InstanceGroups': [
{
'Name': 'Master nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm5.xlarge',
'InstanceCount': 1,
},
{
'Name': 'Core nodes',
'Market': 'SPOT',
'InstanceRole': 'CORE',
'InstanceType': 'm5.xlarge',
'InstanceCount': 2,
}
],
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False,
},
'VisibleToAllUsers': True,
},
aws_conn_id='aws_default',
emr_conn_id='emr_default',
)
Common Pitfalls to Avoid
1. Over-provisioning clusters: Start small and scale as needed
2. Ignoring spot instances: Miss significant cost savings opportunities
3. Poor data partitioning: Causes skewed workloads and slow jobs
4. Hardcoding con gurations: Makes environments inflexible
5. Not setting auto-termination: Leads to runaway costs
Modern EMR Features to Leverage
1. EMR Serverless: For sporadic workloads without cluster management
2. EMR on EKS: Run EMR jobs on Kubernetes
3. Runtime roles: Fine-grained permissions per job
4. Managed scaling: Automatic cluster resizing
5. ACID transactions: With EMR 6+ and Spark 3+
Glue
fi
AWS Glue is a fully managed extract, transform, and load (ETL) service that automates
the time-consuming steps of data preparation. It consists of several components:
1. Data Catalog: Central metadata repository (Hive-compatible)
2. ETL Engine: Serverless Spark environment for transformations
3. Scheduler: Job orchestration with dependency resolution
4. DataBrew: Visual data preparation tool (no-code)
5. Studio: Development interface for authoring ETL jobs
Why AWS Glue Was Built
AWS created Glue to address key data integration challenges:
1. Infrastructure Overhead: Eliminates need to manage ETL servers/clusters
2. Metadata Management: Solves the "data discovery" problem
3. Schema Evolution: Handles changing data structures automatically
4. Serverless Execution: Scales automatically to handle job demands
5. AWS Integration: Native connectivity with S3, RDS, Redshift, etc.
Key Considerations for Data Engineers
1. Job Creation
a. Job Type Selection:
• Spark Script: Full control (Python/Scala)
• Visual Editor: Drag-and-drop (limited to common transforms)
• Python Shell: Lightweight jobs (non-Spark)
b. Con guration Essentials:
python
Copy
Download
# Sample Glue Python job parameters
args = {
'--job-language': 'python',
'--TempDir': 's3://bucket/temp/',
'--job-bookmark-option': 'job-bookmark-enable',
'--enable-metrics': '',
'--enable-continuous-cloudwatch-log': 'true',
'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://bucket/spark-logs/'
}
c. Data Source/Target Setup:
fi
• Use Glue Crawlers judiciously (can be expensive for frequent runs)
• Prefer direct source specifications in code over crawlers for known schemas
• Configure connection types properly (JDBC, S3, etc.)
2. Performance Optimization
a. Worker Con guration:
• Standard Workers (G.1X/G.2X): Most common workloads
• G.4X/G.8X: Memory-intensive jobs
• G.025X: Lightweight Python shell jobs
b. Partitioning Strategies:
python
Copy
Download
# Repartition before write
datasink = glueContext.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={
"path": output_path,
"partitionKeys": ["year", "month", "day"]
},
format="parquet"
)
c. Job Bookmarks:
• Enable for incremental processing
• Understand bookmark limitations (certain sources/transforms break bookmarks)
• Monitor bookmark state via CloudWatch
3. Maintenance & Operations
a. Monitoring:
• Set up CloudWatch alerts for job failures
• Track metrics: DPU hours, execution time, memory usage
• Enable continuous logging for debugging
b. Error Handling:
• Implement try-catch blocks for robust jobs
• Configure job retries with exponential backoff
• Use dead-letter queues for error records
c. Cost Control:
fi
• Right-size DPU allocations (start small, scale as needed)
• Use timeouts to prevent runaway jobs
• Clean up temporary files automatically
Best Practices
1. Development Work ow:
◦ Use Glue Studio for prototyping
◦ Develop locally with Glue Docker image
◦ Version control all scripts
2. Schema Management:
◦ Explicitly define schemas when possible
◦ Handle schema evolution gracefully
◦ Use Glue Schema Registry for streaming
3. Job Design:
python
Copy
Download
# Good practice: Modular transforms
def clean_data(glue_context, dyf):
# Data quality checks
dyf = Filter.apply(
frame=dyf,
f=lambda x: x["field"] is not None
)
# Transformations
dyf = Map.apply(
frame=dyf,
f=lambda x: {**x, "new_field": x["field"].lower()}
)
return dyf
4. Security:
◦ Encrypt data at rest and in transit
◦ Use IAM roles with least privilege
◦ Enable VPC endpoints for private access
Common Pitfalls to Avoid
1. Overusing Crawlers: Running unnecessary crawls that rebuild entire catalogs
2. DPU Overallocation: Using more workers than needed
fl
3. Ignoring Job Bookmarks: Missing incremental load opportunities
4. Poor Error Handling: Jobs failing silently on bad records
5. Hardcoding Paths: Making jobs inflexible to environment changes
Advanced Features
1. Glue Elastic Views:
◦ Create materialized views across data stores
◦ Automatically handles change data capture
2. Glue DataBrew:
◦ Visual data preparation
◦ 250+ built-in transforms
3. Glue Streaming ETL:
◦ Near real-time processing
◦ Integrates with Kinesis, MSK
4. Glue Flex Jobs:
◦ For non-urgent workloads
◦ Runs on spare capacity (70% cost savings)
Integration Example with Other Services
python
Copy
Download
# Sample job integrating S3, Glue Catalog, and Redshift
from awsglue.context import GlueContext
glue_context = GlueContext(sc)
# Read from catalog
datasource = glue_context.create_dynamic_frame.from_catalog(
database="sales_db",
table_name="transactions",
transformation_ctx="datasource"
)
# Transform
cleaned_data = clean_data(glue_context, datasource)
# Write to Redshift
glue_context.write_dynamic_frame.from_jdbc_conf(
frame=cleaned_data,
catalog_connection="redshift-connection",
connection_options={
"dbtable": "processed_transactions",
"database": "analytics_db"
},
redshift_tmp_dir="s3://bucket/temp/"
)
Athena
AWS Athena is an interactive serverless query service that enables SQL analysis of data
directly in Amazon S3 using standard ANSI SQL. It's built on the following core principles:
1. Serverless Architecture: No infrastructure to manage
2. Pay-per-Query Pricing: Costs based on data scanned
3. Standard SQL: Presto/Trino SQL dialect support
4. Direct S3 Access: No data loading required
5. Federated Query Capabilities: Combine data across sources
How Athena is Built
High-Level Architecture
1. Foundation: Built on open-source Presto/Trino distributed SQL engine
2. Compute Layer: Transient clusters spin up per query
3. Metadata Layer: Uses AWS Glue Data Catalog (or legacy Hive metastore)
4. Storage Layer: Directly accesses S3 via optimized connectors
5. Optimization Layer: Includes cost-based optimizer (CBO) and partition pruning
Key Components
• Query Engine: Presto 0.217+ with AWS enhancements
• Metadata Store: Glue Data Catalog (database/table de nitions)
• Security: IAM policies + S3 ACLs/bucket policies
• Result Cache: Reuses previous query results when possible
Query Execution: HLD vs LLD
High-Level Design (HLD) Flow
1. Query Submission: SQL via console/JDBC/ODBC/REST API
2. Query Planning:
◦ Syntax parsing → Logical plan → Distributed execution plan
◦ Cost-based optimization (join ordering, predicate pushdown)
3. Execution:
◦ Coordinator distributes work to worker nodes
◦ Workers read S3 data in parallel
4. Result Aggregation: Coordinator combines partial results
5. Output: Returns results to client (S3 for large results)
Low-Level Design (LLD) Details
fi
1. Metadata Resolution:
-- Table definition lookup in Glue Catalog
2. DESCRIBE TABLE sales_data;
3. Partition Pruning:
-- Only scans partitions matching the filter
4. SELECT * FROM sales_data WHERE dt BETWEEN '2023-01-01' AND '2023-01-31';
5. File Format Handling:
◦ Columnar formats (Parquet/ORC) enable predicate pushdown
◦ Splittable formats allow parallel reads
6. Execution Stages:
◦ Scan → Filter → Aggregate → Join → Sort → Limit
◦ Each stage processed by worker nodes
7. Resource Allocation:
◦ Memory per node: ~10GB per worker
◦ Max workers: Scales automatically based on query complexity
Data Engineer Considerations
1. Performance Optimization
a. Partitioning Strategyy
-- Partitioned table creation example
CREATE TABLE web_logs (
request_time TIMESTAMP,
ip_address VARCHAR(16),
-- ...other columns
)
PARTITIONED BY (
year INT,
month INT,
day INT
)
STORED AS PARQUET
LOCATION 's3://bucket/logs/';
b. File Format Selection:
• Parquet/ORC: Best for analytics (columnar, compression)
• JSON/CSV: Human-readable but slower
• Avro: Row-oriented with schema evolution
c. Compression:
• Snappy (balance of speed/ratio)
• Gzip (better ratio, slower)
• Zstd (modern alternative)
2. Cost Control
a. Data Scanned Reduction:
-- Bad: Scans entire dataset
SELECT * FROM large_table;
-- Good: Uses partition pruning
SELECT user_id FROM large_table
WHERE account_status = 'active'
AND signup_date > CURRENT_DATE - INTERVAL '30' DAY;
b. Workgroups & Limits:
• Create separate workgroups for different teams
• Set per-query/data-scanned limits
• Enforce query result location
c. Caching:
• Enable result reuse for repeated queries
• Use parameterized queries when possible
3. Schema Design
a. Column Projection:
-- Specify only needed columns
SELECT user_id, email FROM users;
b. Type Handling:
• Prefer TIMESTAMP over VARCHAR for dates
• Use appropriate VARCHAR lengths
• Consider nested types (ARRAY, MAP, STRUCT) for complex data
c. Schema Evolution:
• Use Glue Schema Registry for streaming data
• Handle backward compatibility in ETL
4. Integration Patterns
a. Federated Queries:
-- Query across S3 and RDS
SELECT s.*, r.extra_data
FROM s3_archive.sales s
JOIN postgres.public.customers r ON s.cust_id = r.id;
b. ETL Pipeline:
# Sample Glue job writing Athena-optimized output
glue_context.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={
"path": "s3://output-bucket/",
"partitionKeys": ["year", "month"],
"compression": "snappy"
},
format="parquet"
)
5. Monitoring & Tuning
a. Query Analysis:
• Use Athena query history/metrics
• Review EXPLAIN plans
• Monitor partition usage
b. Performance Tuning:
-- Check partition effectiveness
SELECT * FROM information_schema.__internal_partitions__
WHERE table_schema = 'web_logs' AND table_name = 'access_logs';
c. Error Handling:
• Set up SNS alerts for query failures
• Implement retry logic for throttling
• Monitor S3 LIST operation costs
Advanced Features
CTAS (Create Table As Select):
CREATE TABLE optimized_sales
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['region']
) AS
SELECT * FROM raw_sales;
UNLOAD for Large Results:
UNLOAD (
SELECT * FROM large_results
)
TO 's3://results-bucket/path/'
WITH (
format = 'PARQUET'
);
Lambda UDFs:
CREATE FUNCTION parse_udf AS 'com.example.ParseUDF'
USING JAR 's3://bucket/code.jar';
ML Inference:
SELECT ml_function('arn:model', input_data) FROM source_table;
Common Pitfalls to Avoid
1. Full Table Scans: Always lter on partitioned columns
2. Small Files Problem: Combine small les (<128MB) into larger ones
3. Over-partitioning: Too many partitions degrade performance
4. Ignoring Statistics: ANALYZE tables after major changes
5. No Result Pagination: Fetching huge result sets directly
Delta Lake
Delta is an open-source storage layer that brings ACID transactions to data lakes. It's
built on top of existing file formats (like Parquet) and adds reliability and performance
optimizations through:
Core Characteristics:
1. ACID Transactions: Atomicity, Consistency, Isolation, Durability
2. Schema Enforcement: Prevents bad data from being written
3. Time Travel: Versioned data access
4. Upserts/Merges: CRUD operations (not just append)
5. Scalable Metadata: Handles petabyte-scale tables
File Structure:
s3://delta-table/
├── _delta_log/ # Transaction log (JSON files)
│ ├── 00000000000000000000.json
│ └── 00000000000000000001.json
fi
fi
├── part-00000-xxxx.parquet # Data files
└── part-00001-yyyy.parquet
How Delta Fundamentally Differs from Other Formats
Feature Delta Traditional Parquet/ORC CSV/JSON
Transactions Full ACID support None (append-only) None
Schema Evolution Managed (enforcement) Manual handling No schema
Updates Supports MERGE/DELETE Rewrite entire dataset Append-only
Time Travel Built-in versioning Requires manual snapshots Impossible
Performance Optimized metadata File-level only Slow scans
Consistency Strong guarantees Eventual consistency No guarantees
Delta Lake
Delta Lake is the open-source framework that implements the Delta protocol. It
provides:
1. Reliable Data Lakes: ACID over cloud storage (S3, ADLS, GCS)
2. Spark Integration: Native support in PySpark/Spark SQL
3. Multi-engine Support: Works with Presto, Trino, Redshift, etc.
4. Open Format: No vendor lock-in (standard Parquet + transaction log)
Key Components:
1. Transaction Log: Single source of truth (ordered JSON records)
2. Optimistic Concurrency Control: Handles concurrent writes
3. Vacuum: Clean up old files (configurable retention)
4. Z-Ordering: Colocate related data for faster queries
Technical Differentiators
1. Transaction Log (The Secret Sauce)
• Records all changes as ordered JSON files
• Enables atomic commits via "mutex" pattern
• Example transaction:
json
{
• "commitInfo": {...},
• "add": {
• "path": "part-00000-xxx.snappy.parquet",
• "size": 123456,
• "partitionValues": {"date": "2023-01-01"},
• "stats": "minMaxStats": {...}
• }
• }
2. Schema Enforcement
# Fails if new_data doesn't match delta_table schema
delta_table.alias("target").merge(
new_data.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
3. Time Travel
# Read previous version
df = spark.read.format("delta").option("versionAsOf", 12).load("/delta/table")
# Restore to version 10
delta_table.restoreToVersion(10)
Why Data Engineers Choose Delta Lake
1. Reliable Pipelines:
◦ No more "partial writes" during job failures
◦ Concurrent readers/writers won't corrupt data
2. Simpli ed Architecture:
# Before Delta (manual handling)
3. df.write.parquet("output/") # Risky overwrites
4. # After Delta
5. df.write.format("delta").mode("overwrite").save("output/") # Atomic
6. Performance Optimizations:
◦ File skipping using statistics
◦ Z-ordering for faster predicates
◦ Compaction of small files
7. Uni ed Batch/Streaming:
# Streaming upserts
(spark.readStream
.format("delta")
.load("source")
.writeStream
.format("delta")
fi
fi
.foreachBatch(upsert_function)
.start())
Implementation Example
# Create Delta table
df.write.format("delta").save("/delta/events")
# Upsert pattern
delta_table = DeltaTable.forPath(spark, "/delta/events")
(delta_table.alias("target")
.merge(source_df.alias("source"), "target.userId = source.userId")
.whenMatchedUpdate(set = {"lastSeen": "source.timestamp"})
.whenNotMatchedInsert(values = {
"userId": "source.userId",
"firstSeen": "source.timestamp",
"lastSeen": "source.timestamp"
})
.execute())
When Not to Use Delta
1. Extremely High-Frequency Writes: Transaction log may become bottleneck
2. Tiny Datasets: Overhead exceeds benefits
3. Non-Spark Ecosystems: Limited support in some query engines
Air ow
Apache Airflow is an open-source platform to programmatically author, schedule, and
monitor workflows
Core Principles
1. Dynamic Pipeline Generation: Pipelines are defined as Python code (DAGs)
2. Extensibility: Custom operators, hooks, and executors can be created
3. Scalability: Modular architecture with various executors (Local, Celery, Kubernetes)
4. Declarative: Focus on "what" not "how" - define workflows declaratively
5. Reproducibility: Pipelines are version-controlled and repeatable
Key Components
1. Web Server
• Flask-based UI for pipeline monitoring and management
• Views for DAGs, tasks, execution history, and con guration
• Ability to trigger/stop DAGs manually
2. Scheduler
fl
fi
• Heart of Air ow that triggers work ow execution
• Monitors all tasks and DAGs
• Handles dependencies and scheduling logic
3. Executor
• Mechanism that runs tasks (different types available)
• Examples: LocalExecutor, CeleryExecutor, KubernetesExecutor
4. Metadata Database
• Stores state of all DAGs, tasks, variables, connections
• Typically PostgreSQL or MySQL
• Used by all components to coordinate
5. DAG (Directed Acyclic Graph)
• Collection of tasks with de ned dependencies
• Represented as Python code in dags/ folder
• De nes work ow structure and execution order
6. Operators
• Building blocks that de ne individual tasks
• Determine what actually gets executed
7. Hooks
• Interfaces to external platforms/databases
• Reusable connections to external systems
8. Pools
• Limit concurrency for specific tasks
• Useful for managing resource-intensive operations
Common Operators for Data Engineering
1. Spark Operators
• SparkSubmitOperator: Submit Spark jobs to a cluster
submit_spark_job = SparkSubmitOperator(
• task_id='spark_submit_job',
• application='/path/to/your/spark/app.jar',
• conf={'spark.yarn.queue': 'production'},
• application_args=['arg1', 'arg2'],
• dag=dag
• )
• SparkJDBCOperator: Execute Spark SQL to JDBC
2. Database Operators
Redshift:
fi
fl
fl
fi
fi
fl
• RedshiftSQLOperator: Execute SQL on Redshift
redshift_query = RedshiftSQLOperator(
• task_id='redshift_query',
• sql='SELECT * FROM important_table',
• redshift_conn_id='redshift_default',
• dag=dag
• )
• S3ToRedshiftOperator: Load data from S3 to Redshift
• RedshiftToS3Operator: Unload data from Redshift to S3
MongoDB:
• MongoToS3Operator: Export data from MongoDB to S3
• S3ToMongoOperator: Import data from S3 to MongoDB
• MongoOperator: Execute MongoDB commands
mongo_op = MongoOperator(
• task_id='mongo_aggregate',
• mongo_conn_id='mongo_default',
• collection='analytics',
• query=[{'$group': {'_id': '$user', 'count': {'$sum': 1}}}],
• dag=dag
• )
3. Cloud Operators
AWS:
• S3FileTransformOperator: Transform files in S3
• GlueJobOperator: Run AWS Glue jobs
• EMRCreateJobFlowOperator: Create EMR clusters
GCP:
• BigQueryOperator: Run BigQuery queries
• GCSDownloadOperator: Download from Google Cloud Storage
• Data owOperator: Run Dataflow jobs
4. ETL Operators
• PythonOperator: Execute Python functions
• BashOperator: Execute bash commands
• DockerOperator: Run Docker containers
• KubernetesPodOperator: Run pods on Kubernetes
5. Transfer Operators
• SFTPOperator: Transfer files via SFTP
• FTPOperator: File transfer via FTP
fl
• SQLToS3Operator: Export SQL results to S3
6. Specialized Operators
• HiveOperator: Execute Hive queries
• PrestoToMySqlOperator: Transfer from Presto to MySQL
• SlackAPIPostOperator: Send Slack notifications
S3
1. Standard
2. S3 Intelligent-Tiering
3. Standard Infrequent-Access
4. S3 One-Zone-Infrequent-Access
5. Glacier Instant-Retrieval
6. Glacier Flexible Retrival
7. Glacier Deep Archive
8. S3 Outposts
AWS S3 offers several storage classes designed for different use cases based on access
frequency, durability, availability, and cost requirements. Here are all the current S3
storage classes with detailed explanations:
1. S3 Standard (General Purpose)
• Use Case: Frequently accessed data
• Availability: 99.99%
• Durability: 99.999999999% (11 9's)
• Minimum Storage Duration: None
• Retrieval Fee: None
• Features:
◦ Low latency and high throughput
◦ Ideal for content distribution, mobile/gaming applications, big data analytics
◦ Supports SSL for data in transit and encryption at rest
2. S3 Intelligent-Tiering
• Use Case: Data with unknown or changing access patterns
• Availability: 99.9% (all tiers)
• Durability: 99.999999999%
• Minimum Storage Duration: 30 days (for Frequent and Infrequent tiers)
• Retrieval Fee: None (Frequent), $0.01/GB (Infrequent)
• Features:
◦ Automatically moves objects between 4 access tiers:
1. Frequent Access Tier (same as S3 Standard)
2. Infrequent Access Tier (same as S3 Standard-IA)
3. Archive Instant Access Tier (same as S3 One Zone-IA but with higher
availability)
4. Archive Access Tier (same as S3 Glacier Instant Retrieval)
◦ No retrieval fees or operational overhead
◦ Small monthly monitoring/auto-tiering fee ($0.0025 per 1,000 objects)
3. S3 Standard-Infrequent Access (S3 Standard-IA)
• Use Case: Less frequently accessed data requiring rapid retrieval
• Availability: 99.9%
• Durability: 99.999999999%
• Minimum Storage Duration: 30 days
• Retrieval Fee: $0.01/GB
• Features:
◦ Lower storage price than Standard but higher retrieval cost
◦ Ideal for backups, disaster recovery, long-term storage of important data
◦ Data stored across multiple AZs
4. S3 One Zone-Infrequent Access (S3 One Zone-IA)
• Use Case: Infrequently accessed, non-critical data
• Availability: 99.5%
• Durability: 99.999999999% (but only in one AZ)
• Minimum Storage Duration: 30 days
• Retrieval Fee: $0.01/GB
• Features:
◦ 20% cheaper than Standard-IA
◦ Data stored in a single AZ (risk of data loss if AZ is destroyed)
◦ Good for secondary backups or easily reproducible data
5. S3 Glacier Instant Retrieval
• Use Case: Archive data needing millisecond retrieval
• Availability: 99.9%
• Durability: 99.999999999%
• Minimum Storage Duration: 90 days
• Retrieval Fee: $0.03/GB (varies by region)
• Features:
◦ Lowest cost storage with milliseconds retrieval
◦ Ideal for medical images, news media assets, genomics data
◦ Cheaper than Standard-IA but with similar performance
6. S3 Glacier Flexible Retrieval (formerly S3 Glacier)
• Use Case: Long-term backups and archives
• Availability: 99.99% (after restore)
• Durability: 99.999999999%
• Minimum Storage Duration: 90 days
• Retrieval Options:
◦ Expedited: 1-5 minutes ($0.03/GB + $10/GB retrieval)
◦ Standard: 3-5 hours ($0.01/GB)
◦ Bulk: 5-12 hours ($0.0025/GB)
• Features:
◦ Ideal for data accessed 1-2 times per year
◦ Great for compliance archives, digital preservation
◦ No real-time access (must restore first)
7. S3 Glacier Deep Archive
• Use Case: Rarely accessed long-term data
• Availability: 99.99% (after restore)
• Durability: 99.999999999%
• Minimum Storage Duration: 180 days
• Retrieval Options:
◦ Standard: 12 hours ($0.0025/GB)
◦ Bulk: 48 hours ($0.00099/GB)
• Features:
◦ Lowest cost storage in S3 (∼$0.00099/GB/month)
◦ Designed for data accessed less than once per year
◦ Perfect for regulatory/compliance archives
8. S3 Outposts
• Use Case: On-premises S3 storage
• Features:
◦ Brings S3 API to your data center
◦ Uses the same storage classes as standard S3
◦ Ideal for local data processing with AWS compatibility
Comparison Summary Table
Storage Class Best For Availabilit Retrieval Min Cost Structure
y Time Duration
Standard Frequent access 99.99% Immediate None Higher storage, no retrieval
Intelligent- Unknown patterns 99.9% Immediate 30 days Variable, auto-optimized
Tiering
Standard-IA Infrequent access 99.9% Immediate 30 days Lower storage, $0.01/GB
retrieval
One Zone-IA Reproductible data 99.5% Immediate 30 days Cheaper than Standard-IA
Glacier Instant Archive with ms 99.9% Millisecond 90 days Very low storage, $0.03/GB
access s retrieval
Glacier Long-term backup 99.99% Min 1-5 90 days Very low storage, retrieval fees
Flexible hours vary
Glacier Deep Compliance 99.99% Min 12 180 days Lowest storage, $0.0025/GB
archives hours retrieval
Redshift
AWS Redshift is a fully managed, petabyte-scale cloud data warehouse service
designed for online analytical processing (OLAP) and business intelligence workloads.
It's built on a massively parallel processing (MPP) architecture that enables high-
performance analysis of large datasets using standard SQL.
Why Redshift is Important for Data Engineering
1. Performance at Scale:
◦ Columnar storage format (optimized for analytics)
◦ Parallel query execution across multiple nodes
◦ Ability to handle petabytes of data
2. Integration Ecosystem:
◦ Native connectivity with major BI tools (Tableau, Power BI, Looker)
◦ Seamless integration with AWS services (S3, Kinesis, Glue, Lambda)
◦ Supports JDBC/ODBC connections
3. Cost-Effective:
◦ Pay-as-you-go pricing model
◦ Concurrency scaling for handling peak loads
◦ Spectrum for querying data directly in S3
4. Modern Data Architecture:
◦ Enables data lakehouse patterns (combining data warehouse and data lake)
◦ Supports both structured and semi-structured data
◦ Facilitates ELT (Extract-Load-Transform) workflows
5. Security & Compliance:
End-to-end encryption (at rest and in transit)
◦
Fine-grained access control
◦
SOC, HIPAA, PCI DSS compliance
◦
Key Considerations When Creating a Redshift Cluster
1. Cluster Configuration
Node Types:
• RA3 nodes (recommended): Separate compute and storage (auto-scaling storage)
• DC2 nodes: Dense compute for performance-intensive workloads
• DS2 nodes: Dense storage for large datasets
Sizing:
• Start with at least 2 nodes for production (for fault tolerance)
• Use the Redshift sizing calculator for initial estimates
• Consider concurrency scaling for variable workloads
Example Creation Command:
CREATE CLUSTER my-cluster
NODE TYPE ra3.xlplus
NUM NODES 4
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Access'
ENCRYPTED
SNAPSHOT COPY ON;
2. Data Modeling Best Practices
Distribution Styles:
• KEY: For large fact tables (distributes rows based on join columns)
• EVEN: When no clear distribution key exists
• ALL: For small dimension tables (copies to all nodes)
• AUTO: Let Redshift choose (new feature)
Sort Keys:
• Compound: Multiple columns (fixed order)
• Interleaved: Equal weight to multiple columns
• Single: One column
Example Table Design:
CREATE TABLE sales (
sale_id BIGINT,
sale_date TIMESTAMP,
product_id INTEGER,
customer_id INTEGER,
amount DECIMAL(18,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date, product_id);
3. Data Loading Considerations
Bulk Loading:
• Use COPY command for efficient loading from S3
• Compress files (GZIP, LZO, or ZSTD) before loading
• Split large files into multiple smaller files (100-1GB each)
Example COPY Command:
COPY sales
FROM 's3://my-bucket/sales-data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Access'
FORMAT PARQUET
GZIP
COMPUPDATE OFF;
Incremental Loading:
• Use staging tables with MERGE operations
• Consider time-based partitioning
• Track high-water marks for incremental loads
4. Performance Optimization
Workload Management (WLM):
• Configure query queues based on priority
• Set memory allocation for different query types
• Implement short query acceleration
Vacuum & Analyze:
• Regularly VACUUM to reclaim space
• Run ANALYZE to update statistics
• Schedule maintenance windows
Materialized Views:
• Pre-compute complex aggregations
• Automatic refresh options available
• Especially useful for dashboard queries
5. Monitoring & Maintenance
Key Metrics to Monitor:
• Cluster CPU utilization
• Query execution times
• Disk space usage
• Queue wait times
Maintenance Tasks:
• Set up automated snapshots
• Implement cross-region replication for DR
• Regularly resize clusters as needed
Handling Big Data in Redshift
1. Use Redshift Spectrum:
◦ Query data directly in S3 without loading
◦ Ideal for historical or rarely accessed data
2.
CREATE EXTERNAL SCHEMA spectrum_schema
3. FROM DATA CATALOG
4. DATABASE 'spectrum_db'
5. IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole';
6. Implement Data Partitioning:
◦ Partition large tables by date ranges
◦ Use table inheritance for time-series data
7. Leverage Concurrency Scaling:
◦ Automatically adds cluster capacity during peak loads
◦ No need to over-provision for occasional spikes
8. Consider Data Lake Integration:
◦ Store raw data in S3 (data lake)
◦ Use Redshift for processed, analytical data
◦ Implement federated queries when needed
Common Pitfalls to Avoid
1. Overusing DISTKEY ALL: Can lead to storage bloat
2. Ignoring Sort Keys: Results in full table scans
3. Small WLM Queues: Causes query queueing delays
4. Not Compressing Data: Wastes storage and I/O bandwidth
5. Loading During Peak Hours: Impacts query performance
Cost Optimization Tips
1. Use RA3 nodes with managed storage
2. Pause clusters during non-business hours (dev environments)
3. Implement auto-vacuum to reduce storage costs
4. Monitor query performance to identify expensive operations
5. Use reserved instances for predictable workloads
Starschema vs Snow ake schema
Star Schema
Characteristics:
1. Central Fact Table surrounded by denormalized dimension tables
2. Denormalized Structure: Each dimension contains all its attributes without
normalization
3. Single Join Path: Dimensions connect directly to the fact table
4. Simpler Queries: Fewer joins needed for most queries
5. Better Performance: Generally faster query performance for analytical workloads
Example:
fl
SALES_FACT (fact table)
|
|-- PRODUCT_DIM (all product attributes)
|-- CUSTOMER_DIM (all customer attributes)
|-- DATE_DIM (all date attributes)
|-- STORE_DIM (all store attributes)
Advantages:
• Query simplicity: Business users find it easier to understand
• Optimized for reads: Excellent for analytical queries
• Better performance: Fewer joins mean faster queries
• Kimball's preferred approach for most implementations
Disadvantages:
• Potential redundancy: Repeated attribute values in dimensions
• Less exible to changes: May require full dimension reloads for some changes
Snowflake Schema
Characteristics:
1. Normalized Dimensions: Dimension tables are broken into multiple related tables
2. Hierarchical Structure: Some dimensions have "sub-dimensions"
3. Multiple Join Paths: Requires joining through multiple tables to get all attributes
4. More Complex: Resembles a snowflake due to its branching structure
Example:
SALES_FACT (fact table)
|
|-- PRODUCT_DIM
| |-- PRODUCT_CATEGORY_DIM
|-- CUSTOMER_DIM
| |-- CUSTOMER_REGION_DIM
|-- DATE_DIM
|-- STORE_DIM
|-- STORE_REGION_DIM
Advantages:
• Reduced storage: Eliminates some data redundancy
• Easier maintenance: Changes to lookup values affect fewer rows
• More exible: Can accommodate more complex hierarchies
Disadvantages:
• More complex queries: Additional joins degrade performance
• Harder to understand: Less intuitive for business users
• Not Kimball's preferred approach: He favored star schemas for most scenarios
Kimball's Perspective on the Choice
fl
fl
1. Star Schema is Preferred:
◦ Kimball strongly advocated for star schemas in most cases
◦ The performance benefits outweigh storage savings from normalization
◦ Better aligns with business users' mental models
2. Snow ake Considered When:
◦ Dimension tables become extremely large (millions of rows)
◦ There are clear, natural hierarchies in the data
◦ Specific attributes change frequently independently of others
3. Practical Compromise:
◦Kimball suggested "snowflaking" only certain dimensions if needed
◦Most dimensions should remain denormalized in star format
◦Example: Might snowflake a geographic hierarchy (Country→State→City)
but keep other dimensions flat
Key Differences Summary Table
Aspect Star Schema Snow ake Schema
Normalization Denormalized Partially normalized
Joins Required Fewer More
Query Performance Faster Slower
Storage More (due to redundancy) Less (reduced redundancy)
Complexity Simpler More complex
Business User Easier to understand Harder to understand
Kimball's View Preferred for most scenarios Use sparingly when beneficial
Best For Most analytical use cases Complex hierarchies, very large dimensions
Kimball's Recommendation in Practice
Kimball would typically recommend:
1. Start with a star schema as the default approach
2. Only consider snowflaking when:
◦ A dimension has >1 million rows AND
◦ The dimension has natural hierarchies AND
◦ Query patterns frequently access different hierarchy levels separately
fl
fl
3. Even when snowflaking, keep the most frequently accessed attributes in the main
dimension table
Example where snowflaking might be justified:
CUSTOMER_DIM (main attributes)
|
|-- CUSTOMER_DEMOGRAPHICS (rarely used details)
|-- CUSTOMER_HIERARCHY (org structure)
In modern data warehouses with columnar storage and compression, the storage benefits
of snowflaking are less significant, making star schemas even more attractive in most
cases
Scala
Fundamental Features (Basic Concepts)
1. Object-Oriented Features
• Classes and Objects: Everything is an object in Scala
• Traits: Similar to Java interfaces but can contain concrete methods
• Case Classes: Special classes optimized for pattern matching
• Singleton Objects: Using object keyword instead of class
• Companion Objects: Objects sharing the same name with a class
2. Functional Programming Basics
• First-Class Functions: Functions as values that can be passed around
• Immutability: val for immutable variables
• Higher-Order Functions: Functions that take/return other functions
• Anonymous Functions: Lambda expressions ((x: Int) => x * 2)
• Collections API: Rich set of immutable collections
3. Type System
• Type Inference: Compiler often infers types (val x = 5 infers Int)
• Variance Annotations: + (covariant), - (contravariant)
• Option Type: Alternative to null (Some(value)/None)
4. Basic Syntax Features
• Expression-Oriented: Most constructs return values
• String Interpolation: s"Hello $name"
• Multi-line Strings: Triple quotes """..."""
• Operator Overloading: Methods can be symbols (+, -, ::)
Intermediate Features
1. Advanced Functional Programming
• Pattern Matching: Powerful match expressions
• Partial Functions: Functions only defined for certain inputs
• Currying: Transforming multi-arg functions into chains
• Tail Recursion: @tailrec annotation for optimization
• For-Comprehensions: Syntactic sugar for map/flatMap/filter
2. Type System Enhancements
• Generic Classes: Type parameters class Box[T](x: T)
• Upper/Lower Type Bounds: <:, >: constraints
• Implicit Conversions: Automatic type conversions
• Type Classes: Using implicits for ad-hoc polymorphism
3. Advanced OOP Concepts
• Self Types: this: Type => syntax
• Abstract Types: Type members in traits/classes
• Compound Types: A with B with C
• Structural Types: { def close(): Unit } (duck typing)
4. Concurrency Features
• Futures and Promises: Asynchronous programming
• Akka Integration: Actor model implementation
• Parallel Collections: .par for parallel operations
5. DSL Capabilities
• In x Notation: object method param syntax
• Implicit Classes: Extension methods pattern
• ByName Parameters: => Type for lazy evaluation
Python
Decorator with Parameters
Decorators are a powerful Python feature that allow you to modify or enhance functions
without changing their actual code. They essentially "wrap" other functions to extend
their behavior.
How Decorators Work
1. A decorator is a function that takes another function as input
2. Returns a modi ed or enhanced version of that function
3. Uses the @decorator_name syntax for easy application
def repeat(num_times):
def decorator_repeat(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
fi
fi
for _ in range(num_times):
result = func(*args, **kwargs)
return result
return wrapper
return decorator_repeat
@repeat(num_times=3)
def greet(name):
print(f"Hello {name}")
Generator Function
Generators are special functions that produce sequences of values lazily (one at a time)
using the yield keyword, making them memory efficient.
How Generators Work
1. Use yield instead of return to produce values
2. Maintain their state between calls
3. Generate values on-demand rather than all at once
4. Automatically implement the iterator protocol
def fibonacci(limit):
a, b = 0, 1
while a < limit:
yield a
a, b = b, a + b
# Usage:
for num in fibonacci(100):
print(num)
Combined Example
from functools import reduce
# Process data using all four concepts
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# 1. Filter even numbers
evens = filter(lambda x: x % 2 == 0, data)
# 2. Square them using map
squared_evens = map(lambda x: x**2, evens)
# 3. Create a generator that yields incremental sums
def sum_generator(numbers):
total = 0
for num in numbers:
total += num
yield total
# 4. Get the final sum using reduce
result = reduce(lambda x, y: x + y, sum_generator(squared_evens))
print(result) # Output: 220 (4 + 16 + 36 + 64 + 100)
IAM
EC2
Github Actions
Data Quality
SQL OLAP/OLTP
—————————Bonus————————————
Iceberg
Flink
Kinesis
Terraform
Docker
Prometheus
Grafana