0% found this document useful (0 votes)

38 views52 pages

Data Engineering Interview

Apache Kafka is an open-source distributed event streaming platform designed for high-volume, real-time data processing with features like decoupling data producers and consumers, durability, and horizontal scalability. Key components include brokers, topics, partitions, producers, and consumers, with essential configurations for performance tuning. Apache Pulsar is introduced as a modern alternative to Kafka, offering better scalability, multi-tenancy, and cloud-native features, while Spark and Flink are also discussed as frameworks for data processing.

Uploaded by

vaibhavag404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views52 pages

Data Engineering Interview

Uploaded by

vaibhavag404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Kafka

Apache Kafka is an open-source distributed event streaming platform originally

developed by LinkedIn and later open-sourced as part of the Apache Software
Foundation. It's designed to handle high-volume, real-time data feeds with low latency
and high throughput.

Why Kafka Was Built

Kafka was created to solve several key challenges in data processing:

1. Decoupling of data producers and consumers - Allows systems to communicate

without being directly connected
2. Real-time data processing - Enables immediate reaction to events as they occur
3. High throughput requirements - Originally built to handle LinkedIn's massive
activity stream (billions of messages/day)
4. Durability and fault-tolerance - Data is persisted and replicated to prevent loss
5. Horizontal scalability - Can scale to handle increasing loads by adding more
nodes
Core Components and Concepts

1. Brokers

• Kafka servers that store data and serve clients

• Form a cluster for scalability and fault tolerance
• Each broker is identified by an ID (integer)
2. Topics

• Categories or feed names to which messages are published

• Divided into partitions for parallel processing
• Messages within a partition are ordered
3. Partitions

• Ordered, immutable sequence of messages

• Each partition can be hosted on a different server
• Enables parallelism and scalability
4. Producers

• Applications that publish (write) messages to topics

• Can choose which partition to write to (round-robin, key-based, or custom)
5. Consumers

• Applications that subscribe to (read) messages

• Organized into consumer groups for parallel processing
6. ZooKeeper
• Manages and coordinates Kafka brokers (though being phased out in newer
versions)
• Stores metadata about topics, partitions, and consumer offsets
7. Replicas

• Copies of partitions for fault tolerance

• Leader replica handles all read/write requests
• Follower replicas replicate the leader's data
8. Offsets

• Unique ID assigned to each message within a partition

• Consumers track their position via offsets
9. Log Compaction

• Retention policy that keeps the latest value for each key
• Useful for maintaining current state without full history
Essential Configuration Tweaks for Data Engineers

Broker Configurations

1. num.partitions
◦ Default number of partitions per topic
◦ Recommended: Start with 6-10 partitions per topic
2. properties

num.partitions=6

3. log.retention.ms
◦ How long to keep messages (default is 7 days)
◦ Alternative: log.retention.hours or log.retention.minutes
4. properties

log.retention.ms=604800000 # 7 days

5. log.retention.bytes
◦ Maximum size of a partition before deletion
6. properties

log.retention.bytes=1073741824 # 1GB

7. default.replication.factor
◦ Default number of replicas for each partition
◦ Recommended: 3 for production
8. properties
default.replication.factor=3

9. unclean.leader.election.enable
◦ Whether to allow out-of-sync replicas to become leaders
◦ Recommended: false for data consistency
10. properties

unclean.leader.election.enable=false

Producer Configurations

1. acks
◦ Number of acknowledgments required for successful write
◦ acks=1 (leader only), acks=0 (none), acks=all (all replicas)
2. properties

acks=all

3. compression.type
◦ Message compression (none, gzip, snappy, lz4, zstd)
4. properties

compression.type=snappy

5. retries
◦ Number of retries for failed sends
6. properties

retries=5

Consumer Configurations

1. group.id
◦ Consumer group identifier
2. properties

group.id=inventory-consumers

3. auto.offset.reset
◦ What to do when no offset is stored (earliest, latest, none)
4. properties

auto.offset.reset=earliest
5. enable.auto.commit
◦ Whether to automatically commit offsets
6. properties

enable.auto.commit=false # For exactly-once processing

Performance Tuning

1. message.max.bytes
◦ Maximum size of a single message
2. properties

message.max.bytes=1000000 # 1MB

3. num.io.threads
◦ Number of threads handling network requests
4. properties

num.io.threads=8

5. num.network.threads
◦ Number of threads handling network requests
6. properties
num.network.threads=3
Key Considerations for Data Engineers

1. Partitioning Strategy:
◦ Choose partition keys carefully to avoid skew
◦ More partitions = more parallelism but also more overhead
2. Replication Factor:
◦ Higher replication improves durability but requires more resources
◦ Minimum of 2 for staging, 3 for production
3. Monitoring:
◦ Track consumer lag (messages behind)
◦ Monitor disk usage and network throughput
4. Security:
◦ Enable SSL for encryption
◦ Configure SASL for authentication
◦ Set up ACLs for authorization
5. Upgrade Considerations:
◦ Plan for rolling upgrades
◦ Test compatibility between client and broker versions
Consumer Groups in Kafka

A consumer group is a set of consumers that work together to consume data from one or
more topics. The key characteristics are:

1. Parallel Processing: Consumers in the same group divide the partitions among
themselves
2. Message Distribution: Each partition is consumed by only one consumer in the
group
3. Offset Management: The group maintains its position (offsets) in the partitions
How Consumers Should Consume Data

Basic Consumption Rules

1. One Partition Per Consumer: Within a consumer group, each partition is assigned
to exactly one consumer
2. Scalable Consumption: You can add more consumers (up to the number of
partitions) to increase throughput
3. Rebalancing: When consumers join or leave, partitions are reassigned
automatically
Consumption Scenarios

1. Single Consumer in Group:

◦ Consumes from all partitions
◦ Example: 1 consumer reading from a topic with 4 partitions
2. Multiple Consumers (≤ Partitions):
◦ Each consumer gets exclusive access to one or more partitions
◦ Example: 4 consumers reading from a topic with 4 partitions (1:1 mapping)
3. More Consumers Than Partitions:
◦ Extra consumers remain idle
◦ Example: 5 consumers for a topic with 4 partitions → 1 idle consumer
Can One Consumer Consume From Multiple Groups or Partitions?

1. One Consumer → Multiple Groups

• Yes: A single consumer process can subscribe to multiple consumer groups

• However, this is uncommon as it would receive duplicate messages (once per
group)
2. One Consumer → Multiple Partitions

• Yes: A single consumer can be assigned multiple partitions from the same topic
• This happens automatically when consumers outnumber partitions
3. One Consumer → Multiple Topics

• Yes: A consumer can subscribe to multiple topics

• The partitions from all topics will be assigned according to group rules

Pulsar
Apache Pulsar: Overview, Features, and Comparison with Kafka

1. What is Apache Pulsar?

Apache Pulsar is a distributed, cloud-native pub-sub messaging and streaming

platform designed for high performance, scalability, and durability. It combines the best
features of traditional messaging systems (like RabbitMQ) and streaming systems (like
Kafka) into a unified platform.

2. Fundamental Features of Apache Pulsar

Core Architecture

• Multi-layer Architecture:
◦ Brokers (stateless, handle message dispatch)
◦ BookKeeper (persistent storage layer for logs)
◦ ZooKeeper (coordination and metadata management)
• Uni ed Messaging Model:
◦ Supports pub-sub (topic-based), queuing (consumer groups),
and streaming in a single system.
Messaging & Streaming Capabilities

• Persistent & Non-Persistent Topics:

◦ Messages can be stored durably (persistent) or kept in memory (non-
persistent).
• Multi-Tenancy:
◦ Supports multiple tenants with isolated namespaces and quotas.
• Geo-Replication:
◦ Built-in cross-datacenter replication for disaster recovery.
• Message Retention & Expiry:
◦ Configurable TTL (Time-to-Live) for messages.
◦ Supports long-term storage with tiered storage (offloading to cheaper
storage like S3).
Consumer Models

• Exclusive, Failover, Shared, Key-Shared subscription modes for different

consumption patterns.
• Consumer Acknowledgement (Ack):
◦ Supports individual, cumulative, and negative acknowledgments.
fi
3. Advanced Features of Apache Pulsar

1. Tiered Storage

• Automatically offloads older data to cloud storage (S3, GCS, HDFS) while keeping
recent data in BookKeeper.
2. Pulsar Functions (Serverless Computing)

• Lightweight compute functions for stream processing (similar to Kafka Streams but
simpler).
3. Pulsar SQL (via Presto)

• Query historical data in Pulsar using SQL.

4. Transactions (Exactly-Once Semantics)

• Supports end-to-end exactly-once processing (similar to Kafka).

5. Schema Registry

• Enforces schema compatibility (Avro, JSON, Protobuf) for structured data.

6. Multi-Protocol Support

• Supports Kafka-compatible API, MQTT, and AMQP for easier migration.

7. Kubernetes-Native

• Designed for cloud deployments with Helm charts and K8s operators.

4. Apache Pulsar vs. Kafka: Key Differences

Feature Apache Pulsar Apache Kafka

Architecture Multi-layer (Brokers + BookKeeper) Monolithic (Brokers store data)

Scalability Scales brokers & storage independently Scaling requires rebalancing partitions

Geo-Replication Built-in multi-region replication Requires MirrorMaker or Confluent tools

Message Retention Tiered storage (S3 offloading) Limited by local disk (KIP-405 in Kafka
3.6+)

Consumer Models Multiple subscription types (Exclusive, Only consumer groups

Shared, Key-Shared)

Processing Pulsar Functions (serverless) Kafka Streams (stateful processing)

Latency Low-latency (microseconds) Higher latency (milliseconds)

Multi-Tenancy Strong isolation (tenants, namespaces) Limited (depends on RBAC)

Kafka Yes (via Pulsar Proxy) Native Kafka protocol

Compatibility

Ease of Auto-balancing, no manual partitioning Manual partition management

Management

5. When to Use Pulsar vs. Kafka?

Choose Pulsar if:

✅ You need uni ed queuing + streaming in one system.

✅ You want built-in geo-replication & multi-tenancy.
✅ You need long-term storage with tiered of oading.
✅ You prefer cloud-native, Kubernetes-friendly deployments.

Choose Kafka if:

✅ You have an existing Kafka ecosystem (Confluent, Kafka Connect).

✅ You need mature stream processing (Kafka Streams, KSQL).
✅ You prefer simpler architecture (fewer moving parts).

6. Conclusion

• Pulsar is a modern alternative to Kafka with better scalability, multi-tenancy, and

cloud-native features.
• Kafka is more mature and widely adopted, with strong stream processing
capabilities.
• Pulsar is gaining traction in enterprises needing hybrid messaging + streaming
with cloud flexibility.
Would you like a deeper dive into any specific aspect (e.g., performance benchmarks,
deployment models)?

Spark
Spark architecture is designed to be efficient, scalable, and fault-tolerant
1. Driver Program

• The central point of a Spark application

• Contains the main() function and creates the SparkContext
• Converts user code into tasks
fi
fl
• Schedules tasks across executors
• Maintains metadata about all Spark components
• Collects results from workers
3. Worker Nodes

• Machines that run the application code in the cluster

• Host executor processes that perform computations
• Store data for the application
4. Executors

• Processes launched on worker nodes that run tasks

• Store computation results in memory or disk
• Communicate with the driver program
• Each application has its own executor processes
5. Tasks

• Smallest unit of work sent to an executor

• Corresponds to computation on a data partition
• Multiple tasks can run in parallel on an executor
Spark Execution Model

Directed Acyclic Graph (DAG) Scheduler

• Breaks down the logical execution plan (RDD lineage) into stages of tasks
• Optimizes the execution plan (pipelining transformations)
• Determines optimal locations to run tasks based on data locality
Key Architectural Features

1. Lazy Evaluation

• Transformations are only computed when an action is called

• Allows for optimization of the execution plan
2. In-Memory Computation

• Data is kept in memory between operations when possible

• Significantly faster than disk-based systems like MapReduce
3. Resilient Distributed Datasets (RDDs)

• Fundamental data structure of Spark

• Immutable, partitioned collections of records
• Can be reconstructed if partitions are lost (lineage)
Performance Optimizations in Spark Architecture

1. Memory Management: Tunable memory fractions for execution and storage

2. Broadcast Variables: Efficiently send large read-only values to workers
3. Accumulators: Variables that workers can only add to
4. Data Serialization: Options like Java serialization or Kryo
5. Garbage Collection Tuning: Critical for JVM-based operations

What is Garbage Collection?

Garbage Collection is the automatic memory management process in Java Virtual

Machine (JVM) that reclaims memory occupied by objects that are no longer in use.

Common GC Errors in Spark

1. Long GC Pauses: When GC takes too long (seconds), causing task timeouts
2. Frequent GC Cycles: Excessive GC activity slowing down processing
3. OutOfMemoryError (OOM) due to GC overhead: When >98% of CPU time is
spent on GC
Causes of GC Problems in Spark

• Memory pressure: Too many objects in heap

• Improper memory con guration: Incorrect executor/driver memory settings
• Memory leaks: Objects not being properly released
• Large data structures: Big collections that don't fit in memory
• Excessive caching: Caching more data than necessary
Symptoms of GC Issues

• Tasks failing with "ExecutorLostFailure"

• "java.lang.OutOfMemoryError: GC overhead limit exceeded"
• Slow performance with high GC time in Spark UI
• Tasks timing out due to long GC pauses

What is Heap Space?

The heap is the area of memory where Spark stores objects created during execution.

Common Heap Space Errors

1. java.lang.OutOfMemoryError: Java heap space

2. java.lang.OutOfMemoryError: PermGen space (in older JVMs)
3. Container killed by YARN for exceeding memory limits
Causes of Heap Space Errors

1. Insuf cient memory allocation:

◦ Executor memory too small for the workload
◦ Driver memory too small for collected results
2. Data skew:
◦ Uneven distribution causing some tasks to process much more data
3. Improper operations:
◦ Collecting large datasets to the driver
fi
fi
◦ Caching unnecessary RDDs/DataFrames
◦ Broadcasting very large variables
4. Memory leaks:
◦ Accumulators not being cleared
◦ Static collections growing indefinitely

Flink
Apache Flink: Overview, Features, and Comparison with Spark

1. What is Apache Flink?

Apache Flink is a distributed stream processing framework designed for stateful, low-
latency, and high-throughput data processing. Unlike batch-first systems (like Spark),
Flink treats streaming as the core abstraction, with batch as a special case.

2. Fundamental Features of Apache Flink

Core Architecture

• True Streaming Model: Processes data in real-time (unlike micro-batching in Spark

Streaming).
• Event Time Processing: Handles out-of-order events
using watermarks and event-time semantics.
• Stateful Computations: Maintains intermediate state (e.g., counters, windows)
efficiently.
• Fault Tolerance: Uses distributed snapshots (Chandy-Lamport algorithm) for
recovery.
APIs & Libraries

• DataStream API (for unbounded streams)

• DataSet API (for bounded data, deprecated in favor of batch streams)
• Table API & SQL (unified relational queries on streams & batches)
• Stateful Functions (serverless event-driven processing)
Deployment Modes

• Standalone Cluster
• YARN / Kubernetes / Mesos
• Docker & Cloud (AWS, GCP, Azure)

3. Advanced Features of Apache Flink

1. Exactly-Once Processing
• Guarantees no duplicates even after failures (via checkpointing).
2. State Backends

• Supports different storage backends:

◦ MemoryStateBackend (fast, but not durable)
◦ FsStateBackend (file system, e.g., HDFS)
◦ RocksDBStateBackend (disk-based, scalable for large state)
3. Windowing & Time Semantics

• Event Time / Processing Time / Ingestion Time

• Tumbling, Sliding, Session, Global Windows
4. Connectors & Ecosystem

• Kafka, Kinesis, RabbitMQ (streaming sources/sinks)

• HDFS, S3, JDBC (batch integrations)
• Elasticsearch, Cassandra (output sinks)
5. Machine Learning & Graph Processing

• FlinkML (machine learning library, though less mature than Spark MLlib)
• Gelly (graph processing)
6. Savepoints

• Allows stateful upgrades & reprocessing without losing progress.

7. Adaptive Scaling

• Dynamic scaling of parallel operators (experimental).

4. Apache Flink vs. Spark: Key Differences

Feature Apache Flink Apache Spark

Processing Model True streaming (per-record) Micro-batching (small RDD chunks)

Latency Sub-millisecond Seconds to milliseconds

State Management First-class keyed state Requires external storage (e.g., Redis)

Fault Tolerance Lightweight snapshots Lineage recomputation (costly)

Event Time Native support (watermarks) Limited (needs Structured Streaming)

Handling

Batch Processing Treats batch as a bounded stream Optimized for batch- rst (RDDs)

SQL Support Uni ed SQL for streams & batch Separate APIs (Spark SQL vs. Structured
Streaming)
fi
fi
Machine Learning FlinkML (less mature) MLlib (more mature)

Ecosystem Growing (Kafka, Kinesis, connectors) Larger ecosystem (Spark SQL, GraphX, etc.)

Resource Mgmt. Fine-grained (task slots) Coarse-grained (executors)

5. When to Use Flink vs. Spark?

Choose Flink if:

✅ You need real-time streaming with low latency.

✅ You require stateful event-time processing.
✅ You want exactly-once guarantees without micro-batching.
✅ You prefer uni ed batch & stream APIs.

Choose Spark if:

✅ You primarily do batch processing & ETL.

✅ You need machine learning (MLlib) or graph processing (GraphX).
✅ You rely on Spark SQL or Databricks ecosystem.
✅ You prefer mature, widely-adopted big data tech.

6. Conclusion

• Flink excels in real-time, stateful stream processing with low latency.

• Spark dominates batch analytics, ML, and SQL with a richer ecosystem.
• Flink is the future for streaming, while Spark remains king for batch.
Would you like a deeper comparison on performance benchmarks or use-case
examples?

Feature Storage
Feature Storage in Data Science & Engineering

Feature storage is a centralized system designed to store, manage, and serve machine
learning (ML) features—reusable data attributes used for model training and inference. It
ensures consistency, reproducibility, and low-latency access to features across different
stages of the ML lifecycle.

1. Why Feature Storage?

fi
ML models rely on features (e.g., user embeddings, transaction trends, NLP tokens), but
managing them at scale introduces challenges:

• Inconsistent features between training & inference → model drift

• Repetitive computation (same feature calculated multiple times)
• Slow feature retrieval for real-time inference
• Lack of versioning & governance
A feature store solves these by acting as a single source of truth for features.

2. Key Components of a Feature Store

Component Description Example

Of ine Store Stores historical features for batch training (slow, high- HDFS, Snowflake
capacity)

Online Store Low-latency DB for real-time serving (fast, limited history) Redis, DynamoDB

Feature Registry Metadata & schema management (versioning, lineage) Feast, Tecton

Transformation Engine Computes features (batch/streaming) Spark, Flink

Serving Layer Retrieves features for training/inference gRPC, REST API

3. Core Features of a Feature Store

A. Feature Reusability

• Avoid recomputing the same feature (e.g., "user_30d_avg_spend") across models.

B. Consistency Between Training & Serving

• Ensures the same transformation logic is applied during training and inference.
C. Point-in-Time Correctness (Time Travel)

• Retrieves features as they were at training time to avoid data leakage.

D. Low-Latency Serving

• Optimized for real-time models (e.g., fraud detection, recommendations).

E. Feature Versioning & Lineage

• Tracks changes to features (e.g., "v1: raw data → v2: normalized").

F. Back ll Support

• Recomputes historical features when logic changes.

fl
fi
4. Popular Feature Store Solutions

Tool Backers Key Strengths Best For

Feast Open-source (Gojek) Lightweight, integrates with Startups, GCP users

BigQuery

Tecton Commercial (Uber End-to-end, real-time features Enterprise ML teams

alumni)

Hopsworks Open-source PySpark/Flink integration On-prem & hybrid

clouds

Vertex AI Feature Google Cloud Managed service GCP ecosystems

Store

SageMaker Feature AWS Integrated with SageMaker AWS users

Store

5. Feature Store vs. Traditional Data Lakes/Warehouses

Aspect Feature Store Data Lake/Warehouse

Purpose Optimized for ML features General analytics

Latency µs-ms (online store) Seconds-minutes

Consistency Ensures train/serve parity No built-in guarantees

Transforms Embedded computation External ETL needed

Versioning Built-in (feature lineage) Manual tracking

6. When Do You Need a Feature Store?

✅ Multiple ML models sharing features

✅ Real-time serving requirements
✅ Frequent feature back lls (e.g., logic changes)
✅ Large-scale feature management (100s of features)

🚫 Not needed for small projects or ad-hoc models.

7. Example Work ow

1. Compute features (e.g., "user_last_5_purchases") using Spark/Flink.

2. Store in offline (Parquet) and online (Redis) stores.
fl
fi
3. Training: Fetch historical features with point-in-time correctness.
4. Inference: Retrieve latest features in <10ms from the online store.

Conclusion

A feature store standardizes feature pipelines, eliminates training-serving skew, and

accelerates ML deployment. Leading tech companies (Uber, Airbnb, Netflix) rely on them
for production ML.

Would you like a deeper dive into speci c tools (e.g., Feast vs. Tecton)
or implementation patterns?

Kinesis vs MSK
Comparison: AWS Kinesis vs. AWS MSK (Managed Streaming for Kafka)

Both AWS Kinesis and AWS MSK are managed streaming services, but they cater to
different use cases. Below is a detailed comparison:

1. Overview

Feature AWS Kinesis AWS MSK (Managed Kafka)

Type AWS-native streaming service Fully managed Apache Kafka

Use Case Real-time analytics, log processing, event High-throughput, Kafka-compatible event
sourcing streaming

Protocol Kinesis Producer Library (KPL), Kinesis Client Kafka protocol (supports Kafka clients)
Library (KCL)

Scalability Auto-scaling (Kinesis Data Streams On- Manual scaling (brokers, storage)
Demand)

Pricing Pay-per-shard (Provisioned) or pay-per- Pay per broker hour + storage

Model throughput (On-Demand)

2. Key Features Comparison

A. Performance & Throughput

Metric Kinesis MSK

Max Throughput ~1 MB/s per shard (Provisioned) or unlimited Depends on broker size (scales
(On-Demand) horizontally)

Latency ~70-200ms (with KPL) ~10-50ms (Kafka-native)

fi
Partitioning Fixed shards (resharding required for scaling) Dynamic partitions (Kafka-native)

Ordering Per-shard ordering Per-partition ordering

Guarantee

B. Durability & Retention

Feature Kinesis MSK

Data Retention Up to 365 days (configurable) Configurable (default: 7 days)

Replication 3x replication (AWS-managed) Configurable (Kafka replication factor)

Storage Backend AWS-managed (no direct access) EBS volumes (user-configurable)

C. Ecosystem & Integrations

Integration Kinesis MSK

AWS Services Lambda, Firehose, EMR, Redshift Same as Kafka (Lambda, Glue, EMR)

Third-Party Tools Limited (KCL/KPL-based) Full Kafka ecosystem (Connectors, Schema Registry)

Monitoring CloudWatch Metrics CloudWatch + Prometheus (for Kafka metrics)

D. Management & Operations

Aspect Kinesis MSK

Setup Simple (serverless-like) Requires Kafka knowledge

Complexity

Scaling Automatic (On-Demand) or manual Manual (add brokers/storage)

(Provisioned)

Maintenance Fully managed (AWS handles scaling, Managed brokers, but user manages topics &
patching) clients

Security IAM policies, KMS encryption IAM + Kafka ACLs, TLS encryption

3. When to Use Which?

✅ Choose AWS Kinesis If:

• You need a simple, fully managed streaming service.

• You’re already using AWS-native services (Lambda, Firehose, Redshift).
• You want automatic scaling (with On-Demand mode).
• Your use case is real-time analytics, clickstreams, or logs.
✅ Choose AWS MSK If:

• You need Kafka compatibility (existing Kafka apps).

• You require higher throughput & lower latency.
• You rely on the Kafka ecosystem (Kafka Connect, Schema Registry).
• You need custom partitioning and consumer groups.

4. Pricing Comparison

Service Pricing Model Example Cost (Monthly)

Kinesis (Provisioned) $0.015/shard-hour + $0.01/GB data 10 shards = ~$1,080 + data transfer

Kinesis (On-Demand) $0.04/GB processed 1TB data = ~$40

MSK (Development) $0.10/broker-hour + EBS storage 3 brokers (t3.small) = ~$216 + storage

MSK (Production) $0.35/broker-hour (m5.large) 6 brokers = ~$1,512 + storage

💡 Kinesis is cheaper for low/mid-scale, while MSK is cost-effective for high-scale

Kafka workloads.

5. Summary: Kinesis vs. MSK

Decision Factor Winner

Ease of Use Kinesis

Throughput & Latency MSK

Kafka Compatibility MSK

AWS Native Integrations Kinesis

Cost Ef ciency (Small Scale) Kinesis

Cost Ef ciency (Large Scale) MSK

Final Recommendation
fi
fi
• For AWS-centric, simple streaming → Kinesis
• For high-performance, Kafka-based systems → MSK
Would you like a deeper dive into specific use cases (e.g., IoT, clickstream analytics)?

Parquet, Delta Table, Hudi or Iceberg

Comparison Analysis: Parquet vs. Delta Lake vs. Apache Hudi vs. Apache Iceberg

These file formats are widely used in modern data lakes for structured and semi-
structured data storage. Below is a detailed comparison:

1. Overview

Format Parquet Delta Lake Apache Hudi Apache Iceberg

Type Columnar ACID transactional Transactional data Table format for data
storage format layer on Parquet lake framework lakes

Develope Apache Databricks Uber Netflix

d By

Primary Batch analytics ACID transactions, Incremental Schema evolution, large-

Use Case time travel processing, upserts scale analytics

Storage File format Built on Parquet Built on Parquet Built on Parquet/ORC

Layer

ACID ❌ No ✅ Yes ✅ Yes ✅ Yes

Complianc
e

2. Key Features Comparison

A. Transaction Support & Concurrency

Feature Parquet Delta Lake Hudi Iceberg

ACID ❌ No ✅ Optimistic concurrency ✅ Optimistic ✅ Optimistic

Transactions concurrency concurrency

Time Travel ❌ No ✅ Yes (versioned data) ✅ Yes (incremental ✅ Yes (snapshots)

pulls)

Schema ❌ Manual ✅ Automatic (backward ✅ Supports schema ✅ Advanced schema

Evolution compatible) changes evolution

Con ict ❌ Not ✅ Optimistic locking ✅ Custom merge ✅ Snapshot isolation

Resolution applicable strategies

B. Performance & Optimization

fl
Feature Parquet Delta Lake Hudi Iceberg

Compaction ❌ No ✅ Auto- ✅ Auto- ✅ Optional compaction

compaction compaction

Z-Ordering ❌ No ✅ Yes ❌ No ✅ Yes

(Databricks)

Partition ❌ No ✅ Yes ✅ Yes ✅ Best-in-class (hidden

Evolution partitioning)

Indexing ❌ No ✅ Databricks- ✅ Built-in ✅ Metadata-based

specific indexing

C. Ecosystem & Integrations

Feature Parquet Delta Lake Hudi Iceberg

Spark Support ✅ Native ✅ Native (Databricks) ✅ Good ✅ Good

Flink Support ✅ Yes ❌ Limited ✅ Yes ✅ Yes

Presto/Trino ✅ Yes ✅ Limited ✅ Yes ✅ Best support

Hive ✅ Yes ❌ No ✅ Yes ✅ Yes

Cloud Support ✅ All ✅ Best on Databricks ✅ AWS, GCP ✅ AWS, GCP, Azure

3. Use Case Recommendations

✅ When to Use Parquet?

• Batch processing (no updates/deletes needed).

• Cost-ef cient storage (columnar format).
• Compatibility with all query engines (Hive, Presto, Spark).
✅ When to Use Delta Lake?

• Databricks environment (tight integration).

• ACID transactions & time travel needed.
• Frequent upserts & merges (e.g., CDC).
✅ When to Use Apache Hudi?

• Incremental processing (near-real-time pipelines).

• Upserts & deletions at scale (e.g., IoT, CDC).
• Optimized for AWS (EMR, Glue).
fi
✅ When to Use Apache Iceberg?

• Multi-engine access (Spark, Flink, Trino).

• Schema evolution & partition evolution needed.
• Large-scale analytics (Netflix-style workloads).

4. Performance Benchmarks (General Trends)

Metric Parquet Delta Lake Hudi Iceberg

Write Speed Fast Moderate (transaction Fast (indexed upserts) Moderate

overhead)

Read Speed Very Fast Fast Fast Very Fast

Update ❌ Rewrites ✅ Optimized (merge on ✅ Optimized (copy-on- ✅

Ef ciency entire file read) write) Optimized

Metadata ❌ Manual ✅ Good ✅ Good ✅ Best

Scalability

5. Vendor & Community Support

Format Primary Backer Cloud Support Adoption

Parquet Apache All Universal

Delta Lake Databricks Best on Databricks High (Databricks users)

Hudi Uber AWS (EMR), GCP Moderate

Iceberg Netflix AWS, GCP, Azure Growing fast

6. Summary: Which One to Choose?

Decision Factor Best Choice

Basic batch storage Parquet

ACID transactions (Databricks) Delta Lake

Upserts & incremental processing Hudi

Multi-engine, large-scale analytics Iceberg

fi
Final Recommendation

• For simple analytics → Parquet

• For Databricks users → Delta Lake
• For real-time upserts (AWS) → Hudi
• For open, multi-engine lakes → Iceberg
Would you like a deeper dive into speci c benchmarks or migration strategies?

EMR
AWS EMR is a managed big data platform that simplifies running distributed processing
frameworks like Apache Spark, Hadoop, Hive, HBase, Flink, and Presto on AWS
infrastructure. It provides a fully managed environment to process vast amounts of data
across dynamically scalable EC2 instances.

Why EMR Was Built

AWS created EMR to address several big data challenges:

1. Infrastructure Complexity: Eliminates the need to manually configure and

maintain Hadoop/Spark clusters
2. Scalability: Allows processing of data at any scale (from GBs to PBs)
3. Cost Optimization: Provides transient clusters that can be terminated after job
completion
4. Integration: Seamlessly works with other AWS services (S3, Glue, Redshift, etc.)
5. Framework Flexibility: Supports multiple processing frameworks in one platform
Key Considerations for Data Engineers

1. Cluster Creation

a. Cluster Con guration:

• Node Types:
◦ Master: Manages the cluster (1 node)
◦ Core: Runs tasks and stores data (1+ nodes)
◦ Task: Pure compute nodes (optional, 0+ nodes)
• Instance Selection:
◦ Memory-optimized for memory-intensive workloads (Spark)
◦ Compute-optimized for CPU-intensive jobs (HBase)
◦ Storage-optimized when using HDFS
b. Software Stack:

• Choose appropriate EMR release version (match with framework versions)

• Select applications (Spark, Hadoop, Hive, etc.)
fi
fi
• Configure bootstrap actions for custom setups
Example CLI Creation:

aws emr create-cluster \

--name "Spark Cluster" \
--release-label emr-6.7.0 \
--applications Name=Spark \
--ec2-attributes KeyName=my-key \
--instance-type m5.xlarge \
--instance-count 3 \
--use-default-roles
2. Job Execution

a. Job Submission Methods:

• Direct API calls (for programmatic control)

• CLI submission (for manual runs)
• Step functions (for workflow orchestration)
• Air ow EMR operators (for pipeline integration)
b. Performance Optimization:

• Partitioning: Ensure input data is properly partitioned

• Memory Settings: Configure executor/driver memory in Spark
• Data Locality: Use S3 select or EMRFS for S3 integration
• Spot Instances: For task nodes to reduce costs
Example Spark Job Submission:

aws emr add-steps \

--cluster-id j-3KXXXXXX9IOK \
--steps Type=Spark,Name="Spark Job",\
ActionOnFailure=CONTINUE,\
Args=[--class,org.apache.spark.examples.SparkPi,\
/user/hadoop/share/lib/spark-examples.jar,100]
3. Cluster Maintenance

a. Monitoring:

• Use CloudWatch metrics for cluster health

• Monitor HDFS/disk utilization (if using HDFS)
• Track YARN resource allocation
b. Logging:

• Enable logging to S3 for debugging

• Configure log encryption and retention
• Access logs via EMR console or S3 directly
c. Scaling:
fl
• Use managed scaling for automatic resizing
• Configure scaling policies based on metrics
• Consider task spot fleets for cost savings
4. Cost Management

a. Instance Selection:

• Use spot instances for task nodes (up to 70-90% savings)

• Reserve instances for long-running clusters
• Right-size instances based on workload
b. Cluster Lifecycle:

• Use transient clusters for batch jobs

• Implement auto-termination after job completion
• Consider EMR Serverless for sporadic workloads
c. Storage Options:

• Prefer S3 over HDFS for persistent storage

• Use EBS volumes for temporary HDFS storage
• Implement lifecycle policies for S3 data
Best Practices for Data Engineers

1. Infrastructure as Code:
◦ Use CloudFormation/Terraform for reproducible clusters
◦ Version control bootstrap scripts and configurations
2. Security:
◦ Enable encryption at rest and in transit
◦ Use IAM roles for fine-grained permissions
◦ Implement VPC endpoints for S3 access
3. Data Processing:
◦ Leverage EMRFS for consistent S3 access
◦ Use Spark's dynamic allocation for resource efficiency
◦ Implement checkpointing for long-running jobs
4. Job Design:
◦ Break large jobs into smaller steps
◦ Implement retry logic for transient failures
◦ Separate transformation and loading steps
5. Integration Patterns:
6. from airflow.providers.amazon.aws.operators.emr import
EmrCreateJobFlowOperator

create_emr_cluster = EmrCreateJobFlowOperator(
task_id='create_emr_cluster',
job_flow_overrides={
'Name': 'Analytics Cluster',
'ReleaseLabel': 'emr-6.7.0',
'Applications': [{'Name': 'Spark'}],
'Instances': {
'InstanceGroups': [
{
'Name': 'Master nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm5.xlarge',
'InstanceCount': 1,
},
{
'Name': 'Core nodes',
'Market': 'SPOT',
'InstanceRole': 'CORE',
'InstanceType': 'm5.xlarge',
'InstanceCount': 2,
}
],
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False,
},
'VisibleToAllUsers': True,
},
aws_conn_id='aws_default',
emr_conn_id='emr_default',
)

Common Pitfalls to Avoid

1. Over-provisioning clusters: Start small and scale as needed

2. Ignoring spot instances: Miss significant cost savings opportunities
3. Poor data partitioning: Causes skewed workloads and slow jobs
4. Hardcoding con gurations: Makes environments inflexible
5. Not setting auto-termination: Leads to runaway costs
Modern EMR Features to Leverage

1. EMR Serverless: For sporadic workloads without cluster management

2. EMR on EKS: Run EMR jobs on Kubernetes
3. Runtime roles: Fine-grained permissions per job
4. Managed scaling: Automatic cluster resizing
5. ACID transactions: With EMR 6+ and Spark 3+

Glue
fi
AWS Glue is a fully managed extract, transform, and load (ETL) service that automates
the time-consuming steps of data preparation. It consists of several components:

1. Data Catalog: Central metadata repository (Hive-compatible)

2. ETL Engine: Serverless Spark environment for transformations
3. Scheduler: Job orchestration with dependency resolution
4. DataBrew: Visual data preparation tool (no-code)
5. Studio: Development interface for authoring ETL jobs
Why AWS Glue Was Built

AWS created Glue to address key data integration challenges:

1. Infrastructure Overhead: Eliminates need to manage ETL servers/clusters

2. Metadata Management: Solves the "data discovery" problem
3. Schema Evolution: Handles changing data structures automatically
4. Serverless Execution: Scales automatically to handle job demands
5. AWS Integration: Native connectivity with S3, RDS, Redshift, etc.
Key Considerations for Data Engineers

1. Job Creation

a. Job Type Selection:

• Spark Script: Full control (Python/Scala)

• Visual Editor: Drag-and-drop (limited to common transforms)
• Python Shell: Lightweight jobs (non-Spark)
b. Con guration Essentials:

python

Copy

Download
# Sample Glue Python job parameters
args = {
'--job-language': 'python',
'--TempDir': 's3://bucket/temp/',
'--job-bookmark-option': 'job-bookmark-enable',
'--enable-metrics': '',
'--enable-continuous-cloudwatch-log': 'true',
'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://bucket/spark-logs/'
}
c. Data Source/Target Setup:
fi
• Use Glue Crawlers judiciously (can be expensive for frequent runs)
• Prefer direct source specifications in code over crawlers for known schemas
• Configure connection types properly (JDBC, S3, etc.)
2. Performance Optimization

a. Worker Con guration:

• Standard Workers (G.1X/G.2X): Most common workloads

• G.4X/G.8X: Memory-intensive jobs
• G.025X: Lightweight Python shell jobs
b. Partitioning Strategies:

python

Copy

Download
# Repartition before write
datasink = glueContext.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={
"path": output_path,
"partitionKeys": ["year", "month", "day"]
},
format="parquet"
)
c. Job Bookmarks:

• Enable for incremental processing

• Understand bookmark limitations (certain sources/transforms break bookmarks)
• Monitor bookmark state via CloudWatch
3. Maintenance & Operations

a. Monitoring:

• Set up CloudWatch alerts for job failures

• Track metrics: DPU hours, execution time, memory usage
• Enable continuous logging for debugging
b. Error Handling:

• Implement try-catch blocks for robust jobs

• Configure job retries with exponential backoff
• Use dead-letter queues for error records
c. Cost Control:
fi
• Right-size DPU allocations (start small, scale as needed)
• Use timeouts to prevent runaway jobs
• Clean up temporary files automatically
Best Practices

1. Development Work ow:

◦ Use Glue Studio for prototyping
◦ Develop locally with Glue Docker image
◦ Version control all scripts
2. Schema Management:
◦ Explicitly define schemas when possible
◦ Handle schema evolution gracefully
◦ Use Glue Schema Registry for streaming
3. Job Design:
python

Copy

Download

# Good practice: Modular transforms

def clean_data(glue_context, dyf):
# Data quality checks
dyf = Filter.apply(
frame=dyf,
f=lambda x: x["field"] is not None
)

# Transformations
dyf = Map.apply(
frame=dyf,
f=lambda x: {**x, "new_field": x["field"].lower()}
)
return dyf

4. Security:
◦ Encrypt data at rest and in transit
◦ Use IAM roles with least privilege
◦ Enable VPC endpoints for private access
Common Pitfalls to Avoid

1. Overusing Crawlers: Running unnecessary crawls that rebuild entire catalogs

2. DPU Overallocation: Using more workers than needed
fl
3. Ignoring Job Bookmarks: Missing incremental load opportunities
4. Poor Error Handling: Jobs failing silently on bad records
5. Hardcoding Paths: Making jobs inflexible to environment changes
Advanced Features

1. Glue Elastic Views:

◦ Create materialized views across data stores
◦ Automatically handles change data capture
2. Glue DataBrew:
◦ Visual data preparation
◦ 250+ built-in transforms
3. Glue Streaming ETL:
◦ Near real-time processing
◦ Integrates with Kinesis, MSK
4. Glue Flex Jobs:
◦ For non-urgent workloads
◦ Runs on spare capacity (70% cost savings)
Integration Example with Other Services

python

Copy

Download
# Sample job integrating S3, Glue Catalog, and Redshift
from awsglue.context import GlueContext

glue_context = GlueContext(sc)

# Read from catalog

datasource = glue_context.create_dynamic_frame.from_catalog(
database="sales_db",
table_name="transactions",
transformation_ctx="datasource"
)

# Transform
cleaned_data = clean_data(glue_context, datasource)

# Write to Redshift
glue_context.write_dynamic_frame.from_jdbc_conf(
frame=cleaned_data,
catalog_connection="redshift-connection",
connection_options={
"dbtable": "processed_transactions",
"database": "analytics_db"
},
redshift_tmp_dir="s3://bucket/temp/"
)

Athena
AWS Athena is an interactive serverless query service that enables SQL analysis of data
directly in Amazon S3 using standard ANSI SQL. It's built on the following core principles:

1. Serverless Architecture: No infrastructure to manage

2. Pay-per-Query Pricing: Costs based on data scanned
3. Standard SQL: Presto/Trino SQL dialect support
4. Direct S3 Access: No data loading required
5. Federated Query Capabilities: Combine data across sources
How Athena is Built

High-Level Architecture

1. Foundation: Built on open-source Presto/Trino distributed SQL engine

2. Compute Layer: Transient clusters spin up per query
3. Metadata Layer: Uses AWS Glue Data Catalog (or legacy Hive metastore)
4. Storage Layer: Directly accesses S3 via optimized connectors
5. Optimization Layer: Includes cost-based optimizer (CBO) and partition pruning
Key Components

• Query Engine: Presto 0.217+ with AWS enhancements

• Metadata Store: Glue Data Catalog (database/table de nitions)
• Security: IAM policies + S3 ACLs/bucket policies
• Result Cache: Reuses previous query results when possible
Query Execution: HLD vs LLD

High-Level Design (HLD) Flow

1. Query Submission: SQL via console/JDBC/ODBC/REST API

2. Query Planning:
◦ Syntax parsing → Logical plan → Distributed execution plan
◦ Cost-based optimization (join ordering, predicate pushdown)
3. Execution:
◦ Coordinator distributes work to worker nodes
◦ Workers read S3 data in parallel
4. Result Aggregation: Coordinator combines partial results
5. Output: Returns results to client (S3 for large results)
Low-Level Design (LLD) Details
fi
1. Metadata Resolution:
-- Table definition lookup in Glue Catalog
2. DESCRIBE TABLE sales_data;

3. Partition Pruning:

-- Only scans partitions matching the filter

4. SELECT * FROM sales_data WHERE dt BETWEEN '2023-01-01' AND '2023-01-31';

5. File Format Handling:

◦ Columnar formats (Parquet/ORC) enable predicate pushdown
◦ Splittable formats allow parallel reads
6. Execution Stages:
◦ Scan → Filter → Aggregate → Join → Sort → Limit
◦ Each stage processed by worker nodes
7. Resource Allocation:
◦ Memory per node: ~10GB per worker
◦ Max workers: Scales automatically based on query complexity
Data Engineer Considerations

1. Performance Optimization

a. Partitioning Strategyy

-- Partitioned table creation example

CREATE TABLE web_logs (
request_time TIMESTAMP,
ip_address VARCHAR(16),
-- ...other columns
)
PARTITIONED BY (
year INT,
month INT,
day INT
)
STORED AS PARQUET
LOCATION 's3://bucket/logs/';
b. File Format Selection:

• Parquet/ORC: Best for analytics (columnar, compression)

• JSON/CSV: Human-readable but slower
• Avro: Row-oriented with schema evolution
c. Compression:
• Snappy (balance of speed/ratio)
• Gzip (better ratio, slower)
• Zstd (modern alternative)
2. Cost Control

a. Data Scanned Reduction:

-- Bad: Scans entire dataset

SELECT * FROM large_table;

-- Good: Uses partition pruning

SELECT user_id FROM large_table
WHERE account_status = 'active'
AND signup_date > CURRENT_DATE - INTERVAL '30' DAY;
b. Workgroups & Limits:

• Create separate workgroups for different teams

• Set per-query/data-scanned limits
• Enforce query result location
c. Caching:

• Enable result reuse for repeated queries

• Use parameterized queries when possible
3. Schema Design

a. Column Projection:

-- Specify only needed columns

SELECT user_id, email FROM users;
b. Type Handling:

• Prefer TIMESTAMP over VARCHAR for dates

• Use appropriate VARCHAR lengths
• Consider nested types (ARRAY, MAP, STRUCT) for complex data
c. Schema Evolution:

• Use Glue Schema Registry for streaming data

• Handle backward compatibility in ETL
4. Integration Patterns

a. Federated Queries:

-- Query across S3 and RDS

SELECT s.*, r.extra_data
FROM s3_archive.sales s
JOIN postgres.public.customers r ON s.cust_id = r.id;
b. ETL Pipeline:

# Sample Glue job writing Athena-optimized output

glue_context.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={
"path": "s3://output-bucket/",
"partitionKeys": ["year", "month"],
"compression": "snappy"
},
format="parquet"
)
5. Monitoring & Tuning

a. Query Analysis:

• Use Athena query history/metrics

• Review EXPLAIN plans
• Monitor partition usage
b. Performance Tuning:

-- Check partition effectiveness

SELECT * FROM information_schema.__internal_partitions__
WHERE table_schema = 'web_logs' AND table_name = 'access_logs';
c. Error Handling:

• Set up SNS alerts for query failures

• Implement retry logic for throttling
• Monitor S3 LIST operation costs
Advanced Features

CTAS (Create Table As Select):

CREATE TABLE optimized_sales

WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['region']
) AS
SELECT * FROM raw_sales;

UNLOAD for Large Results:

UNLOAD (
SELECT * FROM large_results
)
TO 's3://results-bucket/path/'
WITH (
format = 'PARQUET'
);

Lambda UDFs:

CREATE FUNCTION parse_udf AS 'com.example.ParseUDF'

USING JAR 's3://bucket/code.jar';

ML Inference:

SELECT ml_function('arn:model', input_data) FROM source_table;

Common Pitfalls to Avoid

1. Full Table Scans: Always lter on partitioned columns

2. Small Files Problem: Combine small les (<128MB) into larger ones
3. Over-partitioning: Too many partitions degrade performance
4. Ignoring Statistics: ANALYZE tables after major changes
5. No Result Pagination: Fetching huge result sets directly

Delta Lake

Delta is an open-source storage layer that brings ACID transactions to data lakes. It's
built on top of existing file formats (like Parquet) and adds reliability and performance
optimizations through:

Core Characteristics:

1. ACID Transactions: Atomicity, Consistency, Isolation, Durability

2. Schema Enforcement: Prevents bad data from being written
3. Time Travel: Versioned data access
4. Upserts/Merges: CRUD operations (not just append)
5. Scalable Metadata: Handles petabyte-scale tables
File Structure:

s3://delta-table/
├── _delta_log/ # Transaction log (JSON files)
│ ├── 00000000000000000000.json
│ └── 00000000000000000001.json
fi
fi
├── part-00000-xxxx.parquet # Data files
└── part-00001-yyyy.parquet
How Delta Fundamentally Differs from Other Formats

Feature Delta Traditional Parquet/ORC CSV/JSON

Transactions Full ACID support None (append-only) None

Schema Evolution Managed (enforcement) Manual handling No schema

Updates Supports MERGE/DELETE Rewrite entire dataset Append-only

Time Travel Built-in versioning Requires manual snapshots Impossible

Performance Optimized metadata File-level only Slow scans

Consistency Strong guarantees Eventual consistency No guarantees

Delta Lake

Delta Lake is the open-source framework that implements the Delta protocol. It
provides:

1. Reliable Data Lakes: ACID over cloud storage (S3, ADLS, GCS)
2. Spark Integration: Native support in PySpark/Spark SQL
3. Multi-engine Support: Works with Presto, Trino, Redshift, etc.
4. Open Format: No vendor lock-in (standard Parquet + transaction log)
Key Components:

1. Transaction Log: Single source of truth (ordered JSON records)

2. Optimistic Concurrency Control: Handles concurrent writes
3. Vacuum: Clean up old files (configurable retention)
4. Z-Ordering: Colocate related data for faster queries
Technical Differentiators

1. Transaction Log (The Secret Sauce)

• Records all changes as ordered JSON files

• Enables atomic commits via "mutex" pattern
• Example transaction:
json
{
• "commitInfo": {...},
• "add": {
• "path": "part-00000-xxx.snappy.parquet",
• "size": 123456,
• "partitionValues": {"date": "2023-01-01"},
• "stats": "minMaxStats": {...}
• }
• }

2. Schema Enforcement

# Fails if new_data doesn't match delta_table schema

delta_table.alias("target").merge(
new_data.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
3. Time Travel

# Read previous version

df = spark.read.format("delta").option("versionAsOf", 12).load("/delta/table")

# Restore to version 10
delta_table.restoreToVersion(10)
Why Data Engineers Choose Delta Lake

1. Reliable Pipelines:
◦ No more "partial writes" during job failures
◦ Concurrent readers/writers won't corrupt data
2. Simpli ed Architecture:

# Before Delta (manual handling)

3. df.write.parquet("output/") # Risky overwrites
4. # After Delta
5. df.write.format("delta").mode("overwrite").save("output/") # Atomic

6. Performance Optimizations:
◦ File skipping using statistics
◦ Z-ordering for faster predicates
◦ Compaction of small files
7. Uni ed Batch/Streaming:
# Streaming upserts
(spark.readStream
.format("delta")
.load("source")
.writeStream
.format("delta")
fi
fi
.foreachBatch(upsert_function)
.start())

Implementation Example

# Create Delta table

df.write.format("delta").save("/delta/events")

# Upsert pattern
delta_table = DeltaTable.forPath(spark, "/delta/events")

(delta_table.alias("target")
.merge(source_df.alias("source"), "target.userId = source.userId")
.whenMatchedUpdate(set = {"lastSeen": "source.timestamp"})
.whenNotMatchedInsert(values = {
"userId": "source.userId",
"firstSeen": "source.timestamp",
"lastSeen": "source.timestamp"
})
.execute())
When Not to Use Delta

1. Extremely High-Frequency Writes: Transaction log may become bottleneck

2. Tiny Datasets: Overhead exceeds benefits
3. Non-Spark Ecosystems: Limited support in some query engines

Air ow
Apache Airflow is an open-source platform to programmatically author, schedule, and
monitor workflows
Core Principles

1. Dynamic Pipeline Generation: Pipelines are defined as Python code (DAGs)

2. Extensibility: Custom operators, hooks, and executors can be created
3. Scalability: Modular architecture with various executors (Local, Celery, Kubernetes)
4. Declarative: Focus on "what" not "how" - define workflows declaratively
5. Reproducibility: Pipelines are version-controlled and repeatable
Key Components

1. Web Server

• Flask-based UI for pipeline monitoring and management

• Views for DAGs, tasks, execution history, and con guration
• Ability to trigger/stop DAGs manually
2. Scheduler
fl
fi
• Heart of Air ow that triggers work ow execution
• Monitors all tasks and DAGs
• Handles dependencies and scheduling logic
3. Executor

• Mechanism that runs tasks (different types available)

• Examples: LocalExecutor, CeleryExecutor, KubernetesExecutor
4. Metadata Database

• Stores state of all DAGs, tasks, variables, connections

• Typically PostgreSQL or MySQL
• Used by all components to coordinate
5. DAG (Directed Acyclic Graph)

• Collection of tasks with de ned dependencies

• Represented as Python code in dags/ folder
• De nes work ow structure and execution order
6. Operators

• Building blocks that de ne individual tasks

• Determine what actually gets executed
7. Hooks

• Interfaces to external platforms/databases

• Reusable connections to external systems
8. Pools

• Limit concurrency for specific tasks

• Useful for managing resource-intensive operations
Common Operators for Data Engineering

1. Spark Operators

• SparkSubmitOperator: Submit Spark jobs to a cluster

submit_spark_job = SparkSubmitOperator(
• task_id='spark_submit_job',
• application='/path/to/your/spark/app.jar',
• conf={'spark.yarn.queue': 'production'},
• application_args=['arg1', 'arg2'],
• dag=dag
• )

• SparkJDBCOperator: Execute Spark SQL to JDBC

2. Database Operators

Redshift:
fi
fl
fl
fi
fi
fl
• RedshiftSQLOperator: Execute SQL on Redshift
redshift_query = RedshiftSQLOperator(
• task_id='redshift_query',
• sql='SELECT * FROM important_table',
• redshift_conn_id='redshift_default',
• dag=dag
• )

• S3ToRedshiftOperator: Load data from S3 to Redshift

• RedshiftToS3Operator: Unload data from Redshift to S3
MongoDB:

• MongoToS3Operator: Export data from MongoDB to S3

• S3ToMongoOperator: Import data from S3 to MongoDB
• MongoOperator: Execute MongoDB commands
mongo_op = MongoOperator(
• task_id='mongo_aggregate',
• mongo_conn_id='mongo_default',
• collection='analytics',
• query=[{'$group': {'_id': '$user', 'count': {'$sum': 1}}}],
• dag=dag
• )

3. Cloud Operators

AWS:

• S3FileTransformOperator: Transform files in S3

• GlueJobOperator: Run AWS Glue jobs
• EMRCreateJobFlowOperator: Create EMR clusters
GCP:

• BigQueryOperator: Run BigQuery queries

• GCSDownloadOperator: Download from Google Cloud Storage
• Data owOperator: Run Dataflow jobs
4. ETL Operators

• PythonOperator: Execute Python functions

• BashOperator: Execute bash commands
• DockerOperator: Run Docker containers
• KubernetesPodOperator: Run pods on Kubernetes
5. Transfer Operators

• SFTPOperator: Transfer files via SFTP

• FTPOperator: File transfer via FTP
fl
• SQLToS3Operator: Export SQL results to S3
6. Specialized Operators

• HiveOperator: Execute Hive queries

• PrestoToMySqlOperator: Transfer from Presto to MySQL
• SlackAPIPostOperator: Send Slack notifications

1. Standard
2. S3 Intelligent-Tiering
3. Standard Infrequent-Access
4. S3 One-Zone-Infrequent-Access
5. Glacier Instant-Retrieval
6. Glacier Flexible Retrival
7. Glacier Deep Archive
8. S3 Outposts

AWS S3 offers several storage classes designed for different use cases based on access
frequency, durability, availability, and cost requirements. Here are all the current S3
storage classes with detailed explanations:

1. S3 Standard (General Purpose)

• Use Case: Frequently accessed data

• Availability: 99.99%
• Durability: 99.999999999% (11 9's)
• Minimum Storage Duration: None
• Retrieval Fee: None
• Features:
◦ Low latency and high throughput
◦ Ideal for content distribution, mobile/gaming applications, big data analytics
◦ Supports SSL for data in transit and encryption at rest
2. S3 Intelligent-Tiering

• Use Case: Data with unknown or changing access patterns

• Availability: 99.9% (all tiers)
• Durability: 99.999999999%
• Minimum Storage Duration: 30 days (for Frequent and Infrequent tiers)
• Retrieval Fee: None (Frequent), $0.01/GB (Infrequent)
• Features:
◦ Automatically moves objects between 4 access tiers:
1. Frequent Access Tier (same as S3 Standard)
2. Infrequent Access Tier (same as S3 Standard-IA)
3. Archive Instant Access Tier (same as S3 One Zone-IA but with higher
availability)
4. Archive Access Tier (same as S3 Glacier Instant Retrieval)
◦ No retrieval fees or operational overhead
◦ Small monthly monitoring/auto-tiering fee ($0.0025 per 1,000 objects)
3. S3 Standard-Infrequent Access (S3 Standard-IA)

• Use Case: Less frequently accessed data requiring rapid retrieval

• Availability: 99.9%
• Durability: 99.999999999%
• Minimum Storage Duration: 30 days
• Retrieval Fee: $0.01/GB
• Features:
◦ Lower storage price than Standard but higher retrieval cost
◦ Ideal for backups, disaster recovery, long-term storage of important data
◦ Data stored across multiple AZs
4. S3 One Zone-Infrequent Access (S3 One Zone-IA)

• Use Case: Infrequently accessed, non-critical data

• Availability: 99.5%
• Durability: 99.999999999% (but only in one AZ)
• Minimum Storage Duration: 30 days
• Retrieval Fee: $0.01/GB
• Features:
◦ 20% cheaper than Standard-IA
◦ Data stored in a single AZ (risk of data loss if AZ is destroyed)
◦ Good for secondary backups or easily reproducible data
5. S3 Glacier Instant Retrieval

• Use Case: Archive data needing millisecond retrieval

• Availability: 99.9%
• Durability: 99.999999999%
• Minimum Storage Duration: 90 days
• Retrieval Fee: $0.03/GB (varies by region)
• Features:
◦ Lowest cost storage with milliseconds retrieval
◦ Ideal for medical images, news media assets, genomics data
◦ Cheaper than Standard-IA but with similar performance
6. S3 Glacier Flexible Retrieval (formerly S3 Glacier)

• Use Case: Long-term backups and archives

• Availability: 99.99% (after restore)
• Durability: 99.999999999%
• Minimum Storage Duration: 90 days
• Retrieval Options:
◦ Expedited: 1-5 minutes ($0.03/GB + $10/GB retrieval)
◦ Standard: 3-5 hours ($0.01/GB)
◦ Bulk: 5-12 hours ($0.0025/GB)
• Features:
◦ Ideal for data accessed 1-2 times per year
◦ Great for compliance archives, digital preservation
◦ No real-time access (must restore first)
7. S3 Glacier Deep Archive

• Use Case: Rarely accessed long-term data

• Availability: 99.99% (after restore)
• Durability: 99.999999999%
• Minimum Storage Duration: 180 days
• Retrieval Options:
◦ Standard: 12 hours ($0.0025/GB)
◦ Bulk: 48 hours ($0.00099/GB)
• Features:
◦ Lowest cost storage in S3 (∼$0.00099/GB/month)
◦ Designed for data accessed less than once per year
◦ Perfect for regulatory/compliance archives
8. S3 Outposts

• Use Case: On-premises S3 storage

• Features:
◦ Brings S3 API to your data center
◦ Uses the same storage classes as standard S3
◦ Ideal for local data processing with AWS compatibility
Comparison Summary Table

Storage Class Best For Availabilit Retrieval Min Cost Structure

y Time Duration

Standard Frequent access 99.99% Immediate None Higher storage, no retrieval

Intelligent- Unknown patterns 99.9% Immediate 30 days Variable, auto-optimized

Tiering

Standard-IA Infrequent access 99.9% Immediate 30 days Lower storage, $0.01/GB

retrieval

One Zone-IA Reproductible data 99.5% Immediate 30 days Cheaper than Standard-IA

Glacier Instant Archive with ms 99.9% Millisecond 90 days Very low storage, $0.03/GB
access s retrieval
Glacier Long-term backup 99.99% Min 1-5 90 days Very low storage, retrieval fees
Flexible hours vary

Glacier Deep Compliance 99.99% Min 12 180 days Lowest storage, $0.0025/GB
archives hours retrieval

Redshift
AWS Redshift is a fully managed, petabyte-scale cloud data warehouse service
designed for online analytical processing (OLAP) and business intelligence workloads.
It's built on a massively parallel processing (MPP) architecture that enables high-
performance analysis of large datasets using standard SQL.

Why Redshift is Important for Data Engineering

1. Performance at Scale:
◦ Columnar storage format (optimized for analytics)
◦ Parallel query execution across multiple nodes
◦ Ability to handle petabytes of data
2. Integration Ecosystem:
◦ Native connectivity with major BI tools (Tableau, Power BI, Looker)
◦ Seamless integration with AWS services (S3, Kinesis, Glue, Lambda)
◦ Supports JDBC/ODBC connections
3. Cost-Effective:
◦ Pay-as-you-go pricing model
◦ Concurrency scaling for handling peak loads
◦ Spectrum for querying data directly in S3
4. Modern Data Architecture:
◦ Enables data lakehouse patterns (combining data warehouse and data lake)
◦ Supports both structured and semi-structured data
◦ Facilitates ELT (Extract-Load-Transform) workflows
5. Security & Compliance:
End-to-end encryption (at rest and in transit)
◦
Fine-grained access control
◦
SOC, HIPAA, PCI DSS compliance
◦
Key Considerations When Creating a Redshift Cluster

1. Cluster Configuration

Node Types:

• RA3 nodes (recommended): Separate compute and storage (auto-scaling storage)

• DC2 nodes: Dense compute for performance-intensive workloads
• DS2 nodes: Dense storage for large datasets
Sizing:

• Start with at least 2 nodes for production (for fault tolerance)

• Use the Redshift sizing calculator for initial estimates
• Consider concurrency scaling for variable workloads
Example Creation Command:

CREATE CLUSTER my-cluster

NODE TYPE ra3.xlplus
NUM NODES 4
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Access'
ENCRYPTED
SNAPSHOT COPY ON;
2. Data Modeling Best Practices

Distribution Styles:

• KEY: For large fact tables (distributes rows based on join columns)
• EVEN: When no clear distribution key exists
• ALL: For small dimension tables (copies to all nodes)
• AUTO: Let Redshift choose (new feature)
Sort Keys:

• Compound: Multiple columns (fixed order)

• Interleaved: Equal weight to multiple columns
• Single: One column
Example Table Design:

CREATE TABLE sales (

sale_id BIGINT,
sale_date TIMESTAMP,
product_id INTEGER,
customer_id INTEGER,
amount DECIMAL(18,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date, product_id);
3. Data Loading Considerations

Bulk Loading:

• Use COPY command for efficient loading from S3

• Compress files (GZIP, LZO, or ZSTD) before loading
• Split large files into multiple smaller files (100-1GB each)
Example COPY Command:
COPY sales
FROM 's3://my-bucket/sales-data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Access'
FORMAT PARQUET
GZIP
COMPUPDATE OFF;
Incremental Loading:

• Use staging tables with MERGE operations

• Consider time-based partitioning
• Track high-water marks for incremental loads
4. Performance Optimization

Workload Management (WLM):

• Configure query queues based on priority

• Set memory allocation for different query types
• Implement short query acceleration
Vacuum & Analyze:

• Regularly VACUUM to reclaim space

• Run ANALYZE to update statistics
• Schedule maintenance windows
Materialized Views:

• Pre-compute complex aggregations

• Automatic refresh options available
• Especially useful for dashboard queries
5. Monitoring & Maintenance

Key Metrics to Monitor:

• Cluster CPU utilization

• Query execution times
• Disk space usage
• Queue wait times
Maintenance Tasks:

• Set up automated snapshots

• Implement cross-region replication for DR
• Regularly resize clusters as needed
Handling Big Data in Redshift

1. Use Redshift Spectrum:

◦ Query data directly in S3 without loading
◦ Ideal for historical or rarely accessed data
2.
CREATE EXTERNAL SCHEMA spectrum_schema
3. FROM DATA CATALOG
4. DATABASE 'spectrum_db'
5. IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole';

6. Implement Data Partitioning:

◦ Partition large tables by date ranges
◦ Use table inheritance for time-series data
7. Leverage Concurrency Scaling:
◦ Automatically adds cluster capacity during peak loads
◦ No need to over-provision for occasional spikes
8. Consider Data Lake Integration:
◦ Store raw data in S3 (data lake)
◦ Use Redshift for processed, analytical data
◦ Implement federated queries when needed
Common Pitfalls to Avoid

1. Overusing DISTKEY ALL: Can lead to storage bloat

2. Ignoring Sort Keys: Results in full table scans
3. Small WLM Queues: Causes query queueing delays
4. Not Compressing Data: Wastes storage and I/O bandwidth
5. Loading During Peak Hours: Impacts query performance
Cost Optimization Tips

1. Use RA3 nodes with managed storage

2. Pause clusters during non-business hours (dev environments)
3. Implement auto-vacuum to reduce storage costs
4. Monitor query performance to identify expensive operations
5. Use reserved instances for predictable workloads

Starschema vs Snow ake schema

Star Schema

Characteristics:

1. Central Fact Table surrounded by denormalized dimension tables

2. Denormalized Structure: Each dimension contains all its attributes without
normalization
3. Single Join Path: Dimensions connect directly to the fact table
4. Simpler Queries: Fewer joins needed for most queries
5. Better Performance: Generally faster query performance for analytical workloads
Example:
fl
SALES_FACT (fact table)
|
|-- PRODUCT_DIM (all product attributes)
|-- CUSTOMER_DIM (all customer attributes)
|-- DATE_DIM (all date attributes)
|-- STORE_DIM (all store attributes)
Advantages:

• Query simplicity: Business users find it easier to understand

• Optimized for reads: Excellent for analytical queries
• Better performance: Fewer joins mean faster queries
• Kimball's preferred approach for most implementations
Disadvantages:

• Potential redundancy: Repeated attribute values in dimensions

• Less exible to changes: May require full dimension reloads for some changes
Snowflake Schema

Characteristics:

1. Normalized Dimensions: Dimension tables are broken into multiple related tables
2. Hierarchical Structure: Some dimensions have "sub-dimensions"
3. Multiple Join Paths: Requires joining through multiple tables to get all attributes
4. More Complex: Resembles a snowflake due to its branching structure
Example:

SALES_FACT (fact table)

• Reduced storage: Eliminates some data redundancy

• Easier maintenance: Changes to lookup values affect fewer rows
• More exible: Can accommodate more complex hierarchies
Disadvantages:

• More complex queries: Additional joins degrade performance

• Harder to understand: Less intuitive for business users
• Not Kimball's preferred approach: He favored star schemas for most scenarios
Kimball's Perspective on the Choice
fl
fl
1. Star Schema is Preferred:
◦ Kimball strongly advocated for star schemas in most cases
◦ The performance benefits outweigh storage savings from normalization
◦ Better aligns with business users' mental models
2. Snow ake Considered When:
◦ Dimension tables become extremely large (millions of rows)
◦ There are clear, natural hierarchies in the data
◦ Specific attributes change frequently independently of others
3. Practical Compromise:
◦Kimball suggested "snowflaking" only certain dimensions if needed
◦Most dimensions should remain denormalized in star format
◦Example: Might snowflake a geographic hierarchy (Country→State→City)
but keep other dimensions flat
Key Differences Summary Table

Aspect Star Schema Snow ake Schema

Normalization Denormalized Partially normalized

Joins Required Fewer More

Query Performance Faster Slower

Storage More (due to redundancy) Less (reduced redundancy)

Complexity Simpler More complex

Business User Easier to understand Harder to understand

Kimball's View Preferred for most scenarios Use sparingly when beneficial

Best For Most analytical use cases Complex hierarchies, very large dimensions

Kimball's Recommendation in Practice

Kimball would typically recommend:

1. Start with a star schema as the default approach

2. Only consider snowflaking when:
◦ A dimension has >1 million rows AND
◦ The dimension has natural hierarchies AND
◦ Query patterns frequently access different hierarchy levels separately
fl
fl
3. Even when snowflaking, keep the most frequently accessed attributes in the main
dimension table
Example where snowflaking might be justified:

CUSTOMER_DIM (main attributes)

|
|-- CUSTOMER_DEMOGRAPHICS (rarely used details)
|-- CUSTOMER_HIERARCHY (org structure)
In modern data warehouses with columnar storage and compression, the storage benefits
of snowflaking are less significant, making star schemas even more attractive in most
cases

Scala
Fundamental Features (Basic Concepts)

1. Object-Oriented Features

• Classes and Objects: Everything is an object in Scala

• Traits: Similar to Java interfaces but can contain concrete methods
• Case Classes: Special classes optimized for pattern matching
• Singleton Objects: Using object keyword instead of class
• Companion Objects: Objects sharing the same name with a class
2. Functional Programming Basics

• First-Class Functions: Functions as values that can be passed around

• Immutability: val for immutable variables
• Higher-Order Functions: Functions that take/return other functions
• Anonymous Functions: Lambda expressions ((x: Int) => x * 2)
• Collections API: Rich set of immutable collections
3. Type System

• Type Inference: Compiler often infers types (val x = 5 infers Int)

• Variance Annotations: + (covariant), - (contravariant)
• Option Type: Alternative to null (Some(value)/None)
4. Basic Syntax Features

• Expression-Oriented: Most constructs return values

• String Interpolation: s"Hello $name"
• Multi-line Strings: Triple quotes """..."""
• Operator Overloading: Methods can be symbols (+, -, ::)
Intermediate Features

1. Advanced Functional Programming

• Pattern Matching: Powerful match expressions

• Partial Functions: Functions only defined for certain inputs
• Currying: Transforming multi-arg functions into chains
• Tail Recursion: @tailrec annotation for optimization
• For-Comprehensions: Syntactic sugar for map/flatMap/filter
2. Type System Enhancements

• Generic Classes: Type parameters class Box[T](x: T)

• Upper/Lower Type Bounds: <:, >: constraints
• Implicit Conversions: Automatic type conversions
• Type Classes: Using implicits for ad-hoc polymorphism
3. Advanced OOP Concepts

• Self Types: this: Type => syntax

• Abstract Types: Type members in traits/classes
• Compound Types: A with B with C
• Structural Types: { def close(): Unit } (duck typing)
4. Concurrency Features

• Futures and Promises: Asynchronous programming

• Akka Integration: Actor model implementation
• Parallel Collections: .par for parallel operations
5. DSL Capabilities

• In x Notation: object method param syntax

• Implicit Classes: Extension methods pattern
• ByName Parameters: => Type for lazy evaluation

Python
Decorator with Parameters

Decorators are a powerful Python feature that allow you to modify or enhance functions
without changing their actual code. They essentially "wrap" other functions to extend
their behavior.

How Decorators Work

1. A decorator is a function that takes another function as input

2. Returns a modi ed or enhanced version of that function
3. Uses the @decorator_name syntax for easy application

def repeat(num_times):
def decorator_repeat(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
fi
fi
for _ in range(num_times):
result = func(*args, **kwargs)
return result
return wrapper
return decorator_repeat

@repeat(num_times=3)
def greet(name):
print(f"Hello {name}")

Generator Function

Generators are special functions that produce sequences of values lazily (one at a time)
using the yield keyword, making them memory efficient.

How Generators Work

1. Use yield instead of return to produce values

2. Maintain their state between calls
3. Generate values on-demand rather than all at once
4. Automatically implement the iterator protocol

def fibonacci(limit):
a, b = 0, 1
while a < limit:
yield a
a, b = b, a + b

# Usage:
for num in fibonacci(100):
print(num)

Combined Example

from functools import reduce

# Process data using all four concepts

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# 1. Filter even numbers

evens = filter(lambda x: x % 2 == 0, data)

# 2. Square them using map

squared_evens = map(lambda x: x**2, evens)

# 3. Create a generator that yields incremental sums

def sum_generator(numbers):
total = 0
for num in numbers:
total += num
yield total

# 4. Get the final sum using reduce

result = reduce(lambda x, y: x + y, sum_generator(squared_evens))

print(result) # Output: 220 (4 + 16 + 36 + 64 + 100)

IAM
EC2
Github Actions
Data Quality
SQL OLAP/OLTP

—————————Bonus————————————
Iceberg
Flink
Kinesis
Terraform
Docker
Prometheus
Grafana

Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Kafka Overview
No ratings yet
Kafka Overview
36 pages
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
Kafka & Confluent: A Technical Guide
No ratings yet
Kafka & Confluent: A Technical Guide
72 pages
Kafkha
No ratings yet
Kafkha
32 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
60 pages
Kafka Notes1
No ratings yet
Kafka Notes1
19 pages
Best Practices For Apache Kafka
No ratings yet
Best Practices For Apache Kafka
6 pages
Kafka
No ratings yet
Kafka
28 pages
Kafka Interview Preparation
No ratings yet
Kafka Interview Preparation
13 pages
Kafka Notes2
No ratings yet
Kafka Notes2
19 pages
Kafka & Spring Boot for Developers
No ratings yet
Kafka & Spring Boot for Developers
150 pages
Kafka
No ratings yet
Kafka
12 pages
Apache Kafka Notes
No ratings yet
Apache Kafka Notes
11 pages
Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Kafka
No ratings yet
Kafka
15 pages
Kafka
No ratings yet
Kafka
43 pages
Apache Kafka
No ratings yet
Apache Kafka
8 pages
Kafka Topic Questions
No ratings yet
Kafka Topic Questions
9 pages
Kafka for Big Data Professionals
No ratings yet
Kafka for Big Data Professionals
14 pages
Kafka
No ratings yet
Kafka
23 pages
Understanding Apache Kafka White Paper
No ratings yet
Understanding Apache Kafka White Paper
7 pages
Apache Kafka Description
No ratings yet
Apache Kafka Description
36 pages
Documentation
No ratings yet
Documentation
105 pages
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
Kafka
No ratings yet
Kafka
4 pages
Kafka and Spark Streaming
No ratings yet
Kafka and Spark Streaming
45 pages
5 Kafka 2.7m
No ratings yet
5 Kafka 2.7m
46 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Big Data-Kafka
No ratings yet
Big Data-Kafka
14 pages
Kafka for Developers and Engineers
No ratings yet
Kafka for Developers and Engineers
7 pages
Instaclustr Understanding Apache Kafka White Paper
No ratings yet
Instaclustr Understanding Apache Kafka White Paper
8 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
10 pages
Introduction To - Messaging Systems-My Version
No ratings yet
Introduction To - Messaging Systems-My Version
43 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
Kafka For Beginners
No ratings yet
Kafka For Beginners
77 pages
Kafka Architecture
No ratings yet
Kafka Architecture
5 pages
Kafka Setup & Operations Guide
No ratings yet
Kafka Setup & Operations Guide
38 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Apache Kafka Beginner Guide
No ratings yet
Apache Kafka Beginner Guide
40 pages
Kafka Reference Architecture
No ratings yet
Kafka Reference Architecture
12 pages
Big Data - Group 14
No ratings yet
Big Data - Group 14
26 pages
Design Patterns For Working With Fast Data: © 2016 Mapr Technologies © 2016 Mapr Technologies
No ratings yet
Design Patterns For Working With Fast Data: © 2016 Mapr Technologies © 2016 Mapr Technologies
64 pages
Kafka Notes 20250814
No ratings yet
Kafka Notes 20250814
6 pages
Apache Kafka
No ratings yet
Apache Kafka
10 pages
12lecture - Technology and Tools (Ù SqoobFlume)
No ratings yet
12lecture - Technology and Tools (Ù SqoobFlume)
48 pages
BD Imp Ques 2
No ratings yet
BD Imp Ques 2
26 pages
Class 5 - MsgQueues, PubSub, Kafka
No ratings yet
Class 5 - MsgQueues, PubSub, Kafka
4 pages
Kafka
No ratings yet
Kafka
140 pages
Apache Kafka Architecture Guide
100% (3)
Apache Kafka Architecture Guide
33 pages
Apache Kafka - Thi Nguyen's Blog
No ratings yet
Apache Kafka - Thi Nguyen's Blog
39 pages
Kafka
No ratings yet
Kafka
88 pages
Kafka
No ratings yet
Kafka
20 pages
Kafka
No ratings yet
Kafka
5 pages
RAG Architecture
100% (10)
RAG Architecture
52 pages
100 Generative AI Use Cases Examples For Industries
100% (10)
100 Generative AI Use Cases Examples For Industries
63 pages
Mastering AI Agents
100% (11)
Mastering AI Agents
93 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
94% (17)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
LLMs Guide for Developers & Data Scientists
100% (15)
LLMs Guide for Developers & Data Scientists
132 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
EY Generative AI Use Cases Repository
100% (4)
EY Generative AI Use Cases Repository
267 pages
Agentic AI Playbook v1.1
100% (7)
Agentic AI Playbook v1.1
19 pages
15000+ ChatGPT Prompts, (Crafti - Pro) - Tareas
93% (27)
15000+ ChatGPT Prompts, (Crafti - Pro) - Tareas
367 pages
AI Artificial Intelligence, 60 Leaders 17 Questions
100% (14)
AI Artificial Intelligence, 60 Leaders 17 Questions
236 pages
AI Agents Unleashed Playbook For 2025 Success
86% (7)
AI Agents Unleashed Playbook For 2025 Success
42 pages
AI Agents by Google
100% (11)
AI Agents by Google
42 pages
The A.I. Playbook
82% (11)
The A.I. Playbook
43 pages
Generative AI for Business Leaders
100% (19)
Generative AI for Business Leaders
80 pages
Generative AI On AWS
100% (11)
Generative AI On AWS
208 pages
7 Agentic RAG System Architectures To Build AI Agents
100% (2)
7 Agentic RAG System Architectures To Build AI Agents
12 pages
Prompt Engineering Bible Join and Master The AI Revolution Profit Online With GPT-4 Plugins For Effortless Money Making (Robert E. Miller) (Z-Library)
100% (11)
Prompt Engineering Bible Join and Master The AI Revolution Profit Online With GPT-4 Plugins For Effortless Money Making (Robert E. Miller) (Z-Library)
209 pages
Generative AI With Large Language Models
100% (4)
Generative AI With Large Language Models
31 pages
Top 100 Applications of Generative AI 1683282083
100% (21)
Top 100 Applications of Generative AI 1683282083
119 pages
How To Become An Agentic AI Expert in 2025
67% (3)
How To Become An Agentic AI Expert in 2025
19 pages
PWC - Agentic AI
100% (11)
PWC - Agentic AI
22 pages
RAG - A Simple Introduction
100% (6)
RAG - A Simple Introduction
75 pages
Prompt Engineer 101
97% (34)
Prompt Engineer 101
45 pages
Principles of Building AI Agents 2nd Edition
100% (6)
Principles of Building AI Agents 2nd Edition
149 pages
Full Course of Machine Learning
100% (17)
Full Course of Machine Learning
660 pages
Beyond AI
100% (11)
Beyond AI
532 pages
Agentic AI - Comprehensive Guide
100% (1)
Agentic AI - Comprehensive Guide
20 pages
Tom Taulli - Generative AI - A Non-Technical Introduction-Apress (2023)
100% (9)
Tom Taulli - Generative AI - A Non-Technical Introduction-Apress (2023)
211 pages
Exec Guide Gen Ai
100% (7)
Exec Guide Gen Ai
48 pages
Dokumen - Pub Building Agentic Ai Systems Create Intelligent Autonomous Ai Agents That Can Reason Plan and Adapt 9781803238753
100% (4)
Dokumen - Pub Building Agentic Ai Systems Create Intelligent Autonomous Ai Agents That Can Reason Plan and Adapt 9781803238753
288 pages
Interview Questions
No ratings yet
Interview Questions
23 pages
New Locio Investor Deck
No ratings yet
New Locio Investor Deck
13 pages
Data Lineage
No ratings yet
Data Lineage
1 page
Poltu Pitch Deck
No ratings yet
Poltu Pitch Deck
10 pages
Tax Interview Amazon
No ratings yet
Tax Interview Amazon
1 page
Machinery Vibration
No ratings yet
Machinery Vibration
2 pages
Cybersecurity Attack Simulation Guide
No ratings yet
Cybersecurity Attack Simulation Guide
21 pages
IS4834 Final Exam Sample Questions
No ratings yet
IS4834 Final Exam Sample Questions
5 pages
Enterprise Security Framework Guide
No ratings yet
Enterprise Security Framework Guide
13 pages
Practice - Chapter 5
No ratings yet
Practice - Chapter 5
5 pages
Weekly Report 2022 10 31
No ratings yet
Weekly Report 2022 10 31
3 pages
Car Troubleshooting Guide
No ratings yet
Car Troubleshooting Guide
6 pages
Fdeosinfo
No ratings yet
Fdeosinfo
46 pages
Airline Data Analysis Tool
No ratings yet
Airline Data Analysis Tool
28 pages
C10 E Business and Enterprise Resource Planning System 1
No ratings yet
C10 E Business and Enterprise Resource Planning System 1
9 pages
Communication Aids and Strategies Using Tool of Technology
No ratings yet
Communication Aids and Strategies Using Tool of Technology
10 pages
NURS-FPX4040 Assessment 3..
No ratings yet
NURS-FPX4040 Assessment 3..
15 pages
SQL 3
No ratings yet
SQL 3
19 pages
Application of BIM in Conjunction With Circular Economy
No ratings yet
Application of BIM in Conjunction With Circular Economy
14 pages
Chapter 19 What Is OSS and BSS
No ratings yet
Chapter 19 What Is OSS and BSS
8 pages
Nanoelectronics Materials To Chips Design - Balwinder Raj
No ratings yet
Nanoelectronics Materials To Chips Design - Balwinder Raj
434 pages
Complete Bundle Microeconomics A Modern Approach 1st Edition Schotter
No ratings yet
Complete Bundle Microeconomics A Modern Approach 1st Edition Schotter
402 pages
Asme b16.5 Class 300 Flanges Bolting Pattern and Dimensions Chart Northwest Fastener
No ratings yet
Asme b16.5 Class 300 Flanges Bolting Pattern and Dimensions Chart Northwest Fastener
1 page
Advances in Image and Graphics Technologies 11th Chinese Conference IGTA 2016 Beijing China July 8 9 2016 Proceedings Tieniu Tan
100% (1)
Advances in Image and Graphics Technologies 11th Chinese Conference IGTA 2016 Beijing China July 8 9 2016 Proceedings Tieniu Tan
64 pages
Engineering Students' News Portal
No ratings yet
Engineering Students' News Portal
16 pages
Statement of Account: PS7 Taman Pulau Sebang Indah 78000 Alor Gajah, Melaka
No ratings yet
Statement of Account: PS7 Taman Pulau Sebang Indah 78000 Alor Gajah, Melaka
3 pages
Balance Function Assessment and Management 3rd Edition by Gary P Jacobson Neil T Shepard Ebook and TestBank Bundle Full Version
No ratings yet
Balance Function Assessment and Management 3rd Edition by Gary P Jacobson Neil T Shepard Ebook and TestBank Bundle Full Version
331 pages
Carrier Debonair 33cs 420 Instructions Manual
No ratings yet
Carrier Debonair 33cs 420 Instructions Manual
89 pages
Mohd Najibullah
No ratings yet
Mohd Najibullah
6 pages
Agilent ICP OES 700 Series
No ratings yet
Agilent ICP OES 700 Series
58 pages
Final Exam It 101
No ratings yet
Final Exam It 101
4 pages
CFTRI Presentations18
No ratings yet
CFTRI Presentations18
8 pages
Reviewer in Mil
No ratings yet
Reviewer in Mil
9 pages
879 IOM Press Red, BB0X99 2021-08-19
No ratings yet
879 IOM Press Red, BB0X99 2021-08-19
16 pages
Unit 4: Communication Authentic Assessment Results
100% (1)
Unit 4: Communication Authentic Assessment Results
23 pages