System Design - Learn System Design
System Design - Learn System Design
System Design
📘 Resources
System Design Primer GitHub
YouTube: Gaurav Sen / Tech Dummies Narayan
🛠 Exercises
Pick simple apps: URL Shortener, Chat App
Try whiteboarding: break down their components
Start defining: API contracts, components, data flow
1/139
8/7/25, 2:30 AM System Design - Learn system design
📘 Resources
Book: Designing Data-Intensive Applications by Martin Kleppmann
YouTube: TechWorld with Nana (great visual explainers)
🛠 Practice Ideas
Design Instagram feed delivery
Add Redis cache in your URL shortener
Choose SQL or NoSQL: justify your pick
📘 Resources
Grokking the System Design Interview (Educative.io)
"Refactoring UI" (for frontend-minded system folks)
🛠 Challenges
Design a scalable Notification System
Design YouTube: break into microservices
Explain trade-offs: eventual consistency vs strong consistency
🛠 Tools
Draw.io or Excalidraw for diagrams
Postman for API mocking
2/139
8/7/25, 2:30 AM System Design - Learn system design
💡 Tips
Always ask clarifying questions before diving in
Trade-offs are more important than perfect design
Think scalability first, optimize second
3/139
8/7/25, 2:30 AM System Design - Learn system design
To keep things clear and memorable, I’ll explain each topic with:
✅ Simple definition
🧠 Real-world analogy/example
⚠️ Tricky parts to watch out for
🔍 Deep dive when needed
🧩 Mini activity or question to reflect
Let’s start with the foundations of system design.
You’re not coding yet — you’re designing how the system should
behave and scale under real-world use.
🧠 Real-World Analogy:
Think of designing a restaurant:
⚠️ Tricky Concepts:
4/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Mini Reflection:
Let’s warm up. Imagine you're designing an online food ordering app
(like Zomato or Swiggy):
Try to write 3–4 down (even rough ideas!). We’ll build on that.
Once you reply, I’ll help categorize them into a real system layout — and
then we’ll move into the next foundation: Functional vs Non-
5/139
8/7/25, 2:30 AM System Design - Learn system design
Functional Requirements.
Security Non-functional
Requirement (Auth, Rate
Limiting, etc.)
APIs Communication
Interfaces (REST, gRPC,
GraphQL)
6/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Real-Life Analogy:
If you're building an airport:
Functional = "Planes must take off, land, refuel, and board
passengers"
Non-Functional = "Max 5 minutes to get boarding pass", "Handle
100 flights/day", "No plane delay >10 mins"
⚠️ Tricky Parts:
Most interviews focus on non-functional because that’s where
design skill matters.
You’ll design different systems depending on the scale you’re
building for (10 users vs 10M users = different needs).
🧩 Mini Exercise:
Take this scenario:
2 Functional Requirements
2 Non-Functional Requirements
7/139
8/7/25, 2:30 AM System Design - Learn system design
✅ Your Response:
Functional Requirements (What the system must do):
1. User can open and solve questions ✅
2. User can browse a library of questions ✅
🎯 Spot on. These are exactly the kinds of things that will drive your
design decisions.
8/139
8/7/25, 2:30 AM System Design - Learn system design
Vertical scaling: You give your server 128GB RAM and 64 cores
→ Might still choke on I/O or database.
Horizontal scaling: You spin up 20 backend servers behind a load
balancer, all doing the same job
→ Better distribution, fault tolerance.
⚠️ Tricky Bits:
Vertical scaling hits physical limits and becomes expensive.
Horizontal scaling needs:
Load balancing
Stateless servers (or sticky sessions)
Data synchronization (cache, DBs)
You’ll often need both: scale vertically until it hurts, then horizontally.
9/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Mini Exercise:
Suppose your exam app expects 100k users during a national test
window.
Try answering in your own words — I’ll help refine it next. Then we dive
into Load Balancing (super important!).
✅ Your Answer:
"Upto some level vertical scaling so system doesn't hang out, and
very much horizontal scaling to balance load"
🧠 Refined Take:
Start with vertical scaling: it’s fast, simple, and often enough for
early stages.
As traffic grows, horizontal scaling becomes essential:
Adds fault tolerance (one server can fail)
Shares the load (users get faster responses)
Works well with auto-scaling groups in cloud setups
10/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Real-life Example:
Imagine a bank with 10 counters (servers). You walk in:
One person at the door (load balancer) tells you:
“Go to Counter 3, they’re free.”
🛠 Key Strategies:
Strategy How it Works Use Case
⚠️ Tricky Bits:
If servers store session data locally, load balancers must keep users
on the same machine (called session affinity or sticky sessions).
Load balancers can be single points of failure — so we add
redundant load balancers (active-passive or active-active).
pgsql
[ Clients (users) ]
|
[ Load Balancer ]
|
[ Backend Servers ] <--- horizontally scaled
|
[ Database ]
11/139
8/7/25, 2:30 AM System Design - Learn system design
Try answering in your own words — I’ll help you review. After that, we’ll
explore Databases (SQL vs NoSQL) — a super important and tricky part
of design.
✅ Your Answer:
"It should transfer load to other server. If user data is saved locally, it
must be transferred; if globally, just switch servers."
🔍 Breakdown:
Case What Happens Why It Matters
⚠️ Tricky Concepts:
12/139
8/7/25, 2:30 AM System Design - Learn system design
Perfect — let’s dive deep into Load Balancing and Session Handling,
especially as it applies to systems like your Online Exam Platform.
13/139
8/7/25, 2:30 AM System Design - Learn system design
If the user suddenly switches servers and the new server doesn’t know
their state, bad things happen.
14/139
8/7/25, 2:30 AM System Design - Learn system design
json
3. Any backend server can fetch session state instantly from Redis.
plaintext
This makes your system resilient to failures during heavy load, like live
exams.
🧩 Mini Review
Quick check! Can you answer:
Type your reasoning, and I’ll review. Then we’ll move on to Databases:
SQL vs NoSQL, where you’ll learn to pick the right storage engine based
on data relationships, read/write patterns, and scale.
Exactly right ✅ — and that's one of the biggest risks in system design.
15/139
8/7/25, 2:30 AM System Design - Learn system design
This allows horizontal scaling + fault tolerance — the two goals you're
aiming for!
Go
16/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Real-Life Analogy:
Imagine managing a classroom:
SQL is like a spreadsheet: fixed columns — every student must have
name, ID, marks, etc.
NoSQL is like a notebook: you write down what matters for each
student — no strict structure.
📘 Examples:
User accounts → You always need name, email, password
Payments → Transactions must never be lost or half-saved
Product inventory → SKUs, prices, suppliers — all relational
17/139
8/7/25, 2:30 AM System Design - Learn system design
📘 Examples:
Chat messages → Different users send different data
Exam submissions → Store answers as JSON blobs
Logs, sensor data → Append-heavy, no schema
🧩 Quick Check:
Suppose you want to store users' answers during the exam. Each
question can be MCQ, code, paragraph, or match-the-following.
18/139
8/7/25, 2:30 AM System Design - Learn system design
✅ Refined Answer:
Use NoSQL because answers are unstructured or semi-structured,
and can vary in type and format (code, MCQ, paragraph, etc).
You don't want to create rigid columns for each type — better to
store them as JSON documents that adapt per user.
Strong Eventual
consistency consistency (often)
19/139
8/7/25, 2:30 AM System Design - Learn system design
Go deeper
Perfect — you’re asking the right thing at the right time. Let’s go deep
into one of the most misunderstood but critical parts of system design:
20/139
8/7/25, 2:30 AM System Design - Learn system design
But you can’t have all three perfectly in a real-world distributed system.
Networks fail, and systems must make trade-offs.
🔍 Examples:
System Chosen Pair What It Means
21/139
8/7/25, 2:30 AM System Design - Learn system design
1. Block the user until you're sure the code is saved everywhere (C +
P)
2. Accept the code quickly and sync later (A + P)
Spot on! ✅ That’s the right instinct for most real-world systems —
especially high-scale, user-facing ones.
This is how real systems like Google Docs, HackerRank, and WhatsApp
work.
22/139
8/7/25, 2:30 AM System Design - Learn system design
Method Example
You can even switch trade-offs depending on the phase of the app!
📌 Caching — when and how to use it, what’s safe to cache, and
why it breaks systems if done wrong
GO DEEPER IN CAP
Love that — let’s break CAP open even further. We’ll tackle it with:
23/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 2. Real-World Analogies
A. Consistency Example
Two friends ask you the score of a cricket match.
You tell both: “India has 204 runs.”
B. Availability Example
Even if your TV doesn’t have the exact score, it still shows you
something — maybe slightly delayed, but it doesn't go black.
24/139
8/7/25, 2:30 AM System Design - Learn system design
Now suppose:
25/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Insight:
Great systems often switch priorities depending on the use case:
Try answering and I’ll walk you through quorum consensus, versioning,
and how real databases like Cassandra and DynamoDB handle it.
Perfect reasoning! ✅ You’ve just made the same design decision used
by systems like DynamoDB, Cassandra, and WhatsApp.
26/139
8/7/25, 2:30 AM System Design - Learn system design
Let’s go deeper into what happens during and after that partition —
and introduce you to some of the powerful tools real systems use to
keep data in sync.
Example:
N = 3 (3 replicas)
W = 2 (write to 2 nodes)
R = 2 (read from 2 nodes)
27/139
8/7/25, 2:30 AM System Design - Learn system design
✅ 3. Version Vectors
Each change gets a version (e.g., timestamp or hash)
If conflicts arise, system compares versions:
Newest wins (Last Write Wins)
Or keeps both versions (e.g., Git-style merge)
Used when:
Users may edit same document from different places
Exam submissions may differ across regions
🧩 Wrap-Up Check
Let’s say you use a quorum system with:
N=3
W = 1 (only 1 write needed)
R = 3 (all nodes must agree to read)
28/139
8/7/25, 2:30 AM System Design - Learn system design
Try answering this one — it’s subtle but important. After that, I can show
you a real-world diagram of how quorum writes work, or we can move
on to Caching next.
✅ Final Answer:
With W = 1 and R = 3, you’re prioritizing consistency over
availability.
Why?
Writes are super easy — they only need to go to 1 node.
But reads must hit all 3 nodes — if even one node is down or
partitioned, your read fails.
This setup makes sure the data you read is fully up-to-date, but it’s
more likely to fail under partial outages, so you’re sacrificing
availability.
29/139
8/7/25, 2:30 AM System Design - Learn system design
You’re now thinking like someone who can design a real-world backend
for scale.
✅ Definitions:
N = Total number of replicas for a piece of data
W = Number of replicas that must acknowledge a write before it is
considered successful
R = Number of replicas that must be queried for a read
🔍 Visual Example:
Let’s say N = 3 → 3 copies of each item
30/139
8/7/25, 2:30 AM System Design - Learn system design
Even if 1 node is behind, at least one node overlaps with the latest
write, so the read gets the correct value.
scss
Overlap → [B] ✅
N = 5 friends
You need W = 3 friends to agree to do something (write quorum)
You ask R = 3 of them before acting (read quorum)
As long as at least one person was part of both groups, you’ll know
what the latest decision was.
31/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Quiz Check:
Let’s say:
N = 5 replicas
W=3
R=3
Think through it, give your answer, and then we’ll jump into Caching
right after.
✅ Final Breakdown:
N = 5 → total replicas
W = 3 → need 3 writes to succeed
R = 3 → need to read from 3 replicas
2 replicas are down → 3 remain
👉 So you just meet the quorum requirement for both reads and
writes.
The system continues to operate with full consistency despite partial
failure.
🧠 Subtle Insight:
Once more than 2 nodes go down, you’re stuck.
Because:
32/139
8/7/25, 2:30 AM System Design - Learn system design
You’ve nailed quorum. It's a pro-level concept, and you’ve built a rock-
solid understanding. 👷♂️
✅ Up Next: Caching
Type “go” and we’ll jump into:
What caching is
When to use it
How it supercharges performance — and when it can cause data
inconsistencies or outages if not handled carefully
🧠 Scenario:
2 replicas go down (say, due to network issue or crash). Now you have
only 3 replicas left.
33/139
8/7/25, 2:30 AM System Design - Learn system design
💡 Rule of Thumb:
The stricter your quorum rules, the fewer failures you can tolerate
before the system stops serving.
So, yes:
✅ Choosing consistency means you're OK with some downtime if it
guarantees correctness of data.
🧠 One-Liner Summary:
“Strong consistency needs agreement — but you can’t agree if not
enough people are online.”
Make sense?
If you're happy with this, we’re ready to dive into caching — the
performance booster with its own tricky trade-offs. Ready?
Yes
Perfect — let’s level up with Caching: one of the most powerful and risky
tools in system design.
34/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Real-Life Analogy:
Imagine a school principal:
Looking at the desk list is faster, but it’s not always updated.
🚀 Benefits
Benefit Why it Matters
35/139
8/7/25, 2:30 AM System Design - Learn system design
If the DB changes but the cache doesn’t, users might see outdated data.
E.g., user submits an answer, DB updates... but cache still shows old
version.
❌ Cache Stampede
If the cache expires for a popular key (e.g., exam homepage)…
Thousands of users may hit the DB at once
System gets DDOS’ed by its own users
❌ Cache Pollution
You cache useless or very rare data → memory wasted.
36/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Quick Check:
Imagine you’re caching exam questions for 100k students.
You use cache-aside with Redis.
Try answering. Then I’ll explain how to avoid cache stampedes and
introduce TTL, eviction, and prewarming — essential next caching
topics.
Yes! ✅ That’s exactly what happens — but let’s walk through the full
chain so the risks are clear.
37/139
8/7/25, 2:30 AM System Design - Learn system design
🔍 Real-World Impact:
Database crashes or slows down
Users see lag or errors
System performance tanks at the worst moment
python
use:
python
3. Request Coalescing
If 100 users miss the cache at the same time, only 1 fetches from DB
— the rest wait for that one.
38/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Quick Question:
Let’s say your app caches exam questions for 1 hour.
What happens if the DB is updated during that hour (e.g., a question
is removed)?
Should the cache reflect it? How would you handle that?
Give it a go, and we’ll talk about cache invalidation — the hardest
problem in CS 😉
Exactly! ✅
39/139
8/7/25, 2:30 AM System Design - Learn system design
1. Naming things
2. Cache invalidation
3. Off-by-one errors.”
🔍 Invalidation Strategies
Strategy What It Means Example
python
redis.delete("exam_questions:exam123")
If using write-through:
40/139
8/7/25, 2:30 AM System Design - Learn system design
⚠️ Trade-off Reminder:
Too aggressive invalidation = Cache is often empty → slow
Too lazy invalidation = Cache is stale → users get wrong info
✅ Summary So Far
You’ve now seen:
Next stop:
📌 Message Queues & Async Processing — super useful for things like
result scoring, email notifications, logging, etc.
Queue
Awesome — queues are where your system gets super scalable and
resilient. Let’s break it down step by step:
41/139
8/7/25, 2:30 AM System Design - Learn system design
The sender and receiver don’t need to be online at the same time.
🧠 Real-Life Analogy:
Think of a cafeteria token system:
42/139
8/7/25, 2:30 AM System Design - Learn system design
Tool Notes
csharp
43/139
8/7/25, 2:30 AM System Design - Learn system design
[Frontend App]
|
[Queue (RabbitMQ)]
|
[Worker Service]
|
[Database]
This way:
🧩 Quick Challenge:
Let’s say you want to queue code submissions for scoring.
How would you make sure the answer is still processed once it recovers?
Take a guess — I’ll explain retry strategies and how queues guarantee
processing even across failures.
Exactly — that’s the core idea behind a robust queue system: retry on
failure until the message is successfully processed.
44/139
8/7/25, 2:30 AM System Design - Learn system design
4. Idempotency (Important!)
If a message is retried, your worker may process the same job twice
45/139
8/7/25, 2:30 AM System Design - Learn system design
Feature Purpose
🧩 Mini Check
Let’s say a worker retries a scoring job 3 times and keeps failing (maybe
invalid input or bad code).
🧠 Your Answer:
"Save to DLQ and check manually"
46/139
8/7/25, 2:30 AM System Design - Learn system design
Infinite retries would just waste CPU and clog the queue, blocking
healthy messages.
🔍 Engineers or
Logged/alerted system alerts are
triggered
In cloud systems (like AWS SQS or Google Pub/Sub), you can configure
DLQ thresholds and monitoring built-in.
json
{
"original_message": {...},
"error": "ExecutionTimeoutError",
"retries": 5,
"timestamp": "2025-07-31T15:00:00Z"
}
47/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Real-World Analogy
Imagine 3 people (workers) are picking up tasks (messages) from a line
(queue):
1. Task A goes to Worker 1 (fast)
2. Task B goes to Worker 2 (slow)
3. Task C goes to Worker 3 (medium)
48/139
8/7/25, 2:30 AM System Design - Learn system design
⚠️ Why It Happens
Distributed systems parallelize processing for speed
Queues like Kafka, SQS, or Pub/Sub often shard messages across
different partitions or consumers
Each partition may process independently, causing out-of-order
results
json
{
"userId": 42,
"examId": 101,
"answer": "..."
}
But messages for user 43 may arrive out of sync with 42’s — and that’s
totally fine.
🧠 Summary Table
49/139
8/7/25, 2:30 AM System Design - Learn system design
Without
Ordering Key With Ordering Key
✅ Final Tip:
Use ordering keys when:
Order matters within a context (e.g., per user, per session, per file)
But keep partitions small and specific
(e.g., don’t route all traffic to one “global” ordering key — that
breaks scalability)
📌 Pub/Sub vs Queues
Or
📌 Rate Limiting & Throttling?
Pubsubs
50/139
8/7/25, 2:30 AM System Design - Learn system design
In a Publish–Subscribe model:
Publishers send messages (events)
Subscribers listen for and receive messages that match certain
topics or categories
🧠 Real-Life Analogy
Think of YouTube:
You subscribe to a channel (topic)
When the creator publishes a new video (message), all subscribers
get notified
If you’re offline, you can still watch later — depending on settings
(persistent delivery)
51/139
8/7/25, 2:30 AM System Design - Learn system design
[Publisher]
|
[Topic: "exam.submitted"]
|
-----------------------------
| | | |
[Email] [Logger] [Scoring Engine] ← Subscribers
52/139
8/7/25, 2:30 AM System Design - Learn system design
⚠️ Tricky Concepts
1. Message Duplication
Some systems offer at-least-once delivery → messages might arrive
more than once
➡️ Solution: Make subscribers idempotent
2. Delivery Guarantees
Type Meaning
3. Backpressure
If a subscriber is slow, messages pile up
➡️ Use consumer groups, rate limits, or buffer queues behind each
subscriber
🧩 Mini Scenario:
You publish a message: "student submitted exam"
You have 3 subscribers:
Email Service
Logger
Scoring Engine
53/139
8/7/25, 2:30 AM System Design - Learn system design
✅ Your Reasoning:
"Persist the data and retry later"
54/139
8/7/25, 2:30 AM System Design - Learn system design
or
55/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Real-Life Analogy:
Imagine you run a pizza kitchen.
Customers (publishers) drop in orders (messages).
You have 4 chefs (consumers).
All chefs are part of the same kitchen team (consumer group).
Each order is handled by exactly one chef, not all 4 — that’s efficient.
56/139
8/7/25, 2:30 AM System Design - Learn system design
⚠️ Tricky Parts
57/139
8/7/25, 2:30 AM System Design - Learn system design
1. Partition Limit
If you have 3 partitions and 5 consumers in a group → only 3 will be
active
So your parallelism is limited by partition count
2. Rebalancing Delay
If one consumer dies, group rebalances
During that time, message delivery may pause briefly
3. Offset Management
Each consumer tracks its position (offset) per partition
You must handle this carefully to avoid duplicate or missed
messages
🧩 Quiz Check:
You have:
4 partitions
4 consumers in Group A (Scoring Service)
2 consumers in Group B (Logging Service)
Try answering — and I’ll confirm or correct. Then we’ll jump into Rate
Limiting or a real system case study — your call.
Exactly right! ✅
💡 Final Answer:
2 total copies of each message will be processed — one by each
group, not each consumer.
Here’s why:
58/139
8/7/25, 2:30 AM System Design - Learn system design
🔍 Breakdown:
Group Consumers Behavior
🧠 Rule to Remember:
A message is delivered once per consumer group,
but only once within each group.
Consumer groups
Parallelism limits
Offset tracking
Pub/Sub and Queue hybrid behavior
59/139
8/7/25, 2:30 AM System Design - Learn system design
Rate limiting
🧠 Real-Life Analogy:
Imagine a ticket counter with this rule:
60/139
8/7/25, 2:30 AM System Design - Learn system design
Reason Example
API: /submit-answer
Rule: Max 10 submits per minute per user
So you:
1. Track request counts per user
2. On 11th request, return 429 Too Many Requests
🧱 How to Implement
61/139
8/7/25, 2:30 AM System Design - Learn system design
Method Tools
Custom Code-level
Middleware implementation (e.g.,
Express, FastAPI, etc.)
⚠️ Tricky Parts
1. Distributed Systems
How do you count requests across multiple servers?
Solution: Use a shared store like Redis
2. Race Conditions
Two requests may sneak in at the same time
Solution: Use atomic counters or locks (e.g., Redis INCR with TTL)
http
🧩 Quick Scenario:
Let’s say you want to allow:
62/139
8/7/25, 2:30 AM System Design - Learn system design
How would you track this limit to make sure it works across all
servers?
Try answering, and I’ll review it with you — and we’ll explore Token
Bucket vs Leaky Bucket, two pro-level strategies.
✅ Your Answer:
"Track for that particular user across a shared store"
plaintext
user:1234:api_count
bash
INCR user:1234:api_count
EXPIRE user:1234:api_count 3600 # (only on first request in
window)
➡️ All 5 servers will hit the same Redis store, so they share the count
63/139
8/7/25, 2:30 AM System Design - Learn system design
http
🧠 Pro Tip:
You can even use INCRBY to weight requests differently:
Normal endpoint → +1
Heavy query → +5
Bucket
Great — let’s dive into Token Bucket vs Leaky Bucket, two of the most
important rate limiting strategies, especially in distributed systems and
API gateways.
64/139
8/7/25, 2:30 AM System Design - Learn system design
Imagine a bucket that gets refilled with tokens at a fixed rate (e.g., 1
token per second).
Each API request spends a token. If there’s a token, the request
proceeds.
🧠 Real-Life Analogy:
You’re allowed to make 1 call per second.
But you didn’t make any for the last 10 seconds — now you can make
up to 10 calls instantly because tokens accumulated.
✅ This allows short bursts, as long as you're under the long-term limit.
🔁 Behavior:
Bucket has a max size (say, 60 tokens)
Tokens refill at a steady rate
Requests are allowed if token is available
If no token → wait or reject
🔧 Example:
Bucket size = 10
Refill rate = 1 token/sec
User makes 5 rapid calls → ✅ allowed
Waits 10 sec → bucket back to 10 tokens
🧠 Real-Life Analogy:
65/139
8/7/25, 2:30 AM System Design - Learn system design
🔁 Behavior:
Requests enter a queue
The queue is drained at a fixed rate
If queue is full → new requests are rejected
66/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Mini Check:
Let’s say a user clicks the "Submit" button 10 times in 2 seconds.
Token Bucket = ?
Leaky Bucket = ?
Try answering what would happen in both cases — I’ll confirm and wrap
up with a cheat sheet you can refer to any time.
Perfectly said ✅
🔍 Your Answer:
"Token Bucket allows if there are 10 tokens available. Leaky Bucket
waits or rejects."
✅ Exactly right.
67/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 One-Liner Summary:
Token Bucket = “You're allowed to burst — just don’t exceed your
refill rate long term.”
Leaky Bucket = “We process requests at a strict fixed rate — no
burst allowed.”
Yes apply
Great! Let’s apply Token Bucket and Leaky Bucket to real components
of your Online Exam System so it clicks in context.
68/139
8/7/25, 2:30 AM System Design - Learn system design
Why it fits
quickly
🔧 Config Example:
Bucket size: 5 tokens
Refill rate: 1 token every 10 seconds
🧠 Effect:
→ They can submit answers up to 5 times in a row, but then must wait
as the bucket refills.
🔧 Config Example:
Queue size: 100 jobs
Leak rate: 1 job per second
🧠 Effect:
→ Keeps your CPU from melting down, and gives you predictable
performance under load.
69/139
8/7/25, 2:30 AM System Design - Learn system design
Backpressure is important
— don't crash logger
service
✅ Takeaway Map
70/139
8/7/25, 2:30 AM System Design - Learn system design
Your call!
Nice — let’s keep building! You're stacking core components like a real
system architect now. Here are a few next building blocks we can
explore, each with deep dives and real-world context:
71/139
8/7/25, 2:30 AM System Design - Learn system design
Step by step
Perfect — you’re thinking like a real system engineer now. Let’s proceed
step by step.
72/139
8/7/25, 2:30 AM System Design - Learn system design
13 🚨 Observability stack
Monitoring, (Prometheus, Grafana,
Logging & ELK)
Alerting
73/139
8/7/25, 2:30 AM System Design - Learn system design
Go
This is the foundation for security and user management in any system
— especially your exam platform, where controlling access to exams,
roles (admin vs student), and sessions is critical.
🔐 Authentication vs Authorization
Term What It Means
🧠 Think of it like:
You show ID at an airport (Authentication),
Then they check your ticket and seat class (Authorization).
74/139
8/7/25, 2:30 AM System Design - Learn system design
Method Notes
Structure:
css
xxxxx.yyyyy.zzzzz
↓ ↓ ↓
Header Payload Signature
Example Payload:
json
{
"userId": 42,
"role": "student",
"exp": 1722748495
}
Pros:
Stateless (no server memory needed)
Easily stored in browser/localStorage
Encodes user info and roles
Cons:
75/139
8/7/25, 2:30 AM System Design - Learn system design
⚠️ Tricky Parts
1. Token expiry and refresh → Use refresh tokens to extend session
securely
2. Securing tokens → Never expose JWTs in URLs; use HTTPS only
3. Revocation → JWTs are stateless; need extra logic for forced logout
4. Token tampering → Use strong signing secret (HMAC or RSA)
json
{
"userId": 101,
"role": "student",
"examId": "EXAM-2025",
"exp": 1722768000
}
What should your server do if this token is expired when the student
tries to submit an answer?
Try answering. I’ll help you refine it and then walk you through refresh
tokens and session expiration strategies.
Exactly right ✅
76/139
8/7/25, 2:30 AM System Design - Learn system design
Server:
Validates the refresh token
Issues a new access token
Optionally rotates the refresh token (for security)
3. User continues without logging in again
Refresh Token Used only to get a Longer (days) HTTP-only secure cookie
new access token or encrypted storage
🧩 Mini Recap:
What should happen if a refresh token is expired or missing?
Try answering that — then I’ll explain how forced logouts, blacklisting,
and token revocation lists work in secure systems.
77/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Why?
A missing or expired refresh token means the user’s long-term
session has ended
It protects against:
Stolen refresh tokens
Inactive users lingering forever
Users who changed passwords or revoked access
🧩 Review
Scenario Response
78/139
8/7/25, 2:30 AM System Design - Learn system design
Scenario Response
Your call!
Rbac abac
79/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Real-Life Analogy:
In a college:
Students can take exams
Teachers can create exams
Admins can manage users
student submit_exam ,
view_results
admin create_exam ,
assign_student ,
export_scores
reviewer approve_exam ,
flag_answers
js
80/139
8/7/25, 2:30 AM System Design - Learn system design
🧠 Real-Life Analogy:
You’re allowed to take the exam only if:
You’re a student
You’re assigned to that specific exam
It’s currently between start and end time
Your IP address is in India
🔧 Example Policy:
pgsql
allow if
user.role == 'student' AND
exam.id in user.assignedExams AND
current_time < exam.deadline
⚖️ RBAC vs ABAC
Feature RBAC ABAC
81/139
8/7/25, 2:30 AM System Design - Learn system design
✅ Best Practice
👉 Use RBAC as your base
➕ Add ABAC where necessary, especially when:
Permissions depend on who is accessing what
Time/location matters
Policies must change dynamically
Would you like to try defining ABAC rules for a sample exam system
feature?
Or move on to CDN / Step 10 next?
📌 What Is a CDN?
82/139
8/7/25, 2:30 AM System Design - Learn system design
Instead of hitting your origin server (e.g., AWS, Node backend) for every
request, users get content from a nearby CDN edge server.
🧠 Real-Life Analogy:
You’re selling a book worldwide:
Without a CDN → ship from your home every time
With a CDN → pre-stock copies in local warehouses around the
world
83/139
8/7/25, 2:30 AM System Design - Learn system design
⚠️ Tricky Concepts
1. Cache Invalidation
If a file changes (e.g., new exam version), how do you update the CDN?
Use versioned URLs: exam-v2.pdf
Or set cache-control headers like:
arduino
Cache-Control: max-age=3600
2. Content Security
Never cache:
Sensitive data (e.g., user profile JSON)
Private exam papers without auth
Use:
Signed URLs or signed cookies
HTTPS with cache headers
84/139
8/7/25, 2:30 AM System Design - Learn system design
3. Geo Routing
CDNs use Anycast IPs — user requests automatically go to nearest
server
→ Makes your app feel faster worldwide without extra infra
Would you like to try designing a cache + CDN strategy for your app?
Or shall we move to Step 11: Database Sharding & Partitioning?
Next step
This comes into play when your database grows too big or too hot to
live on a single machine.
📌 What Is Sharding?
Sharding means splitting your data across multiple databases
(shards), each holding a subset of your data.
85/139
8/7/25, 2:30 AM System Design - Learn system design
Spread load
Reduce latency
Handle larger datasets
Avoid single points of failure
🧠 Real-Life Analogy:
You run a library:
With 1 librarian handling 1000 people → long lines
Instead, you create 10 branches, each responsible for certain books
🔍 Sharding vs Partitioning
Partitioning = Splitting data within a single server (logical
separation)
Sharding = Splitting data across multiple servers (physical +
logical)
You usually start with partitioning, and evolve into sharding as scale
increases.
⚙️ Sharding Strategies
Strategy How It Works Pros Cons
86/139
8/7/25, 2:30 AM System Design - Learn system design
🔁 Routing Logic
How do you know which shard to query?
⚠️ Tricky Parts
1. Cross-Shard Joins
Joins across shards are slow and painful
→ Avoid if possible, or join in code
2. Resharding
If a shard gets too big, you need to split/move data
3. Transactions
Hard to do atomic transactions across shards
Use eventual consistency or 2-phase commit (slow)
87/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Mini Scenario:
Let’s say you want to shard exam responses by examId .
What happens if one exam (like JEE Main) has 10x more students
than all others?
What’s the problem here, and how might you solve it?
Try answering — and I’ll guide you through hotspot shards and load
balancing tricks before we move to Replication & Consistency Models
(Step 12).
Now:
Instead of 1 shard per exam, split JEE Main into multiple virtual
shards
88/139
8/7/25, 2:30 AM System Design - Learn system design
For example:
plaintext
You do:
plaintext
3. Application-Aware Routing
In your backend:
Detect when a shard is overloaded (QPS, latency)
Temporarily route traffic to read replicas
Throttle or queue less-critical writes (e.g., auto-saves vs final
submission)
🔐 Rule of Thumb:
When sharding by a field, make sure it doesn’t create extreme
outliers
89/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Quick Check:
Let’s say you currently shard exam data by userId using:
js
shardIndex = hash(userId) % 4
But now a few users (e.g., cheaters) are generating 10x more traffic than
average.
✅ Absolutely right!
You nailed both the problem and the solution space. Let’s expand on it
clearly:
Now:
90/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Summary:
Issue Caused By
91/139
8/7/25, 2:30 AM System Design - Learn system design
Primary-replica setups
Leaderless systems like Dynamo
Eventual vs strong consistency
Read replicas, sync/async replication
go
📌 What Is Replication?
Replication is the process of copying data from one database/server
(the leader) to one or more followers (replicas).
🧠 Why do it?
Improve read performance
Ensure data durability
Add high availability (failover if one node dies)
⚙️ Types of Replication
1. Leader-Follower (Primary-Replica)
👉 Most common setup.
Role Behavior
92/139
8/7/25, 2:30 AM System Design - Learn system design
Role Behavior
🧠 Think of it like:
👨 Teacher writes notes → 👩🎓 Students copy notes (followers)
Used in:
🧠 Problem: Write conflicts — two users update the same data at once
from different regions
93/139
8/7/25, 2:30 AM System Design - Learn system design
Used by:
Amazon DynamoDB
Cassandra
Riak
Feature Description
N Total replicas
W # of nodes needed
to ACK write
R # of nodes to read
from
⚖️ Consistency Models
Type Behavior
⚠️ Tricky Trade-Offs
Goal Use
94/139
8/7/25, 2:30 AM System Design - Learn system design
Goal Use
🧩 Quick Check:
Let’s say:
Type your answer — then we’ll look at read-after-write issues and how
to solve them!
95/139
8/7/25, 2:30 AM System Design - Learn system design
Even though the write succeeded, the read shows stale data.
📦 This Is Called:
Stale Reads or Replica Lag Issues
js
if (user_just_submitted) {
read_from = "primary"
} else {
read_from = "replica"
}
2. Read-Your-Writes Consistency
Some systems (like Dynamo, Cassandra) use quorum reads:
Ensure that the read includes at least one node that received the
write
96/139
8/7/25, 2:30 AM System Design - Learn system design
🧩 Summary So Far
Problem Solution
Your call!
Step 13
“Is my system working? And if it’s not, will I know before my users
do?”
This is your eyes and ears in production. Without it, you're flying blind
🛩️
97/139
8/7/25, 2:30 AM System Design - Learn system design
98/139
8/7/25, 2:30 AM System Design - Learn system design
99/139
8/7/25, 2:30 AM System Design - Learn system design
⚠️ Tricky Bits
1. Too Many Alerts = Alert Fatigue
→ Prioritize: availability, performance, errors
→ Use thresholds and noise suppression
2. Metrics vs Logs
Metrics = numbers over time (lightweight)
Logs = full story of what happened (verbose, heavy)
3. Correlating Events
A spike in 500s might correlate with:
A new code deployment
High memory on a DB pod
A user abusing the system
100/139
8/7/25, 2:30 AM System Design - Learn system design
Your move 👇
Step 14
Great choice — let’s dive into Step 14: 🔌 Circuit Breakers & Failover, a
critical part of building resilient systems that can survive failure without
taking the whole app down.
This step helps you contain errors, limit blast radius, and recover
gracefully when something fails — which will happen at scale.
🧠 Real-Life Analogy:
Imagine trying to call a friend whose phone is off:
101/139
8/7/25, 2:30 AM System Design - Learn system design
State Behavior
✅ Benefits
Benefit Why It Matters
102/139
8/7/25, 2:30 AM System Design - Learn system design
Internal ✅ Especially if
microservices loosely coupled
Example:
If DB1 goes down, your app connects to DB2 replica
Can be automatic (HAProxy, Aurora)
Or app-level: try primary → fallback
☁️ Cloud-Level Tools
Circuit Breaker
Platform Tool
103/139
8/7/25, 2:30 AM System Design - Learn system design
Circuit Breaker
Platform Tool
🧩 Scenario Check
Imagine your code evaluator service goes down during an exam:
Your call!
Perfect! ✅ You picked the ideal strategy for a high-stakes system like
an online exam platform.
104/139
8/7/25, 2:30 AM System Design - Learn system design
105/139
8/7/25, 2:30 AM System Design - Learn system design
Stack Option
Step 15
Let’s wrap this system design journey with Step 15: 🧪 Load Testing &
Simulation — the final test of whether your system is actually ready for
real traffic.
You’ve built the house. Now it’s time to shake it, flood it, and stomp
on it… and see if it holds.
Does it crash?
Does latency spike?
Do queues back up?
Where’s the bottleneck?
106/139
8/7/25, 2:30 AM System Design - Learn system design
🛠 Popular Tools
Tool Notes
k6 Modern, developer-
friendly load tester (JS
scripts)
107/139
8/7/25, 2:30 AM System Design - Learn system design
Tool Notes
Questions to Test:
Can /submit-answer handle 100 requests/sec for 5 min?
How does latency change over time?
Do circuit breakers or queues trigger?
Does DB CPU spike?
Does Redis cache hold up?
k6 Sample Script:
js
108/139
8/7/25, 2:30 AM System Design - Learn system design
⚠️ Things to Watch
Metric Sign of Trouble
✅ What to Do After
Fix any failing services
Tune configs (timeouts, pool sizes, retries)
Add autoscaling where needed
Rerun load test until it stabilizes
🎯 Endgame Goal
109/139
8/7/25, 2:30 AM System Design - Learn system design
“Our system can handle [X] users, under [Y] load, for [Z] time… and
fail gracefully if it can’t.”
Would you like a final recap of all 15 system design steps, or want to
now apply everything to a full case study (like your exam system)?
110/139
8/7/25, 2:30 AM System Design - Learn system design
1️⃣5️⃣ Load Testing & Test at real scale: Prepares you for real-
Simulation latency, QPS, failure world usage & stress
handling
111/139
8/7/25, 2:30 AM System Design - Learn system design
A cheat sheet
1. Architecture Basics
2. Scaling
3. Security
4. Load Balancing
112/139
8/7/25, 2:30 AM System Design - Learn system design
5. Databases
6. CAP Theorem
7. Caching
Re-sharding = expensive
113/139
8/7/25, 2:30 AM System Design - Learn system design
Bonus Principles
Monitor everything
114/139
8/7/25, 2:30 AM System Design - Learn system design
Your System Design Cheat Sheet is ready and saved in the document
titled "System Design Cheatsheet" ✅
You can use this as a quick reference before interviews or while building
real systems.
A case study
🔧 Requirements
Functional
Users can register, log in, and take coding exams
Each exam has questions, a timer, and code submission
Submissions are evaluated and scored
Admins can create/manage exams and view results
Non-Functional
Should handle 10k users concurrently
Submissions must be evaluated within 5s
Secure (no cheating or data leaks)
Handle failures gracefully
115/139
8/7/25, 2:30 AM System Design - Learn system design
👉 Your turn:
Pick one and justify it. Then we’ll sketch the high-level components
and move to DB + queues + caching + etc.
✅ Perfect reasoning!
116/139
8/7/25, 2:30 AM System Design - Learn system design
Service Responsibility
117/139
8/7/25, 2:30 AM System Design - Learn system design
Service Responsibility
Choose one — and we’ll build it top to bottom like a real system
architect.
118/139
8/7/25, 2:30 AM System Design - Learn system design
Notes:
Rate limiting here (to prevent brute force)
Monitor login failure spikes
Cache tokens with expiry in Redis
DB Tables:
users
exams
user_exam_mapping
119/139
8/7/25, 2:30 AM System Design - Learn system design
Notes:
Cache questions per exam in Redis/CDN
Questions should be signed or hashed to prevent tampering
Timer logic: controlled by backend + frontend sync
txt
↓
Start Exam
↓
Submit code → Submission Service → Queue → Execution + Scoring
↓
Submit Exam → Lock → Notify Results Service
🧩 Quick Questions:
1. “Where would you enforce exam timing so users can’t fake
timers?”
2. “Where would you cache data and where would you never
cache?”
Your turn — answer these two, and then we’ll move to the Code
Evaluation Pipeline (queue + retry + circuit breaker + failover).
✅ Spot on!
121/139
8/7/25, 2:30 AM System Design - Learn system design
backend.
Shall we?
Yes
🧭 Step-by-Step Pipeline
122/139
8/7/25, 2:30 AM System Design - Learn system design
json
{
"userId": 42,
"examId": "JEE2025",
"questionId": "Q1",
"language": "python",
"code": "print('Hello')"
}
json
{
"jobId": 134,
"code": "...",
123/139
8/7/25, 2:30 AM System Design - Learn system design
"language": "cpp",
"userId": 42,
"questionId": "Q1"
}
🔒 4. Sandbox Isolation
Run code in Docker, Firecracker, or gVisor
Disable network access
Apply:
Memory caps
Execution timeouts
Volume mounts (temp file system)
🎯 5. Scoring Engine
Compare output against expected outputs
Support multiple test cases
Save:
Pass/fail status
Runtime
Memory usage
Final score
124/139
8/7/25, 2:30 AM System Design - Learn system design
💡 Resilience Tactics
Failure Strategy
User
↓ POST /submit
Submission Service
↓
Kafka Queue
↓
Worker Pool (Code Execution)
↓
Docker Sandbox → Output
↓
Scoring Service
125/139
8/7/25, 2:30 AM System Design - Learn system design
↓
Result DB
↓
User sees score
Take a stab at these — we’ll review and move to results reporting + final
wrap-up!
🔌 Spot on:
Location Why Use Circuit Breaker
126/139
8/7/25, 2:30 AM System Design - Learn system design
⚙️ A. Submission Service
Scaling Method Notes
127/139
8/7/25, 2:30 AM System Design - Learn system design
⚙️ D. Sandboxing Layer
Strategy Reason
⚙️ E. Scoring + Result DB
Scaling Notes
128/139
8/7/25, 2:30 AM System Design - Learn system design
Scaling Notes
129/139
8/7/25, 2:30 AM System Design - Learn system design
Finish
[Scoring Service]
↓
[Results DB]
↓
[Results Service] → REST API / GraphQL
↓
[User / Admin UI]
🛠 Design Decisions
✅ DB Strategy
Store scores in a results DB (sharded by examId or userId)
Separate analytics DB (OLAP) for large-scale aggregations
130/139
8/7/25, 2:30 AM System Design - Learn system design
Type Examples
userId 101
examId “EXAM2025”
score 72.5
passed true
durationTaken 43 min
submissionTime timestamp
⚡ Caching
Cache leaderboard results, stats, and exam metadata in Redis
Use invalidation on new submissions or scheduled refresh
📊 Analytics Examples
Query Used By
131/139
8/7/25, 2:30 AM System Design - Learn system design
Query Used By
🔐 Secure Access
Auth middleware on GET /results/:userId
RBAC:
Users see their data only
Admins see everyone’s (with filters)
📉 Monitoring Metrics
Metric Why
[User Login]
↓
[Exam Metadata + Questions] ← Redis cache
↓
[Submit Answer] → Submission Service → Kafka
↓
[Worker] → Sandbox Execution → Scoring Service
↓
[Result DB] ← Save score + output
↓
[Results Service] → /leaderboard /my-score /analytics
132/139
8/7/25, 2:30 AM System Design - Learn system design
Would you like this full case study exported into a PDF or document?
Or want a checklist version to revise from?
Great — let’s dive into a real-world system design case study: Zomato
(or any food delivery platform like Swiggy, UberEats).
We'll walk through this like a backend architect: full system overview,
major components, scaling, caching, and reliability — all step by step.
🧭 High-Level Goals
Zomato allows users to:
It must:
133/139
8/7/25, 2:30 AM System Design - Learn system design
🧱 Core Microservices
Service Description
🗺️ High-Level Architecture
text
134/139
8/7/25, 2:30 AM System Design - Learn system design
2. Add to Cart
Cart Service stores items in Redis (short-lived TTL)
Applies pricing rules (offers, delivery fee, taxes)
3. Place Order
Cart → Order Service (validates availability)
Calls Payment Service to block amount
On success:
Notifies Delivery Service to assign agent
Pushes job to order status queue (Kafka)
135/139
8/7/25, 2:30 AM System Design - Learn system design
6. Order Updates
Order status is updated via Kafka events:
ORDER_PLACED → RESTAURANT_CONFIRMED → OUT_FOR_DELIVERY
→ DELIVERED
Frontend polls or uses WebSocket for real-time updates
7. Post Order
User leaves ratings/review
Updates Review DB (can be sharded by restaurantId)
🧠 Scaling Strategy
Component Scaling Method
136/139
8/7/25, 2:30 AM System Design - Learn system design
Monitoring Prometheus,
Grafana, Sentry,
Datadog
137/139
8/7/25, 2:30 AM System Design - Learn system design
Metrics Tools
138/139
8/7/25, 2:30 AM System Design - Learn system design
139/139