[go: up one dir, main page]

0% found this document useful (0 votes)
72 views139 pages

System Design - Learn System Design

The document provides a comprehensive guide to learning system design, structured in stages from basics to advanced topics. It includes key concepts, resources, exercises, and practical applications for each stage, such as load balancing, database types, and real-world case studies. Additionally, it emphasizes the importance of understanding functional and non-functional requirements, as well as scaling strategies in system design.

Uploaded by

22je0013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views139 pages

System Design - Learn System Design

The document provides a comprehensive guide to learning system design, structured in stages from basics to advanced topics. It includes key concepts, resources, exercises, and practical applications for each stage, such as load balancing, database types, and real-world case studies. Additionally, it emphasizes the importance of understanding functional and non-functional requirements, as well as scaling strategies in system design.

Uploaded by

22je0013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

8/7/25, 2:30 AM System Design - Learn system design

System Design - Learn system design

System Design

Great! Here's a full breakdown to learn System Design from scratch —


starting from the basics and gradually moving to advanced topics. Plus,
I’ve added tips, exercises, and book/tool recommendations along the
way.

🔰 Stage 1: Basics of System Design


✅ Topics to Learn
What is system design?
Types: High-Level Design (HLD) vs Low-Level Design (LLD)
Functional vs Non-functional requirements
Scalability, reliability, availability, latency, throughput

📘 Resources
System Design Primer GitHub
YouTube: Gaurav Sen / Tech Dummies Narayan

🛠 Exercises
Pick simple apps: URL Shortener, Chat App
Try whiteboarding: break down their components
Start defining: API contracts, components, data flow

🧱 Stage 2: Building Blocks (Core Components)


✅ Key Concepts
Load balancers
Caching (Redis, Memcached)
Database types (SQL vs NoSQL)
Message Queues (Kafka, RabbitMQ)
Proxies, CDN, DNS, API Gateways

1/139
8/7/25, 2:30 AM System Design - Learn system design

📘 Resources
Book: Designing Data-Intensive Applications by Martin Kleppmann
YouTube: TechWorld with Nana (great visual explainers)

🛠 Practice Ideas
Design Instagram feed delivery
Add Redis cache in your URL shortener
Choose SQL or NoSQL: justify your pick

📊 Stage 3: Deep Dive into Architecture


✅ Focus Areas
Microservices vs Monolith
Consistent Hashing, Sharding
CAP Theorem, PACELC
Event-Driven Architecture
Data Partitioning and Replication
Leader Election, Consensus (Paxos/Raft)

📘 Resources
Grokking the System Design Interview (Educative.io)
"Refactoring UI" (for frontend-minded system folks)

🛠 Challenges
Design a scalable Notification System
Design YouTube: break into microservices
Explain trade-offs: eventual consistency vs strong consistency

🚦 Stage 4: Real-world Systems


✅ Design Case Studies
WhatsApp / Signal
Netflix Streaming
Uber / Ola Backend
Google Maps
YouTube Recommender

🛠 Tools
Draw.io or Excalidraw for diagrams
Postman for API mocking

2/139
8/7/25, 2:30 AM System Design - Learn system design

Wireshark or network simulators

📈 Stage 5: Interview Preparation


✅ Must-Practice
Think in constraints: latency budget, QPS, peak traffic
Estimate capacity: how much storage, bandwidth
Use formulas: cache hit ratio, DB reads/writes per second
Practice frameworks: FAANG-style 5-step design flow

💡 Tips
Always ask clarifying questions before diving in
Trade-offs are more important than perfect design
Think scalability first, optimize second

🧠 Summary of Learning Path


Stage Focus Deliverables

1. Basics Terminologies, Design URL


HLD vs LLD shortener

2. Components Infra & tools Cache layer in app

3. Architecture Patterns, CAP, Notification system


EDA

4. Case Studies Large-scale Design Uber


examples backend

5. Interview Frameworks & Mock interviews


QPS

Want to start with a specific system or tool (like Kafka, Redis, or


designing Twitter)? Just say the word and I’ll walk you through that step-
by-step too!

Start from basics explaining everything in


detail with taking real life examples.Defining

3/139
8/7/25, 2:30 AM System Design - Learn system design

what are some tricky things. Deep dive in


each topic.REMEMEBER THIS FOR EACH

Awesome — we're going to take a deep and thoughtful journey through


System Design, starting from the absolute basics.

To keep things clear and memorable, I’ll explain each topic with:

✅ Simple definition
🧠 Real-world analogy/example
⚠️ Tricky parts to watch out for
🔍 Deep dive when needed
🧩 Mini activity or question to reflect
Let’s start with the foundations of system design.

📌 1. What is System Design?


✅ Definition:
System Design is the process of defining the architecture, components,
modules, interfaces, and data for a system to satisfy specific
requirements.

You’re not coding yet — you’re designing how the system should
behave and scale under real-world use.

🧠 Real-World Analogy:
Think of designing a restaurant:

You decide where the kitchen, dining tables, and washroom go


(like different microservices or modules).
You decide how many chefs you need (workers).
You plan for rush hours (peak load).
You don’t cook yet — you plan everything needed for cooking to
happen efficiently at scale.

⚠️ Tricky Concepts:

4/139
8/7/25, 2:30 AM System Design - Learn system design

People often skip non-functional parts like scalability, reliability,


maintainability — but these are the core of system design.
Overengineering: Adding fancy stuff (like Kafka or 10 microservices)
when a simple design is better.

🔍 Deep Dive: What You’re Designing


Component What it means

Architecture Overall structure


(monolith?
microservices?)

Data flow How data moves between


components

Storage DB choice, schema,


backups

Networking Load balancing, caching,


routing

Scalability plan Vertical vs horizontal


scaling

APIs REST, GraphQL, gRPC

Security Auth, rate limiting,


encryption

Monitoring Logs, alerts, health checks

🧩 Mini Reflection:
Let’s warm up. Imagine you're designing an online food ordering app
(like Zomato or Swiggy):

What are the key components you’d expect in such a system?

Try to write 3–4 down (even rough ideas!). We’ll build on that.

Once you reply, I’ll help categorize them into a real system layout — and
then we’ll move into the next foundation: Functional vs Non-

5/139
8/7/25, 2:30 AM System Design - Learn system design

Functional Requirements.

1 - Architecture 2 which scalinng horizontal


or vertical 3 - Security 4-Apis etc

Perfect! You've already named some of the most important pillars of


system design. Let's take each and understand their roles — then zoom
into our next big foundational block: Requirements.

✅ Your List (Refined in System Design Terms):


System Design
Your Idea Component

Architecture Overall System Structure


(Monolith vs
Microservices)

Horizontal/Verti Scalability Strategy


cal Scaling

Security Non-functional
Requirement (Auth, Rate
Limiting, etc.)

APIs Communication
Interfaces (REST, gRPC,
GraphQL)

Now, before we design anything, we need to understand this:

📌 2. Functional vs Non-Functional Requirements


✅ Definition:
Functional Requirements = What the system must do
→ Features, endpoints, workflows
Non-Functional Requirements = How the system behaves under
conditions
→ Speed, reliability, security, cost, etc.

6/139
8/7/25, 2:30 AM System Design - Learn system design

🧠 Real-Life Analogy:
If you're building an airport:
Functional = "Planes must take off, land, refuel, and board
passengers"
Non-Functional = "Max 5 minutes to get boarding pass", "Handle
100 flights/day", "No plane delay >10 mins"

🔍 Examples in a Food App


Functional Non-Functional

Users can place Must support 10k


orders orders per minute

Admin can Payment must be


onboard processed in < 2s
restaurants

Show menu by Must handle


location failures gracefully

Track order Data should be


status encrypted in transit

⚠️ Tricky Parts:
Most interviews focus on non-functional because that’s where
design skill matters.
You’ll design different systems depending on the scale you’re
building for (10 users vs 10M users = different needs).

🧩 Mini Exercise:
Take this scenario:

You’re building an Online Exam System like HackerRank.


Can you list:

2 Functional Requirements
2 Non-Functional Requirements

7/139
8/7/25, 2:30 AM System Design - Learn system design

Once you try, I’ll help validate and explain.


After that — we’ll explore Horizontal vs Vertical Scaling next (since you
mentioned it!).

functional - user can open and solve ques and


look through lib of ques and non functional -
code should shbmitn in 5 s , 10k users
handling at same time for say some exam

Excellent! You're already thinking like a system designer. Let’s review


your answers together:

✅ Your Response:
Functional Requirements (What the system must do):
1. User can open and solve questions ✅
2. User can browse a library of questions ✅

→ These define core features. Perfect!

Non-Functional Requirements (How well it should do it):


1. Code should submit in 5 seconds ✅
→ This is a performance/latency constraint.
2. System should handle 10k users simultaneously ✅
→ This is a scalability requirement.

🎯 Spot on. These are exactly the kinds of things that will drive your
design decisions.

Let’s now go deeper into the next topic you mentioned:

📌 3. Horizontal vs Vertical Scaling


✅ Definitions:

8/139
8/7/25, 2:30 AM System Design - Learn system design

Type What it means Real-Life Analogy

Vertical Scaling Add more Upgrade a food


power (CPU, stall with a bigger
RAM) to 1 stove
machine

Horizontal Add more Open more stalls


Scaling machines to with same menu
share load

🧠 Example: Online Exam Submission


If 10k users press "Submit" at once:

Vertical scaling: You give your server 128GB RAM and 64 cores
→ Might still choke on I/O or database.
Horizontal scaling: You spin up 20 backend servers behind a load
balancer, all doing the same job
→ Better distribution, fault tolerance.

⚠️ Tricky Bits:
Vertical scaling hits physical limits and becomes expensive.
Horizontal scaling needs:
Load balancing
Stateless servers (or sticky sessions)
Data synchronization (cache, DBs)

You’ll often need both: scale vertically until it hurts, then horizontally.

🔍 Deep Dive Visual:


Imagine your app is a restaurant:

Vertical scaling = Bigger kitchen and more chefs in one location


Horizontal scaling = Franchising — open the same restaurant in
many places

But now: how do customers get served evenly?

Answer: Load Balancers — our next topic soon!

9/139
8/7/25, 2:30 AM System Design - Learn system design

🧩 Mini Exercise:
Suppose your exam app expects 100k users during a national test
window.

Would you choose horizontal or vertical scaling? Why?


What other things would you keep in mind?

Try answering in your own words — I’ll help refine it next. Then we dive
into Load Balancing (super important!).

Upto some level vertical scaling so system


doesnt hang out and very much horizontal
scaling to balance load

Yes! 💯 That's exactly the kind of thinking system designers use.

Let’s break down your reasoning and refine it:

✅ Your Answer:
"Upto some level vertical scaling so system doesn't hang out, and
very much horizontal scaling to balance load"

🧠 Refined Take:
Start with vertical scaling: it’s fast, simple, and often enough for
early stages.
As traffic grows, horizontal scaling becomes essential:
Adds fault tolerance (one server can fail)
Shares the load (users get faster responses)
Works well with auto-scaling groups in cloud setups

📌 4. Load Balancers (Next Concept)


✅ What is a Load Balancer?
A load balancer is like a traffic cop. It receives all incoming user requests
and distributes them evenly across available servers.

10/139
8/7/25, 2:30 AM System Design - Learn system design

🧠 Real-life Example:
Imagine a bank with 10 counters (servers). You walk in:
One person at the door (load balancer) tells you:
“Go to Counter 3, they’re free.”

🛠 Key Strategies:
Strategy How it Works Use Case

Round-robin Rotate through Even basic


servers distribution

Least Send to server with Real-time apps


Connections fewest active sessions

IP Hashing Same user → same Sticky sessions


server (e.g., carts)

⚠️ Tricky Bits:
If servers store session data locally, load balancers must keep users
on the same machine (called session affinity or sticky sessions).
Load balancers can be single points of failure — so we add
redundant load balancers (active-passive or active-active).

📌 Visual Stack for Online Exam:


Here’s a basic backend architecture you’re now ready to imagine:

pgsql

[ Clients (users) ]
|
[ Load Balancer ]
|
[ Backend Servers ] <--- horizontally scaled
|
[ Database ]

11/139
8/7/25, 2:30 AM System Design - Learn system design

🧩 Your Turn: Quick Check


Suppose you're using 5 backend servers and one load balancer.

What happens if one backend server crashes during the exam?


What should the load balancer ideally do?

Try answering in your own words — I’ll help you review. After that, we’ll
explore Databases (SQL vs NoSQL) — a super important and tricky part
of design.

It should trasnsfer load to other server with


user data if saved locally if globally just need
to xchange server

Exactly! 🙌 You’ve got the core idea.

Let’s unpack and reinforce what you said:

✅ Your Answer:
"It should transfer load to other server. If user data is saved locally, it
must be transferred; if globally, just switch servers."

🔍 Breakdown:
Case What Happens Why It Matters

User data saved locally Switching Needs sticky sessions or


(in memory) servers may session replication
lose
session/state

User data saved User can go to This is called stateless


globally (e.g., Redis, any server design → preferred for
DB) scalability

🎯 Most large-scale systems avoid local-only storage and use central


stores (like Redis or DB) to keep the servers stateless.

⚠️ Tricky Concepts:

12/139
8/7/25, 2:30 AM System Design - Learn system design

If your backend relies on in-memory data (like exam timer or code


editor state), losing a server might disrupt user sessions.
Solutions:
Use a centralized cache (e.g., Redis)
Replicate sessions
Design client to sync regularly (e.g., autosave every 10s)

📌 Up Next: Databases (SQL vs NoSQL)


Unless you want to pause here, let’s move on to understanding how to
store data — and what tradeoffs you’ll face when choosing between
relational and non-relational databases.

Let me know if you’re ready, or want to go deeper into load


balancing/session handling first.

Go deep first then we will move ahead to dbs

Perfect — let’s dive deep into Load Balancing and Session Handling,
especially as it applies to systems like your Online Exam Platform.

🧠 Part 1: Deep Dive into Load Balancers


✅ What does a load balancer really do?
Distributes incoming requests across multiple servers to avoid
overloading any single one.
Acts as a reverse proxy (users don’t know which server they're
talking to).
Often provides health checks, SSL termination, routing, and session
management.

📊 Load Balancing Algorithms


Algorithm Behavior Best For

Round Robin Requests go to Simple, uniform traffic


each server in
order

13/139
8/7/25, 2:30 AM System Design - Learn system design

Algorithm Behavior Best For

Least Chooses server Long-lived connections


Connections with fewest (e.g., live exam sessions)
active users

Weighted Round Some servers Mixed hardware strengths


Robin get more traffic

IP Hashing Always sends a Sticky sessions, if no


user to the global store
same server

🧠 In your exam app, you'd likely want Least Connections or Sticky


Sessions during test execution.

🔐 Part 2: Session Handling


✅ Why are sessions tricky?
In real-time apps (like exams), a user might:
Be writing code
Answering MCQs
Watching a timer tick down

If the user suddenly switches servers and the new server doesn’t know
their state, bad things happen.

⚠️ Session Storage Options


Strategy Description Good?

In-memory on Each server holds its ❌ Not scalable,


server own user sessions not fault-tolerant

Sticky Sessions Load balancer routes ⚠️ Better, but fails


same user to same if server crashes
server

Shared session All servers read/write ✅ Stateless,


store (e.g. Redis) session to same store scalable, reliable

14/139
8/7/25, 2:30 AM System Design - Learn system design

🔄 Real-World Flow with Redis (Preferred)


1. User logs in
2. Server stores session in Redis:

json

session:1234 → {userId: "U42", examId: "E15", lastSaved:


10:12:03}

3. Any backend server can fetch session state instantly from Redis.

This is how scalable systems (like Google Docs, HackerRank, etc.)


work.

🛡 Bonus: Health Checks


Load balancers regularly ping backend servers to see if they're alive:

If a server fails → stop sending traffic to it


You can define rules like:

plaintext

ping every 5s → if no response in 3 tries → mark unhealthy

 

This makes your system resilient to failures during heavy load, like live
exams.

🧩 Mini Review
Quick check! Can you answer:

Why is it risky to store sessions only in a server’s local memory


during a timed online exam?

Type your reasoning, and I’ll review. Then we’ll move on to Databases:
SQL vs NoSQL, where you’ll learn to pick the right storage engine based
on data relationships, read/write patterns, and scale.

If server crashes data will be lost

Exactly right ✅ — and that's one of the biggest risks in system design.

15/139
8/7/25, 2:30 AM System Design - Learn system design

Let’s sharpen your answer just a bit more:

💡 Final Answer (Refined):


If you store sessions only in local memory, and the server crashes or
restarts, all session data (like answers, code, timers) will be lost.
This means users may lose progress, get logged out, or face
inconsistencies — which is unacceptable in systems like online
exams, banking, or e-commerce.

✅ So, what do we do?


We externalize session state to shared systems like:

Redis (fast, in-memory, great for sessions)


Database (slower, but persistent)
Or use token-based stateless auth (for login info)

This allows horizontal scaling + fault tolerance — the two goals you're
aiming for!

You're crushing it so far. Ready for the next major topic?

👉 Up next: SQL vs NoSQL — choosing the right database


We’ll start with:

How to pick based on your system’s data needs


Real-life examples (like exam submissions vs user accounts)
Tricky parts like normalization, joins, and consistency

Type "go" and we’ll dive right in.

Go

Awesome — let’s dive into Databases in System Design, starting with


the big decision:

📌 5. SQL vs NoSQL — When, Why, and How

16/139
8/7/25, 2:30 AM System Design - Learn system design

✅ What are SQL and NoSQL?


Type Full Form Key Idea

SQL Structured Uses tables, rows, and


Query fixed schemas
Language

NoSQL Not Only SQL Uses flexible schemas


(key-value, docs, graphs,
etc.)

🧠 Real-Life Analogy:
Imagine managing a classroom:
SQL is like a spreadsheet: fixed columns — every student must have
name, ID, marks, etc.
NoSQL is like a notebook: you write down what matters for each
student — no strict structure.

⚙️ SQL Deep Dive


✅ Use SQL when:
Data has strong relationships (foreign keys, joins)
You need ACID properties (Atomic, Consistent, Isolated, Durable)
Schema is well-defined

📘 Examples:
User accounts → You always need name, email, password
Payments → Transactions must never be lost or half-saved
Product inventory → SKUs, prices, suppliers — all relational

⚙️ NoSQL Deep Dive


✅ Use NoSQL when:
Schema can change often
You want high write speed and scale horizontally
You can tolerate eventual consistency

17/139
8/7/25, 2:30 AM System Design - Learn system design

📘 Examples:
Chat messages → Different users send different data
Exam submissions → Store answers as JSON blobs
Logs, sensor data → Append-heavy, no schema

⚠️ Tricky Design Questions:


1. Can I join across collections/tables easily?
SQL: ✅ Yes
NoSQL: ❌ Expensive (no joins in most NoSQL DBs)
2. How fast can I write and scale?
SQL: May bottleneck on write
NoSQL: Optimized for massive write scale
3. Can I afford eventual consistency?
Mission-critical? → SQL
High-volume + tolerant? → NoSQL

🔄 Your Exam System: What Fits Where?


Component Best DB Why?

User login & SQL (e.g. Structured, secure,


auth PostgreSQL) relational

Exam metadata SQL Defined schema: title,


duration, questions

User NoSQL (e.g. Flexible: answer blobs,


submissions MongoDB) auto-saves

Activity logs NoSQL (e.g. High volume, write-heavy


DynamoDB)

🧩 Quick Check:
Suppose you want to store users' answers during the exam. Each
question can be MCQ, code, paragraph, or match-the-following.

Would SQL or NoSQL be better for that?


And why?

18/139
8/7/25, 2:30 AM System Design - Learn system design

Give it a go — then we’ll look at CAP theorem and trade-offs next!

No sql because there is no predefindes same


type of answer

Exactly! 🔥 That’s the right intuition.

✅ Refined Answer:
Use NoSQL because answers are unstructured or semi-structured,
and can vary in type and format (code, MCQ, paragraph, etc).
You don't want to create rigid columns for each type — better to
store them as JSON documents that adapt per user.

This also makes autosave and versioning much easier.

🧠 Recap of SQL vs NoSQL (Quick Summary):


SQL NoSQL

Rigid schema Flexible schema

Strong Eventual
consistency consistency (often)

Joins supported Joins hard or


impossible

Scale vertically Scale horizontally

Good for Good for


structured data messy/fast-
changing data

You’ve now got:


Load balancing 💡
Session handling 🧠
Scaling 🔁
Database strategy ✅

Next up: Should we go into CAP Theorem and consistency trade-offs?

19/139
8/7/25, 2:30 AM System Design - Learn system design

This is where you learn why systems often can’t be perfect at


everything — and how to choose between consistency, availability,
and partition tolerance.

Type "CAP" if you’re ready, or let me know if you want to pause or go


deeper into DB stuff.

Go deeper

Perfect — you’re asking the right thing at the right time. Let’s go deep
into one of the most misunderstood but critical parts of system design:

📌 6. CAP Theorem — The Triangle of Pain


✅ Definition:
In a distributed system, you can only guarantee two out of the
following three:

Letter Stands for Meaning

C Consistency Every read returns the


latest write

A Availability Every request gets a


response, even if it’s stale

P Partition System keeps working


Tolerance even if network breaks
between nodes

🧠 Real-Life Analogy: Pizza Shop Chain


You order pizza from Zomato:
If Zomato always shows real-time kitchen status, that’s
consistency.
If Zomato always replies, even with slightly outdated info, that’s
availability.

20/139
8/7/25, 2:30 AM System Design - Learn system design

If Zomato works even when one outlet loses internet, that’s


partition tolerance.

But you can’t have all three perfectly in a real-world distributed system.
Networks fail, and systems must make trade-offs.

⚠️ Why is Partition Tolerance non-negotiable?


Because:
Networks can and do fail
Systems must tolerate delays, timeouts, or broken links

So in practice, all distributed systems must tolerate partitions → you


choose between:
C + P (Consistency + Partition) → Sacrifice availability
A + P (Availability + Partition) → Sacrifice consistency

🔍 Examples:
System Chosen Pair What It Means

MongoDB A+P Reads/writes work even if


(default) data is a bit stale

HBase / C+P Writes won’t go through


Bigtable until consistency is
guaranteed

Banking C+P Accuracy > response time


Systems

Shopping Carts A+P It’s OK if your cart lags


slightly, but should not
fail

🧩 Your Exam App:


Imagine a user submits a code answer but a server partition happens.

Which would you rather do?

21/139
8/7/25, 2:30 AM System Design - Learn system design

1. Block the user until you're sure the code is saved everywhere (C +
P)
2. Accept the code quickly and sync later (A + P)

👉 Think in terms of: user experience, exam integrity, and scale.


What's your instinct? Let’s decide and then look at real-world trade-
offs.

2 one i think because we can save answer and


can check later or sync later and let user
continue maybe

Spot on! ✅ That’s the right instinct for most real-world systems —
especially high-scale, user-facing ones.

🎯 Let’s Lock in the Reasoning:


You picked Option 2: A + P (Availability + Partition) — Accept the
answer immediately and sync later.

💡 Why this makes sense:


In a live exam, the worst UX is “Submit button stuck” or “Server
error”.
You prioritize responsiveness, even if it means the answer isn’t fully
consistent yet.
Behind the scenes, you can:
Write to a local node
Queue it for eventual replication
Use versioning to resolve sync conflicts if needed

This is how real systems like Google Docs, HackerRank, and WhatsApp
work.

⚠️ But Here’s the Tricky Part:


💥 What if two versions of an answer get saved due to conflict?

You need a conflict resolution strategy:

22/139
8/7/25, 2:30 AM System Design - Learn system design

Method Example

Last write wins Timestamped answers —


latest one kept

Manual merge Alert admin if two


different answers exist

Store all Like Google Docs history


versions (safe but heavy)

In design interviews, being able to talk about these trade-offs is 🔥


gold.

🧩 Quick Summary for CAP in Your Exam System:


Goal CAP Pair Reason

Let user submit A+P Prioritize UX over


fast strict accuracy

Score accurately C+P Sync everything


post-exam and verify later

You can even switch trade-offs depending on the phase of the app!

You’re doing great. Up next, we’ll hit:

📌 Caching — when and how to use it, what’s safe to cache, and
why it breaks systems if done wrong

Ready to go? Type “cache” and we’ll jump in!

GO DEEPER IN CAP

Love that — let’s break CAP open even further. We’ll tackle it with:

1. ✅ Super-clear breakdown of each component


2. 🧠 Visual real-world examples
3. ⚠️ Tricky misconceptions
4. 🔍 Deep dive into behavior under network failure
5. 🧩 Design-level insights for your own systems

23/139
8/7/25, 2:30 AM System Design - Learn system design

✅ 1. What Exactly Are C, A, and P?


Letter What It Means in Practice

C— All clients see the same


Consistency data, no matter which
server they talk to.

A — Availability Every request gets a non-


error response — the
system never says “can’t
serve you.”

P — Partition The system keeps working


Tolerance even if some servers can’t
talk to others.

🧠 2. Real-World Analogies
A. Consistency Example
Two friends ask you the score of a cricket match.
You tell both: “India has 204 runs.”

That’s consistency — no matter who asks, same correct data.

B. Availability Example
Even if your TV doesn’t have the exact score, it still shows you
something — maybe slightly delayed, but it doesn't go black.

That’s availability — always answering.

C. Partition Tolerance Example


You’re texting on WhatsApp. Suddenly, your Wi-Fi dies.
You switch to mobile data — messages go through anyway.

That’s partition tolerance — the system adapts to broken


connections.

24/139
8/7/25, 2:30 AM System Design - Learn system design

🔥 3. The Tricky Truth


Partition tolerance is not optional in distributed systems.
So the real decision is:

"In a network partition, do you want to serve stale data (A + P)


or block until consistent (C + P)?"

🔍 4. What Happens During a Partition?


Let’s say your system has 3 servers: A, B, C.

Now suppose:

A and B lose connection to C (partitioned)


User tries to read from A and write to B

You have two options:

Option What Happens Result

Favor Don’t respond Reliable but users face


Consistency until partition delays/errors
(C+P) heals

Favor Allow A & B to Fast, but might return


Availability operate outdated/incomplete
(A+P) independently data

⚠️ 5. Misconceptions to Watch For


1. ❌ CAP doesn't say “you can’t have all three ever” — it says during
a partition, you must pick C or A.
2. ❌ Availability ≠ uptime — it’s about returning responses, even if
they're stale.
3. ❌ Consistency ≠ correctness always — it's data agreement across
nodes.

📦 6. How to Choose in Real Systems

25/139
8/7/25, 2:30 AM System Design - Learn system design

System Chooses Why

Banking C+P Must be accurate — don’t


allow inconsistent
balances

Social media A+P OK to see post a few


posts seconds late

Online exam A + P (during Must not block writes, but


exam), C + P final scoring must be
(post exam) consistent

🧠 Insight:
Great systems often switch priorities depending on the use case:

Write fast (A+P), then sync and verify later (C+P)


Or give user read-after-write guarantees using techniques like
quorum reads/writes

🧩 Your Turn — Thinking Exercise


Let’s say you’re building a multi-region online exam system:
Users in India and US
Servers in Mumbai and New York
A network partition isolates both for 2 mins

Should Indian users still be able to submit?


What should happen when partition heals?

Try answering and I’ll walk you through quorum consensus, versioning,
and how real databases like Cassandra and DynamoDB handle it.

Yes able to submit once heal sync with real


data .

Perfect reasoning! ✅ You’ve just made the same design decision used
by systems like DynamoDB, Cassandra, and WhatsApp.

26/139
8/7/25, 2:30 AM System Design - Learn system design

Let’s go deeper into what happens during and after that partition —
and introduce you to some of the powerful tools real systems use to
keep data in sync.

📌 7. What Happens After Partition Heals?


Scenario:
Indian users submit answers to Mumbai servers
US users can’t reach Mumbai (partition)
After 2 minutes, the network heals

Now your system needs to answer:

“Which copy of the data is correct?”

🔍 How Real Systems Handle This:


✅ 1. Eventual Consistency
Once partition heals, nodes exchange data
System resolves conflicts (based on timestamps, versioning, etc.)
Final state becomes consistent eventually

✅ 2. Quorum-Based Consensus (used in Cassandra, Dynamo)


Instead of waiting for all nodes:

You define N: total number of replicas


You define R: min nodes to read from
You define W: min nodes to write to

If R + W > N, you can guarantee strong consistency!

Example:

N = 3 (3 replicas)
W = 2 (write to 2 nodes)
R = 2 (read from 2 nodes)

→ Any read sees at least 1 copy with latest write

27/139
8/7/25, 2:30 AM System Design - Learn system design

✅ 3. Version Vectors
Each change gets a version (e.g., timestamp or hash)
If conflicts arise, system compares versions:
Newest wins (Last Write Wins)
Or keeps both versions (e.g., Git-style merge)

Used when:
Users may edit same document from different places
Exam submissions may differ across regions

✅ 4. Conflict Resolution Strategies


Strategy Example

Last Write Wins Keep the version with


latest timestamp

Custom merge Combine partial answers


logic or select longest

User Rare — notify admins or


intervention users

Multi-version Like Google Docs’ version


support history

🧠 Real World Example — DynamoDB (A+P System)


Prioritizes availability and partition tolerance
Uses vector clocks to track versions
Application code decides how to resolve conflicts

🧩 Wrap-Up Check
Let’s say you use a quorum system with:

N=3
W = 1 (only 1 write needed)
R = 3 (all nodes must agree to read)

👉 What trade-off are you making?

28/139
8/7/25, 2:30 AM System Design - Learn system design

Try answering this one — it’s subtle but important. After that, I can show
you a real-world diagram of how quorum writes work, or we can move
on to Caching next.

Sacrificing avaibalibity ' '

Yes! 💯 Nailed it — and let’s lock it in:

✅ Final Answer:
With W = 1 and R = 3, you’re prioritizing consistency over
availability.

Why?
Writes are super easy — they only need to go to 1 node.
But reads must hit all 3 nodes — if even one node is down or
partitioned, your read fails.

This setup makes sure the data you read is fully up-to-date, but it’s
more likely to fail under partial outages, so you’re sacrificing
availability.

📊 Quick Review: CAP + Quorum Cheat Table


Goal W R Tradeoff

High write Low W Any writes succeed


availability fast

High read Low R Reads succeed even if


availability some nodes are down

Strong High W + R (W Ensures overlap


consistency + R > N) between
writers/readers

High availability Low W + Low Risk stale reads, but


R always responsive

29/139
8/7/25, 2:30 AM System Design - Learn system design

You’re now thinking like someone who can design a real-world backend
for scale.

Want to keep going?


Next stop:
📌 Caching — what it is, when to use it, and why it can break your
system if done wrong.

Type “cache” and we’ll dive deep.

Explain qourum one last time deeper then we


will move to caching

Absolutely — let’s go deep into quorum-based consistency one last


time so it really sticks. This is one of the trickiest but most powerful
concepts in distributed system design.

📌 What is Quorum in System Design?


In a replicated system with multiple copies of data (N nodes),
quorum ensures that reads and writes overlap on at least one node
so you don’t miss the latest data.

✅ Definitions:
N = Total number of replicas for a piece of data
W = Number of replicas that must acknowledge a write before it is
considered successful
R = Number of replicas that must be queried for a read

🧠 The magic rule:


If W + R > N, then at least one node will have both read and write.
That gives you strong consistency.

🔍 Visual Example:
Let’s say N = 3 → 3 copies of each item

Write goes to W = 2 nodes

30/139
8/7/25, 2:30 AM System Design - Learn system design

Read queries R = 2 nodes

Even if 1 node is behind, at least one node overlaps with the latest
write, so the read gets the correct value.

scss

Write → [A] [B] [C] (missed)


Read → [B] [C]

Overlap → [B] ✅

⚠️ Trade-offs by Tweaking W and R


Case W R Pros Cons

W=1, R=N Fast writes Strong read, Read slows


slow under down
failure

W=N, R=1 All writes must Read is fast, Write


succeed but write is bottleneck
slow

W=2, R=2 Balance of Consistent, Good for most


both moderately systems
fast

W=1, R=1 Max speed Risk of stale Weakest


data consistency

🧠 Real-Life Analogy: Voting


Think of it like making a group decision:

N = 5 friends
You need W = 3 friends to agree to do something (write quorum)
You ask R = 3 of them before acting (read quorum)

As long as at least one person was part of both groups, you’ll know
what the latest decision was.

31/139
8/7/25, 2:30 AM System Design - Learn system design

🧩 Quiz Check:
Let’s say:

N = 5 replicas
W=3
R=3

What happens if 2 replicas go offline?


Can you still read and write?

Think through it, give your answer, and then we’ll jump into Caching
right after.

Yes we can as r and w both are 3 and we willl


have 3 remaining replicas

Exactly! 🔥 That’s a perfect answer.

✅ Final Breakdown:
N = 5 → total replicas
W = 3 → need 3 writes to succeed
R = 3 → need to read from 3 replicas
2 replicas are down → 3 remain

👉 So you just meet the quorum requirement for both reads and
writes.
The system continues to operate with full consistency despite partial
failure.

🧠 Subtle Insight:
Once more than 2 nodes go down, you’re stuck.
Because:

You won’t meet W = 3


You won’t meet R = 3

So availability drops — that’s the trade-off for consistency.

This kind of resilience threshold is why designers often choose quorum


levels based on expected failure tolerance.

32/139
8/7/25, 2:30 AM System Design - Learn system design

You’ve nailed quorum. It's a pro-level concept, and you’ve built a rock-
solid understanding. 👷‍♂️

✅ Up Next: Caching
Type “go” and we’ll jump into:
What caching is
When to use it
How it supercharges performance — and when it can cause data
inconsistencies or outages if not handled carefully

So availability drops — that’s the trade-off for


consistency..?why?

Awesome question — this is where system design becomes real


thinking, not just memorizing. Let’s break it down slowly.

✅ Why Availability Drops When You Choose Strong


Consistency
Imagine again:
N=5
W = 3 (writes must hit 3 replicas)
R = 3 (reads must query 3 replicas)

🧠 Scenario:
2 replicas go down (say, due to network issue or crash). Now you have
only 3 replicas left.

If even 1 more node goes down — now:

You have only 2 nodes left


But your config says:
Need 3 to write
Need 3 to read

33/139
8/7/25, 2:30 AM System Design - Learn system design

👉 Now you can do neither read nor write.


The system becomes unavailable even though 2 nodes are still healthy.

📊 Why this is the trade-off:


Choice What You Get What You Lose

Prioritize Consistency Data is always But can’t respond if


(W + R > N) accurate and up-to- enough nodes aren’t
date available → Availability
drops

Prioritize Availability System stays Risk users seeing stale or


(W or R < majority) responsive even if conflicting data (weaker
nodes are missing consistency)

💡 Rule of Thumb:
The stricter your quorum rules, the fewer failures you can tolerate
before the system stops serving.

So, yes:
✅ Choosing consistency means you're OK with some downtime if it
guarantees correctness of data.

🧠 One-Liner Summary:
“Strong consistency needs agreement — but you can’t agree if not
enough people are online.”

Make sense?

If you're happy with this, we’re ready to dive into caching — the
performance booster with its own tricky trade-offs. Ready?

Yes

Perfect — let’s level up with Caching: one of the most powerful and risky
tools in system design.

34/139
8/7/25, 2:30 AM System Design - Learn system design

📌 8. Caching — Speed Comes at a Price


✅ What is a Cache?
A cache is a temporary, fast-access storage layer that holds copies of
frequently-used data to reduce load on slower backend systems (like
databases).

🧠 Real-Life Analogy:
Imagine a school principal:

Keeps a shortlist of top students on his desk (cache)


The full student records are in a dusty cabinet in the office (DB)

Looking at the desk list is faster, but it’s not always updated.

⚙️ How Caching Works in Systems


Users make a request (e.g., get question paper)
System checks cache first (e.g., Redis, Memcached)
If found → return immediately (super fast)
If not found → fetch from DB, store in cache, return to user

🚀 Benefits
Benefit Why it Matters

🔥 Speed Cache reads are often 10x


to 100x faster than DB

💸 Cost Fewer DB queries → lower


infra cost

📈 Scale Offloads pressure from


your primary database

⚠️ But… What’s Tricky?


❌ Cache Inconsistency

35/139
8/7/25, 2:30 AM System Design - Learn system design

If the DB changes but the cache doesn’t, users might see outdated data.

E.g., user submits an answer, DB updates... but cache still shows old
version.

❌ Cache Stampede
If the cache expires for a popular key (e.g., exam homepage)…
Thousands of users may hit the DB at once
System gets DDOS’ed by its own users

❌ Cache Pollution
You cache useless or very rare data → memory wasted.

🔍 Common Caching Strategies


Strategy How it Works Example Use

Read-through App reads from cache; User profile pages


if miss, fetch + store

Write-through Write to cache + DB Leaderboards


at same time

Write-back / Write to cache first, Risky but fast


Lazy then DB after delay writes

Cache-aside App checks cache → if Flexible, most


miss, queries DB + common
updates cache

🧠 Where to Cache in Your Exam System


Data Cache? Why

User profiles ✅ Yes Rarely changes,


high reuse

Question bank ✅ Yes Static data,


improves load time

36/139
8/7/25, 2:30 AM System Design - Learn system design

Data Cache? Why

Live code ⚠️ Maybe Must be synced


answers carefully

Scoring results ❌ No Must be accurate,


real-time only

🧩 Quick Check:
Imagine you’re caching exam questions for 100k students.
You use cache-aside with Redis.

What happens if the cache expires just before exam starts?

Try answering. Then I’ll explain how to avoid cache stampedes and
introduce TTL, eviction, and prewarming — essential next caching
topics.

Have to update cache with fetching data from


db maybe?

Yes! ✅ That’s exactly what happens — but let’s walk through the full
chain so the risks are clear.

🧠 What You Said (Refined):


If the cache expires just before the exam, and students start the
exam, the app will miss the cache and must fetch data from the
database, then update the cache.

That’s the cache-aside pattern in action.

⚠️ Here’s Where It Gets Risky: The Cache Stampede


If 100,000 students open the exam at once and the cache has expired:
Every student hits the backend
Cache is cold
The database gets flooded with requests

37/139
8/7/25, 2:30 AM System Design - Learn system design

👉 This is called a cache stampede or thundering herd problem

🔍 Real-World Impact:
Database crashes or slows down
Users see lag or errors
System performance tanks at the worst moment

✅ Strategies to Prevent This


1. Pre-warming the Cache

Reload important cache keys before they expire


🛠 Scheduled jobs (cron) or warmup triggers

2. Staggered Expiry / TTL

Avoid expiring all keys at once

E.g., instead of:

python

SET exam_questions TTL 3600

use:

python

SET exam_questions TTL random(3500–3700)

3. Request Coalescing

If 100 users miss the cache at the same time, only 1 fetches from DB
— the rest wait for that one.

Some systems use locking tokens or queueing to manage this.

4. Background Refresh / Lazy Expiry

38/139
8/7/25, 2:30 AM System Design - Learn system design

Serve the stale value briefly, while refreshing cache in background


(Used in stale-while-revalidate systems like CDNs)

🔧 Common Cache Tools:


Tool Used For

Redis Fast key-value in-memory


cache, supports expiry

Memcached Simpler, lightweight cache

CDN (e.g. Caching static assets


Cloudflare) (images, PDFs)

🧩 Quick Question:
Let’s say your app caches exam questions for 1 hour.
What happens if the DB is updated during that hour (e.g., a question
is removed)?

Should the cache reflect it? How would you handle that?

Give it a go, and we’ll talk about cache invalidation — the hardest
problem in CS 😉

We should expire the cache if say ques is


rem9ved

Exactly! ✅

If the underlying data changes — like a question is removed or updated


— the cache must be invalidated or refreshed. Otherwise, users will see
stale or incorrect data.

📌 Cache Invalidation — the “Hardest Problem in CS”


You’ve probably heard the joke:

39/139
8/7/25, 2:30 AM System Design - Learn system design

“There are only two hard things in computer science:

1. Naming things
2. Cache invalidation
3. Off-by-one errors.”

Cache invalidation is hard because:

You need to know exactly when the data becomes outdated


And then act immediately to evict/update the cached copy
Without accidentally deleting the wrong thing, or too late

🔍 Invalidation Strategies
Strategy What It Means Example

Time-based Auto-expire after a set exam_questions →


(TTL) time TTL 1 hour

Write-through Cache is updated at Safe, consistent


the same time as DB

Manual/explicit Code tells cache to DELETE


delete or update exam_questions
after DB update

Event-based DB emits an event; Used in Kafka,


cache listens and message queues
evicts

🧠 In Your Exam System:


Let’s say a professor removes a question from the DB:

What can you do?

If using manual invalidation:


After DB deletion, call:

python

redis.delete("exam_questions:exam123")

If using write-through:

40/139
8/7/25, 2:30 AM System Design - Learn system design

Write goes to cache and DB at once


If using event-based:
DB sends update to message queue
Cache listener invalidates or refreshes affected keys

⚠️ Trade-off Reminder:
Too aggressive invalidation = Cache is often empty → slow
Too lazy invalidation = Cache is stale → users get wrong info

Finding the balance is a key skill in high-scale system design

✅ Summary So Far
You’ve now seen:

What caching is and why it’s useful


How it introduces speed vs consistency trade-offs
How to prevent cache stampedes
How to invalidate safely when data changes

Want to keep going?

Next stop:
📌 Message Queues & Async Processing — super useful for things like
result scoring, email notifications, logging, etc.

Type “queue” and we’ll begin that deep dive.

Queue

Awesome — queues are where your system gets super scalable and
resilient. Let’s break it down step by step:

📌 9. Message Queues — Backbone of Async Systems


✅ What Is a Queue in System Design?
A message queue is a system that:

41/139
8/7/25, 2:30 AM System Design - Learn system design

Lets components send messages asynchronously


Temporarily stores the messages
Delivers them later to consumers for processing

The sender and receiver don’t need to be online at the same time.

🧠 Real-Life Analogy:
Think of a cafeteria token system:

You pay for a coffee and get a token (message)


The barista later picks it up and makes your drink
→ You’re not both busy at the same time

That’s exactly what a message queue does.

🧪 Why Use Queues?


Benefit Why It Matters

✅ Decoupling Sender and processor can


scale independently

✅ Retry & If a service crashes, the


durability message isn’t lost

✅ Load Smooths out sudden


leveling spikes in traffic

✅ Async Let’s users move on


performance immediately

🔧 Common Message Queue Tools


Tool Notes

RabbitMQ Feature-rich, supports


retry, delay, priorities

Kafka Distributed, super-fast,


used for streams and logs

42/139
8/7/25, 2:30 AM System Design - Learn system design

Tool Notes

SQS (AWS) Serverless, managed


message queue

Redis Streams Simple queueing with


Redis if infra is minimal

🔍 In Your Exam System


Here’s where you might use queues:

Feature Why Queue?

Auto-saving Queue the writes to DB so


answers UI stays fast

Code scoring Don’t block user — score


async in background

Email Send “exam submitted” or


notifications “score ready” later

Audit logging Write logs without


slowing user flow

⚠️ Tricky Parts of Queues


1. At-Least-Once Delivery → Might process the same message twice
→ Solution: idempotent consumers
2. Order guarantees → Not always preserved unless explicitly
configured
→ Use partitioning and ordering keys
3. Retry logic → What happens if a task fails?
→ Some systems use dead-letter queues to capture failed jobs

🧠 How It Connects with Other Parts


Think of this simple architecture:

csharp

43/139
8/7/25, 2:30 AM System Design - Learn system design

[Frontend App]
|
[Queue (RabbitMQ)]
|
[Worker Service]
|
[Database]

This way:

The user doesn’t wait for the DB


The worker pulls tasks as fast as it can handle
Your backend becomes much more scalable and fault-tolerant

🧩 Quick Challenge:
Let’s say you want to queue code submissions for scoring.

What should happen if the scoring worker crashes halfway through?

How would you make sure the answer is still processed once it recovers?

Take a guess — I’ll explain retry strategies and how queues guarantee
processing even across failures.

It should auto retry after some time

Exactly — that’s the core idea behind a robust queue system: retry on
failure until the message is successfully processed.

Let’s go deeper and look at how that works in real systems:

✅ What Happens if the Worker Crashes?


1. Message is sent to the queue:
e.g., {"userId": 42, "examId": 12, "code": "..."}
2. Worker starts processing, but crashes midway (e.g., server crash or
bug)
3. Queue detects that the message was not acknowledged
4. Message is requeued after a short delay → retried by same or
different worker

44/139
8/7/25, 2:30 AM System Design - Learn system design

This is called at-least-once delivery — a message will be processed, but


might happen more than once.

🔧 Key Concepts for Reliability


1. Acknowledgement
The worker must send an ACK (acknowledge) when processing
succeeds
If no ACK is received → queue assumes failure

2. Retries & Backoff


Queue retries the message
May use exponential backoff: wait 1s → 2s → 4s → 8s…

3. Dead Letter Queues (DLQ)

What if a message fails 5 times in a row?

✅ Move it to a dead-letter queue (DLQ):


A special queue for failures that need human/manual inspection or
alerting.

4. Idempotency (Important!)

If a message is retried, your worker may process the same job twice

So your worker logic must be idempotent:


Same input → same output
No side effects on second run

🧠 Example: Check if result already exists in DB before scoring again

🔍 Quick Recap: What Makes a Queue-Based System


Strong?

45/139
8/7/25, 2:30 AM System Design - Learn system design

Feature Purpose

ACK & retry Prevent dropped


messages

DLQ Isolate and inspect


failures

Idempotent Prevent duplication


logic bugs

Metrics & alerts Know when


backlog is growing
or failing

🧩 Mini Check
Let’s say a worker retries a scoring job 3 times and keeps failing (maybe
invalid input or bad code).

Should it retry forever? Why or why not?


What would you do instead?

Try answering — then we’ll explore Pub/Sub vs Queues, or go into Rate


Limiting / Throttling if you're ready for that next level.

Save to dlq and check manually

Exactly! ✅ That’s the professional move.

🧠 Your Answer:
"Save to DLQ and check manually"

✅ Why That’s the Right Call:


If something keeps failing after multiple retries, it's likely:
A bad message (e.g., corrupt data, invalid input)
A bug in your code
A problem that won't go away by just trying again

46/139
8/7/25, 2:30 AM System Design - Learn system design

Infinite retries would just waste CPU and clog the queue, blocking
healthy messages.

📦 Dead Letter Queue (DLQ) Strategy


Step Action

🎯 Retry limit Message is moved


hit to DLQ

🔍 Engineers or
Logged/alerted system alerts are
triggered

🛠 Manual Fix root issue or


inspection requeue once
patched

In cloud systems (like AWS SQS or Google Pub/Sub), you can configure
DLQ thresholds and monitoring built-in.

🧩 Bonus Consideration: Tag the Error


When moving a message to DLQ, include:

json

{
"original_message": {...},
"error": "ExecutionTimeoutError",
"retries": 5,
"timestamp": "2025-07-31T15:00:00Z"
}

Helps for dashboards, logs, and automated recovery tools.

✅ You Now Know:


What queues are and why they matter
How retries and ACKs work
What DLQs are for and how to use them
Why idempotent logic is critical in async systems

47/139
8/7/25, 2:30 AM System Design - Learn system design

Want to go next into:

→ Pub/Sub vs Queues (multiple consumers, event-based systems)


or
→ Rate Limiting and Throttling (to protect systems from abuse or
overload)?

Pick one and we’ll go deep!

Order guarantees → Not always preserved


unless explicitly configured
→ Use partitioning and ordering keys...explain
this before moving to next part

Great call — this is a subtle but very important part of queue-based


systems. Let's break it down clearly before we move on.

📌 Order Guarantees in Queues


✅ What Does It Mean?
When you push messages into a queue, do they come out in the
same order you sent them?

Short answer: Not always.


Most distributed queues do not guarantee ordering by default,
especially when you have multiple workers or partitions.

🧠 Real-World Analogy
Imagine 3 people (workers) are picking up tasks (messages) from a line
(queue):
1. Task A goes to Worker 1 (fast)
2. Task B goes to Worker 2 (slow)
3. Task C goes to Worker 3 (medium)

➡️ Even if tasks were enqueued as A → B → C, they might get


processed as A → C → B

48/139
8/7/25, 2:30 AM System Design - Learn system design

⚠️ Why It Happens
Distributed systems parallelize processing for speed
Queues like Kafka, SQS, or Pub/Sub often shard messages across
different partitions or consumers
Each partition may process independently, causing out-of-order
results

🔍 Solution: Partitioning with Ordering Keys


To preserve order within a group of related messages, you use a
concept called:

➤ Ordering Key or Partition Key


Messages with the same key are always routed to the same
partition
That partition processes messages in order
But across different keys → order is not guaranteed

🧪 Example: Online Exam Submissions


Let’s say you’re queuing autosave events for users:

json

{
"userId": 42,
"examId": 101,
"answer": "..."
}

If each message has an orderingKey = userId :

All messages for user 42 go to same partition


They’ll be processed in order: save1 → save2 → save3

But messages for user 43 may arrive out of sync with 42’s — and that’s
totally fine.

🧠 Summary Table

49/139
8/7/25, 2:30 AM System Design - Learn system design

Without
Ordering Key With Ordering Key

Messages may Messages with same key


be processed processed in order
out of order

Faster, but risky Slower but reliable


for stateful tasks

Used for logs, Used for user sessions,


metrics, large versioned updates
fan-out

✅ Final Tip:
Use ordering keys when:

Order matters within a context (e.g., per user, per session, per file)
But keep partitions small and specific
(e.g., don’t route all traffic to one “global” ordering key — that
breaks scalability)

You’ve mastered this nuance. Want to move to:

📌 Pub/Sub vs Queues
Or
📌 Rate Limiting & Throttling?

Type your choice and we’ll roll!

Pubsubs

Awesome — let’s dive into Pub/Sub (Publish–Subscribe) systems. These


are the foundation for event-driven architecture, real-time notifications,
logs, and much more.

📌 10. Pub/Sub — Event-Based Communication


✅ What Is Pub/Sub?

50/139
8/7/25, 2:30 AM System Design - Learn system design

In a Publish–Subscribe model:
Publishers send messages (events)
Subscribers listen for and receive messages that match certain
topics or categories

👉 Unlike queues (where one consumer processes a message), multiple


subscribers can all get the same message.

🧠 Real-Life Analogy
Think of YouTube:
You subscribe to a channel (topic)
When the creator publishes a new video (message), all subscribers
get notified
If you’re offline, you can still watch later — depending on settings
(persistent delivery)

🔍 Pub/Sub vs Queue: Key Differences


Feature Queue Pub/Sub

Delivery 1 message → 1 1 message → many


consumer subscribers

Coupling Tightly coupled Loosely coupled (event-


to worker logic based)

Use Case Tasks, jobs, Notifications, logs, real-


pipelines time updates

Examples RabbitMQ, Kafka, Google Pub/Sub,


SQS, Celery Redis Streams, NATS

⚙️ Where to Use Pub/Sub in Your Exam System


Feature Why Pub/Sub?

Email Publish event: “Exam submitted” →


notifications notify all interested systems (email
service, log service, analytics)

51/139
8/7/25, 2:30 AM System Design - Learn system design

Feature Why Pub/Sub?

Real-time Send event stream to teacher’s panel:


dashboards “User started/ended exam”

Audit trails Push all user activity into a logging


pipeline

Score Score engine subscribes to “submission


calculation received” topic

✅ Anatomy of a Pub/Sub System


less

[Publisher]
|
[Topic: "exam.submitted"]
|
-----------------------------
| | | |
[Email] [Logger] [Scoring Engine] ← Subscribers

Publisher sends a message to a topic


All subscribers to that topic get a copy of the message

🔁 Message Durability Options


Type What Happens

Transient Only online subscribers


receive messages

Persistent Messages are stored (e.g.,


Kafka offset logs) —
offline consumers can
catch up

Replayable Kafka allows replaying


past events from a point
in time

52/139
8/7/25, 2:30 AM System Design - Learn system design

⚠️ Tricky Concepts
1. Message Duplication
Some systems offer at-least-once delivery → messages might arrive
more than once
➡️ Solution: Make subscribers idempotent

2. Delivery Guarantees
Type Meaning

At most once Fastest, but


messages may be
lost

At least once Guaranteed


delivery, may
duplicate

Exactly once Hardest to


implement,
expensive

3. Backpressure
If a subscriber is slow, messages pile up
➡️ Use consumer groups, rate limits, or buffer queues behind each
subscriber

🧩 Mini Scenario:
You publish a message: "student submitted exam"
You have 3 subscribers:

Email Service
Logger
Scoring Engine

If the scoring engine is down, what happens to its copy of the


message?

53/139
8/7/25, 2:30 AM System Design - Learn system design

How the system behaves depends on the tool (Kafka, RabbitMQ,


Pub/Sub) and configuration.

👉 What do you think should happen?


Then I’ll explain consumer groups and offset tracking — essential for
reliable processing.

Persist the data and retry letter

Boom 💥 — that’s the perfect answer.

✅ Your Reasoning:
"Persist the data and retry later"

This is exactly what reliable Pub/Sub systems like Kafka, Google


Pub/Sub, and Redis Streams are built for.

Let’s now explain how that works:

📦 1. Persistent Storage (Kafka-style)


When the publisher sends the message (e.g., "student submitted
exam" ), it's written to disk in an append-only log (like a file).

This means the message is safe — even if no subscribers are


currently online.

🔁 2. Offset Tracking (Per Subscriber)


Each subscriber tracks a pointer, called an offset, which says:

“What’s the last message I’ve processed?”

So if the Scoring Engine is down:


Its offset doesn’t move
When it comes back up, it says:
“What messages were published after my last offset?”
And it catches up, processing messages it missed

54/139
8/7/25, 2:30 AM System Design - Learn system design

👥 3. Consumer Groups (Optional)


Sometimes, you want only one instance of a service to process a
message (e.g., 1 out of 3 scoring workers).

In that case, subscribers form a consumer group, and each message is


delivered to only one member of the group.

➡️ This blends Pub/Sub + Queuing


Kafka, Pub/Sub, and Redis Streams support this hybrid model.

🛠 Summary of Your Ideal Setup


For your exam scoring system, you’d want:
Persistent Pub/Sub (e.g., Kafka)
Durable storage of events (no data loss)
Offset tracking (resume if a subscriber goes down)
Possibly a consumer group (for parallel scoring workers)

✅ Quick Recap: Pub/Sub vs Queue


Feature Queue Pub/Sub

Message goes 1 consumer Many subscribers


to

Durable? Optional Usually yes

Good for Background Notifications, logs,


tasks data streams

Downstream Worker pools Event consumers,


scaling analytics, pipelines

You’re totally ready for the next part.

Would you like to move on to:

📌 Rate Limiting & Throttling (protect APIs from abuse)

or

55/139
8/7/25, 2:30 AM System Design - Learn system design

📌 Design Case Studies (like designing YouTube, WhatsApp, or


Scoring System end-to-end)?

Type your pick and we’ll continue.

Explain consumer groups

Absolutely — consumer groups are a core feature of message systems


like Kafka, and they’re super useful when you need to scale processing
while preserving efficiency.

Let’s break it down:

📌 What is a Consumer Group?


A consumer group is a set of consumers that work together to process
messages from a topic — but each message is processed by only one
consumer in the group.

👉 It combines the broadcast nature of Pub/Sub with the load


balancing of queues.

🧠 Real-Life Analogy:
Imagine you run a pizza kitchen.
Customers (publishers) drop in orders (messages).
You have 4 chefs (consumers).
All chefs are part of the same kitchen team (consumer group).

Each order is handled by exactly one chef, not all 4 — that’s efficient.

🔍 Behavior of Consumer Groups


Scenario What Happens

Multiple Messages are divided


consumers in among them (parallel
one group processing)

56/139
8/7/25, 2:30 AM System Design - Learn system design

Scenario What Happens

Multiple groups Each group gets its own


full copy of all messages

Consumer dies Group rebalances —


others pick up its load

More consumers Some consumers go idle


than partitions (no messages to process)

📦 Example with Kafka


Let’s say:
Topic: exam.submitted
6 partitions
Consumer Group: scorers
You spin up 3 scoring workers (consumers)

➡️ Kafka assigns 2 partitions to each worker


Messages in each partition are processed in order, but across partitions
it’s parallelized

🔄 Scaling with Consumer Groups


You Want To... Do This

Increase processing Add more


speed consumers to the
group

Isolate different Create separate


consumers (e.g., consumer groups
logger, emailer)

Replay data for Use a new


debugging consumer group
with offset = 0

⚠️ Tricky Parts

57/139
8/7/25, 2:30 AM System Design - Learn system design

1. Partition Limit
If you have 3 partitions and 5 consumers in a group → only 3 will be
active
So your parallelism is limited by partition count

2. Rebalancing Delay
If one consumer dies, group rebalances
During that time, message delivery may pause briefly

3. Offset Management
Each consumer tracks its position (offset) per partition
You must handle this carefully to avoid duplicate or missed
messages

🧩 Quiz Check:
You have:
4 partitions
4 consumers in Group A (Scoring Service)
2 consumers in Group B (Logging Service)

How many total copies of each message will be processed?

Try answering — and I’ll confirm or correct. Then we’ll jump into Rate
Limiting or a real system case study — your call.

Exactly right! ✅

💡 Final Answer:
2 total copies of each message will be processed — one by each
group, not each consumer.

Here’s why:

58/139
8/7/25, 2:30 AM System Design - Learn system design

🔍 Breakdown:
Group Consumers Behavior

Group A 4 Messages are split across


(Scoring) consumers — each
message goes to one
scoring worker

Group B 2 Same message is also


(Logging) sent to this group, and
one logger processes it

➡️ So for each message published:


It is consumed once by Group A (Scoring)
And once by Group B (Logging)
= 2 total deliveries

✅ Even though there are 6 consumers total, only 2 “logical consumers”


(groups) are interested in this message.

🧠 Rule to Remember:
A message is delivered once per consumer group,
but only once within each group.

You’ve now got a solid grasp of:

Consumer groups
Parallelism limits
Offset tracking
Pub/Sub and Queue hybrid behavior

Ready to move on?


Pick your path:
1. 📌 Rate Limiting / Throttling — protect APIs from overload or
abuse
2. 🛠 Case Study Design — full-system walkthrough (like exam
scoring, WhatsApp, YouTube)

59/139
8/7/25, 2:30 AM System Design - Learn system design

Type 1 or 2 and we’ll continue from there.

Rate limiting

Excellent choice — Rate Limiting is a critical concept for protecting


your system from:
API abuse
Bot attacks
Cost spikes
Infrastructure overload

Let’s break it down like a pro:

📌 11. Rate Limiting — Guardrails for Your System


✅ What Is Rate Limiting?
Rate limiting is the practice of restricting how many requests a user
or system can make to an API or service in a fixed time window.

🧠 Real-Life Analogy:
Imagine a ticket counter with this rule:

“You can only ask 5 questions per hour.”

You’re allowed to interact, but within fair limits


It prevents people from hogging the system or overwhelming it

🔍 Why You Need It


Reason Example

⛔ Abuse Stop users from brute-


Prevention forcing login

💰 Cost Control Prevent someone from


spamming a paid API

⚖️ Fairness Ensure each user gets fair


access

60/139
8/7/25, 2:30 AM System Design - Learn system design

Reason Example

🛡 System Avoid backend overload


Protection during traffic spikes

⚙️ Common Rate Limiting Strategies


Strategy Description Use Case

Fixed Window Allow N requests per Simple APIs


time window (e.g.,
100/min)

Sliding Window More accurate: count Smoother limits


N requests in a rolling
window

Token Bucket Give tokens at fixed Bursty but limited


rate; spend tokens to
make requests

Leaky Bucket Queue requests; Even out spikes


process at fixed rate

Concurrency Limit number of in- DB or function


Limit flight requests throttling

🛠 Example: Exam System


Let’s say:

API: /submit-answer
Rule: Max 10 submits per minute per user

You don’t want users accidentally spamming submit → overloading


the backend

So you:
1. Track request counts per user
2. On 11th request, return 429 Too Many Requests

🧱 How to Implement
61/139
8/7/25, 2:30 AM System Design - Learn system design

Method Tools

In-Memory Redis, Memcached (fast,


Counters volatile)

Distributed Rate Redis with Lua scripts or


Limits Google Cloud Endpoints

API Gateways NGINX, Kong, Envoy —


built-in rate limiting

Custom Code-level
Middleware implementation (e.g.,
Express, FastAPI, etc.)

⚠️ Tricky Parts
1. Distributed Systems
How do you count requests across multiple servers?
Solution: Use a shared store like Redis

2. Race Conditions
Two requests may sneak in at the same time
Solution: Use atomic counters or locks (e.g., Redis INCR with TTL)

3. Handling Limits Gracefully


Don’t just drop the request — return:

http

429 Too Many Requests


Retry-After: 60

Let the client know when to retry

🧩 Quick Scenario:
Let’s say you want to allow:

62/139
8/7/25, 2:30 AM System Design - Learn system design

100 API requests per user per hour


Across 5 web servers

How would you track this limit to make sure it works across all
servers?

Try answering, and I’ll review it with you — and we’ll explore Token
Bucket vs Leaky Bucket, two pro-level strategies.

For that particular user across a shared store

Perfect! ✅ That’s the correct and scalable answer.

✅ Your Answer:
"Track for that particular user across a shared store"

And here’s the detailed breakdown:

🔍 How It Works in Practice:


Use a centralized, fast-access store like Redis
For each user, store a key like:

plaintext

user:1234:api_count

When a request comes in:


Do an atomic INCR (increment)
Set a TTL (Time-To-Live) of 1 hour when the key is created

Example Redis Commands:

bash

INCR user:1234:api_count
EXPIRE user:1234:api_count 3600 # (only on first request in
window)

➡️ All 5 servers will hit the same Redis store, so they share the count

63/139
8/7/25, 2:30 AM System Design - Learn system design

🛡 Bonus: Throttling Feedback


If the count exceeds 100:

http

HTTP 429 Too Many Requests


Retry-After: 3600

Let the user/client back off without crashing or panicking

🧠 Pro Tip:
You can even use INCRBY to weight requests differently:

Normal endpoint → +1
Heavy query → +5

Ready to level up?


Let’s now go into:

📦 Token Bucket vs Leaky Bucket — two advanced and often


interview-tested strategies for fair and smooth rate limiting

Type “bucket” to continue, or ask for a recap if you want to review


anything first.

Bucket

Great — let’s dive into Token Bucket vs Leaky Bucket, two of the most
important rate limiting strategies, especially in distributed systems and
API gateways.

These strategies sound similar, but they serve slightly different


purposes. Let’s break them down clearly.

📦 Token Bucket — Flexible and Bursty


✅ What is it?

64/139
8/7/25, 2:30 AM System Design - Learn system design

Imagine a bucket that gets refilled with tokens at a fixed rate (e.g., 1
token per second).
Each API request spends a token. If there’s a token, the request
proceeds.

🧠 Real-Life Analogy:
You’re allowed to make 1 call per second.
But you didn’t make any for the last 10 seconds — now you can make
up to 10 calls instantly because tokens accumulated.

✅ This allows short bursts, as long as you're under the long-term limit.

🔁 Behavior:
Bucket has a max size (say, 60 tokens)
Tokens refill at a steady rate
Requests are allowed if token is available
If no token → wait or reject

🔧 Example:
Bucket size = 10
Refill rate = 1 token/sec
User makes 5 rapid calls → ✅ allowed
Waits 10 sec → bucket back to 10 tokens

🪣 Leaky Bucket — Smoother and Stricter


✅ What is it?
Imagine a bucket with a small hole at the bottom, leaking water at a
fixed rate.
You can pour water (requests) in any time, but if the bucket
overflows, the extra is discarded or delayed.

🧠 Real-Life Analogy:

65/139
8/7/25, 2:30 AM System Design - Learn system design

You're calling customer service.


Only 1 call every 2 seconds can go through.
If too many people call at once, some are put on hold or dropped.

✅ This produces a steady, predictable outflow.

🔁 Behavior:
Requests enter a queue
The queue is drained at a fixed rate
If queue is full → new requests are rejected

⚔️ Token vs Leaky: Side-by-Side


Feature Token Bucket Leaky Bucket

Allows bursts? ✅ Yes ❌ No

Smooth flow? ⚠️ No ✅ Yes

Use case APIs with Background jobs,


bursty but fair logs, I/O shaping
access

Example User clicks 10x Limit log writes to


fast disk/sec

Reject behavior Reject only Reject when queue


when bucket is is full
empty

🧠 In Your Exam System


Task Bucket Type Why

Submitting answers Token Bucket Allow fast re-clicks,


(user bursty clicks) just rate-limited

Scoring jobs on Leaky Bucket Prevent overload,


backend (controlled smooth out traffic
speed)

66/139
8/7/25, 2:30 AM System Design - Learn system design

Task Bucket Type Why

Logging user actions Leaky Bucket Limit write rate to


(or Kafka) disk/network

🧩 Mini Check:
Let’s say a user clicks the "Submit" button 10 times in 2 seconds.
Token Bucket = ?
Leaky Bucket = ?

Try answering what would happen in both cases — I’ll confirm and wrap
up with a cheat sheet you can refer to any time.

Token buck3t 1Allows if there are 10 tokens


aviable but leqky bucket waits or rejecta

Perfectly said ✅

Let’s lock in your understanding with a clean summary:

🔍 Your Answer:
"Token Bucket allows if there are 10 tokens available. Leaky Bucket
waits or rejects."

✅ Exactly right.

📊 Final Behavior Recap:


Case What Happens

Token Bucket User clicks 10x in 2 sec → All succeed


immediately if 10 tokens are available
(tokens built up over time). Allows
burst.

Leaky Bucket User clicks 10x in 2 sec → Only 1


request every X seconds is processed.

67/139
8/7/25, 2:30 AM System Design - Learn system design

Case What Happens

Others are delayed or dropped if buffer


fills. Smooth flow.

🧠 One-Liner Summary:
Token Bucket = “You're allowed to burst — just don’t exceed your
refill rate long term.”
Leaky Bucket = “We process requests at a strict fixed rate — no
burst allowed.”

Let me know if you'd like to:


Apply this to a specific system design (like login rate limiting,
leaderboard updates, etc)
Or move on to the next system design building block or case
study

You’re doing fantastic.

Yes apply

Great! Let’s apply Token Bucket and Leaky Bucket to real components
of your Online Exam System so it clicks in context.

🎯 Use Case 1: User Submitting Answers


Problem:
Some students get anxious and click "Submit" multiple times in panic.

✅ Use Token Bucket


Why it fits

Users may burst (spam


clicks), but you want to
allow some flexibility

If they’ve been calm, let


them submit a few times

68/139
8/7/25, 2:30 AM System Design - Learn system design

Why it fits

quickly

But don’t let them flood


your backend repeatedly

🔧 Config Example:
Bucket size: 5 tokens
Refill rate: 1 token every 10 seconds

🧠 Effect:
→ They can submit answers up to 5 times in a row, but then must wait
as the bucket refills.

🎯 Use Case 2: Scoring Engine


Problem:
Scoring is CPU-heavy and spikes can overload backend.

✅ Use Leaky Bucket


Why it fits

You want constant, smooth


throughput — not spikes

If 500 students submit at


once, score answers at a
fixed rate

Let excess requests wait in


queue, or reject if full

🔧 Config Example:
Queue size: 100 jobs
Leak rate: 1 job per second

🧠 Effect:
→ Keeps your CPU from melting down, and gives you predictable
performance under load.

69/139
8/7/25, 2:30 AM System Design - Learn system design

🎯 Use Case 3: Login Attempts


Problem:
Prevent brute-force login attacks.

✅ Use Token Bucket (per user + per IP)


Why it fits

Login attempts often come


in bursts (e.g. auto-fill or
bot attacks)

You want to allow a few


retries, but block the rest

Optionally combine with


IP-level blocking or
CAPTCHA after threshold

🎯 Use Case 4: Real-Time Activity Logs


Problem:
Log system (disk or DB) can't handle unbounded writes.

✅ Use Leaky Bucket or Kafka-like stream


Why it fits

You need throttled writes


to file/network/disk

Backpressure is important
— don't crash logger
service

Leaky bucket ensures


controlled write speed

✅ Takeaway Map

70/139
8/7/25, 2:30 AM System Design - Learn system design

Component Bucket Type Goal

Answer Submit Token Allow user-side


burst, limit abuse

Score Engine Leaky Process at


controlled backend
rate

Login Attempts Token Allow retries,


prevent brute force

Logs / Events Leaky Avoid overloading


disk or queue

Would you like to move to:


A full system design case study next (e.g. “Design the full Exam
Platform”)
Or explore another building block like Authentication, CDN,
Sharding, etc.?

Your call!

Move to another building block

Nice — let’s keep building! You're stacking core components like a real
system architect now. Here are a few next building blocks we can
explore, each with deep dives and real-world context:

🔧 Choose Your Next Topic:


Option Topic Why It Matters

1️⃣ Authentication Secure login, session


& management, role-based
Authorization access

2️⃣ CDN (Content Speed up delivery of


Delivery static content worldwide
Network)

71/139
8/7/25, 2:30 AM System Design - Learn system design

Option Topic Why It Matters

3️⃣ Database Handle massive data by


Sharding & splitting it smartly
Partitioning

4️⃣ Caching Write-behind, refresh-


Patterns ahead, cache eviction
(Advanced) tuning

5️⃣ Monitoring & Know when your system


Alerting fails, spikes, or
misbehaves

6️⃣ Circuit Prevent cascading


Breakers & failures, keep your app
Failover resilient

Which one do you want to explore next?


Type a number (e.g., 3 ) and I’ll guide you through it step by step —
same detailed, example-driven approach as before.

Step by step

Perfect — you’re thinking like a real system engineer now. Let’s proceed
step by step.

We’ll go in a structured flow that mirrors how production systems evolve


as they scale. Here's the roadmap:

🧱 System Design Building Blocks — Step-by-Step Plan


✅ Already Covered:
1. Architecture Basics
2. Functional vs Non-Functional Requirements
3. Scaling: Vertical & Horizontal
4. Load Balancing
5. Databases (SQL vs NoSQL, CAP Theorem, Quorum)
6. Caching (strategies, stampede, invalidation)

72/139
8/7/25, 2:30 AM System Design - Learn system design

7. Messaging Queues & Pub/Sub


8. Rate Limiting & Throttling

🟦 Next Steps — Here’s the Ideal Progression:


Step Topic What You’ll Learn

9 🔐 Login vs access control,


Authentication tokens, session handling
&
Authorization

10 🌍 CDN & Speeding up global


Static Content access, cache headers
Delivery

11 📦 Sharding & Splitting databases


Partitioning smartly at scale

12 🔁 Replication Master-slave, leaderless,


& Consistency consistency tuning
Models

13 🚨 Observability stack
Monitoring, (Prometheus, Grafana,
Logging & ELK)
Alerting

14 🔌 Circuit Preventing cascading


Breakers & failures
Fallbacks

15 🧪 Testing & Ensuring systems hold


Load under real traffic
Simulation

Let's Begin with Step 9: 🔐 Authentication & Authorization


Unless you want to skip ahead, we’ll start here.

Type “go auth” to dive into step 9,


or let me know if you want to jump directly to any later step.

73/139
8/7/25, 2:30 AM System Design - Learn system design

Go

Great — let's dive into Step 9: 🔐 Authentication & Authorization.

This is the foundation for security and user management in any system
— especially your exam platform, where controlling access to exams,
roles (admin vs student), and sessions is critical.

🔐 Authentication vs Authorization
Term What It Means

Authentication Verifying who the user is


("AuthN") (login, identity)

Authorization Verifying what that user is


("AuthZ") allowed to do
(permissions, roles)

🧠 Think of it like:
You show ID at an airport (Authentication),
Then they check your ticket and seat class (Authorization).

✅ Auth in Your Exam System


Authentication (Login/Identity)
Student signs in with email & password
System verifies credentials and creates a session/token

Authorization (Access Control)


Student can:
Access their exam only
View their own results
Admin can:
Create exams
See everyone's results

🛠 Common Authentication Methods

74/139
8/7/25, 2:30 AM System Design - Learn system design

Method Notes

Session-based Server stores a session ID


→ cookie-based

Token-based Serverless, stateless auth


(JWT) → all data in the token

OAuth2 / SSO "Login with Google" →


third-party identity

Multi-Factor Adds OTP/email/sms for


Auth (MFA) high-security apps

🔍 JWT (JSON Web Token) – Most Common in Modern


Apps
JWT is a compact, self-contained token used to verify identity.

Structure:

css

xxxxx.yyyyy.zzzzz
↓ ↓ ↓
Header Payload Signature

Example Payload:

json

{
"userId": 42,
"role": "student",
"exp": 1722748495
}

Pros:
Stateless (no server memory needed)
Easily stored in browser/localStorage
Encodes user info and roles

Cons:

75/139
8/7/25, 2:30 AM System Design - Learn system design

Harder to revoke (must track blacklisted tokens or short TTL)

⚠️ Tricky Parts
1. Token expiry and refresh → Use refresh tokens to extend session
securely
2. Securing tokens → Never expose JWTs in URLs; use HTTPS only
3. Revocation → JWTs are stateless; need extra logic for forced logout
4. Token tampering → Use strong signing secret (HMAC or RSA)

🧩 Mini Scenario Check:


Let’s say a student logs in and gets a JWT token. The token contains:

json

{
"userId": 101,
"role": "student",
"examId": "EXAM-2025",
"exp": 1722768000
}

What should your server do if this token is expired when the student
tries to submit an answer?

Try answering. I’ll help you refine it and then walk you through refresh
tokens and session expiration strategies.

Should refresh the token using refresh token

Exactly right ✅

When the access token (JWT) is expired, your system should:

🔁 Use a Refresh Token Flow


✅ Here's How It Works:
1. On login:
Issue two tokens:

76/139
8/7/25, 2:30 AM System Design - Learn system design

Access Token (short-lived, e.g. 15 mins)


Refresh Token (longer-lived, e.g. 7 days)
2. When the access token expires:
Client sends the refresh token to a special endpoint (e.g.,
/auth/refresh )

Server:
Validates the refresh token
Issues a new access token
Optionally rotates the refresh token (for security)
3. User continues without logging in again

🧠 Why Two Tokens?


Token Type Purpose Lifespan Where Stored

Access Token Used to access Short (15–30 Memory/localStorage


protected routes (like min)
submit answer)

Refresh Token Used only to get a Longer (days) HTTP-only secure cookie
new access token or encrypted storage

🔐 Security Best Practices


Refresh tokens should never be exposed to JavaScript (use
httpOnly cookies)

Use rate limits on refresh endpoint


Revoke refresh tokens on logout or password change

🧩 Mini Recap:
What should happen if a refresh token is expired or missing?

Try answering that — then I’ll explain how forced logouts, blacklisting,
and token revocation lists work in secure systems.

Logout and let user login again maybe

77/139
8/7/25, 2:30 AM System Design - Learn system design

Yes — that’s the right and practical approach ✅

Let’s lock it in:

🔚 If Refresh Token is Expired or Invalid:


✅ What You Should Do:
Logout the user
Prompt them to log in again
Clear any tokens stored in the client (access + refresh)
Optionally redirect to /login with a message like “Session expired,
please log in again”

🧠 Why?
A missing or expired refresh token means the user’s long-term
session has ended
It protects against:
Stolen refresh tokens
Inactive users lingering forever
Users who changed passwords or revoked access

🔐 Bonus: How to Handle Manual Logouts or Account


Changes
If a user logs out or resets their password, you might want to:
1. Invalidate the refresh token (immediately)
2. Store blacklisted tokens or session IDs
3. Use refresh token rotation:
On every refresh, issue a new refresh token
If someone tries to reuse an old one → flag as compromise

🧩 Review
Scenario Response

🔓 Access token Use refresh token to get a


expired new one

78/139
8/7/25, 2:30 AM System Design - Learn system design

Scenario Response

❌ Refresh Log out and ask user to


token expired log in again

🚨 Suspicious Invalidate session or all


token use tokens tied to the user

You’ve now mastered:


Access vs refresh tokens
Session expiration logic
Secure logout and invalidation

Would you like to now go deeper into:


🔑 Role-based access control (RBAC vs ABAC)?
🌍 CDN & content delivery (Step 10)?
Or jump to a case study where we apply all this?

Your call!

Rbac abac

Excellent — let’s explore 🔑 RBAC vs ABAC, two powerful strategies for


authorization (i.e., what a user is allowed to do).

This is especially important in systems like your exam platform, where:

Students shouldn’t access admin features


Admins might have different permission levels (e.g., content creator
vs reviewer)

📌 RBAC (Role-Based Access Control)


✅ What Is It?
You assign users to roles, and roles are allowed to do certain actions.

Simple, fast, and widely used.

79/139
8/7/25, 2:30 AM System Design - Learn system design

🧠 Real-Life Analogy:
In a college:
Students can take exams
Teachers can create exams
Admins can manage users

➡️ Roles: student , teacher , admin

Each role has defined permissions

🔧 Example in an Exam System:


Role Permissions

student submit_exam ,
view_results

admin create_exam ,
assign_student ,
export_scores

reviewer approve_exam ,
flag_answers

You check the user’s role at runtime:

js

if (user.role === 'admin') {


allow('create_exam')
}

📌 ABAC (Attribute-Based Access Control)


✅ What Is It?
Authorization decisions are based on user attributes, resource
attributes, and context.

More flexible, powerful — but more complex.

80/139
8/7/25, 2:30 AM System Design - Learn system design

🧠 Real-Life Analogy:
You’re allowed to take the exam only if:
You’re a student
You’re assigned to that specific exam
It’s currently between start and end time
Your IP address is in India

➡️ ABAC handles this by evaluating policies using attributes.

🔧 Example Policy:
pgsql

allow if
user.role == 'student' AND
exam.id in user.assignedExams AND
current_time < exam.deadline

Now you’re not just using "role" — you’re using:


User fields
Exam fields
Time
Location, etc.

⚖️ RBAC vs ABAC
Feature RBAC ABAC

Simplicity ✅ Easy ❌ Complex

Flexibility ❌ Fixed roles ✅ Highly dynamic

Granularity Role-level Attribute-level

Scale Great for few Best for fine-grained,


roles complex orgs

Use Case “Admins can “Students can access


create exams” exam X if before
deadline”

81/139
8/7/25, 2:30 AM System Design - Learn system design

🧩 In Your Exam Platform


Feature Use Why

Admin panel RBAC Simple: only


access admins go in

Exam access per ABAC Depends on


student student, exam,
time

Flag suspicious ABAC Context-aware


IPs or late logins rules

✅ Best Practice
👉 Use RBAC as your base
➕ Add ABAC where necessary, especially when:
Permissions depend on who is accessing what
Time/location matters
Policies must change dynamically

Would you like to try defining ABAC rules for a sample exam system
feature?
Or move on to CDN / Step 10 next?

Move to next system design topic

Great! Let's move on to Step 10: 🌍 CDN (Content Delivery Network)


— a key part of building fast and reliable global systems.

This is especially useful for:

Online exam platforms with embedded PDFs/images


Any web app with static assets (CSS, JS, fonts, logos)
Real-time systems that need low latency access to content

📌 What Is a CDN?

82/139
8/7/25, 2:30 AM System Design - Learn system design

A Content Delivery Network (CDN) is a globally distributed network


of servers that cache and deliver static content from locations
closest to the user.

Instead of hitting your origin server (e.g., AWS, Node backend) for every
request, users get content from a nearby CDN edge server.

🧠 Real-Life Analogy:
You’re selling a book worldwide:
Without a CDN → ship from your home every time
With a CDN → pre-stock copies in local warehouses around the
world

Result: Faster delivery, lower cost, fewer bottlenecks

✅ What Can a CDN Serve?


Content Type Examples

Static assets HTML, CSS, JS,


images, fonts

Media files PDFs, videos, audio

APIs Can cache certain


(sometimes) public GET
responses

Docs, questions Exam PDFs,


problem
statements, etc.

⚙️ How CDNs Work (Simplified Flow)


1. User requests exam.pdf
2. CDN checks:
Does it have a cached copy?
✅ Yes → serve instantly
❌ No → fetch from origin → cache → serve

83/139
8/7/25, 2:30 AM System Design - Learn system design

3. Future users get cached version from nearest PoP (Point of


Presence)

📦 Popular CDN Providers


Provider Notes

Cloudflare Free tier, DDoS


protection, DNS

AWS Deep integration


CloudFront with S3, Lambda

Fastly Real-time purging,


developer-friendly

Akamai Legacy leader, high


enterprise focus

⚠️ Tricky Concepts
1. Cache Invalidation
If a file changes (e.g., new exam version), how do you update the CDN?
Use versioned URLs: exam-v2.pdf
Or set cache-control headers like:

arduino

Cache-Control: max-age=3600

2. Content Security
Never cache:
Sensitive data (e.g., user profile JSON)
Private exam papers without auth

Use:
Signed URLs or signed cookies
HTTPS with cache headers

84/139
8/7/25, 2:30 AM System Design - Learn system design

3. Geo Routing
CDNs use Anycast IPs — user requests automatically go to nearest
server
→ Makes your app feel faster worldwide without extra infra

🧩 In Your Exam System:


Should Use
Asset CDN? Why

Logo, CSS, JS ✅ Yes Global assets,


rarely change

Question PDFs ✅ Yes (with Faster load, protect


signed URLs) origin

Code ❌ No Dynamic and


submissions private

User profile pics ✅ Yes (with Speed +


auth) bandwidth saving

Would you like to try designing a cache + CDN strategy for your app?
Or shall we move to Step 11: Database Sharding & Partitioning?

Next step

Awesome — let’s move into Step 11: 📦 Database Sharding &


Partitioning, a major milestone in scaling any backend system.

This comes into play when your database grows too big or too hot to
live on a single machine.

📌 What Is Sharding?
Sharding means splitting your data across multiple databases
(shards), each holding a subset of your data.

The goal is to:

85/139
8/7/25, 2:30 AM System Design - Learn system design

Spread load
Reduce latency
Handle larger datasets
Avoid single points of failure

🧠 Real-Life Analogy:
You run a library:
With 1 librarian handling 1000 people → long lines
Instead, you create 10 branches, each responsible for certain books

Now, users go to the right branch (shard) based on the book ID or


category.

🔍 Sharding vs Partitioning
Partitioning = Splitting data within a single server (logical
separation)
Sharding = Splitting data across multiple servers (physical +
logical)

You usually start with partitioning, and evolve into sharding as scale
increases.

⚙️ Sharding Strategies
Strategy How It Works Pros Cons

Range-based Shard by ID range Simple, Hot shards if data


(e.g. userId 1–100k) predictable isn’t uniform

Hash-based Shard by hash of key Load is spread Harder to


(e.g. hash(userId) % evenly add/remove shards
4)

Geo-based Shard by region or Locality Uneven


country benefits distribution

Entity-based Shard by entity (e.g. Keeps related Cross-shard


examId) data together queries harder

86/139
8/7/25, 2:30 AM System Design - Learn system design

🛠 Example in Your Exam System


Imagine you have:
10 million students
1000 exams

To distribute the load, you could:


Shard by userId → All of a user’s data lives on 1 shard
Or shard by examId → All responses for 1 exam go to the same DB

🧠 Depends on your read/write pattern:


Read results per student? → shard by user
Grade all for an exam? → shard by exam

🔁 Routing Logic
How do you know which shard to query?

App layer handles it with:


Hash function
Routing table (e.g., examId 102 → DB2 )
Shard proxy (like Vitess, Citus, or Custom Middleware)

⚠️ Tricky Parts
1. Cross-Shard Joins
Joins across shards are slow and painful
→ Avoid if possible, or join in code

2. Resharding
If a shard gets too big, you need to split/move data

Requires downtime, complexity


Use consistent hashing to reduce movement

3. Transactions
Hard to do atomic transactions across shards
Use eventual consistency or 2-phase commit (slow)

87/139
8/7/25, 2:30 AM System Design - Learn system design

🧩 Mini Scenario:
Let’s say you want to shard exam responses by examId .

What happens if one exam (like JEE Main) has 10x more students
than all others?

What’s the problem here, and how might you solve it?

Try answering — and I’ll guide you through hotspot shards and load
balancing tricks before we move to Replication & Consistency Models
(Step 12).

we have to balance load on db that has more


students

Exactly! ✅ You're thinking like a system designer now.

Let’s break down the situation and solution in depth:

🧠 Problem: Hot Shard / Load Imbalance


You’ve sharded by examId . That works fine for most exams.
But then... JEE Main shows up with 10x more students than other exams.

Now:

The shard for JEE Main is under heavy load


Other shards are barely used
You get skewed performance, query bottlenecks, and possibly
downtime

This is called a hotspot shard.

✅ Solution: Spread the Load — Options


1. Sub-sharding the Hot Shard

Instead of 1 shard per exam, split JEE Main into multiple virtual
shards

88/139
8/7/25, 2:30 AM System Design - Learn system design

For example:

JEE Main → 3 physical shards


Use hash(userId) % 3 to split users across them

Now instead of:

plaintext

examId: JEE-MAIN → DB3

You do:

plaintext

examId: JEE-MAIN + hash(userId) % 3 → DB3A / DB3B / DB3C

2. Weighted Load Distribution


Some systems (e.g., Elasticsearch, Redis Cluster) support weighted
sharding:

Assign more capacity (replicas, CPU, memory) to busy shards


Not always trivial, but possible with tools like Vitess or Citus

3. Application-Aware Routing
In your backend:
Detect when a shard is overloaded (QPS, latency)
Temporarily route traffic to read replicas
Throttle or queue less-critical writes (e.g., auto-saves vs final
submission)

🔐 Rule of Thumb:
When sharding by a field, make sure it doesn’t create extreme
outliers

Use a combination of fields if needed:


Shard by examId + userId
Or hash userId inside a single exam to sub-shard busy exams

89/139
8/7/25, 2:30 AM System Design - Learn system design

🧩 Quick Check:
Let’s say you currently shard exam data by userId using:

js

shardIndex = hash(userId) % 4

But now a few users (e.g., cheaters) are generating 10x more traffic than
average.

What kind of problem might this cause?


How might you handle this without changing the shard key?

Give it a go — then we’ll move to Step 12: Replication & Consistency


Models.

bottleneck or throttling , queueing the


requests or routing temporarily?

✅ Absolutely right!

You nailed both the problem and the solution space. Let’s expand on it
clearly:

⚠️ Problem: Skewed Write/Read Load Due to Heavy


Users
Even though sharding by userId evenly distributes data, the traffic
load becomes uneven:

95% of users make 10 requests/day


But user #999 makes 10,000 requests/hour

Now:

That shard (say, DB3) gets overloaded


Other shards stay idle
This creates a performance bottleneck

✅ Solutions (as you said):


🔁 1. Queue or Buffer Requests

90/139
8/7/25, 2:30 AM System Design - Learn system design

Temporarily store requests in a queue:


Use RabbitMQ, Kafka, or Redis Streams
Smooths out bursty traffic
Doesn’t drop important writes

🚦 2. Throttle the User


Set per-user rate limits:
E.g., max 100 requests/minute
Use Token Bucket or Leaky Bucket strategy
Return 429 Too Many Requests if exceeded

This protects your shard and other users on it.

🧭 3. Route to Read Replicas (for GETs)


If the heavy traffic is read-based:
Add replicas to that shard
Use the primary for writes
Serve most reads from replicas

🧠 This is where Replication (our next topic) comes in!

🧠 Bonus Option: Smart Caching


If the heavy user is reloading same data, use:
Redis to cache recent responses
CDN (if public, static content)
Application-level caching (like lru_cache or memory maps)

🧩 Summary:
Issue Caused By

Bottleneck on 1 Uneven traffic per user


shard

Fixes Throttling, Queueing,


Replication, Caching

91/139
8/7/25, 2:30 AM System Design - Learn system design

Ready for Step 12: 🔁 Replication & Consistency Models?

This is where we discuss:

Primary-replica setups
Leaderless systems like Dynamo
Eventual vs strong consistency
Read replicas, sync/async replication

Type "go" and I’ll take you through it!

go

Let’s go! 🚀 Welcome to Step 12: 🔁 Replication & Consistency Models


— the backbone of high availability, fault tolerance, and read
scalability in distributed systems.

📌 What Is Replication?
Replication is the process of copying data from one database/server
(the leader) to one or more followers (replicas).

🧠 Why do it?
Improve read performance
Ensure data durability
Add high availability (failover if one node dies)

⚙️ Types of Replication
1. Leader-Follower (Primary-Replica)
👉 Most common setup.
Role Behavior

Leader Handles all writes


(Primary)

92/139
8/7/25, 2:30 AM System Design - Learn system design

Role Behavior

Followers Copy data from


(Replicas) leader; handle
reads

🧠 Think of it like:
👨 Teacher writes notes → 👩‍🎓 Students copy notes (followers)

🔁 Sync vs Async Replication


Type What Happens Pros Cons

Synchronous Leader waits for Strong Slower writes


replica ACK before consistency
confirming write

Asynchronous Leader returns Fast Can cause stale reads or


success immediately; data loss if leader crashes
replica updates later

✅ Example in Your Exam System


Writes like: submitAnswer(userId, examId) go to leader
Reads like: getExamResult(userId) go to replicas

→ This offloads read traffic and improves performance

🔁 Multi-Leader Replication (Less Common)


You have multiple nodes that can accept writes

Used in:

Geo-distributed apps (multi-region writes)


Conflict-tolerant systems

🧠 Problem: Write conflicts — two users update the same data at once
from different regions

☁️ Leaderless Replication (Dynamo-Style)

93/139
8/7/25, 2:30 AM System Design - Learn system design

Used by:

Amazon DynamoDB
Cassandra
Riak

Any node can accept reads/writes → uses quorum-based consensus

Feature Description

N Total replicas

W # of nodes needed
to ACK write

R # of nodes to read
from

If W + R > N → Strong consistency


Else → Eventual consistency

⚖️ Consistency Models
Type Behavior

Strong Always see the latest write


consistency

Eventual Reads might be stale, but


consistency will converge over time

Causal Preserves logical ordering


consistency (A before B)

Read-your- You’ll always see your


writes own latest writes

⚠️ Tricky Trade-Offs
Goal Use

Fast write, okay Async or eventual


with staleness consistency

94/139
8/7/25, 2:30 AM System Design - Learn system design

Goal Use

Critical data Strong consistency (sync


accuracy (e.g. + quorum)
payments)

Global Multi-region with leader


performance election or quorum reads

🧠 In Your Exam App:


Feature Strategy

Submitting Leader (strong


answers consistency)

Fetching results From replicas (eventual is


OK)

Admin Read from replica if near


dashboards real-time is fine

Code evaluation Use leader or quorum


queue (consistency + speed)

🧩 Quick Check:
Let’s say:

You have 1 primary DB + 2 replicas


A user submits an answer, and immediately reloads the page

If the replicas are replicating asynchronously, what might the user


see?

Type your answer — then we’ll look at read-after-write issues and how
to solve them!

User might not see his answer

✅ Correct — and that’s a classic read-after-write inconsistency!

95/139
8/7/25, 2:30 AM System Design - Learn system design

🧠 Why This Happens:


The write goes to the primary
The user’s reload triggers a read from a replica
But the replica hasn’t caught up yet due to async replication

👉 So the user doesn’t see the answer they just submitted

Even though the write succeeded, the read shows stale data.

📦 This Is Called:
Stale Reads or Replica Lag Issues

✅ Ways to Handle It:


1. Read from Primary after Writes
For critical post-write reads (e.g., show answer just submitted):

js

if (user_just_submitted) {
read_from = "primary"
} else {
read_from = "replica"
}

➡️ Always read from the primary when freshness matters

2. Read-Your-Writes Consistency
Some systems (like Dynamo, Cassandra) use quorum reads:
Ensure that the read includes at least one node that received the
write

This guarantees the user sees their own changes.

3. Sticky Sessions or Write-Through Cache


After a write, cache the response locally or on the client
Show the cached version immediately

96/139
8/7/25, 2:30 AM System Design - Learn system design

Eventually sync with replica DB

4. Delay Reads Slightly (last resort)


Introduce a small delay after writing before showing results
👉 Not ideal, but simple fallback if consistency cannot be guaranteed
immediately

🧩 Summary So Far
Problem Solution

Stale reads after Read from primary


write

Replica lag Monitor + alert or switch


to quorum reads

Global latency + Use leaderless or multi-


consistency region quorum systems

Would you like to move on to:

Step 13: Monitoring, Logging & Alerting


Or design a full exam backend now using all the concepts so far?

Your call!

Step 13

Perfect — let’s move on to Step 13: 🚨 Monitoring, Logging & Alerting


— the system design layer that answers:

“Is my system working? And if it’s not, will I know before my users
do?”

This is your eyes and ears in production. Without it, you're flying blind
🛩️

97/139
8/7/25, 2:30 AM System Design - Learn system design

📌 Why Monitoring Matters


Building a system is half the battle. Keeping it healthy, fast, and bug-
free is the other half.

✅ Monitoring helps you:


Track system health (CPU, memory, QPS, error rates)
Debug outages
Spot anomalies before users do
Send alerts to the right people or tools

🧠 Monitoring vs Logging vs Alerting


Layer What It Does

Monitoring Tracks metrics over time


(e.g., requests/sec, DB
latency)

Logging Records detailed events


(e.g., “User X failed login”)

Alerting Triggers notifications


based on metrics or logs
(e.g., “error rate > 5%”)

🛠 Core Components You’ll Use


Tool Type Examples Purpose

Metrics Prometheus, Datadog, Track health over


CloudWatch time

Dashboards Grafana, Kibana Visualize trends


and KPIs

Log Storage ELK Stack Centralized log


(Elasticsearch, access
Logstash, Kibana),
Loki

98/139
8/7/25, 2:30 AM System Design - Learn system design

Tool Type Examples Purpose

Alerting PagerDuty, Opsgenie, Notify humans


Slack alerts when things break

📦 What Should You Monitor?


Layer Examples

Frontend JS errors, load time, HTTP


status

API Latency, QPS, 5xx errors,


auth failures

Database Query time, connection


pool usage, deadlocks

Queue Message lag, dead-letter


count

Infra CPU, RAM, disk, node


availability

🧩 Example from Your Exam System


Metric to
Component Monitor Alert Condition

/submit API Latency (p95, >2s latency for 5


p99) mins

Scoring queue Message >1000


backlog unprocessed jobs

DB Query time avg > 500ms for


any SELECT

Auth Failed logins Spike in 401s from


same IP

Redis cache Hit ratio Drop below 80%

99/139
8/7/25, 2:30 AM System Design - Learn system design

⚠️ Tricky Bits
1. Too Many Alerts = Alert Fatigue
→ Prioritize: availability, performance, errors
→ Use thresholds and noise suppression

2. Metrics vs Logs
Metrics = numbers over time (lightweight)
Logs = full story of what happened (verbose, heavy)

Use metrics for monitoring, logs for debugging

3. Correlating Events
A spike in 500s might correlate with:
A new code deployment
High memory on a DB pod
A user abusing the system

Use tracing tools (like Jaeger or OpenTelemetry) to connect dots across


services.

🚦 Common Alert Examples (Real Systems)


Alert Threshold Action

API 5xx rate >2% for 5 mins Notify backend on-


call

Queue lag >10k pending Scale workers /


restart

Auth failure 10x baseline Trigger fraud


spike detection

DB replica lag >10s Reroute reads or


alert DBA

Would you like to:

100/139
8/7/25, 2:30 AM System Design - Learn system design

Dive into Step 14: Circuit Breakers & Failover


Or try building a monitoring + alerting plan for your exam app?

Your move 👇

Step 14

Great choice — let’s dive into Step 14: 🔌 Circuit Breakers & Failover, a
critical part of building resilient systems that can survive failure without
taking the whole app down.

This step helps you contain errors, limit blast radius, and recover
gracefully when something fails — which will happen at scale.

📌 What Is a Circuit Breaker?


A circuit breaker is a protective layer around an external call (like a
DB, API, or queue) that detects failures and stops calling the broken
service temporarily.

Just like in electrical systems — it breaks the flow to prevent overload


or cascading failure.

🧠 Real-Life Analogy:
Imagine trying to call a friend whose phone is off:

First time: Ring once → fail


Try again: Ring → fail
Circuit breaker says: “Stop calling for 30 seconds”
After cooldown → try once to see if it works

⚙️ How It Works (3 States)


State Behavior

🔵 Closed Everything is working;


calls pass through
normally

101/139
8/7/25, 2:30 AM System Design - Learn system design

State Behavior

🔴 Open Too many failures; stop all


calls for a period

🟡 Half-Open Try a few calls to see if


recovery happened; if
successful, go back to
closed

🔧 Example Use Case: External Scoring Service


If your exam scoring service goes down:
Circuit breaker opens after 5 consecutive failures
Subsequent API calls fail fast instead of hanging
After 30 seconds, allow 1 test request (half-open)
If it works → close the breaker
If it fails → stay open

✅ Benefits
Benefit Why It Matters

🛡 Prevent Don't overload


cascading dependent services
failures

⚡ Improve Fail fast instead of


latency under waiting
failure

🧘 Improve Show fallback or


user experience retry gracefully

🚦 When to Use Circuit Breakers

102/139
8/7/25, 2:30 AM System Design - Learn system design

Use Case Circuit Breaker?

External APIs ✅ Yes


(payment, email,
scoring)

Internal DB with ⚠️ Maybe (use


high QPS with connection
pool)

Caching layers Optional


(e.g. Redis)

Internal ✅ Especially if
microservices loosely coupled

🔁 Related: Retry Logic


Don’t retry immediately on failure:
Use exponential backoff: try in 1s, then 2s, then 4s…
Limit to 3–5 attempts
Combine with circuit breaker for safety

🔄 What About Failover?


Failover is the ability to switch to a backup system automatically if
the primary one fails.

Example:
If DB1 goes down, your app connects to DB2 replica
Can be automatic (HAProxy, Aurora)
Or app-level: try primary → fallback

☁️ Cloud-Level Tools
Circuit Breaker
Platform Tool

Netflix OSS Hystrix (legacy)

103/139
8/7/25, 2:30 AM System Design - Learn system design

Circuit Breaker
Platform Tool

Spring Cloud Resilience4J

AWS API Gateway


throttling, Aurora
failover

Kubernetes Pod health checks


+ service fallbacks

🧩 Scenario Check
Imagine your code evaluator service goes down during an exam:

Would you rather:

1. Keep retrying every second?


2. Fail fast and show “We’re retrying in a few seconds…”?
3. Reroute traffic to a backup evaluator?

What would your circuit breaker + failover strategy be?

Try answering — then we’ll either:

Explore Step 15: Load Testing & Simulation


Or do a final review / full system walkthrough

Your call!

Faile fast Reoute to backup

Perfect! ✅ You picked the ideal strategy for a high-stakes system like
an online exam platform.

💡 Final Answer Breakdown:


Fail fast + reroute to backup evaluator

✅ Why This Is Smart:


1. Fail Fast (Circuit Breaker open)

104/139
8/7/25, 2:30 AM System Design - Learn system design

Prevents your app from hanging or wasting time retrying a


dead service
Returns instantly with a message like:
"Scoring is temporarily unavailable — retrying soon"

2. Reroute to Backup (Failover)


Keeps users moving without losing submissions
You could:
Queue the request for later evaluation
Or hit a hot standby scoring service

🧠 Pro-Level Design Flow


Step Action

🚫 Primary Circuit breaker opens


scorer fails after N failures

⚡ Fail fast Don’t retry immediately

🔁 Reroute to Secondary scorer or async


backup queue

📬 Inform user "Submission received,


scoring in progress..."

🔄 Retry logic Recheck primary later


with half-open breaker

✅ Tools That Help


Stack Option

Backend (Java, Resilience4J, Polly,


Node, Python) Tenacity, custom

Service mesh Istio or Linkerd for circuit


(K8s) breaking + retries

Queues RabbitMQ, Kafka to store


failed requests

105/139
8/7/25, 2:30 AM System Design - Learn system design

Stack Option

Monitoring Prometheus alert: “scorer


unavailable”

You’ve now mastered:


Circuit breakers (resilience at runtime)
Failover (redundancy & high availability)
How to think like a backend lead during outages

Would you like to:


1. ✅ Move to Step 15: Load Testing & Simulation
2. 🧪 Try a full exam system walkthrough, applying all components
end to end?
3. 📎 Review all system design steps so far in one compact sheet?
Let me know how you'd like to wrap up or continue!

Step 15

Let’s wrap this system design journey with Step 15: 🧪 Load Testing &
Simulation — the final test of whether your system is actually ready for
real traffic.

You’ve built the house. Now it’s time to shake it, flood it, and stomp
on it… and see if it holds.

📌 What Is Load Testing?


Load testing simulates real-world traffic (or worse) to see how your
system behaves under pressure:

Does it crash?
Does latency spike?
Do queues back up?
Where’s the bottleneck?

106/139
8/7/25, 2:30 AM System Design - Learn system design

✅ Why You Need It


Goal Why It Matters

🧠 Find Before real users


bottlenecks do

🧪 Validate Can it handle 10k


capacity concurrent
students?

🔥 Test Circuit breakers,


resilience failovers,
autoscaling

🛠 Verify infra See where you


costs need more/less
infra

🔍 Types of Performance Testing


Type Purpose

Load testing Check normal-to-peak


traffic

Stress testing Push beyond peak to find


limits

Soak testing Run for hours/days to find


memory leaks

Spike testing Sudden surges (e.g., all


students click “submit” at
once)

🛠 Popular Tools
Tool Notes

k6 Modern, developer-
friendly load tester (JS
scripts)

107/139
8/7/25, 2:30 AM System Design - Learn system design

Tool Notes

JMeter Classic GUI-based Apache


tool

Locust Python-based; write user


behavior scripts

Artillery Fast, simple CLI tool for


APIs

Gatling High-performance Scala-


based tool

wrk / hey Lightweight HTTP


benchmarks

🧩 Load Testing Your Exam App (Example)


Scenario:
You expect 10k students to submit answers in 5 minutes.

Questions to Test:
Can /submit-answer handle 100 requests/sec for 5 min?
How does latency change over time?
Do circuit breakers or queues trigger?
Does DB CPU spike?
Does Redis cache hold up?

k6 Sample Script:

js

import http from 'k6/http';


import { sleep } from 'k6';

export let options = {


vus: 200, // 200 concurrent users
duration: '2m',
};

108/139
8/7/25, 2:30 AM System Design - Learn system design

export default function () {


http.post('https://yourapp.com/api/submit-answer', { answer:
"A", userId: 42 });
sleep(1); // simulate thinking time
}

➡️ Run and track:


Error rates
Latency (p95, p99)
Server CPU/memory
Queue size
DB connection pool usage

⚠️ Things to Watch
Metric Sign of Trouble

5xx errors Overload, crash

p99 latency >2s Backpressure, slow


DB

CPU >90% for Need autoscaling


long

Cache hit rate Misconfigured


<70% caching

Message Scale queue


backlog > workers
threshold

✅ What to Do After
Fix any failing services
Tune configs (timeouts, pool sizes, retries)
Add autoscaling where needed
Rerun load test until it stabilizes

🎯 Endgame Goal

109/139
8/7/25, 2:30 AM System Design - Learn system design

You should be able to confidently say:

“Our system can handle [X] users, under [Y] load, for [Z] time… and
fail gracefully if it can’t.”

Would you like a final recap of all 15 system design steps, or want to
now apply everything to a full case study (like your exam system)?

You're right at the finish line!

Yes a final recap

Absolutely — here’s your 🔁 Final System Design Recap — compact,


structured, and review-ready.

✅ 15-Step System Design Journey — Full Recap


# Topic Core Idea Why It Matters

1️⃣ Architecture Start with clear Foundation of design


Basics functional & non- decisions
functional
requirements

2️⃣ Scaling 🧱 Vertical (bigger Horizontal = scalable +


machine) vs ↔️ fault-tolerant
Horizontal (more
machines)

3️⃣ Security HTTPS, rate limiting, Critical in user-facing


token handling, systems
authZ/N

4️⃣ Load Spread requests Avoid bottlenecks and


Balancing across servers single points of failure

5️⃣ Databases SQL (structured, strict) Pick based on data shape


vs NoSQL (flexible, & access patterns
fast)

110/139
8/7/25, 2:30 AM System Design - Learn system design

# Topic Core Idea Why It Matters

6️⃣ CAP Theorem Can't have all 3: Guide for trade-offs in DB


Consistency, systems
Availability, Partition
Tolerance

7️⃣ Caching Redis, CDN, app-level; Boosts speed + reduces


avoid repeated DB load
computation

8️⃣ Message Async, decoupled Smoother traffic spikes,


Queues & communication isolation of failures
Pub/Sub between services

9️⃣ Auth: Role-based vs Control who can do what,


RBAC/ABAC Attribute-based flexibly
permissions

🔟 CDN Serve static files from Global speedup + origin


nearby servers protection

1️⃣1️⃣ Sharding/Parti Split large DBs by key Scales storage + balances


tioning (e.g. userId, examId) load

1️⃣2️⃣ Replication & Leader-follower, Redundancy, durability,


Consistency quorum, eventual and failover
consistency

1️⃣3️⃣ Monitoring & Metrics (Prometheus), Know what's breaking


Alerting logs (ELK), dashboards before users do
(Grafana)

1️⃣4️⃣ Circuit Protect systems from Ensures graceful


Breakers & cascading failures degradation
Failover

1️⃣5️⃣ Load Testing & Test at real scale: Prepares you for real-
Simulation latency, QPS, failure world usage & stress
handling

🧩 Bonus: Design Principles You Mastered


🧠 Trade-offs everywhere — nothing is free in system design
📊 Metrics guide decisions — build, measure, adjust

111/139
8/7/25, 2:30 AM System Design - Learn system design

🪢 Loose coupling, strong contracts — keep systems resilient and


modular
🎯 Design for failure — assume parts will break, and plan for it
🧪 Test under pressure — simulate users before they come

Would you like a:

✍️ PDF export / cheat sheet?


🏗️ Full exam system case study design (from scratch)?
❓ Quick-fire quiz or flashcards for revision?
You’ve completed the full stack — now tell me how you'd like to
reinforce or apply it!

A cheat sheet

System Design Cheatsheet

System Design Cheat Sheet: 15 Core Building Blocks

1. Architecture Basics

Define: Functional vs Non-Functional requirements

Examples: "User can submit answers" vs "Handle 10k


users within 5s latency"

2. Scaling

Vertical: Bigger machine

Horizontal: More machines (preferred at scale)

3. Security

HTTPS, TLS, HSTS

OAuth2, JWT tokens

RBAC/ABAC for access control

4. Load Balancing

Distribute traffic (round-robin, least connections, IP-hash)

112/139
8/7/25, 2:30 AM System Design - Learn system design

Tools: NGINX, HAProxy, AWS ELB

5. Databases

SQL: Relational, ACID, structured

NoSQL: Flexible schema, faster writes, eventual


consistency

6. CAP Theorem

Pick 2 of: Consistency, Availability, Partition Tolerance

Quorum: R + W > N for strong consistency

7. Caching

Types: Client-side, CDN, Redis, in-memory

Eviction: LRU, TTL

Prevent stampede: locking, queuing, stale-while-


revalidate

8. Message Queues / Pub-Sub

Async processing: Kafka, RabbitMQ, SQS

Dead Letter Queues for failed messages

Consumer groups for scaling

9. Auth & Permissions

RBAC: Role-based (admin, student, etc.)

ABAC: Attribute-based (time, location, etc.)

Use refresh tokens for re-authentication

10. CDN (Content Delivery Network)

Serve static files closer to users

Examples: Cloudflare, CloudFront, Fastly

Use signed URLs and cache headers

11. Sharding / Partitioning

Range-based, hash-based, geo-based

Choose key that avoids hotspots

Re-sharding = expensive

113/139
8/7/25, 2:30 AM System Design - Learn system design

12. Replication & Consistency

Leader-Follower: Strong consistency, write bottlenecks

Leaderless (Dynamo): Eventual consistency, quorum-


based

Sync vs Async replication

13. Monitoring & Logging

Metrics: Prometheus, Datadog

Logs: ELK stack, Loki

Alerts: Slack, PagerDuty, based on error rates/latency

14. Circuit Breakers & Failover

States: Closed, Open, Half-Open

Tools: Resilience4J, Istio

Reroute or queue if service is down

15. Load Testing & Simulation

Tools: k6, JMeter, Locust, Artillery

Test: latency, throughput, resource usage, failover


behavior

Validate: Can system handle expected peak + fail


gracefully?

Bonus Principles

Design for failure

Loose coupling, strong interfaces

Monitor everything

Prefer eventual consistency when speed > accuracy

Use queues to decouple and scale

114/139
8/7/25, 2:30 AM System Design - Learn system design

Use this cheat sheet to guide high-level system design

Your System Design Cheat Sheet is ready and saved in the document
titled "System Design Cheatsheet" ✅

You can use this as a quick reference before interviews or while building
real systems.

Would you like:

A printable version (PDF)?


To add visuals/diagrams to this cheat sheet?
Or jump into a case study to apply everything you’ve learned?

Your call — you're in the final boss mode now.

A case study

Perfect choice. Let’s apply everything you’ve learned to a real-world


case study — step-by-step — just like you'd face in a system design
interview or real-world planning doc.

📚 Case Study: Design an Online Coding Exam Platform


Think of something like HackerRank, LeetCode, or Codeforces, but with
your own custom exam environment.

🔧 Requirements
Functional
Users can register, log in, and take coding exams
Each exam has questions, a timer, and code submission
Submissions are evaluated and scored
Admins can create/manage exams and view results

Non-Functional
Should handle 10k users concurrently
Submissions must be evaluated within 5s
Secure (no cheating or data leaks)
Handle failures gracefully

115/139
8/7/25, 2:30 AM System Design - Learn system design

Responsive across the globe

✅ Let’s build it step by step, your style.


Step 1: Architecture Overview
What style of architecture do we want?

Monolith, Modular Monolith, or Microservices?

👉 Your turn:
Pick one and justify it. Then we’ll sketch the high-level components
and move to DB + queues + caching + etc.

What would you go for?

Microservices as there are different features in


my app

✅ Perfect reasoning!

Since your platform handles distinct concerns (auth, exams, code


execution, scoring, results, admin), a microservices architecture offers:

Benefit Why It Fits

🧱 Modular Scale code execution


scaling separately from user auth

🔒 Isolation Keep scoring secure +


sandboxed

🔁 Independent Push updates to exam UI


deployments without breaking code
evaluator

🔄 Easier Retry, failover, queueing


resilience per service
patterns

🧩 Step 2: High-Level Components (Microservices Map)

116/139
8/7/25, 2:30 AM System Design - Learn system design

Let’s outline the major microservices you'd need:

Service Responsibility

Auth Service Login, registration, token


management
(JWT/refresh)

User Profile Handles profile, settings,


Service roles

Exam Service Create exams, assign to


users, timer mgmt

Question Manage questions, tags,


Service difficulty

Submission Accept submissions,


Service status updates

Code Execution Runs code in secure


Service sandbox (Docker,
Firecracker)

Scoring Service Grades answers, stores


scores

Results Service Shows performance


reports

Admin Panel Admin CRUDs,


Service dashboards

Notification Email, SMS, push updates


Service

Gateway / API Entry point with rate


Gateway limiting, routing

Monitoring & Tracks health, errors,


Alerting Layer latency

Cache Layer Caching questions,


(Redis) tokens, exam metadata

117/139
8/7/25, 2:30 AM System Design - Learn system design

Service Responsibility

Queue System Buffer submissions →


(Kafka/RabbitM score async
Q)

DB Layer Separate DB per service


(or shared if justified)

✅ Let’s Design Deeply, Step-by-Step


Would you like to:

1. 🎯 Design the code submission + evaluation pipeline (includes


queueing, sandboxing, caching, DB writes, failover)
2. 🧾 Start with the user-to-exam flow (auth → assigned exam →
question delivery → submission)
3. 🔐 Zoom into security & anti-cheating layer (timer sync, session
pinning, copy-paste blocking)

Choose one — and we’ll build it top to bottom like a real system
architect.

User to exam flow

Nice — we’ll now design the User → Exam → Question → Submission


flow, just like a real-world system walkthrough.

🎯 Goal: Let a user take a coding exam smoothly


(from login → loading questions → submitting answers)

🧭 Step-by-Step System Flow


Let’s break this into logical stages:

Stage What Happens

1️⃣ Login / Auth User logs in, gets


tokens

118/139
8/7/25, 2:30 AM System Design - Learn system design

Stage What Happens

2️⃣ Exam Exam metadata is


Assignment loaded

3️⃣ Start Exam Questions, timer,


rules loaded

4️⃣ Answer User submits


Questions code/answers

5️⃣ Submit Exam Answers pushed to


backend

🔒 1. Login / Auth Service


Flow:
1. User hits POST /login
2. Auth Service verifies credentials
3. Returns:
accessToken (short-lived JWT)

refreshToken (stored securely, long-lived)

Notes:
Rate limiting here (to prevent brute force)
Monitor login failure spikes
Cache tokens with expiry in Redis

📋 2. Exam Assignment (Exam Service)


Flow:
1. User hits GET /my-exams
2. Exam Service fetches active + upcoming exams for userId
3. Returns list with:
Exam ID, title, start time, duration, attempt status

DB Tables:
users
exams

user_exam_mapping

119/139
8/7/25, 2:30 AM System Design - Learn system design

🧭 3. Start Exam (Question Service + Timer)


Flow:
1. User clicks “Start Exam” → frontend calls POST /start-exam
2. Backend:
Marks exam as “in progress”
Records timestamp
Returns:
Full question list
Timer duration
Exam metadata (title, section info, etc.)

Notes:
Cache questions per exam in Redis/CDN
Questions should be signed or hashed to prevent tampering
Timer logic: controlled by backend + frontend sync

🧠 Question Delivery (Cache + Shard Aware)


Questions are:

Cached (CDN/Redis) for read performance


Read from correct DB shard if needed (based on examId)

💻 4. User Submits Answers (Submission Service)


For each question, code is submitted to POST /submit
Submission is saved and pushed to message queue for evaluation
Response is immediate: {"status": "submitted", "jobId": 123}

📬 5. Submit Full Exam


When time ends (or user clicks “Submit Exam”), backend:
Marks exam status = completed
Locks submissions (read-only)
Notifies Scoring Service to finalize grades

🔁 Data Flow Map


120/139
8/7/25, 2:30 AM System Design - Learn system design

txt

User → Auth → Exam Service → Question Service


↓ ↓ ↓
Token Exam Metadata Question List (cached)


Start Exam

Submit code → Submission Service → Queue → Execution + Scoring

Submit Exam → Lock → Notify Results Service

🧩 Quick Questions:
1. “Where would you enforce exam timing so users can’t fake
timers?”

2. “Where would you cache data and where would you never
cache?”

Your turn — answer these two, and then we’ll move to the Code
Evaluation Pipeline (queue + retry + circuit breaker + failover).

Exam timing in backend and cache data for


ques ets but not for answers

✅ Spot on!

Let’s break down your answers for clarity:

🕐 1. Exam Timing → Backend-Controlled


✅ Correct: Never trust frontend clocks
Timer starts when backend receives POST /start-exam
Track actual duration on server:
Save start_time , duration
Reject submissions after start_time + duration

Frontend timer is for UX only — actual enforcement lives in the

121/139
8/7/25, 2:30 AM System Design - Learn system design

backend.

📦 2. Caching → Questions ✅, Answers ❌


What to Cache Why

✅ Questions Same for all users → ideal


for Redis/CDN caching

✅ Exam Like titles, time limits, rule


Metadata text — rarely changes

❌ Answers / User-specific, sensitive,


Submissions needs durability

❌ Scores / Often calculated, not


Results worth caching until
finalized

🧠 Always avoid caching anything write-heavy, user-specific, or


security-critical.

Ready to move into the Code Evaluation Pipeline?


This is where we tie in queues, workers, retries, circuit breakers, and
scoring microservices.

Shall we?

Yes

Awesome — let’s build your Code Evaluation Pipeline step-by-step,


applying everything you’ve learned:

🎯 Goal: Handle 10k+ concurrent code submissions


Efficiently, securely, and reliably

🧭 Step-by-Step Pipeline
122/139
8/7/25, 2:30 AM System Design - Learn system design

Step What Happens

1️⃣ Submit Code User submits answer to


backend

2️⃣ Save & Submission is saved +


Enqueue added to queue

3️⃣ Worker Picks Background service


Up processes jobs

4️⃣ Run in Code runs securely


Sandbox (Docker, Firecracker, etc.)

5️⃣ Scoring Output is compared


against expected

6️⃣ Save Result Score/status is saved and


user notified

🔧 1. POST /submit → Submission Service

json

{
"userId": 42,
"examId": "JEE2025",
"questionId": "Q1",
"language": "python",
"code": "print('Hello')"
}

✅ Store raw submission in DB


✅ Push job to queue (e.g., Kafka, SQS)

📬 2. Message Queue (Kafka/RabbitMQ)


Queue stores job like:

json

{
"jobId": 134,
"code": "...",

123/139
8/7/25, 2:30 AM System Design - Learn system design

"language": "cpp",
"userId": 42,
"questionId": "Q1"
}

Ensures async decoupling


Handles retry, throttling, DLQ on failure

🛠 3. Worker: Code Executor Service


Picks up job
Creates a secure sandbox (container/vm) for execution
Sets resource limits (CPU, RAM, timeout)
Compiles and runs code
Captures stdout, stderr, exit code

🧠 You can parallelize this with multiple workers

🔒 4. Sandbox Isolation
Run code in Docker, Firecracker, or gVisor
Disable network access
Apply:
Memory caps
Execution timeouts
Volume mounts (temp file system)

Prevents abuse, infinite loops, and security breaches

🎯 5. Scoring Engine
Compare output against expected outputs
Support multiple test cases
Save:
Pass/fail status
Runtime
Memory usage
Final score

124/139
8/7/25, 2:30 AM System Design - Learn system design

💾 6. Save & Notify


Save results to DB
Notify user via:
WebSocket / long polling
GET /submissions polling endpoint

Update “My Submissions” UI

💡 Resilience Tactics
Failure Strategy

Worker crashes Retry from queue,


track job status

Evaluator down Circuit breaker +


backup instance

Code takes too Timeout + mark as


long failed

Malformed code Return compile


error status

Evaluation fails Push to Dead


3x Letter Queue →
manual check

🔁 Example Flow (Visual)


text

User
↓ POST /submit
Submission Service

Kafka Queue

Worker Pool (Code Execution)

Docker Sandbox → Output

Scoring Service

125/139
8/7/25, 2:30 AM System Design - Learn system design


Result DB

User sees score

🧩 Quick Design Decisions:


1. “Where would you place a circuit breaker in this flow?”

2. “How would you scale the code evaluation service?”

Take a stab at these — we’ll review and move to results reporting + final
wrap-up!

In submission end point and in evaulation end


point ....sharding and horizontal vertical
scaling and other things?

✅ Brilliant — you're clearly thinking like a system designer now.

Let’s break down your answer, then deepen it into a production-ready


design:

✅ 1. Circuit Breaker Placement


You said:

Submission endpoint and evaluation endpoint

🔌 Spot on:
Location Why Use Circuit Breaker

POST /submit If Kafka/SQS is down,


→ Queue don’t hang — fail fast,
retry later

Evaluator If container runtime is


(worker → failing (e.g., Docker
sandbox) daemon), skip job
temporarily to prevent
queue backlog

126/139
8/7/25, 2:30 AM System Design - Learn system design

Location Why Use Circuit Breaker

Scoring Service If scoring API is


overloaded, reroute or
degrade gracefully

✅ 2. Scaling Code Evaluation


Let's go deeper into scaling each part of the pipeline:

⚙️ A. Submission Service
Scaling Method Notes

Horizontal Stateless → spin up more


scaling instances behind a load
balancer

Cache Token/session info,


question metadata

Circuit breaker Protect against queue


overload / write failures

⚙️ B. Kafka / Queue Layer


Scaling Method Notes

Partitions More partitions = more


concurrency for
consumers

Consumer Scale workers to process


groups multiple partitions

DLQ + retries Handle failures gracefully


without dropping data

⚙️ C. Code Executor Workers

127/139
8/7/25, 2:30 AM System Design - Learn system design

Scaling Method Notes

Horizontal More containers = more


scaling parallel evaluations

Node Based on CPU/memory


autoscaling queue size metrics

Sharding Assign workers to specific


partitions or topics if
needed

Vertical scaling Rare — if each container


needs more RAM/CPU,
upgrade instance types

🧠 If code runs in Docker:


Use a worker pool manager (like Nomad or K8s Jobs)
Limit max concurrent containers per node to avoid overload

⚙️ D. Sandboxing Layer
Strategy Reason

Use microVMs Faster startup,


(Firecracker) better isolation

Resource quotas Avoid infinite loops


/ fork bombs

Ephemeral disks Temp data only —


no leakage

⚙️ E. Scoring + Result DB
Scaling Notes

Scoring Service Stateless → horizontally


scale with load

Result DB Use write-optimized DB,


async flush to analytics
DB

128/139
8/7/25, 2:30 AM System Design - Learn system design

Scaling Notes

Partitioning By examId or userId to


avoid hotspots

🧩 Full Scaling Summary


Layer Scaling Strategy

Frontend + API Horizontal pods + CDN +


caching

Auth Stateless + Redis-backed


token store

Queue Kafka with partitioning


and consumer scaling

Worker Pool Horizontal scaling based


on queue size

DB Shard, replicate, and


monitor query
performance

Cache Redis with eviction + TTLs

Monitoring Autoscale + alert on


latency, queue lag,
memory, 5xx

You’ve completed the user-to-exam flow + evaluation pipeline, with


resilience, scaling, and fail-safes built-in.

Would you like to:

✅ Finish with results reporting & analytics


🧪 Try a mini system design mock based on this?
📤 Export this case study as a doc?
You're 99% done — tell me how you'd like to wrap up!

129/139
8/7/25, 2:30 AM System Design - Learn system design

Finish

Perfect — let’s wrap up with the final piece:

🏁 Results Reporting & Analytics


This is where users (and admins) see how they performed and how the
system is doing overall.

📦 What Do We Need to Handle?


User Flow Admin Flow

View my scores View leaderboard

Check See aggregate


passed/failed stats

See code Track cheating or


output/errors replays

Review Export CSVs, track


submissions usage trends

🧭 Results Pipeline Overview


text

[Scoring Service]

[Results DB]

[Results Service] → REST API / GraphQL

[User / Admin UI]

🛠 Design Decisions
✅ DB Strategy
Store scores in a results DB (sharded by examId or userId)
Separate analytics DB (OLAP) for large-scale aggregations

130/139
8/7/25, 2:30 AM System Design - Learn system design

Type Examples

OLTP PostgreSQL, MySQL →


fast reads for single user

OLAP ClickHouse, BigQuery →


for trends, stats, exports

🧠 What to Store in Results Table


Field Example

userId 101

examId “EXAM2025”

score 72.5

passed true

durationTaken 43 min

submissionTime timestamp

⚡ Caching
Cache leaderboard results, stats, and exam metadata in Redis
Use invalidation on new submissions or scheduled refresh

📊 Analytics Examples
Query Used By

Avg score per Admin QA


question

Time taken per Cheating detection


user

Most failed Exam improvement


questions

131/139
8/7/25, 2:30 AM System Design - Learn system design

Query Used By

Submission Load tuning


spikes by hour

🔐 Secure Access
Auth middleware on GET /results/:userId
RBAC:
Users see their data only
Admins see everyone’s (with filters)

📉 Monitoring Metrics
Metric Why

Score Evaluate scoring


calculation time load

Result fetch Detect DB hot


latency shards

API 5xx or Sign of DB lag or


empty results bugs

✅ End-to-End System Flow Recap


text

[User Login]

[Exam Metadata + Questions] ← Redis cache

[Submit Answer] → Submission Service → Kafka

[Worker] → Sandbox Execution → Scoring Service

[Result DB] ← Save score + output

[Results Service] → /leaderboard /my-score /analytics

132/139
8/7/25, 2:30 AM System Design - Learn system design

You’ve now designed:

✅ A fully modular microservice system


✅ With caching, queueing, circuit breakers
✅ Sharding, replication, load balancing
✅ Monitoring, security, and failover
✅ All the way from login → exam → results reporting

Would you like this full case study exported into a PDF or document?
Or want a checklist version to revise from?

You've completed the full arc — brilliant work 👏

Okay now moving to a real world case study


say zomato explain to me do not ask
questions now

Great — let’s dive into a real-world system design case study: Zomato
(or any food delivery platform like Swiggy, UberEats).

We'll walk through this like a backend architect: full system overview,
major components, scaling, caching, and reliability — all step by step.

🍽️ System Design Case Study: Zomato / Food Delivery


Platform

🧭 High-Level Goals
Zomato allows users to:

Browse restaurants and dishes


Place orders
Track delivery
Leave reviews
Make payments

It must:

133/139
8/7/25, 2:30 AM System Design - Learn system design

Serve millions of users & partners


Maintain real-time inventory & location
Ensure reliability, low latency, and fault tolerance

🧱 Core Microservices
Service Description

User Service Login, profiles, addresses

Restaurant Menus, availability,


Service pricing

Search Service Location-based discovery,


filters

Cart Service Add/remove items,


pricing rules

Order Service Order placement &


updates

Delivery Service Assign delivery partner,


live tracking

Payment Initiate and confirm


Service payments

Rating & Review Star ratings, comments


Service

Notification SMS, push, email


Service

Admin Panel Restaurant onboarding,


config, dashboards

🗺️ High-Level Architecture
text

[User App / Web]


134/139
8/7/25, 2:30 AM System Design - Learn system design

[API Gateway / BFF]


↙️ ↓ ↘️
[Search] [Order] [User/Auth]
↓ ↓ ↓
[Restaurant] [Cart] [Reviews]
↓ ↓ ↓
[Inventory] [Delivery] [Notifications]

[Tracking + Map APIs]

⚙️ Data Flows (Critical Use Cases)


1. Browse Restaurants
User location → Search Service → Geo-indexed restaurants (using
ElasticSearch, Redis)
Menus cached (Redis, CDN for images)
Rating, delivery time, offers fetched asynchronously

2. Add to Cart
Cart Service stores items in Redis (short-lived TTL)
Applies pricing rules (offers, delivery fee, taxes)

3. Place Order
Cart → Order Service (validates availability)
Calls Payment Service to block amount
On success:
Notifies Delivery Service to assign agent
Pushes job to order status queue (Kafka)

4. Assign Delivery Partner


Delivery Service uses:
Location of restaurant + customer
Live delivery agent locations (from mobile apps)
ETA + load balancing algorithm (e.g., greedy or ML model)
Assigns delivery agent

135/139
8/7/25, 2:30 AM System Design - Learn system design

Tracks in real time

5. Payment & Confirmation


Payment gateway API (Razorpay/Stripe)
Retry logic with circuit breakers
Confirm only if payment is acknowledged

6. Order Updates
Order status is updated via Kafka events:
ORDER_PLACED → RESTAURANT_CONFIRMED → OUT_FOR_DELIVERY

→ DELIVERED
Frontend polls or uses WebSocket for real-time updates

7. Post Order
User leaves ratings/review
Updates Review DB (can be sharded by restaurantId)

🧠 Scaling Strategy
Component Scaling Method

Search Use ElasticSearch with


geo-indexing; horizontally
scale

Menus Cache aggressively (Redis


+ CDN)

Orders Kafka + stateless Order


processors

Payments Isolated microservice with


retry + fallbacks

Delivery Location updates via


streaming (WebSocket,
gRPC)

136/139
8/7/25, 2:30 AM System Design - Learn system design

Component Scaling Method

DBs Shard by userId,


restaurantId, city, etc.

Static Content Use CDN for menus,


banners, logos

🔐 Security & Reliability


Layer Feature

Auth OAuth2, JWT,


refresh tokens

PII Protection Encrypted


addresses, contact
info

Circuit Breakers For payment,


delivery
assignment

Retry Queues For order failures,


delivery retries

Rate Limiting Prevent API abuse


per user/device

Monitoring Prometheus,
Grafana, Sentry,
Datadog

📊 Analytics & Monitoring


Metrics Tools

Active delivery Redis or time-


agents window counters

Order Kafka consumers +


success/failure dashboard

137/139
8/7/25, 2:30 AM System Design - Learn system design

Metrics Tools

Abandon cart Event pipeline


(Snowflake,
Redshift)

Location Comparison with


accuracy Google APIs

Conversion rates Custom tracking


events

🧩 Real-World Design Challenges


Challenge Solution

Geo-based Local edge caches for


latency menus

High peak loads Auto-scaling Kubernetes


pods + queue-based
backpressure

Real-time Mobile SDKs +


tracking WebSocket infra

Fraudulent ML-based anomaly


ratings detection

Delivery failures Manual reassignments +


compensation logic

This system needs to be modular, real-time, resilient, and optimized for


scale and locality.

Would you like to go deeper into any subcomponent like:

Real-time delivery tracking?


Search and ranking?
Payment flow with failure handling?
Recommendation systems?

138/139
8/7/25, 2:30 AM System Design - Learn system design

You’re ready to dive into any of them.

139/139

You might also like