Common HLD System Design Interview Terms You Should Know: A Complete Guide for Engineering Interviews

Table of Contents

Introduction

Picture this: You’re sitting in a system design interview, and the interviewer asks, “How would you design Instagram?” You start confidently, but then they interrupt: “What about your load balancing strategy? How would you handle cache invalidation? What’s your approach to database sharding?” Suddenly, you realize that knowing data structures and algorithms isn’t enough—you need to speak the language of distributed systems.

Why This Topic Matters

System design interviews have become a crucial component of software engineering interviews, particularly for mid-level to senior-level positions. Unlike coding interviews that test your algorithmic thinking, system design interviews evaluate your ability to architect large-scale systems that can handle millions of users, process terabytes of data, and remain resilient under failure conditions.

The challenge isn’t just understanding these concepts—it’s being able to articulate them clearly, explain trade-offs confidently, and demonstrate that you’ve thought about real-world constraints like scalability, reliability, and performance. When you use precise terminology, you signal to the interviewer that you understand the building blocks of distributed systems and can reason about them systematically.

Real-World Scenarios Where It’s Useful

Every major tech company—Google, Amazon, Netflix, Uber, Airbnb—relies on these architectural patterns to serve billions of requests daily. When Netflix streams video to 200 million subscribers simultaneously, when Amazon handles millions of transactions during Prime Day, or when Twitter processes 500 million tweets per day, they’re using the exact concepts we’ll discuss in this guide.

Understanding these terms isn’t just about passing interviews. As a software engineer, you’ll encounter these patterns in your daily work: deciding between SQL and NoSQL databases, implementing caching strategies to reduce latency, designing APIs that can scale horizontally, or choosing the right message queue for asynchronous processing.

Why Interviewers Ask About It

Interviewers use system design questions to assess multiple dimensions of your engineering maturity:

Technical breadth: Are you familiar with the standard tools and patterns commonly used in the industry?
Trade-off analysis: Can you weigh the pros and cons of different approaches?
Scalability thinking: Do you consider how systems behave under increasing load?
Real-world pragmatism: Can you balance theoretical perfection with practical constraints?
Communication skills: Can you explain complex systems clearly to both technical and non-technical stakeholders?

When you demonstrate fluency in system design terminology, you’re showing that you can contribute to architectural decisions from day one, not just write isolated pieces of code.

Let’s dive deep into the essential terms that will transform you from someone who can code to someone who can architect.

The Foundation: Understanding System Design Components

Before we explore individual components, let’s understand how they fit together in a typical high-level design.

This diagram represents a simplified but realistic architecture. Every component here has a specific purpose, and understanding each one is crucial for system design interviews. Let’s break them down systematically.

1. Load Balancer: The Traffic Director

What It Is

A load balancer is like a sophisticated traffic cop at a busy intersection. Imagine you’re at a mall with multiple checkout counters. If everyone lines up at counter one while counters two and three sit empty, you’d have inefficiency and frustrated customers. A load balancer solves this problem for web servers—it distributes incoming requests across multiple servers to ensure no single server gets overwhelmed while others sit idle.

In technical terms, a load balancer sits between clients and servers, receiving all incoming traffic and distributing it across a pool of backend servers based on specific algorithms. This distribution ensures optimal resource utilization, maximizes throughput, minimizes response time, and avoids overloading any single server.

Load Balancing Algorithms

Different scenarios call for different distribution strategies:

Round Robin: This is the simplest approach. If you have three servers (A, B, C), the first request goes to A, the second to B, the third to C, then back to A. It’s like dealing cards in a game—everyone gets one in turn. This works well when all servers have similar capabilities and requests have similar processing requirements.

Least Connections: This is smarter—send the next request to the server currently handling the fewest active connections. Imagine you’re at an airport security line, and you choose the shortest queue. This works better when requests have varying processing times, as it prevents one server from getting backed up with long-running requests while others sit relatively idle.

Weighted Round Robin: Not all servers are created equal. Maybe you have three servers, but one is a beast with 32 cores while the others have 8 cores. Weighted round robin lets you assign weights (like 4:1:1), so the powerful server gets four times as many requests. It’s like having a checkout counter that can process customers faster—you should send more people there.

IP Hash: This algorithm computes a hash of the client’s IP address to determine which server handles the request. This is crucial when you need session persistence (sticky sessions). Think of it like always going to the same bank teller who knows your account history—consistency matters for certain operations.

flowchart LR
    Client[Client Requests] --> LB{Load Balancer}
    LB -->|Request 1| S1[Server 1<br/>20% Load]
    LB -->|Request 2| S2[Server 2<br/>40% Load]
    LB -->|Request 3| S3[Server 3<br/>80% Load]
    
    LB -.->|Next request goes<br/>to Server 1| S1
    
    style S3 fill:#ffcccc
    style S1 fill:#ccffcc
    style S2 fill:#ffffcc

Layer 4 vs Layer 7 Load Balancing

This distinction confuses many candidates, but it’s simpler than it sounds:

Layer 4 (Transport Layer): These load balancers make routing decisions based on IP addresses and TCP/UDP ports. They’re fast because they don’t inspect packet contents—they just look at network information. It’s like a mail sorting facility that only reads zip codes, not the actual letters inside. Examples: AWS Network Load Balancer, HAProxy in TCP mode.

Layer 7 (Application Layer): These load balancers can inspect HTTP headers, cookies, and even the content of requests. They can route based on URL paths, headers, or HTTP methods. For example, you could route all /api/user/* requests to user service servers and /api/payment/* to payment servers. It’s like a mail sorter who opens letters to determine the best handling. Examples: AWS Application Load Balancer, NGINX, HAProxy in HTTP mode.

Real-World Application

At Netflix, load balancers handle millions of requests per second. When you click “play” on Stranger Things, your request hits a load balancer that routes it to an available API server. If that server is busy transcoding video, the load balancer knows to send your request elsewhere. This is why Netflix can serve 200 million subscribers without melting down.

🚨 Common Mistake: Candidates often forget about load balancer failure. What if the load balancer itself fails? In production systems, you need multiple load balancers in an active-passive or active-active configuration, often using DNS round-robin or floating IPs to provide redundancy.

2. CDN (Content Delivery Network): The Speed Amplifier

What It Is

Think of a CDN as a network of convenience stores instead of one massive warehouse. If you lived in New York and Amazon had only one warehouse in Seattle, every package would take days to arrive. Instead, Amazon has fulfillment centers everywhere—you get same-day delivery because the warehouse is nearby. A CDN does this for digital content.

A CDN is a geographically distributed network of proxy servers and their data centers. The goal is to provide high availability and performance by distributing content spatially relative to end users. Instead of every user connecting to your origin server in Virginia, users in Tokyo connect to a CDN edge server in Tokyo, users in London connect to one in London, and so on.

How It Works: The Complete Flow

sequenceDiagram
    participant User in Tokyo
    participant CDN Edge Tokyo
    participant CDN Edge London
    participant Origin Server (Virginia)
    
    User in Tokyo->>CDN Edge Tokyo: Request image.jpg
    CDN Edge Tokyo->>CDN Edge Tokyo: Check local cache
    
    alt Cache HIT
        CDN Edge Tokyo-->>User in Tokyo: Return image.jpg (5ms latency)
    else Cache MISS
        CDN Edge Tokyo->>Origin Server (Virginia): Request image.jpg
        Origin Server (Virginia)-->>CDN Edge Tokyo: Send image.jpg (200ms)
        CDN Edge Tokyo->>CDN Edge Tokyo: Store in cache
        CDN Edge Tokyo-->>User in Tokyo: Return image.jpg (205ms first time)
    end
    
    Note over CDN Edge Tokyo: Subsequent requests<br/>served from cache (5ms)

Let me walk you through what happens:

User Request: A user in Tokyo requests https://cdn.example.com/image.jpg
DNS Resolution: DNS routes the user to the nearest CDN edge server (Tokyo)
Cache Check: The Tokyo edge server checks if it has the file cached
Cache Hit: If yes, return it immediately (maybe 5-10ms latency)
Cache Miss: If no, fetch from origin server (200ms latency across the Pacific)
Cache Population: Store the file in Tokyo edge server’s cache
Return to User: Send the file to the user and also serve future requests from cache

What Content Goes on a CDN?

Static Assets: Images, CSS files, JavaScript bundles, fonts, videos—anything that doesn’t change frequently. YouTube doesn’t serve every video from California; they use CDNs to cache popular videos close to viewers.

Dynamic Content with Caching: Even some API responses can be CDN-cached with appropriate headers. If your API returns the same data for 5 minutes, CDN edge servers can cache and serve those responses.

Streaming Media: Netflix uses a specialized CDN called Open Connect. When Squid Game premiered, Netflix didn’t stream every episode from their Virginia data center to Korea—they pre-positioned copies on edge servers worldwide.

Cache Invalidation Strategy

This is where it gets tricky. How do you update content that’s cached on 200 edge servers worldwide?

Time-Based (TTL): Set a Time To Live, like 24 hours. After 24 hours, the CDN refetches from origin. Simple but means users might see stale content for up to 24 hours.

Version-Based: Instead of logo.png, use logo-v2.png or logo.png?v=12345. When you update the logo, you change the filename or version parameter. This is bulletproof—no stale content issues.

Purge/Invalidation API: CDN providers offer APIs to manually purge content. When you update logo.png, you call the CDN API to invalidate it across all edge servers. This is fast but costs API calls and can be complex to manage.

💡 Interview Tip: When discussing CDNs, always mention the trade-off between freshness and performance. Shorter TTLs mean fresher content but more origin requests (higher cost, higher latency). Longer TTLs mean better performance but potentially stale content.

3. Caching: The Performance Multiplier

What It Is and Why It Matters

Caching is the practice of storing copies of data in a fast-access location so subsequent requests can be served more quickly. It’s like keeping your most-used tools on your workbench instead of walking to the garage every time you need a screwdriver.

Here’s a mind-blowing statistic: accessing data from RAM is about 100,000 times faster than accessing it from a hard drive, and about 100 times faster than accessing it from an SSD. For network-based storage like databases, the difference is even more dramatic when you factor in network latency.

Cache Hierarchy in a Typical System

flowchart TD
    Request[User Request] --> L1[L1: Browser Cache<br/>~1ms]
    L1 -->|Cache Miss| L2[L2: CDN Cache<br/>~10-50ms]
    L2 -->|Cache Miss| L3[L3: Application Cache<br/>Redis/Memcached<br/>~1-5ms]
    L3 -->|Cache Miss| L4[L4: Database Query Cache<br/>~10-50ms]
    L4 -->|Cache Miss| DB[(Database<br/>~100-500ms)]
    
    style L1 fill:#d4f4dd
    style L2 fill:#b3e6ff
    style L3 fill:#ffe6b3
    style L4 fill:#ffd4d4
    style DB fill:#ff9999

Each layer provides progressively slower access but broader coverage:

Browser Cache: Fastest, but only helps repeat visits by the same user on the same device. Your browser cached that Netflix thumbnail, so it loads instantly next time.

CDN Cache: Helps all users in a geographic region. Once one user in Tokyo loads an image, all Tokyo users benefit from the cached copy.

Application Cache (Redis/Memcached): Sits in your data center, shared across all application servers. This is your primary weapon against database load.

Database Query Cache: Some databases like MySQL have internal query caches, but these are being phased out in modern databases due to complexity and limited benefit.

Cache Eviction Policies

Your cache can’t store everything—it has limited memory. When it’s full and you need to cache something new, what do you remove? This decision is governed by eviction policies:

LRU (Least Recently Used): Remove the item that hasn’t been accessed for the longest time. Like cleaning out your closet—if you haven’t worn that shirt in two years, maybe it’s time to donate it. This is the most common policy because it exploits temporal locality—if you accessed something recently, you’ll probably access it again soon.

LFU (Least Frequently Used): Remove the item accessed least often. Good for scenarios where some data is consistently popular (trending topics, popular products), but can be unfair to new items that haven’t had a chance to accumulate access counts yet.

FIFO (First In, First Out): Remove the oldest item. Simple but not very smart—it doesn’t consider usage patterns at all.

Random Replacement: Randomly pick something to evict. Sounds dumb, but for some workloads, it performs surprisingly close to LRU with less overhead.

Cache Strategies: When and How to Cache

Cache-Aside (Lazy Loading):

def get_user(user_id):
    # Try cache first
    user = cache.get(f"user:{user_id}")
    
    if user is None:  # Cache miss</em>
        <em># Fetch from database</em>
        user = database.query("SELECT * FROM users WHERE id = ?", user_id)
        
        <em># Store in cache for future requests</em>
        cache.set(f"user:{user_id}", user, ttl=3600)  <em># 1 hour TTL</em>
    
    return user

def get_user(user_id):
    # Try cache first
    user = cache.get(f"user:{user_id}")
    
    if user is None:  # Cache miss</em>
        <em># Fetch from database</em>
        user = database.query("SELECT * FROM users WHERE id = ?", user_id)
        
        <em># Store in cache for future requests</em>
        cache.set(f"user:{user_id}", user, ttl=3600)  <em># 1 hour TTL</em>
    
    return user

This is the most common pattern. The application checks the cache first. On a miss, it queries the database and populates the cache. It’s called “lazy” because data is only loaded into the cache when requested, not proactively.

Pros: Only requested data is cached (no wasted space), resilient to cache failures (system still works, just slower).

Cons: Initial request is slow (cache miss), potential for stale data, extra code complexity.

Write-Through:

python

def update_user(user_id, new_data):
    <em># Update database</em>
    database.update("UPDATE users SET ... WHERE id = ?", user_id, new_data)
    
    <em># Immediately update cache</em>
    cache.set(f"user:{user_id}", new_data, ttl=3600)

def update_user(user_id, new_data):
    <em># Update database</em>
    database.update("UPDATE users SET ... WHERE id = ?", user_id, new_data)
    
    <em># Immediately update cache</em>
    cache.set(f"user:{user_id}", new_data, ttl=3600)

Every write goes to both the cache and the database. This ensures cache is always fresh but adds latency to writes.

Pros: Cache is always consistent with database, no stale reads.

Cons: Every write is slower (two operations), wasted writes if data is never read again.

Write-Behind (Write-Back):

Writes go to cache first, and the database is updated asynchronously later. This is fast for writes but risky—if the cache crashes before syncing to the database, you lose data.

Pros: Extremely fast writes, can batch multiple writes together for efficiency.

Cons: Risk of data loss, complex to implement, consistency challenges.

The Cache Invalidation Problem

Phil Karlton famously said, “There are only two hard things in Computer Science: cache invalidation and naming things.” He wasn’t wrong. When you update data in the database, how do you ensure the cache reflects the change?

Time-Based Invalidation (TTL): Simplest approach—every cached item expires after a fixed time. The downside is you might serve stale data for the entire TTL period.

Event-Based Invalidation: When you update the database, explicitly delete the corresponding cache entry. This ensures freshness but requires careful code coordination.

Write-Through: As mentioned earlier, update both database and cache simultaneously.

🎯 Pro Insight: In practice, most systems use a combination: cache-aside with TTL for reads, and explicit invalidation for critical writes. For example, user profile changes might invalidate the cache immediately, while less critical data uses TTL-based expiration.

4. Database Concepts: SQL vs NoSQL, Sharding, and Replication

SQL vs NoSQL: The Great Divide

This isn’t really a versus situation—they’re different tools for different jobs. Choosing between SQL and NoSQL is like choosing between a Swiss Army knife and a specialized tool. Sometimes you need the versatility; sometimes you need optimization for a specific task.

SQL (Relational Databases)

SQL databases like PostgreSQL, MySQL, and Oracle store data in tables with predefined schemas. Think of them as strict filing cabinets where every folder has the same structure:

Users Table:
| id (int) | name (varchar) | email (varchar) | created_at (timestamp) |
|----------|----------------|-----------------|------------------------|
| 1        | Alice          | alice@email.com | 2024-01-15 10:30:00   |
| 2        | Bob            | bob@email.com   | 2024-01-16 14:22:00   |

Users Table:
| id (int) | name (varchar) | email (varchar) | created_at (timestamp) |
|----------|----------------|-----------------|------------------------|
| 1        | Alice          | alice@email.com | 2024-01-15 10:30:00   |
| 2        | Bob            | bob@email.com   | 2024-01-16 14:22:00   |

Strengths:

ACID compliance: Atomicity, Consistency, Isolation, Durability. Your bank account transfer either completes entirely or not at all—no half-transfers.
Complex queries: JOINs, aggregations, subqueries—SQL handles them elegantly.
Data integrity: Foreign keys, constraints, and transactions prevent data corruption.
Mature ecosystem: Decades of tooling, expertise, and optimization.

Weaknesses:

Schema rigidity: Adding a new column to a billion-row table can lock the table for hours.
Vertical scaling limitations: Eventually, you can’t just buy a bigger server.
Sharding complexity: Splitting data across multiple SQL servers is painful.

When to use SQL: Financial transactions, complex reporting, applications where data relationships are central (social networks with friends/followers), any scenario requiring strict consistency.

NoSQL Databases

NoSQL is an umbrella term for databases that don’t use the traditional relational model. There are several types:

Document Stores (MongoDB, CouchDB):

Store data as JSON-like documents:

{
  "_id": "user123",
  "name": "Alice",
  "email": "alice@email.com",
  "addresses": [
    {"type": "home", "city": "NYC"},
    {"type": "work", "city": "SF"}
  ],
  "preferences": {
    "theme": "dark",
    "notifications": true
  }
}

{
  "_id": "user123",
  "name": "Alice",
  "email": "alice@email.com",
  "addresses": [
    {"type": "home", "city": "NYC"},
    {"type": "work", "city": "SF"}
  ],
  "preferences": {
    "theme": "dark",
    "notifications": true
  }
}

No rigid schema—each document can have different fields. This is like having folders where each one can contain different types of papers.

Key-Value Stores (Redis, DynamoDB):

Simplest model—just key-value pairs, like a giant HashMap:

user:123 -> {"name": "Alice", "email": "alice@email.com"}
session:xyz789 -> {"user_id": 123, "expires": 1735789200}

user:123 -> {"name": "Alice", "email": "alice@email.com"}
session:xyz789 -> {"user_id": 123, "expires": 1735789200}

Column-Family Stores (Cassandra, HBase):

Optimized for writing and reading large amounts of data across distributed clusters. Think of it as tables that can have billions of rows and be split across hundreds of machines.

Graph Databases (Neo4j, Amazon Neptune):

Optimized for storing and querying relationships. Perfect for social networks, recommendation engines, fraud detection—anywhere connections between entities matter as much as the entities themselves.

graph LR
    Alice -->|FOLLOWS| Bob
    Alice -->|LIKES| Post1
    Bob -->|LIKES| Post1
    Bob -->|FOLLOWS| Charlie
    Charlie -->|WORKS_AT| Company1
    Alice -->|WORKS_AT| Company1
    
    style Alice fill:#ffcccc
    style Bob fill:#ccffcc
    style Charlie fill:#ccccff

NoSQL Strengths:

Horizontal scalability: Add more servers easily.
Flexible schema: No migrations needed when adding fields.
High performance for specific use cases: Redis can handle millions of operations per second.
Better for unstructured data: Logs, user-generated content, sensor data.

NoSQL Weaknesses:

Limited query flexibility: No complex JOINs across collections.
Eventual consistency: In distributed setups, reads might not reflect the latest write immediately.
Less mature transaction support: Though modern NoSQL databases are improving here.

When to use NoSQL: High-volume data with simple access patterns (time-series data, logs), flexible/evolving schemas, applications requiring massive horizontal scale (IoT devices, real-time analytics), caching layers.

Database Sharding: Horizontal Scaling

Sharding is the process of splitting your database across multiple machines. Imagine you have a library with 10 million books. One building can’t hold them all, so you create multiple library branches, each holding a subset of books.

flowchart TD
    App[Application] --> Router{Shard Router}
    Router -->|Users A-M| Shard1[(Shard 1<br/>Users A-M)]
    Router -->|Users N-Z| Shard2[(Shard 2<br/>Users N-Z)]
    Router -->|Hash-based| Shard3[(Shard 3<br/>Hash 0-999)]
    Router -->|Hash-based| Shard4[(Shard 4<br/>Hash 1000-1999)]
    
    style Shard1 fill:#ffe6e6
    style Shard2 fill:#e6f3ff
    style Shard3 fill:#e6ffe6
    style Shard4 fill:#fff0e6

Sharding Strategies:

Range-Based Sharding: Split by ranges. Users A-M go to Shard 1, N-Z to Shard 2. Simple but can create hotspots if data isn’t evenly distributed (what if most users have names starting with ‘S’?).

Hash-Based Sharding: Compute a hash of the shard key (like user ID) and use modulo to determine the shard: shard = hash(user_id) % num_shards. This distributes data evenly but makes range queries difficult (finding all users in a city requires checking all shards).

Geographic Sharding: EU users in EU shard, US users in US shard. Great for latency and compliance (GDPR requires EU data to stay in EU), but some features requiring global data become complex.

Directory-Based Sharding: Maintain a lookup table that maps keys to shards. Flexible but the lookup table itself becomes a potential bottleneck.

Challenges with Sharding:

Cross-shard queries: Finding all orders for a product when orders are sharded by user ID requires querying all shards.
Resharding: If you need to add more shards, you must redistribute data—a massive undertaking.
Transaction complexity: Distributed transactions across shards are slow and complex.

🚨 Common Mistake: Candidates often suggest sharding too early. Sharding adds enormous complexity. Modern single-server databases can handle terabytes of data. Only shard when a single database genuinely can’t handle your load, which is much later than most people think.

Database Replication: Reliability and Read Scaling

Replication means maintaining copies of your database on multiple servers. It’s like having backup keys for your house—if you lose one, you’re not locked out.

sequenceDiagram
    participant App
    participant Primary DB
    participant Replica 1
    participant Replica 2
    
    App->>Primary DB: WRITE: Update user
    Primary DB->>Replica 1: Replicate changes
    Primary DB->>Replica 2: Replicate changes
    Primary DB-->>App: Write confirmed
    
    Note over Replica 1,Replica 2: Replication lag: 0-100ms
    
    App->>Replica 1: READ: Get user
    Replica 1-->>App: Return user data
    
    App->>Replica 2: READ: Get user
    Replica 2-->>App: Return user data

Primary-Replica (Master-Slave) Replication:

One database is the primary (master), and others are replicas (slaves). All writes go to the primary, which then replicates changes to replicas. Reads can be distributed across replicas, reducing load on the primary.

Advantages:

Read scaling: Add more replicas to handle more read traffic.
Backup: If the primary fails, promote a replica to primary.
Analytics: Run heavy analytical queries on a replica without affecting production traffic.

Challenges:

Replication lag: Replicas might be slightly behind the primary (0-100ms typically). If you write data and immediately read from a replica, you might not see your write yet.
Write bottleneck: All writes still go to one server.

Multi-Primary (Master-Master) Replication:

Multiple databases can accept writes, each replicating to the others. More complex but eliminates the single write bottleneck.

Advantages:

Write scaling: Distribute writes across multiple primaries.
Geographic distribution: EU users write to EU primary, US users to US primary.

Challenges:

Conflict resolution: What if two users update the same record on different primaries simultaneously? Last-write-wins? Custom merge logic? This gets complicated fast.
Increased complexity: Much harder to reason about consistency.

5. API Gateway: The Front Door

An API Gateway is the single entry point for all client requests to your backend microservices. Think of it as a hotel lobby—you don’t go directly to hotel rooms; you go through the lobby where the staff can direct you, handle your check-in, verify you’re a guest, etc.

What Problems Does It Solve?

Without an API Gateway:

flowchart LR
    Mobile[Mobile App] --> Auth[Auth Service]
    Mobile --> User[User Service]
    Mobile --> Order[Order Service]
    Mobile --> Payment[Payment Service]
    
    Web[Web App] --> Auth
    Web --> User
    Web --> Order
    Web --> Payment
    
    Partner[Partner API] --> Auth
    Partner --> User
    Partner --> Order
    Partner --> Payment

Every client must know about every service, handle authentication separately with each service, and deal with different protocols/formats. It’s chaos.

With an API Gateway:

flowchart LR
    Mobile[Mobile App] --> Gateway[API Gateway]
    Web[Web App] --> Gateway
    Partner[Partner API] --> Gateway
    
    Gateway --> Auth[Auth Service]
    Gateway --> User[User Service]
    Gateway --> Order[Order Service]
    Gateway --> Payment[Payment Service]
    
    style Gateway fill:#ffd700

One entry point. The gateway handles routing, authentication, rate limiting, and more.

Key Responsibilities

Request Routing: The gateway examines the request path and routes it to the appropriate backend service:

/api/auth/* → Auth Service
/api/users/* → User Service
/api/orders/* → Order Service

Authentication & Authorization: Instead of every service implementing authentication, the gateway handles it once:

python

def handle_request(request):
    <em># Validate JWT token</em>
    token = request.headers.get('Authorization')
    user = validate_jwt(token)
    
    if not user:
        return {"error": "Unauthorized"}, 401
    
    <em># Add user context to request for downstream services</em>
    request.headers['X-User-ID'] = user.id
    request.headers['X-User-Role'] = user.role
    
    <em># Route to appropriate service</em>
    return route_to_service(request)

def handle_request(request):
    <em># Validate JWT token</em>
    token = request.headers.get('Authorization')
    user = validate_jwt(token)
    
    if not user:
        return {"error": "Unauthorized"}, 401
    
    <em># Add user context to request for downstream services</em>
    request.headers['X-User-ID'] = user.id
    request.headers['X-User-Role'] = user.role
    
    <em># Route to appropriate service</em>
    return route_to_service(request)

Rate Limiting: Prevent abuse by limiting requests per user/IP:

def check_rate_limit(user_id):
    key = f"rate_limit:{user_id}"
    current = redis.incr(key)
    
    if current == 1:
        redis.expire(key, 60)  <em># 60 second window</em>
    
    if current > 100:  <em># Max 100 requests per minute</em>
        raise RateLimitExceeded()

def check_rate_limit(user_id):
    key = f"rate_limit:{user_id}"
    current = redis.incr(key)
    
    if current == 1:
        redis.expire(key, 60)  <em># 60 second window</em>
    
    if current > 100:  <em># Max 100 requests per minute</em>
        raise RateLimitExceeded()

Request/Response Transformation: Convert between different formats. Maybe your mobile app needs JSON, but your legacy service returns XML.

Caching: Cache responses at the gateway level to avoid hitting backend services:

def get_user_profile(user_id):
    cache_key = f"profile:{user_id}"
    cached = cache.get(cache_key)
    
    if cached:
        return cached
    
    <em># Fetch from User Service</em>
    profile = user_service.get_profile(user_id)
    cache.set(cache_key, profile, ttl=300)  <em># 5 minutes</em>
    
    return profile

def get_user_profile(user_id):
    cache_key = f"profile:{user_id}"
    cached = cache.get(cache_key)
    
    if cached:
        return cached
    
    <em># Fetch from User Service</em>
    profile = user_service.get_profile(user_id)
    cache.set(cache_key, profile, ttl=300)  <em># 5 minutes</em>
    
    return profile

Service Discovery & Load Balancing: The gateway knows where your services are running and can load balance between multiple instances.

Monitoring & Logging: Centralized place to log all requests, measure latency, track errors—crucial for observability.

API Gateway vs Load Balancer

This confuses many candidates. Here’s the distinction:

Load Balancer: Works at network/transport layer, distributes traffic across multiple identical instances of the same service. Doesn’t understand application logic.

API Gateway: Works at application layer, routes different requests to different services based on path/headers, understands HTTP, can transform requests, apply business logic.

You often use both: API Gateway → Load Balancer → Service Instances.

💡 Interview Tip: When discussing API Gateway, mention potential bottlenecks. The gateway is a single point of failure and can become a performance bottleneck. In practice, you run multiple gateway instances behind a load balancer, just like any other service.

6. Message Queues: Asynchronous Communication

Message queues enable asynchronous communication between services. Instead of Service A calling Service B directly and waiting for a response, Service A puts a message in a queue, and Service B processes it whenever it’s ready.

The Restaurant Analogy

Think of a restaurant:

Synchronous (without queue): You order food, then stand at the kitchen door waiting until the chef finishes cooking. You can’t do anything else. If the chef is slow or the kitchen is busy, you just wait. This blocks you and creates congestion.

Asynchronous (with queue): You order food, get a number, and sit down. The kitchen has a queue of orders. Chefs work through them at their own pace. When your food is ready, they call your number. Meanwhile, you can browse your phone, talk to friends, or relax. The system is decoupled.

sequenceDiagram
    participant Client
    participant API Server
    participant Queue
    participant Worker
    participant Email Service
    
    Client->>API Server: POST /register
    API Server->>Database: Create user account
    Database-->>API Server: User created
    API Server->>Queue: Enqueue welcome email job
    API Server-->>Client: 200 OK (Registration successful)
    
    Note over Client: User sees success<br/>immediately
    
    Queue->>Worker: Dequeue welcome email job
    Worker->>Email Service: Send welcome email
    Email Service-->>Worker: Email sent
    Worker-->>Queue: Job completed

Why Use Message Queues?

Decoupling: Services don’t need to know about each other. The producer doesn’t care who consumes the message; the consumer doesn’t care who produced it. This is huge for system flexibility.

Handling Load Spikes: During Black Friday, your e-commerce site might receive 10,000 orders per minute, but your payment processor can only handle 1,000 per minute. Without a queue, 9,000 requests fail. With a queue, they all queue up, and workers process them as fast as possible. Users see “Order received” immediately; actual payment happens within a few minutes.

Reliability: If a worker crashes while processing a message, the message queue can requeue it for another worker. Without a queue, the request is just lost.

Async Processing: Some tasks are slow and don’t need to block the user. When you upload a video to YouTube, you see “Upload successful” immediately, but transcoding (generating different resolutions) happens asynchronously. Your browser doesn’t sit there waiting 20 minutes for transcoding.

Retry Logic: Messages can be retried with exponential backoff if processing fails. This is harder to implement with direct service-to-service calls.

Popular Message Queue Systems

RabbitMQ: Full-featured message broker with complex routing capabilities. Supports multiple protocols, various exchange types (direct, topic, fanout), and message acknowledgments.

Apache Kafka: Not just a message queue—it’s a distributed event streaming platform. Designed for high throughput (millions of messages per second), durability, and replay capability. Used for event sourcing, log aggregation, real-time analytics.

Amazon SQS: Managed queue service by AWS. Simple, scalable, but fewer features than RabbitMQ or Kafka. Great if you want something that just works without operational overhead.

Redis (as a queue): Can function as a simple queue using lists/streams. Fast, but less reliable than dedicated message queues (data might be lost if Redis crashes).

Queue Patterns

Work Queue: Multiple workers consume from a single queue. Each message is processed by exactly one worker. Used for distributing tasks among workers.

Pub/Sub (Publish/Subscribe): Messages are broadcasted to multiple subscribers. When a new user registers, you might want to:

Send a welcome email (Email Service subscribes)
Update analytics (Analytics Service subscribes)
Create a sample project (Onboarding Service subscribes)

All three services subscribe to the “user_registered” event. One event, multiple actions.

graph LR
    Publisher[Order Service] -->|order.created| Exchange{Message Exchange}
    Exchange -->|Route| Q1[Email Queue]
    Exchange -->|Route| Q2[Analytics Queue]
    Exchange -->|Route| Q3[Inventory Queue]
    
    Q1 --> W1[Email Worker]
    Q2 --> W2[Analytics Worker]
    Q3 --> W3[Inventory Worker]
    
    style Exchange fill:#ffd700

Dead Letter Queue: If a message fails processing repeatedly (e.g., malformed data, external service down), after N attempts, move it to a “dead letter queue” for manual inspection. This prevents one bad message from blocking the entire queue.

Message Queue Guarantees

At-most-once delivery: Message might be lost but never delivered twice. Fast but unreliable.

At-least-once delivery: Message definitely gets delivered, but might be delivered multiple times if the worker crashes after processing but before acknowledging. Workers must be idempotent (handling the same message twice produces the same result).

Exactly-once delivery: Holy grail, but expensive and complex to achieve. Kafka supports it with significant overhead.

🎯 Pro Insight: In interviews, when you suggest using a message queue, always discuss what happens if messages pile up faster than workers can process them. Options include:

Adding more workers (horizontal scaling)
Prioritizing critical messages
Throttling producers
Setting queue size limits with backpressure

7. CAP Theorem: The Fundamental Trade-off

CAP Theorem is one of the most important concepts in distributed systems, yet it’s frequently misunderstood. Let me break it down clearly.

The Three Properties

Consistency (C): Every read receives the most recent write. If I update my profile picture, every subsequent read (from any server, anywhere) should see the new picture, not the old one. It’s like having one source of truth that everyone sees identically.

Availability (A): Every request receives a response (success or failure), without guarantee that it contains the most recent write. The system is always responsive, even if some nodes are down.

Partition Tolerance (P): The system continues operating despite arbitrary message loss or failure of part of the system. Network partitions happen—cables get cut, switches fail, data centers go down. The system must handle these scenarios.

The Theorem: Pick Two

CAP Theorem states you can only achieve two of these three properties simultaneously in a distributed system. But here’s the crucial point that most candidates miss: partition tolerance is not optional. Network failures happen in the real world. You can’t just assume perfect network connectivity. So in practice, the choice is really between:

CP (Consistency + Partition Tolerance): Sacrifice availability during partitions
AP (Availability + Partition Tolerance): Sacrifice consistency during partitions

graph TD
    CAP[CAP Theorem]
    CAP --> CP[CP Systems<br/>Consistent + Partition Tolerant]
    CAP --> AP[AP Systems<br/>Available + Partition Tolerant]
    
    CP --> CP_Ex["Examples:<br/>• MongoDB (with default settings)<br/>• HBase<br/>• Redis Cluster<br/>• ZooKeeper"]
    
    AP --> AP_Ex["Examples:<br/>• Cassandra<br/>• DynamoDB<br/>• Couchbase<br/>• Riak"]
    
    style CP fill:#ffcccc
    style AP fill:#ccffcc

Real-World Example: Banking vs Social Media

Banking (needs CP):

Imagine you have $100 in your account. You try to withdraw $100 from two different ATMs simultaneously. A CP system ensures that only one withdrawal succeeds—even if that means one ATM shows an error message (sacrificing availability). Consistency is non-negotiable; you can’t create money out of thin air.

If there’s a network partition between your bank’s data centers, the system might refuse some operations to avoid inconsistency. You’d rather have an ATM say “Service temporarily unavailable” than have duplicate withdrawals.

Social Media (can be AP):

You post a photo on Instagram. Due to a network partition, some users see your post immediately, while others don’t see it for a few seconds. That’s okay—eventual consistency is acceptable. Instagram prioritizes availability (users can always post and view content) over strict consistency.

If there’s a partition, both sides of the partition continue serving requests. They’ll reconcile differences once the partition heals. A few seconds of stale data doesn’t break the user experience.

Consistency Models: It’s a Spectrum

In reality, consistency isn’t binary. There are different levels:

Strong Consistency: What CAP’s “C” refers to. Every read sees the latest write. Requires coordination, which adds latency.

Eventual Consistency: Weaker guarantee. Given enough time without new updates, all replicas converge to the same value. Reads might return stale data temporarily.

Causal Consistency: If operation A causally preceded operation B, everyone sees A before B. But concurrent operations can be seen in different orders by different users. For example, in a chat app, everyone sees messages in the same order within a conversation, but messages across different conversations might arrive in varying orders.

Read-Your-Writes Consistency: After you write something, you immediately see your write in subsequent reads, but others might see stale data temporarily. For example, you update your profile picture and immediately see the new one, but your friends might still see the old picture for a few seconds.

🚨 Common Mistake: Candidates often say “I’ll use a CP system” without considering the implications. In interviews, discuss the trade-offs: “For financial transactions, I’d choose a CP system like PostgreSQL to ensure consistency, even if it means some requests might timeout during network issues. For a social media feed, I’d choose an AP system like Cassandra for better availability, accepting that users might see slightly stale data.”

8. Microservices vs Monolith: Architectural Decisions

This debate is central to modern system design. Let’s explore both approaches without bias, because both are valid depending on context.

Monolithic Architecture

A monolith is a single deployable unit—one codebase, one database, one application. Think of it as a house where all rooms are connected internally.

graph TD
    LB[Load Balancer] --> M1[Monolith Instance 1]
    LB --> M2[Monolith Instance 2]
    LB --> M3[Monolith Instance 3]
    
    M1 --> DB[(Shared Database)]
    M2 --> DB
    M3 --> DB
    
    subgraph "Monolith Application"
        M1
        M2
        M3
        Auth[Auth Module]
        User[User Module]
        Order[Order Module]
        Payment[Payment Module]
    end
    
    style DB fill:#b3e0ff

Advantages:

Simplicity: One codebase to understand, one deployment pipeline, one thing to monitor. For small teams, this is a huge advantage.

Performance: No network calls between components. Calling a function in the same process is microseconds; calling another microservice is milliseconds. 1000x difference.

Transactions: Easy to maintain ACID properties across the entire application. If you need to update users, orders, and inventory together, just wrap it in a database transaction.

Easier to develop locally: Run the entire application on your laptop. No orchestrating 15 microservices.

Disadvantages:

Scaling limitations: If your order processing is slow but user management is fast, you can’t scale them independently—you have to scale the entire monolith.

Deployment risk: Deploying any change requires deploying the entire application. A bug in a minor feature can take down the whole system.

Technology lock-in: The entire application is in one language/framework. Can’t easily use Python for machine learning and Java for business logic.

Coordination overhead: As teams grow, multiple developers working on the same codebase leads to merge conflicts and coordination overhead.

Microservices Architecture

Decompose the application into small, independently deployable services, each responsible for a specific business capability.

graph TD
    Gateway[API Gateway] --> Auth[Auth Service]
    Gateway --> User[User Service]
    Gateway --> Order[Order Service]
    Gateway --> Payment[Payment Service]
    
    Auth --> DB1[(Auth DB)]
    User --> DB2[(User DB)]
    Order --> DB3[(Order DB)]
    Order --> Queue[Message Queue]
    Queue --> Inventory[Inventory Service]
    Payment --> DB4[(Payment DB)]
    Payment --> External[External Payment API]
    
    style Gateway fill:#ffd700
    style Queue fill:#ff99cc

Advantages:

Independent scaling: Scale the order service to 50 instances during Black Friday while keeping user service at 5 instances.

Technology diversity: Use the best tool for each job. Use Go for high-performance services, Python for ML services, Node.js for real-time services.

Fault isolation: If the recommendation service crashes, the rest of the application continues working. In a monolith, one bug can bring down everything.

Organizational scalability: Different teams can own different services. The payment team doesn’t need to understand the recommendation algorithm.

Independent deployment: Deploy the order service ten times a day without touching the user service. This enables rapid iteration.

Disadvantages:

Operational complexity: Instead of monitoring one application, you’re monitoring 20+ services, each with its own logs, metrics, and failure modes. You need sophisticated DevOps practices, service mesh, distributed tracing, centralized logging…

Network latency: Every inter-service communication is a network call. What was a function call (microseconds) is now an HTTP request (milliseconds). This adds up fast.

Data consistency challenges: No single database to maintain ACID properties. If you need to update user, order, and inventory atomically, you need distributed transactions (complex) or eventual consistency with compensation logic (even more complex).

Testing complexity: How do you test the order service when it depends on user, inventory, and payment services? You need integration tests, contract testing, and more sophisticated test strategies.

Debugging nightmares: A request spans 5 services. Which one is slow? Where did the error originate? You need distributed tracing (Jaeger, Zipkin) to follow request flows.

When to Choose Each

Start with a monolith if:

Small team (< 10 developers)
MVP or early-stage product
Unclear requirements (microservices premature when domain boundaries are unclear)
Don’t have DevOps expertise for managing distributed systems

Move to microservices when:

Clear domain boundaries have emerged
Different parts of the system have very different scaling needs
Multiple teams working independently
Parts of the system need different technologies
Deployment frequency is constrained by coupling

🎯 Pro Insight: In interviews, many candidates immediately jump to microservices for every problem. This reveals inexperience. Microservices are not a silver bullet—they trade complexity for flexibility. A better answer: “I’d start with a modular monolith with clear boundaries. If we need to extract certain services later due to scaling or organizational needs, the clear boundaries make that migration easier.” This shows pragmatism and understanding of trade-offs.

9. Rate Limiting: Protecting Your System

Rate limiting is the practice of controlling how many requests a user or client can make to your API within a given time window. It’s like a bouncer at a nightclub—only let in a certain number of people per hour to avoid overcrowding.

Why Rate Limit?

Prevent abuse: Without rate limiting, a malicious user or bot can overwhelm your system with requests, causing a denial-of-service. Even accidental abuse matters—a buggy client stuck in a retry loop can take down your API.

Fair resource allocation: Ensure one heavy user doesn’t consume all resources, degrading experience for others.

Cost control: Many third-party APIs charge per request. Rate limiting your users protects your budget.

Manage scale: If your downstream database can handle 10,000 queries per second, you need to ensure incoming traffic doesn’t exceed that.

Rate Limiting Algorithms

Fixed Window:

Simplest approach. Count requests in fixed time windows.

def is_allowed_fixed_window(user_id):
    current_minute = int(time.time() / 60)  <em># Current minute</em>
    key = f"rate_limit:{user_id}:{current_minute}"
    
    count = redis.incr(key)
    redis.expire(key, 60)  <em># Expire after 60 seconds</em>
    
    if count > 100:  <em># Max 100 requests per minute</em>
        return False
    return True

def is_allowed_fixed_window(user_id):
    current_minute = int(time.time() / 60)  <em># Current minute</em>
    key = f"rate_limit:{user_id}:{current_minute}"
    
    count = redis.incr(key)
    redis.expire(key, 60)  <em># Expire after 60 seconds</em>
    
    if count > 100:  <em># Max 100 requests per minute</em>
        return False
    return True

Problem: Boundary issue. If a user makes 100 requests at 10:59:59 and another 100 at 11:00:01, they’ve made 200 requests in 2 seconds, even though both are within limits for their respective windows.

Sliding Window Log:

Keep a log of all request timestamps. Count requests in the past N seconds.

def is_allowed_sliding_window_log(user_id):
    now = time.time()
    window = 60  <em># 60 second window</em>
    max_requests = 100
    
    key = f"rate_limit:{user_id}"
    
    <em># Remove old timestamps</em>
    redis.zremrangebyscore(key, 0, now - window)
    
    <em># Count current requests in window</em>
    current_count = redis.zcard(key)
    
    if current_count >= max_requests:
        return False
    
    <em># Add current timestamp</em>
    redis.zadd(key, {str(now): now})
    redis.expire(key, window)
    
    return True

def is_allowed_sliding_window_log(user_id):
    now = time.time()
    window = 60  <em># 60 second window</em>
    max_requests = 100
    
    key = f"rate_limit:{user_id}"
    
    <em># Remove old timestamps</em>
    redis.zremrangebyscore(key, 0, now - window)
    
    <em># Count current requests in window</em>
    current_count = redis.zcard(key)
    
    if current_count >= max_requests:
        return False
    
    <em># Add current timestamp</em>
    redis.zadd(key, {str(now): now})
    redis.expire(key, window)
    
    return True

Pros: Accurate, no boundary issues.

Cons: Memory intensive—you’re storing timestamps of every request.

Sliding Window Counter:

Hybrid approach. Estimate current window based on previous and current fixed windows.

Current rate = (prev_window_count * overlap) + current_window_count

Where overlap = (window_size - time_elapsed_in_current_window) / window_size

Current rate = (prev_window_count * overlap) + current_window_count

Where overlap = (window_size - time_elapsed_in_current_window) / window_size

Pros: Memory efficient like fixed window, more accurate like sliding log.

Cons: Slightly more complex to implement.

Token Bucket:

Start with a bucket of tokens. Each request consumes a token. Tokens refill at a fixed rate. If no tokens available, request is rejected.

python

def is_allowed_token_bucket(user_id):
    capacity = 100  <em># Max tokens</em>
    refill_rate = 10  <em># Tokens per second</em>
    
    key = f"rate_limit:{user_id}"
    
    <em># Get current bucket state</em>
    current = redis.get(key)
    
    if current is None:
        tokens = capacity
        last_refill = time.time()
    else:
        tokens, last_refill = map(float, current.split(':'))
        
        <em># Refill tokens based on time elapsed</em>
        now = time.time()
        elapsed = now - last_refill
        tokens_to_add = elapsed * refill_rate
        tokens = min(capacity, tokens + tokens_to_add)
        last_refill = now
    
    if tokens < 1:
        return False
    
    <em># Consume one token</em>
    tokens -= 1
    redis.set(key, f"{tokens}:{last_refill}", ex=3600)
    
    return True

def is_allowed_token_bucket(user_id):
    capacity = 100  <em># Max tokens</em>
    refill_rate = 10  <em># Tokens per second</em>
    
    key = f"rate_limit:{user_id}"
    
    <em># Get current bucket state</em>
    current = redis.get(key)
    
    if current is None:
        tokens = capacity
        last_refill = time.time()
    else:
        tokens, last_refill = map(float, current.split(':'))
        
        <em># Refill tokens based on time elapsed</em>
        now = time.time()
        elapsed = now - last_refill
        tokens_to_add = elapsed * refill_rate
        tokens = min(capacity, tokens + tokens_to_add)
        last_refill = now
    
    if tokens < 1:
        return False
    
    <em># Consume one token</em>
    tokens -= 1
    redis.set(key, f"{tokens}:{last_refill}", ex=3600)
    
    return True

Pros: Allows bursts (if tokens have accumulated), smooth rate limiting.

Cons: More complex to understand and implement.

Leaky Bucket:

Requests are added to a queue (bucket). The queue is processed at a constant rate. If the queue is full, new requests are rejected.

Pros: Smooths out bursts, constant outgoing rate.

Cons: Can add latency (requests wait in queue), less flexible than token bucket.

Distributed Rate Limiting

In a distributed system with multiple API servers, you need shared state. That’s where Redis comes in:

flowchart TD
    Client --> Server1[API Server 1]
    Client --> Server2[API Server 2]
    Client --> Server3[API Server 3]
    
    Server1 --> Redis[(Redis<br/>Shared Rate Limit State)]
    Server2 --> Redis
    Server3 --> Redis
    
    Redis --> Decision{Rate Limit<br/>Exceeded?}
    Decision -->|No| Allow[Allow Request]
    Decision -->|Yes| Reject[Reject: 429<br/>Too Many Requests]

Every server checks the shared Redis store before allowing a request. This ensures consistency across all servers.

💡 Interview Tip: When discussing rate limiting, mention different levels:

User-level: Limit per user/account
IP-level: Limit per IP address (catch anonymous abusers)
Global: Limit total traffic to protect backend
Per-endpoint: Some endpoints are expensive (like search); limit them more aggressively

10. Consistency Patterns: Strong vs Eventual

This is about data consistency in distributed systems—a common interview topic that confuses many candidates.

Strong Consistency

After a write completes, all subsequent reads (from any replica, anywhere) return the new value. It’s like updating a whiteboard—once you erase something and write something new, everyone looking at the whiteboard sees the new content immediately.

Implementation: Typically achieved through synchronous replication. The write doesn’t complete until all replicas have been updated.

sequenceDiagram
    participant Client
    participant Primary
    participant Replica1
    participant Replica2
    
    Client->>Primary: Write: balance = $500
    Primary->>Replica1: Replicate write
    Primary->>Replica2: Replicate write
    Replica1-->>Primary: Ack
    Replica2-->>Primary: Ack
    Primary-->>Client: Write confirmed
    
    Note over Primary,Replica2: All replicas updated<br/>before confirming to client
    
    Client->>Replica1: Read balance
    Replica1-->>Client: $500 (consistent)

Use cases: Banking, inventory management, booking systems—anywhere inconsistency causes real problems.

Trade-offs: Higher latency (must wait for all replicas), lower availability (if a replica is down, writes might fail).

Eventual Consistency

After a write, replicas might temporarily have different values, but given enough time without new writes, all replicas converge to the same value. It’s like gossip in an office—information spreads gradually, but eventually everyone knows.

Implementation: Asynchronous replication. Write completes on primary immediately; replicas catch up later.

sequenceDiagram
    participant Client
    participant Primary
    participant Replica1
    participant Replica2
    
    Client->>Primary: Write: New post
    Primary-->>Client: Write confirmed (immediately)
    
    Note over Primary: Asynchronous replication
    
    Primary->>Replica1: Replicate write
    Primary->>Replica2: Replicate write
    
    Note over Replica1,Replica2: 10-100ms delay
    
    Client->>Replica1: Read posts
    Replica1-->>Client: Old data (temporarily)
    
    Note over Replica1,Replica2: After replication completes...
    
    Client->>Replica1: Read posts
    Replica1-->>Client: New post (eventually consistent)

Use cases: Social media feeds, content delivery, analytics—where slight staleness is acceptable.

Trade-offs: Lower latency, higher availability, but potential for stale reads.

Read-After-Write Consistency

A middle ground. After you write something, you immediately see your own writes, but others might see stale data temporarily. This is common in social media—when you post a tweet, you see it immediately, but it might take a few seconds to appear for your followers.

Implementation:

Route reads to the primary for a short period after the user writes
Use client-side timestamps to track writes
Use sticky sessions (always route a user’s requests to the same server)

🚨 Common Mistake: Candidates confuse “eventual consistency” with “inconsistent.” Eventual consistency is still consistent—it just takes time to propagate. The system doesn’t have conflicting values permanently; it temporarily lags but ultimately converges.

Bringing It All Together: A Complete System

Let’s design a simplified Twitter-like system to see how these concepts integrate:

graph TD
    Users[Users] --> CDN[CDN<br/>Static Assets]
    Users --> LB[Load Balancer]
    
    LB --> API1[API Server 1]
    LB --> API2[API Server 2]
    
    API1 --> Cache[(Redis Cache<br/>User feeds)]
    API2 --> Cache
    
    API1 --> TweetDB[(Tweet Database<br/>Cassandra - AP)]
    API2 --> TweetDB
    
    API1 --> UserDB[(User Database<br/>PostgreSQL - CP)]
    API2 --> UserDB
    
    API1 --> MQ[Message Queue<br/>Kafka]
    API2 --> MQ
    
    MQ --> FeedWorker[Feed Generation<br/>Workers]
    MQ --> NotificationWorker[Notification<br/>Workers]
    
    FeedWorker --> Cache
    FeedWorker --> TweetDB
    
    NotificationWorker --> NotificationDB[(Notifications<br/>PostgreSQL)]
    
    TweetDB --> S1[(Shard 1<br/>Users 0-999M)]
    TweetDB --> S2[(Shard 2<br/>Users 1B-1.999B)]
    
    UserDB --> Replica1[(Read Replica 1)]
    UserDB --> Replica2[(Read Replica 2)]

Component breakdown:

CDN: Serves static assets (images, profile pictures, CSS, JS)
Load Balancer: Distributes traffic across API servers
API Servers: Handle requests, enforce rate limiting, authenticate users
Redis Cache: Stores pre-computed user feeds for fast access
Cassandra (Tweets): AP system for high write throughput, eventual consistency okay for tweets
PostgreSQL (Users): CP system for strong consistency on user accounts/authentication
Kafka: Message queue for asynchronous operations
Workers: Process async tasks like generating feeds, sending notifications
Sharding: Tweets sharded by user ID to handle billions of tweets
Replication: User database replicated for read scaling

Request flows:

Posting a tweet:

User sends POST request through load balancer to API server
API server validates authentication, checks rate limit
Write tweet to Cassandra (eventually consistent, high throughput)
Publish event to Kafka: “tweet_created”
Return success to user immediately (async processing happens later)
Workers consume Kafka events, generate feeds for followers, cache in Redis

Reading timeline:

User requests a timeline through the load balancer to the API server
API server checks Redis cache for pre-computed feed
If cache hit, return immediately (5ms latency)
If cache miss, query Cassandra for recent tweets, cache result, return (200ms latency)

This architecture handles millions of users, scales horizontally, tolerates failures, and optimizes for a read-heavy workload (typical for social media).

Interview Preparation: Answering Common Questions

Question: “Design a URL shortener like bit.ly”

Strong answer:

“I’d approach this in layers. First, let’s discuss the core functionality: we need to take long URLs, generate short codes, and redirect users quickly.

Database choice: I’d use a SQL database like PostgreSQL for storing URL mappings. We need strong consistency—if I generate a short code, no one else should get the same code. SQL’s ACID properties guarantee this. Schema would be simple: id, short_code, long_url, created_at, click_count.

Generating short codes: Base62 encoding (a-z, A-Z, 0-9) gives us 62^7 = 3.5 trillion possible codes with 7 characters. I’d use an auto-incrementing database ID, convert to base62. Example: ID 12345 becomes ‘dnh’. This is deterministic, collision-free, and doesn’t require checking for duplicates.

Scaling reads: URL redirects are read-heavy. I’d use Redis caching with the short code as the key. 99% of redirects would be cache hits (<1ms latency). Set a TTL based on access patterns—popular links cached longer.

Load balancing: Multiple API servers behind a load balancer, all reading from the same PostgreSQL primary and Redis cache. For higher read throughput, add PostgreSQL read replicas.

Sharding: Only needed at massive scale. If we have billions of URLs, shard by hash of short_code modulo number of shards.

Analytics: Track clicks asynchronously. When redirecting, publish a click event to Kafka. Background workers consume events, aggregate statistics, store in a separate analytics database (could use Cassandra for high write throughput).

CDN: Ideally, cache redirect responses at CDN edge servers for the most popular links. A click from Tokyo for a popular link could be redirected directly from a Tokyo edge server without hitting our data center.

This design handles millions of redirects per second, provides <10ms latency for cached links, and scales horizontally.”

Question: “How would you design rate limiting for your API?”

Strong answer:

“I’d implement a multi-layered rate limiting strategy:

Layer 1: API Gateway level – Token bucket algorithm implemented using Redis. Each user gets 1000 tokens that refill at 100 per hour. This allows bursts (if you haven’t used your API in a while, you can make many requests quickly) while enforcing long-term limits.

Layer 2: Per-endpoint – Different endpoints have different limits. Read operations (GET requests) are cheaper, so allow 1000/hour. Write operations (POST/PUT/DELETE) are more expensive, so limit to 100/hour. Search endpoints are very expensive, so limit to 50/hour.

Layer 3: Global limits – Even within user limits, apply a global throttle to protect backend. If the database can handle 10K queries/second, ensure total incoming traffic doesn’t exceed this.

Implementation:

python

def check_rate_limit(user_id, endpoint):
    <em># User-level check</em>
    user_key = f'rate_limit:user:{user_id}'
    if not token_bucket_check(user_key, capacity=1000, rate=100/3600):
        return False, 'User rate limit exceeded'
    
    <em># Endpoint-level check</em>
    endpoint_key = f'rate_limit:user:{user_id}:endpoint:{endpoint}'
    endpoint_limit = get_endpoint_limit(endpoint)
    if not token_bucket_check(endpoint_key, **endpoint_limit):
        return False, f'{endpoint} rate limit exceeded'
    
    <em># Global throttle</em>
    global_key = 'rate_limit:global'
    if not fixed_window_check(global_key, limit=10000, window=1):
        return False, 'System capacity reached'
    
    return True, None

def check_rate_limit(user_id, endpoint):
    <em># User-level check</em>
    user_key = f'rate_limit:user:{user_id}'
    if not token_bucket_check(user_key, capacity=1000, rate=100/3600):
        return False, 'User rate limit exceeded'
    
    <em># Endpoint-level check</em>
    endpoint_key = f'rate_limit:user:{user_id}:endpoint:{endpoint}'
    endpoint_limit = get_endpoint_limit(endpoint)
    if not token_bucket_check(endpoint_key, **endpoint_limit):
        return False, f'{endpoint} rate limit exceeded'
    
    <em># Global throttle</em>
    global_key = 'rate_limit:global'
    if not fixed_window_check(global_key, limit=10000, window=1):
        return False, 'System capacity reached'
    
    return True, None

Response headers: Include rate limit information in response headers:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 234
X-RateLimit-Reset: 1609459200

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 234
X-RateLimit-Reset: 1609459200

HTTP status: Return 429 Too Many Requests with a Retry-After header.

Different tiers: Premium users get higher limits. Implement this by checking user tier and applying different capacities.

Distributed consideration: Use Redis to share rate limit state across all API servers. This ensures consistency—a user can’t bypass limits by hitting different servers.

This approach balances fairness, protects backend resources, provides flexibility for different use cases, and communicates clearly with API consumers.”

Conclusion

System design interviews are fundamentally about demonstrating that you understand how to build software that scales, performs, and remains reliable under real-world conditions. The terms we’ve covered—load balancers, CDNs, caching, databases, API gateways, message queues, CAP theorem, microservices, rate limiting, and consistency patterns—are the vocabulary of distributed systems engineering.

But knowing these terms isn’t enough. What distinguishes strong candidates from average ones is the ability to:

Explain trade-offs: Nothing is free. More availability might mean less consistency. Better performance might mean higher costs. Simpler architecture might mean scaling limitations. Always articulate what you’re gaining and what you’re sacrificing.
Justify decisions: Don’t just say “I’ll use Redis for caching.” Say “I’ll use Redis for caching because we need sub-millisecond read latency, and the cache data is ephemeral—if Redis crashes, we can rebuild from the database. The performance gain justifies the operational complexity.”
Consider real-world constraints: Theoretical perfection doesn’t exist. Systems must be built within budgets, timelines, and team capabilities. Sometimes the “worse” technical solution is the right choice given organizational constraints.
Think about failure: Systems fail. Networks partition. Servers crash. Disks fill up. Good designs anticipate failures and handle them gracefully. Always ask yourself: “What breaks if X fails?”
Start simple, evolve: Don’t jump to microservices and sharding on day one. Start with a monolith, add caching when you measure performance issues, shard when a single database genuinely can’t handle your load. Premature optimization is real.

Your Action Plan

To truly master these concepts:

Build something: Theory only goes so far. Build a simple Twitter clone or URL shortener. Deploy it. Watch it break. Fix it. Scale it. You’ll learn more from one failed deployment than from reading ten blog posts.
Study real architectures: Read engineering blogs from Netflix, Uber, Airbnb, Twitter. How did they evolve from small startups to handling billions of requests? What problems did they encounter? How did they solve them?
Practice system design questions: Use platforms like educative.io’s “Grokking the System Design Interview” or pick problems from “System Design Interview” books by Alex Xu. Practice explaining your designs out loud—articulation matters as much as knowledge.
Understand the why: Don’t memorize that “CAP theorem says you can only have two out of three.” Understand why this trade-off exists, when it matters, and how real systems navigate it.
Learn from production: If you work on a production system, study its architecture. Why was this database chosen? Why this caching strategy? What has failed before, and how was it fixed? Production systems are the best teachers.

When you walk into your next system design interview and the interviewer asks, “Design Instagram,” you won’t panic. You’ll think: “Photo storage needs a CDN for global distribution. User authentication requires strong consistency, so PostgreSQL. Feeds are read-heavy and tolerate eventual consistency, so Cassandra with Redis caching. Image uploads should be async through a message queue…” The pieces fit together naturally because you understand not just the terms, but the principles behind them.

You’re now equipped with the vocabulary and mental models to have sophisticated conversations about distributed systems architecture. The difference between junior and senior engineers often comes down to this: juniors know how to code; seniors know how to architect systems that scale. You’re now firmly on the path to the latter.

The terms we’ve covered aren’t just interview preparation—they’re the foundation of building the next generation of internet-scale applications. Go build something amazing.

🎯 Final Interview Tip: When the interviewer asks, “Are there any improvements you’d make?” always have 2-3 ready: “We could add monitoring with Prometheus/Grafana for better observability. We could implement distributed tracing with Jaeger to debug cross-service issues. We could add automated failover for the database primary. We could implement chaos engineering to test failure scenarios.” This shows you’re thinking beyond the happy path and considering operational maturity.

Introduction

The Foundation: Understanding System Design Components

1. Load Balancer: The Traffic Director

What It Is

Load Balancing Algorithms

Layer 4 vs Layer 7 Load Balancing

Real-World Application

2. CDN (Content Delivery Network): The Speed Amplifier

What It Is

How It Works: The Complete Flow

What Content Goes on a CDN?

Cache Invalidation Strategy

3. Caching: The Performance Multiplier

What It Is and Why It Matters

Cache Hierarchy in a Typical System

Cache Eviction Policies

Cache Strategies: When and How to Cache

The Cache Invalidation Problem

4. Database Concepts: SQL vs NoSQL, Sharding, and Replication

SQL vs NoSQL: The Great Divide

Database Sharding: Horizontal Scaling

Database Replication: Reliability and Read Scaling

5. API Gateway: The Front Door

What Problems Does It Solve?

Key Responsibilities

API Gateway vs Load Balancer

6. Message Queues: Asynchronous Communication

The Restaurant Analogy

Why Use Message Queues?

Popular Message Queue Systems

Queue Patterns

Message Queue Guarantees

7. CAP Theorem: The Fundamental Trade-off

The Three Properties

The Theorem: Pick Two

Real-World Example: Banking vs Social Media

Consistency Models: It’s a Spectrum

8. Microservices vs Monolith: Architectural Decisions

Monolithic Architecture

Microservices Architecture

When to Choose Each

9. Rate Limiting: Protecting Your System

Why Rate Limit?

Rate Limiting Algorithms

Distributed Rate Limiting

10. Consistency Patterns: Strong vs Eventual

Strong Consistency

Eventual Consistency

Read-After-Write Consistency

Bringing It All Together: A Complete System

Interview Preparation: Answering Common Questions

Question: “Design a URL shortener like bit.ly”

Question: “How would you design rate limiting for your API?”

Conclusion

Your Action Plan

You Might Also Like

The RUM Conjecture: Understanding the Fundamental Trade-offs in Distributed Data Systems

What Are the Traits of a Well-Maintained Codebase and System?

System Design Interview Series #1: Building a Social media Feed | Senior Software Engineer Prep Guide

Leave a Reply Cancel reply