Skip to main content

Overview

Design a key-value cache that saves the results of the most recent web server queries for a search engine. This problem explores cache design, LRU (Least Recently Used) eviction policies, distributed caching, and handling billions of queries per month.

Step 1: Use Cases and Constraints

Use Cases

In Scope

  • User sends a search request resulting in a cache hit
  • User sends a search request resulting in a cache miss
  • Service has high availability

Constraints and Assumptions

Assumptions:
  • Traffic is not evenly distributed
    • Popular queries should almost always be in cache
    • Need to determine how to expire/refresh
  • Serving from cache requires fast lookups
  • Low latency between machines
  • Limited memory in cache
    • Need to determine what to keep/remove
    • Need to cache millions of queries
  • 10 million users
  • 10 billion queries per month

Usage Calculations

Cache stores: Ordered list of key: query, value: resultsSize per entry:
  • query - 50 bytes
  • title - 20 bytes
  • snippet - 200 bytes
  • Total: 270 bytes
Storage:
  • 2.7 TB of cache data per month if all 10 billion queries are unique
    • 270 bytes × 10 billion queries per month
  • Limited memory requires expiration strategy
Throughput:
  • 4,000 requests per second
Conversion guide:
  • 2.5 million seconds per month
  • 1 request per second = 2.5 million requests per month
  • 40 requests per second = 100 million requests per month
  • 400 requests per second = 1 billion requests per month

Step 2: High Level Design

Query Cache High Level Design

Step 3: Core Components

Use Case: User Search Results in Cache Hit

1

Client sends request

Client sends request to Web Server (reverse proxy)
2

Web Server routes to Query API

Web Server forwards to Query API server
3

Query API processes search

Query API server:
  • Parses query (remove markup, tokenize, fix typos, normalize, convert to boolean)
  • Checks Memory Cache for matching content
  • If cache hit:
    • Updates entry position to front of LRU list
    • Returns cached contents
  • If cache miss:
    • Uses Reverse Index Service to find matching documents
    • Uses Document Service to return titles and snippets
    • Updates Memory Cache with results at front of LRU list

Cache Implementation

The cache uses a doubly-linked list with a hash table for O(1) operations:
  • New items added to head
  • Items to expire removed from tail
  • Hash table provides fast lookups to linked list nodes
class QueryApi(object):

    def __init__(self, memory_cache, reverse_index_service):
        self.memory_cache = memory_cache
        self.reverse_index_service = reverse_index_service

    def parse_query(self, query):
        """Remove markup, break text into terms, deal with typos,
        normalize capitalization, convert to use boolean operations.
        """
        ...

    def process_query(self, query):
        query = self.parse_query(query)
        results = self.memory_cache.get(query)
        if results is None:
            results = self.reverse_index_service.process_search(query)
            self.memory_cache.set(query, results)
        return results
class Node(object):

    def __init__(self, query, results):
        self.query = query
        self.results = results
class LinkedList(object):

    def __init__(self):
        self.head = None
        self.tail = None

    def move_to_front(self, node):
        ...

    def append_to_front(self, node):
        ...

    def remove_from_tail(self):
        ...
class Cache(object):

    def __init__(self, MAX_SIZE):
        self.MAX_SIZE = MAX_SIZE
        self.size = 0
        self.lookup = {}  # key: query, value: node
        self.linked_list = LinkedList()

    def get(self, query):
        """Get the stored query result from the cache.

        Accessing a node updates its position to the front of the LRU list.
        """
        node = self.lookup[query]
        if node is None:
            return None
        self.linked_list.move_to_front(node)
        return node.results

    def set(self, query, results):
        """Set the result for the given query key in the cache.

        When updating an entry, updates its position to the front of the LRU list.
        If the entry is new and the cache is at capacity, removes the oldest entry
        before the new entry is added.
        """
        node = self.lookup[query]
        if node is not None:
            # Key exists in cache, update the value
            node.results = results
            self.linked_list.move_to_front(node)
        else:
            # Key does not exist in cache
            if self.size == self.MAX_SIZE:
                # Remove the oldest entry from the linked list and lookup
                self.lookup.pop(self.linked_list.tail.query, None)
                self.linked_list.remove_from_tail()
            else:
                self.size += 1
            # Add the new key and value
            new_node = Node(query, results)
            self.linked_list.append_to_front(new_node)
            self.lookup[query] = new_node

When to Update the Cache

Update cache when:
  • Page contents change
  • Page is removed or new page added
  • Page rank changes
TTL (Time To Live): The most straightforward approach is to set a maximum time that cached entries can stay before requiring update.Pattern: This describes the cache-aside pattern.

Step 4: Scale the Design

Query Cache Scaled Design
Important: Take an iterative approach:
  1. Benchmark/Load Test
  2. Profile for bottlenecks
  3. Address bottlenecks
  4. Repeat

Scaling Components

DNS

Route users to nearest data center

Load Balancer

Distribute traffic across web servers

Web Servers

Horizontal scaling as reverse proxies

API Servers

Application layer for query processing

Memory Cache

Distributed caching with sharding strategies

Expanding Memory Cache to Multiple Machines

To handle heavy load and large memory requirements, scale horizontally with three main options:
Pros:
  • Simple implementation
Cons:
  • Low cache hit rate
  • Same query might be cached on multiple machines
Pros:
  • Simple implementation
  • Higher hit rate than Option 1
Cons:
  • Inefficient use of memory
  • Memory constraints limit scalability
Consistent Hashing: Distributes keys across cache servers while minimizing remapping when servers are added or removed.

Implementation Reference

Python Implementation

View the complete Python implementation including LRU cache logic.

SQL Scaling Patterns

  • Read replicas
  • Federation
  • Sharding
  • Denormalization
  • SQL Tuning

NoSQL Options

  • Key-value store
  • Document store
  • Wide column store
  • Graph database

Caching Strategies

  • Cache-aside
  • Write-through
  • Write-behind
  • Refresh ahead

Asynchronous Processing

  • Message queues
  • Task queues
  • Back pressure
  • Microservices

Key Takeaways

  • LRU eviction policy keeps most popular queries cached
  • Doubly-linked list + hash table provides O(1) operations
  • Cache-aside pattern updates cache on misses
  • TTL (Time To Live) handles cache freshness
  • Consistent hashing enables distributed caching
  • Sharding across machines most efficient for scale
  • Handle 4,000 requests/second with distributed architecture
  • Memory Cache (Redis/Memcached) handles unevenly distributed traffic

Build docs developers (and LLMs) love