Design Amazon's Sales Ranking by Category

Overview

Design Amazon’s sales ranking system that calculates and displays the most popular products by category. This problem explores MapReduce for batch processing, real-time analytics, ranking algorithms, and scaling to handle billions of transactions per month.

Step 1: Use Cases and Constraints

Use Cases

In Scope

Service calculates the past week’s most popular products by category
User views the past week’s most popular products by category
Service has high availability

Out of Scope

The general e-commerce site
- Design components only for calculating sales rank

Constraints and Assumptions

Assumptions:

Traffic is not evenly distributed
Items can be in multiple categories
Items cannot change categories
No subcategories (e.g., foo/bar/baz)
Results must be updated hourly
- More popular products might need more frequent updates
10 million products
1,000 categories
1 billion transactions per month
100 billion read requests per month
100:1 read to write ratio

Usage Calculations

Back-of-the-envelope calculations

Size per transaction:

created_at - 5 bytes
product_id - 8 bytes
category_id - 4 bytes
seller_id - 8 bytes
buyer_id - 8 bytes
quantity - 4 bytes
total_price - 5 bytes
Total: ~40 bytes

Storage:

40 GB of new transaction content per month
- 40 bytes per transaction × 1 billion transactions per month
1.44 TB of new transaction content in 3 years

Throughput:

400 transactions per second on average
40,000 read requests per second on average

Conversion guide:

2.5 million seconds per month
1 request per second = 2.5 million requests per month
40 requests per second = 100 million requests per month
400 requests per second = 1 billion requests per month

Step 2: High Level Design

Step 3: Core Components

Use Case: Service Calculates Weekly Rankings

Store raw Sales API server log files on Object Store (Amazon S3) rather than managing distributed file system.

Log File Format

Sample log entry (tab delimited):

timestamp   product_id  category_id    qty     total_price   seller_id    buyer_id
t1          product1    category1      2       20.00         1            1
t2          product1    category2      2       20.00         2            2
t2          product1    category2      1       10.00         2            3
t3          product2    category1      3        7.00         3            4
t4          product3    category2      7        2.00         4            5
t5          product4    category1      1        5.00         5            6

MapReduce Implementation

The Sales Rank Service uses multi-step MapReduce:

Step 1: Transform and Aggregate

Transform data to (category, product_id), sum(quantity)

Step 2: Distributed Sort

Perform distributed sort to rank products within categories

Step 3: Write to Database

Insert sorted results into sales_rank table in SQL Database

MapReduce implementation

class SalesRanker(MRJob):

    def within_past_week(self, timestamp):
        """Return True if timestamp is within past week, False otherwise."""
        ...

    def mapper(self, _, line):
        """Parse each log line, extract and transform relevant lines.

        Emit key value pairs of the form:

        (category1, product1), 2
        (category2, product1), 2
        (category2, product1), 1
        (category1, product2), 3
        (category2, product3), 7
        (category1, product4), 1
        """
        timestamp, product_id, category_id, quantity, total_price, seller_id, \
            buyer_id = line.split('\t')
        if self.within_past_week(timestamp):
            yield (category_id, product_id), quantity

    def reducer(self, key, values):
        """Sum values for each key.

        (category1, product1), 2
        (category2, product1), 3
        (category1, product2), 3
        (category2, product3), 7
        (category1, product4), 1
        """
        yield key, sum(values)

    def mapper_sort(self, key, value):
        """Construct key to ensure proper sorting.

        Transform key and value to the form:

        (category1, 2), product1
        (category2, 3), product1
        (category1, 3), product2
        (category2, 7), product3
        (category1, 1), product4

        The shuffle/sort step of MapReduce will then do a
        distributed sort on the keys, resulting in:

        (category1, 1), product4
        (category1, 2), product1
        (category1, 3), product2
        (category2, 3), product1
        (category2, 7), product3
        """
        category_id, product_id = key
        quantity = value
        yield (category_id, quantity), product_id

    def reducer_identity(self, key, value):
        yield key, value

    def steps(self):
        """Run the map and reduce steps."""
        return [
            self.mr(mapper=self.mapper,
                    reducer=self.reducer),
            self.mr(mapper=self.mapper_sort,
                    reducer=self.reducer_identity),
        ]

MapReduce Output

Sorted list ready for insertion into sales_rank table:

(category1, 1), product4
(category1, 2), product1
(category1, 3), product2
(category2, 3), product1
(category2, 7), product3

Database Schema

sales_rank table:

id int NOT NULL AUTO_INCREMENT
category_id int NOT NULL
total_sold int NOT NULL
product_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(category_id) REFERENCES Categories(id)
FOREIGN KEY(product_id) REFERENCES Products(id)

Indexes on id, category_id, and product_id speed up lookups (log-time vs full table scan) and keep data in memory.

Use Case: User Views Popular Products

Client sends request

Client sends request to Web Server (reverse proxy)

Web Server routes to Read API

Web Server forwards to Read API server

Read API queries database

Read API reads from SQL Database sales_rank table

REST API:

curl https://amazon.com/api/v1/popular?category_id=1234

Response:

[
  {
    "id": "100",
    "category_id": "1234",
    "total_sold": "100000",
    "product_id": "50"
  },
  {
    "id": "53",
    "category_id": "1234",
    "total_sold": "90000",
    "product_id": "200"
  },
  {
    "id": "75",
    "category_id": "1234",
    "total_sold": "80000",
    "product_id": "3"
  }
]

Step 4: Scale the Design

Important: Take an iterative approach:

Benchmark/Load Test
Profile for bottlenecks
Address bottlenecks
Repeat

Scaling Components

DNS

Route users to nearest data center

CDN

Serve static content with low latency

Load Balancer

Distribute traffic across servers

Web Servers

Horizontal scaling as reverse proxies

API Servers

Separate Read/Write APIs

Memory Cache

Handle 40,000 average reads/second

Cache popular product rankings
Handle unevenly distributed traffic
Handle traffic spikes

SQL Database

Master-Slave replication with read replicas

Object Store

Store transaction log files (S3)

Handles 40 GB/month easily

Database Scaling Challenges

Read Challenge: 40,000 average read requests/second may overwhelm SQL Read Replicas on cache misses.Write Challenge: 400 average writes/second may be tough for single SQL Write Master-Slave.

SQL Scaling Patterns:

Federation - Split databases by function
Sharding - Distribute data across databases
Denormalization - Optimize read performance
SQL Tuning - Optimize queries and indexes
NoSQL - Move appropriate data to NoSQL

Storage Optimization

Analytics Database

Solution: Use data warehousing (Amazon Redshift or Google BigQuery)Benefit: Better suited for analytics workloads than transactional databases

Limited Time Period Storage

Strategy: Store only limited time period in databaseArchive: Rest in data warehouse or Object StoreCapacity: S3 easily handles 40 GB/month constraint

Implementation Reference

Python Implementation

View the complete Python implementation including MapReduce ranking logic.

NoSQL Options

Key-value store
Document store
Wide column store
SQL vs NoSQL tradeoffs

Caching Strategies

Cache-aside
Write-through
Write-behind
Refresh ahead

Asynchronous Processing

Message queues
Task queues
Back pressure
Microservices

Communication Patterns

REST for external APIs
RPC for internal services
Service discovery

Key Takeaways

MapReduce processes billions of transactions efficiently
Two-step MapReduce: aggregate then sort
Object Store (S3) stores raw transaction logs
Memory Cache handles read-heavy workload (100:1 ratio)
SQL Database stores pre-computed rankings
Hourly updates balance freshness with computational cost
Analytics Database (Redshift/BigQuery) better for aggregations
Distributed sort in MapReduce ranks products by category
Consider SQL scaling patterns for high read/write loads

Interview Guide

System Design Problems

Object-Oriented Design

Design Amazon's Sales Ranking by Category

Overview

Step 1: Use Cases and Constraints

Use Cases

In Scope

Out of Scope

Constraints and Assumptions

Usage Calculations

Step 2: High Level Design

Step 3: Core Components

Use Case: Service Calculates Weekly Rankings

Log File Format

MapReduce Implementation

MapReduce Output

Database Schema

Use Case: User Views Popular Products

Step 4: Scale the Design

Scaling Components

DNS

CDN

Load Balancer

Web Servers

API Servers

Memory Cache

SQL Database

Object Store

Database Scaling Challenges

Storage Optimization

Implementation Reference

Python Implementation

NoSQL Options

Caching Strategies

Asynchronous Processing

Communication Patterns

Key Takeaways

Build docs developers (and LLMs) love

Interview Guide

System Design Problems

Object-Oriented Design

​Overview

​Step 1: Use Cases and Constraints

​Use Cases

​In Scope

​Out of Scope

​Constraints and Assumptions

​Usage Calculations

​Step 2: High Level Design

​Step 3: Core Components

​Use Case: Service Calculates Weekly Rankings

​Log File Format

​MapReduce Implementation

​MapReduce Output

​Database Schema

​Use Case: User Views Popular Products

​Step 4: Scale the Design

​Scaling Components

DNS

CDN

Load Balancer

Web Servers

API Servers

Memory Cache

SQL Database

Object Store

​Database Scaling Challenges

​Storage Optimization

​Implementation Reference

Python Implementation

​Related Topics

NoSQL Options

Caching Strategies

Asynchronous Processing

Communication Patterns

​Key Takeaways

Build docs developers (and LLMs) love

Overview

Step 1: Use Cases and Constraints

Use Cases

In Scope

Out of Scope

Constraints and Assumptions

Usage Calculations

Step 2: High Level Design

Step 3: Core Components

Use Case: Service Calculates Weekly Rankings

Log File Format

MapReduce Implementation

MapReduce Output

Database Schema

Use Case: User Views Popular Products

Step 4: Scale the Design

Scaling Components

Database Scaling Challenges

Storage Optimization

Implementation Reference

Related Topics

Key Takeaways