Overview
Lichess serves millions of daily active users from a lean infrastructure. The production architecture emphasizes horizontal scaling, caching, and geographic distribution to handle massive traffic while maintaining low latency.High-Level Architecture
Server Infrastructure
Application Servers (lila)
Multiple Scala application servers handle HTTP requests: Configuration:- Count: 4-6 instances (varies with load)
- CPU: 8-16 cores per instance
- Memory: 16-32 GB RAM
- JVM: Java 21+ with optimized GC settings
WebSocket Servers (lila-ws)
Configuration:- Count: 3-5 instances
- CPU: 4-8 cores
- Memory: 8-16 GB
- Connections: 100k+ per instance
- nginx with sticky sessions (optional)
- Round-robin with failover
- Client auto-reconnect on node failure
Nginx Configuration
Frontend load balancer and reverse proxy:Database Infrastructure
MongoDB Cluster
Topology: Replica set with 3+ nodes- Storage: 4+ TB SSD per node (4.7B+ games)
- RAM: 64-128 GB (for working set cache)
- Write Concern:
w:1(primary acknowledged) - Read Preference:
primaryPreferred(use secondaries when primary busy)
Redis Cluster
Purpose:- WebSocket pub/sub (lila ↔ lila-ws)
- Session storage
- Rate limiting
- Temporary caching
- Count: 3 nodes (master + 2 replicas)
- Memory: 16-32 GB per node
- Persistence: RDB snapshots + AOF
- Eviction: LRU for cache entries
- Redis Sentinel monitors cluster health
- Automatic failover to replica on master failure
- Sub-second failover time
Elasticsearch
Purpose: Full-text search for games, studies, forums Configuration:- Count: 3 nodes (for redundancy)
- Storage: 1+ TB per node
- Indexes:
game,study,forum,team
- MongoDB change streams trigger Elasticsearch updates
- Async indexing (eventual consistency)
- Periodic full reindex for consistency
CDN and Asset Delivery
Fastly CDN
Lichess uses Fastly for global content delivery: Cached Assets:- JavaScript bundles (
/compiled/*.js) - CSS stylesheets
- Images (board themes, pieces)
- Fonts
- Static resources
- Global edge network: 70+ POPs worldwide
- Low latency: Assets served from nearby edge
- Origin shield: Reduces backend load
- Instant purge: Purge cache on deployment
Asset Versioning
Content-hashed URLs enable aggressive caching:ui/buildgenerates hashed filenames- Manifest maps logical names to hashed URLs
- Server injects correct URLs in HTML
- Browsers cache assets until hash changes
Deployment Process
Build and Package
Deployment Strategy
Rolling deployment with zero downtime:Systemd Service
Configuration Management
Production configuration in/etc/lila/application.conf:
Monitoring and Observability
Metrics Collection
Kamon exports metrics to InfluxDB:- CPU, memory, disk usage
- JVM heap, GC pauses
- HTTP request rate, latency
- Database query time
- Redis operations
- WebSocket connection count
Grafana Dashboards
Real-time monitoring dashboards:- System Overview: CPU, memory, network across all nodes
- Application: Request rate, latency percentiles, error rate
- Games: Active games, moves/second, game starts
- Users: Online count, registrations, logins
- Database: Query time, connection pool, replication lag
- WebSocket: Connections, message rate, disconnects
Alerting
PagerDuty integration for critical alerts:- HTTP 5xx error rate > threshold
- Response time p99 > threshold
- Database replica lag > threshold
- Redis disconnections
- Disk space < threshold
- Application crashes
Logging
Structured logging with Logback:- Centralized log collection (e.g., ELK stack)
- Structured JSON logs
- Search by game ID, user ID, error type
- Alert on specific log patterns
Scaling Considerations
Horizontal Scaling
Application Servers (lila)
Application Servers (lila)
Easy to scale: Stateless application serversNo coordination needed between instances.
WebSocket Servers (lila-ws)
WebSocket Servers (lila-ws)
Easy to scale: Independent connection handlersRedis pub/sub coordinates message delivery.
Database (MongoDB)
Database (MongoDB)
Vertical scaling: Increase CPU/RAM on existing nodesRead scaling: Add read replicas for read-heavy workloadsWrite scaling: Consider sharding if write load exceeds single primary capacity (not yet needed)
Redis
Redis
Vertical scaling: Increase memory for more cacheRead scaling: Add replicas and use read commands from replicasPartitioning: Separate Redis instances for different use cases (pub/sub vs. cache)
Performance Optimization
Connection pooling:- HTTP client connection pools
- Database connection pools
- Redis connection pools
- In-memory (Scaffeine) for hot data
- Redis for distributed cache
- CDN for static assets
- All I/O operations non-blocking
- Akka Streams for backpressure
- Queue background jobs
Security
TLS/HTTPS
- Certificate: Let’s Encrypt with auto-renewal
- Protocols: TLS 1.2, TLS 1.3 only
- Ciphers: Modern ciphers (AEAD, forward secrecy)
- HSTS: Strict-Transport-Security header
Rate Limiting
- Login attempts
- API requests
- Game creation
- Chat messages
DDoS Protection
- CDN: Fastly provides DDoS mitigation
- nginx: Connection limits, request rate limits
- Application: Per-IP rate limiting
Proxy Detection
IP2Proxy database detects proxies/VPNs:- Flag suspicious IPs
- Additional verification for proxied users
- Anti-cheat measures
Disaster Recovery
Backups
MongoDB:- Daily snapshots retained for 30 days
- Oplog backup for point-in-time recovery
- Geo-distributed backups
- RDB snapshots every 5 minutes
- AOF append-only file for durability
- Version controlled in Git
- Encrypted secrets management
Recovery Procedures
Database restore:Cost Optimization
Lichess operates on donations and minimizes costs:- No Kubernetes: Simple systemd services reduce overhead
- Bare metal: Owned servers vs. cloud reduces costs
- Efficient encoding: Game compression saves storage
- CDN offloading: Reduces origin bandwidth
- Donated infrastructure: Some servers donated by community
See Also
- Backend Architecture - Application structure
- Frontend Architecture - UI build and deployment
- Database Architecture - Data storage details
- WebSocket Architecture - Real-time infrastructure

