High Availability Overview
Availability Targets
| Tier | Uptime % | Downtime/Year | Configuration |
|---|---|---|---|
| Standard | 99.0% | 3.65 days | Single node with backups |
| High | 99.9% | 8.76 hours | Multi-node with replication |
| Very High | 99.99% | 52.56 minutes | Full HA cluster with redundancy |
| Mission Critical | 99.999% | 5.26 minutes | Geo-redundant, active-active |
Failure Scenarios
Component Failures:- Backend API server crash
- Database server failure
- Elasticsearch node failure
- Network connectivity loss
- Disk failure
- Agent communication loss
- RTO (Recovery Time Objective): How quickly service is restored
- RPO (Recovery Point Objective): How much data loss is acceptable
- RTO: < 5 minutes
- RPO: < 1 minute
High Availability Architecture
Load Balancer Configuration
HAProxy with Keepalived
HAProxy Configuration (/etc/haproxy/haproxy.cfg):
/etc/keepalived/keepalived.conf):
Database High Availability
PostgreSQL Streaming Replication
Primary Server Configuration (postgresql.conf):
pg_hba.conf):
postgresql.conf):
standby.signal file):
Automatic Failover with Patroni
Patroni Configuration (/etc/patroni/patroni.yml):
Elasticsearch High Availability
Cluster Configuration
3-Node Cluster Setup: Master Node (elasticsearch.yml):
Redis High Availability
Redis Sentinel
Sentinel Configuration (/etc/redis/sentinel.conf):
Application Layer HA
Stateless Backend Services
Docker Swarm (alternative to manual setup):Health Checks
Spring Boot Actuator:Disaster Recovery
Backup Strategy
Automated Backup Script:Recovery Procedures
PostgreSQL Recovery:Monitoring HA Status
Grafana Dashboard Alerts:Next Steps
Horizontal Scaling
Scale for higher capacity
Performance Tuning
Optimize for best performance
System Architecture
Review complete architecture
Data Storage
Understand storage design