Failover and High Availability
Gate provides comprehensive failover mechanisms at multiple levels to ensure your Minecraft network remains available even during infrastructure failures, maintenance, or unexpected issues.Failover Architecture
Gate implements failover at three distinct layers:Layer 1: Frontend Gate Instance Failover
Ensure continuous proxy availability when individual Gate instances fail.Method 1: Multiple Gate Instances with Connect
The simplest approach uses Gate Connect’s built-in redundancy:config.yml (All instances)
Failover Behavior:
- Automatic health checking by Connect network
- Failed instances removed from rotation instantly
- No player-facing downtime
- Automatic recovery when instance comes back online
- Zero configuration failover
- No external load balancer required
- Works anywhere (home servers, cloud, containers)
- Built-in health monitoring
Method 2: Kubernetes with Replica Sets
For cloud-native deployments, leverage Kubernetes’ self-healing:gate-deployment.yaml
Replica Sets
replicas: 3 ensures three instances always running. Kubernetes automatically recreates failed pods.Health Probes
Liveness and readiness probes detect unhealthy instances and remove them from service.
Rolling Updates
Zero-downtime deployments with
maxUnavailable: 1 ensuring continuous availability.Session Affinity
sessionAffinity: ClientIP keeps players connected to the same instance across reconnects.Method 3: HAProxy with Health Checks
For traditional infrastructure, configure HAProxy for automatic failover:haproxy.cfg
inter 2s- Check every 2 secondsfall 3- Mark down after 3 failed checks (6 seconds)rise 2- Mark up after 2 successful checks (4 seconds)on-marked-down shutdown-sessions- Gracefully close connections when backend fails
Layer 2: Backend Server Failover
Gate automatically handles backend server failures and reconnects players to available servers.Try List Configuration
Thetry list defines the server order Gate attempts when connecting players:
config.yml
Automatic Reconnection on Disconnect
Gate can automatically reconnect players when they’re kicked from a server:config.yml
- Server crashes (connection lost)
- Server becomes unreachable (network failure)
- Server sends disconnect without reason
- Server times out
- Server kicks player with explicit reason (“You are banned”)
- Player uses
/servercommand - Server shutdown with proper kick message
Lite Mode Backend Failover
In Lite mode, Gate tries backends in order automatically:config.yml
- Automatic backend health checking
- Instant failover on connection failure
- Custom fallback messages when all backends down
- Works for both player connections and status pings
Layer 3: Connection-Level Failover
Gate handles individual connection failures gracefully.Connection Timeouts
Configure aggressive timeouts for fast failure detection:config.yml
| Scenario | connectionTimeout | readTimeout |
|---|---|---|
| Fast failover (vanilla) | 2-3s | 15-30s |
| Modded servers (Forge/Fabric) | 5-10s | 60-90s |
| High-latency networks | 5-10s | 45-60s |
Rate Limiting and Protection
Protect against connection floods that could prevent legitimate failover:config.yml
- Prevents connection exhaustion during failures
- Ensures resources available for failover operations
- Protects against DoS attacks during incidents
High-Availability Deployment Patterns
Pattern 1: Active-Active (Recommended)
Multiple Gate instances actively serving traffic:- All instances actively handling connections
- Load distributed evenly
- Maximum resource utilization
- Survives n-1 failures (can lose any single instance)
- No idle resources
- Better performance under normal load
- Graceful degradation (performance decreases gradually)
Pattern 2: Active-Passive (Cost-Optimized)
Primary instance serves traffic, backup stands ready:- Backup only receives traffic when primary fails
- Lower resource costs (backup can be smaller)
- Slower failover (backup needs to handle full load suddenly)
- Cost-sensitive deployments
- Predictable traffic patterns
- Development/staging environments
Pattern 3: Multi-Region (Geo-Distributed)
Gate instances across multiple geographic regions:config.yml
- Survives entire region failures
- Lower latency for global players
- Compliance with data residency requirements
- More complex configuration
- Higher operational costs
- Cross-region data synchronization
Disaster Recovery
Backup and Restore Procedures
Recovery Time Objectives (RTO)
Set and measure recovery objectives:| Scenario | Target RTO | Achieved Through |
|---|---|---|
| Single Gate instance failure | < 30s | Active-active with health checks |
| Backend server failure | < 5s | Try list + failoverOnUnexpectedServerDisconnect |
| Entire datacenter failure | < 5min | Multi-region with DNS failover |
| Configuration corruption | < 30min | Automated backups + restore procedure |
Monitoring and Alerting
Critical Metrics to Monitor
Gate Instance Health
- Process uptime
- Memory usage
- CPU utilization
- Connection count
Backend Server Status
- Reachability (ping/status)
- Response time
- Player count
- Failed connection attempts
Failover Events
- Failover frequency
- Recovery time
- Affected players
- Root cause (if known)
Player Experience
- Login success rate
- Connection errors
- Unexpected disconnects
- Average connection time
Example Monitoring with Prometheus
If Gate exposes metrics (via HTTP API or custom exporter):prometheus-alerts.yml
Log-Based Monitoring
Monitor Gate logs for failover events:Testing Failover
Regularly test failover mechanisms to ensure they work when needed:Test 1: Backend Server Failover
Verify Failover
- Attempt player connection
- Confirm connection to backup server
- Check Gate logs for failover messages
Test 2: Gate Instance Failover
Verify Load Balancer Failover
- Monitor load balancer logs
- Confirm traffic routed to healthy instances
- Test new player connections
Test 3: Chaos Engineering
Use chaos engineering tools for comprehensive testing:Best Practices
Defense in Depth
Implement failover at multiple layers (frontend, Gate, backend) for maximum resilience.
Test Regularly
Schedule monthly failover drills to verify mechanisms work correctly.
Monitor Continuously
Set up alerts for failures and track failover events over time.
Document Procedures
Maintain runbooks for common failure scenarios and recovery procedures.
Automate Recovery
Use orchestration tools (Kubernetes, Systemd) for automatic instance recovery.
Optimize Timeouts
Balance fast failover with avoiding false positives from temporary issues.
Troubleshooting
Players disconnected during failover
Players disconnected during failover
Expected Behavior: Players connected to failed instance will disconnect.Mitigation:
- Use
failoverOnUnexpectedServerDisconnect: truefor automatic backend failover - Implement session affinity on frontend load balancer
- Use multiple Gate instances to minimize impact radius
Failover loops (continuous switching)
Failover loops (continuous switching)
Possible Causes:
- All backends unhealthy but health checks inconsistent
- Timeout values too aggressive
- Network issues causing intermittent failures
- Increase
connectionTimeoutand health check intervals - Implement exponential backoff for failed backends
- Fix underlying network/server issues
Backup server never receives traffic
Backup server never receives traffic
Check:
- Backup server configured correctly in try list or as HAProxy backup
- Primary server is actually failing (not just slow)
- Health checks detecting primary failure properly
Slow failover (> 30 seconds)
Slow failover (> 30 seconds)
Possible Causes:
- Long timeout values
- Slow health check intervals
- Resource exhaustion (CPU/memory)
- Reduce
connectionTimeoutto 2-5 seconds - Increase health check frequency
- Scale up instance resources
Related Resources
Load Balancing
Configure load balancing strategies for optimal distribution
Gate Connect
Use Connect for automatic failover and load balancing
Kubernetes Deployment
Deploy Gate on Kubernetes with built-in self-healing
Configuration Reference
Complete configuration options documentation

