Architecture Design
Cluster Sizing
Node Pool Strategy
Use dedicated node pools for game servers:Benefits:
- Predictable resource allocation
- Isolation from system workloads
- Easier capacity planning
- Better node autoscaling
Node Capacity Planning
Calculate nodes needed:Account for:
- System daemons (kubelet, kube-proxy)
- Monitoring agents (node-exporter)
- Logging agents (fluent-bit)
- CNI overhead
Resource Management
- GameServer Resources
- SDK Sidecar Resources
- Controller Resources
Always set resource requests and limits:
Set limits to 2x requests to allow bursts while preventing resource hogging. Monitor actual usage and adjust accordingly.
Fleet Configuration
Autoscaling Strategy
Choose buffer size
Buffer = ready GameServers available for immediate allocationSizing guidelines:
- Small games (< 100 CCU): buffer = 5-10
- Medium games (100-1000 CCU): buffer = 10-20
- Large games (> 1000 CCU): buffer = 20-50 or use percentage
Set appropriate min/max
- minReplicas: Cover baseline load (e.g., internal testing, monitoring)
- maxReplicas: Set to node capacity × GameServers per node
Health Checks
Configure robust health checking:Health check tuning
Health check tuning
Aggressive (fast failure detection):
- periodSeconds: 3
- failureThreshold: 2
- initialDelaySeconds: 5
- Use for: Session-based games, quick matches
- periodSeconds: 10
- failureThreshold: 5
- initialDelaySeconds: 30
- Use for: Persistent worlds, long sessions
- periodSeconds: 5
- failureThreshold: 3
- initialDelaySeconds: 10-15
- Use for: Most game types
Networking
Port Allocation
- Dynamic Ports (Recommended)
- Passthrough (Advanced)
- Static (Not Recommended)
Let Agones assign ports from a range:Configure port range:Benefits:
- No port conflicts
- Higher density (more GameServers per node)
- Simpler configuration
Firewall Rules
Ensure game ports are accessible:- GKE
- EKS
- AKS
High Availability
Controller HA
Set via Helm:
Multi-Zone Deployment
Monitoring and Observability
Essential Metrics
Monitor these key metrics:GameServer Health
Allocation Performance
Fleet Capacity
Node Utilization
Alerting Rules
Critical alerts for production:prometheus-rules.yaml
Security
RBAC Configuration
Follow principle of least privilege:Network Policies
Restrict network access:Operational Procedures
Deployment Strategy
Canary Deployment
Test new game server versions with small subset:Monitor canary metrics before full rollout.
Backup and Disaster Recovery
Cost Optimization
Right-size Resources
- Profile actual resource usage
- Set requests to 95th percentile usage
- Set limits to 2x requests
- Use Vertical Pod Autoscaler for recommendations
Use Spot/Preemptible Nodes
- 60-80% cost savings
- Requires graceful shutdown handling
- Use for non-critical game modes
- Mix with on-demand nodes for stability
Scale to Zero Off-Peak
Optimize Node Size
- Larger nodes = fewer nodes = less overhead
- But reduces scheduling flexibility
- Balance based on GameServer size
- Test different node types
Testing
Load Testing
Chaos Testing
Checklist
Before going to production:- Resources requests/limits configured
- Health checks tuned
- Autoscaling configured and tested
- Monitoring and alerting set up
- Firewall rules configured
- Multi-zone deployment enabled
- Backup procedures documented
- Rollback procedures tested
- Load testing completed
- Runbooks created for common issues
- On-call rotation established
- Disaster recovery plan documented
Next Steps
Monitoring
Set up comprehensive monitoring
Troubleshooting
Learn to debug common issues
Upgrades
Plan for safe upgrades
Multi-Cluster
Deploy across multiple regions
