Overview
This guide covers performance optimization techniques for running Agones at scale, including controller tuning, resource management, and cluster optimization.
API Server QPS Tuning
The Agones controller can be configured to adjust its rate of requests to the Kubernetes API server:
helm install agones agones/agones \
--set agones.controller.apiServerQPS=400 \
--set agones.controller.apiServerQPSBurst=500
Default values are QPS=400 and Burst=500. Increase these for larger clusters with thousands of game servers.
From the allocator source code (cmd/allocator/main.go:99-100):
viper.SetDefault(apiServerSustainedQPSFlag, 400)
viper.SetDefault(apiServerBurstQPSFlag, 500)
Worker Queue Configuration
Agones uses multiple specialized worker queues for different operations:
# Controller deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: agones-controller
spec:
template:
spec:
containers:
- name: agones-controller
env:
# Number of workers for general operations
- name: NUM_WORKERS
value: "100"
# Separate workers for creation operations
- name: CREATION_WORKERS
value: "50"
# Separate workers for deletion operations
- name: DELETION_WORKERS
value: "50"
Increasing workers improves parallelism but also increases API server load. Balance based on your cluster capacity.
Allocation Batch Processing
The allocator batches allocation requests to improve throughput:
helm install agones agones/agones \
--set agones.allocator.allocationBatchWaitTime=500ms
From cmd/allocator/main.go:110:
viper.SetDefault(allocationBatchWaitTime, 500*time.Millisecond)
Lower values decrease latency but reduce batching efficiency. Higher values increase throughput but add latency.
Resource Optimization
Controller Resources
Optimize controller resource allocation based on cluster size:
# For small clusters (< 100 game servers)
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# For medium clusters (100-1000 game servers)
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
# For large clusters (1000+ game servers)
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
Helm configuration:
helm install agones agones/agones \
--set agones.controller.resources.requests.cpu=1000m \
--set agones.controller.resources.requests.memory=1Gi \
--set agones.controller.resources.limits.cpu=2000m \
--set agones.controller.resources.limits.memory=2Gi
Sidecar Resource Tuning
The SDK sidecar runs alongside every game server. Optimize its resources:
helm install agones agones/agones \
--set agones.sdkServer.sidecar.resources.requests.cpu=50m \
--set agones.sdkServer.sidecar.resources.requests.memory=64Mi \
--set agones.sdkServer.sidecar.resources.limits.cpu=100m \
--set agones.sdkServer.sidecar.resources.limits.memory=128Mi
For minimal overhead:
sidecar:
resources:
requests:
cpu: 30m # Minimum viable
memory: 32Mi # Minimum viable
limits:
cpu: 50m
memory: 64Mi
SDK Rate Limiting
Limit SDK request rate to prevent sidecar overload:
helm install agones agones/agones \
--set agones.sdkServer.sidecar.requestsRateLimit=100
This sets a limit of 100 requests per second per sidecar.
Port Range Configuration
From pkg/portallocator/portallocator.go:64-84, the port allocator manages dynamic port assignment:
func New(portRanges map[string]PortRange,
kubeInformerFactory informers.SharedInformerFactory,
agonesInformerFactory externalversions.SharedInformerFactory) Interface {
return newAllocator(portRanges, kubeInformerFactory, agonesInformerFactory)
}
type PortRange struct {
MinPort int32
MaxPort int32
}
Optimize port ranges for your workload:
# Default range
helm install agones agones/agones \
--set agones.controller.portRange=7000-8000 # 1000 ports
# Large deployment
helm install agones agones/agones \
--set agones.controller.portRange=7000-17000 # 10000 ports
Each node can support hundreds of game servers with the right port range. Calculate: (MaxPort - MinPort) / PortsPerGameServer = Max GameServers per Node.
Static Port Policy
Use Static port policy to skip dynamic allocation:
apiVersion: agones.dev/v1
kind: GameServer
spec:
ports:
- name: default
portPolicy: Static # No dynamic allocation overhead
hostPort: 7654
containerPort: 7654
Benefits:
- No port allocator overhead
- Predictable port numbers
- Faster GameServer creation
Drawbacks:
- Manual port management
- Port conflicts possible
- Less flexible scaling
Pod Network Optimization
Use host networking for maximum performance:
apiVersion: agones.dev/v1
kind: GameServer
spec:
template:
spec:
hostNetwork: true # Bypass pod network overlay
dnsPolicy: ClusterFirstWithHostNet
Host networking limits one GameServer per port per node and has security implications. Use cautiously.
Bypass kube-proxy
For latency-sensitive workloads, use PortPolicy None to bypass kube-proxy:
apiVersion: agones.dev/v1
kind: GameServer
spec:
ports:
- name: game
portPolicy: None # No hostPort, direct to containerPort
containerPort: 7654
protocol: UDP
Clients connect directly to the pod IP, bypassing NodePort overhead.
Allocation Strategy
Choose the right scheduling strategy for your use case:
apiVersion: allocation.agones.dev/v1
kind: GameServerAllocation
spec:
scheduling: Packed # Bin-packing for cloud (default)
From pkg/apis/scheduling.go:18-30:
const (
// Packed scheduling strategy will prioritise allocating GameServers
// on Nodes with the most Allocated, and then Ready GameServers
// to bin pack as many Allocated GameServers on a single node.
// This is most useful for dynamic Kubernetes clusters - such as on Cloud Providers.
Packed SchedulingStrategy = "Packed"
// Distributed scheduling strategy will prioritise allocating GameServers
// on Nodes with the least Allocated, and then Ready GameServers
// to distribute Allocated GameServers across many nodes.
// This is most useful for statically sized Kubernetes clusters - such as on physical hardware.
Distributed SchedulingStrategy = "Distributed"
)
Packed (Cloud environments):
- Maximizes node utilization
- Enables aggressive scale-down
- Reduces infrastructure costs
Distributed (On-premises/bare metal):
- Spreads load across all nodes
- Better fault tolerance
- More consistent performance
Allocation Caching
The allocator maintains a cache of Ready game servers:
helm install agones agones/agones \
--set agones.allocator.remoteAllocationTimeout=10s \
--set agones.allocator.totalRemoteAllocationTimeout=30s
From cmd/allocator/main.go:107-108:
viper.SetDefault(remoteAllocationTimeoutFlag, 10*time.Second)
viper.SetDefault(totalRemoteAllocationTimeoutFlag, 30*time.Second)
Buffer Size Optimization
Maintain a buffer of Ready game servers:
apiVersion: agones.dev/v1
kind: Fleet
metadata:
name: game-fleet
spec:
replicas: 100
# Keep 20% Ready for instant allocation
# 80 Allocated + 20 Ready = 100 total
With autoscaling:
apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
name: game-fleet-autoscaler
spec:
fleetName: game-fleet
policy:
type: Buffer
buffer:
bufferSize: 20 # Keep 20 Ready servers
minReplicas: 10 # Never scale below 10
maxReplicas: 1000 # Never scale above 1000
Rolling Update Strategy
Optimize Fleet updates:
apiVersion: agones.dev/v1
kind: Fleet
spec:
replicas: 100
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25% # Create 25 new before deleting old
maxUnavailable: 25% # Allow 25 to be unavailable during update
For zero-downtime updates:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 100% # Double capacity during rollout
maxUnavailable: 0% # Never reduce capacity
Metrics and Monitoring
Enable Prometheus Metrics
helm install agones agones/agones \
--set agones.metrics.prometheusEnabled=true \
--set agones.metrics.prometheusServiceDiscovery=true
Key metrics to monitor:
# Controller queue depth
workqueue_depth{name="gameservers"}
# Allocation latency
allocation_duration_seconds
# GameServer state distribution
агones_gameservers_count{state="Ready"}
агones_gameservers_count{state="Allocated"}
# Fleet desired vs current
агones_fleet_replicas_total
агones_gameservers_total{fleet="my-fleet"}
# Node utilization
агones_nodes_count
агones_gameservers_node_count
Enable pprof for the controller:
env:
- name: ENABLE_PPROF
value: "true"
Access profiling endpoints:
# CPU profile
kubectl port-forward -n agones-system deploy/agones-controller 6060:6060
curl http://localhost:6060/debug/pprof/profile > cpu.prof
go tool pprof cpu.prof
# Memory profile
curl http://localhost:6060/debug/pprof/heap > mem.prof
go tool pprof mem.prof
# Goroutine profile
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof
Cluster-Level Optimization
Node Configuration
Optimize nodes for game server workloads:
# Node labels for game server placement
kubectl label nodes <node-name> \
agones.dev/gameserver=true \
node.kubernetes.io/instance-type=c5.2xlarge
Use taints to dedicate nodes:
kubectl taint nodes <node-name> \
agones.dev/gameserver=true:NoSchedule
Then configure GameServers with tolerations:
spec:
template:
spec:
tolerations:
- key: agones.dev/gameserver
operator: Equal
value: "true"
effect: NoSchedule
Cluster Autoscaling
Configure cluster autoscaler for game server nodes:
# GKE example
gcloud container node-pools create game-servers \
--cluster=my-cluster \
--enable-autoscaling \
--min-nodes=3 \
--max-nodes=100 \
--machine-type=c2-standard-4 \
--node-labels=agones.dev/gameserver=true
Set appropriate scale-down delay:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler
data:
scale-down-delay-after-add: "10m"
scale-down-unneeded-time: "10m"
For large Agones deployments, tune etcd:
# Increase etcd quota (Kubernetes control plane)
--quota-backend-bytes=8589934592 # 8GB (default is 2GB)
# Enable etcd metrics
--metrics=extensive
Monitor etcd health:
ETCDCTL_API=3 etcdctl endpoint status --cluster
ETCDCTL_API=3 etcdctl endpoint health --cluster
Load Testing Allocations
#!/bin/bash
# Stress test allocations
for i in {1..1000}; do
kubectl create -f - <<EOF &
apiVersion: allocation.agones.dev/v1
kind: GameServerAllocation
metadata:
generateName: load-test-
spec:
selectors:
- matchLabels:
agones.dev/fleet: game-fleet
EOF
done
wait
Measure Allocation Latency
time kubectl create -f gameserverallocation.yaml
Fleet Scale Testing
# Scale to 1000 game servers
kubectl scale fleet game-fleet --replicas=1000
# Measure time to Ready
watch kubectl get fleet game-fleet
Right-size Controller
Set appropriate CPU/memory based on cluster size
Tune API QPS
Increase QPS limits for large clusters (>1000 game servers)
Optimize Sidecar
Minimize sidecar resources while maintaining stability
Choose Strategy
Use Packed for cloud, Distributed for on-premises
Buffer Sizing
Maintain adequate Ready buffer for instant allocations
Monitor Metrics
Set up Prometheus and alert on queue depth, allocation latency
Cluster Autoscaling
Configure node autoscaling with appropriate delays
Load Test
Test allocation throughput before production launch