Skip to main content

Overview

This guide covers performance optimization techniques for running Agones at scale, including controller tuning, resource management, and cluster optimization.

Controller Performance

API Server QPS Tuning

The Agones controller can be configured to adjust its rate of requests to the Kubernetes API server:
helm install agones agones/agones \
  --set agones.controller.apiServerQPS=400 \
  --set agones.controller.apiServerQPSBurst=500
Default values are QPS=400 and Burst=500. Increase these for larger clusters with thousands of game servers.
From the allocator source code (cmd/allocator/main.go:99-100):
viper.SetDefault(apiServerSustainedQPSFlag, 400)
viper.SetDefault(apiServerBurstQPSFlag, 500)

Worker Queue Configuration

Agones uses multiple specialized worker queues for different operations:
# Controller deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agones-controller
spec:
  template:
    spec:
      containers:
      - name: agones-controller
        env:
        # Number of workers for general operations
        - name: NUM_WORKERS
          value: "100"
        # Separate workers for creation operations
        - name: CREATION_WORKERS
          value: "50"
        # Separate workers for deletion operations  
        - name: DELETION_WORKERS
          value: "50"
Increasing workers improves parallelism but also increases API server load. Balance based on your cluster capacity.

Allocation Batch Processing

The allocator batches allocation requests to improve throughput:
helm install agones agones/agones \
  --set agones.allocator.allocationBatchWaitTime=500ms
From cmd/allocator/main.go:110:
viper.SetDefault(allocationBatchWaitTime, 500*time.Millisecond)
Lower values decrease latency but reduce batching efficiency. Higher values increase throughput but add latency.

Resource Optimization

Controller Resources

Optimize controller resource allocation based on cluster size:
# For small clusters (< 100 game servers)
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

# For medium clusters (100-1000 game servers)
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

# For large clusters (1000+ game servers)
resources:
  requests:
    cpu: 1000m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 2Gi
Helm configuration:
helm install agones agones/agones \
  --set agones.controller.resources.requests.cpu=1000m \
  --set agones.controller.resources.requests.memory=1Gi \
  --set agones.controller.resources.limits.cpu=2000m \
  --set agones.controller.resources.limits.memory=2Gi

Sidecar Resource Tuning

The SDK sidecar runs alongside every game server. Optimize its resources:
helm install agones agones/agones \
  --set agones.sdkServer.sidecar.resources.requests.cpu=50m \
  --set agones.sdkServer.sidecar.resources.requests.memory=64Mi \
  --set agones.sdkServer.sidecar.resources.limits.cpu=100m \
  --set agones.sdkServer.sidecar.resources.limits.memory=128Mi
For minimal overhead:
sidecar:
  resources:
    requests:
      cpu: 30m      # Minimum viable
      memory: 32Mi  # Minimum viable
    limits:
      cpu: 50m
      memory: 64Mi

SDK Rate Limiting

Limit SDK request rate to prevent sidecar overload:
helm install agones agones/agones \
  --set agones.sdkServer.sidecar.requestsRateLimit=100
This sets a limit of 100 requests per second per sidecar.

Port Allocation Performance

Port Range Configuration

From pkg/portallocator/portallocator.go:64-84, the port allocator manages dynamic port assignment:
func New(portRanges map[string]PortRange,
    kubeInformerFactory informers.SharedInformerFactory,
    agonesInformerFactory externalversions.SharedInformerFactory) Interface {
    return newAllocator(portRanges, kubeInformerFactory, agonesInformerFactory)
}

type PortRange struct {
    MinPort int32
    MaxPort int32
}
Optimize port ranges for your workload:
# Default range
helm install agones agones/agones \
  --set agones.controller.portRange=7000-8000  # 1000 ports

# Large deployment
helm install agones agones/agones \
  --set agones.controller.portRange=7000-17000  # 10000 ports
Each node can support hundreds of game servers with the right port range. Calculate: (MaxPort - MinPort) / PortsPerGameServer = Max GameServers per Node.

Static Port Policy

Use Static port policy to skip dynamic allocation:
apiVersion: agones.dev/v1
kind: GameServer
spec:
  ports:
  - name: default
    portPolicy: Static  # No dynamic allocation overhead
    hostPort: 7654
    containerPort: 7654
Benefits:
  • No port allocator overhead
  • Predictable port numbers
  • Faster GameServer creation
Drawbacks:
  • Manual port management
  • Port conflicts possible
  • Less flexible scaling

Network Performance

Pod Network Optimization

Use host networking for maximum performance:
apiVersion: agones.dev/v1
kind: GameServer
spec:
  template:
    spec:
      hostNetwork: true  # Bypass pod network overlay
      dnsPolicy: ClusterFirstWithHostNet
Host networking limits one GameServer per port per node and has security implications. Use cautiously.

Bypass kube-proxy

For latency-sensitive workloads, use PortPolicy None to bypass kube-proxy:
apiVersion: agones.dev/v1
kind: GameServer
spec:
  ports:
  - name: game
    portPolicy: None  # No hostPort, direct to containerPort
    containerPort: 7654
    protocol: UDP
Clients connect directly to the pod IP, bypassing NodePort overhead.

Allocation Performance

Allocation Strategy

Choose the right scheduling strategy for your use case:
apiVersion: allocation.agones.dev/v1
kind: GameServerAllocation
spec:
  scheduling: Packed  # Bin-packing for cloud (default)
From pkg/apis/scheduling.go:18-30:
const (
    // Packed scheduling strategy will prioritise allocating GameServers
    // on Nodes with the most Allocated, and then Ready GameServers
    // to bin pack as many Allocated GameServers on a single node.
    // This is most useful for dynamic Kubernetes clusters - such as on Cloud Providers.
    Packed SchedulingStrategy = "Packed"

    // Distributed scheduling strategy will prioritise allocating GameServers
    // on Nodes with the least Allocated, and then Ready GameServers
    // to distribute Allocated GameServers across many nodes.
    // This is most useful for statically sized Kubernetes clusters - such as on physical hardware.
    Distributed SchedulingStrategy = "Distributed"
)
Packed (Cloud environments):
  • Maximizes node utilization
  • Enables aggressive scale-down
  • Reduces infrastructure costs
Distributed (On-premises/bare metal):
  • Spreads load across all nodes
  • Better fault tolerance
  • More consistent performance

Allocation Caching

The allocator maintains a cache of Ready game servers:
helm install agones agones/agones \
  --set agones.allocator.remoteAllocationTimeout=10s \
  --set agones.allocator.totalRemoteAllocationTimeout=30s
From cmd/allocator/main.go:107-108:
viper.SetDefault(remoteAllocationTimeoutFlag, 10*time.Second)
viper.SetDefault(totalRemoteAllocationTimeoutFlag, 30*time.Second)

Fleet Scaling Performance

Buffer Size Optimization

Maintain a buffer of Ready game servers:
apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: game-fleet
spec:
  replicas: 100
  # Keep 20% Ready for instant allocation
  # 80 Allocated + 20 Ready = 100 total
With autoscaling:
apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
  name: game-fleet-autoscaler
spec:
  fleetName: game-fleet
  policy:
    type: Buffer
    buffer:
      bufferSize: 20      # Keep 20 Ready servers
      minReplicas: 10     # Never scale below 10
      maxReplicas: 1000   # Never scale above 1000

Rolling Update Strategy

Optimize Fleet updates:
apiVersion: agones.dev/v1
kind: Fleet
spec:
  replicas: 100
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%        # Create 25 new before deleting old
      maxUnavailable: 25%  # Allow 25 to be unavailable during update
For zero-downtime updates:
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 100%      # Double capacity during rollout
    maxUnavailable: 0%  # Never reduce capacity

Metrics and Monitoring

Enable Prometheus Metrics

helm install agones agones/agones \
  --set agones.metrics.prometheusEnabled=true \
  --set agones.metrics.prometheusServiceDiscovery=true
Key metrics to monitor:
# Controller queue depth
workqueue_depth{name="gameservers"}

# Allocation latency
allocation_duration_seconds

# GameServer state distribution
агones_gameservers_count{state="Ready"}
агones_gameservers_count{state="Allocated"}

# Fleet desired vs current
агones_fleet_replicas_total
агones_gameservers_total{fleet="my-fleet"}

# Node utilization
агones_nodes_count
агones_gameservers_node_count

Performance Profiling

Enable pprof for the controller:
env:
- name: ENABLE_PPROF
  value: "true"
Access profiling endpoints:
# CPU profile
kubectl port-forward -n agones-system deploy/agones-controller 6060:6060
curl http://localhost:6060/debug/pprof/profile > cpu.prof
go tool pprof cpu.prof

# Memory profile
curl http://localhost:6060/debug/pprof/heap > mem.prof
go tool pprof mem.prof

# Goroutine profile
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof

Cluster-Level Optimization

Node Configuration

Optimize nodes for game server workloads:
# Node labels for game server placement
kubectl label nodes <node-name> \
  agones.dev/gameserver=true \
  node.kubernetes.io/instance-type=c5.2xlarge
Use taints to dedicate nodes:
kubectl taint nodes <node-name> \
  agones.dev/gameserver=true:NoSchedule
Then configure GameServers with tolerations:
spec:
  template:
    spec:
      tolerations:
      - key: agones.dev/gameserver
        operator: Equal
        value: "true"
        effect: NoSchedule

Cluster Autoscaling

Configure cluster autoscaler for game server nodes:
# GKE example
gcloud container node-pools create game-servers \
  --cluster=my-cluster \
  --enable-autoscaling \
  --min-nodes=3 \
  --max-nodes=100 \
  --machine-type=c2-standard-4 \
  --node-labels=agones.dev/gameserver=true
Set appropriate scale-down delay:
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler
data:
  scale-down-delay-after-add: "10m"
  scale-down-unneeded-time: "10m"

etcd Performance

For large Agones deployments, tune etcd:
# Increase etcd quota (Kubernetes control plane)
--quota-backend-bytes=8589934592  # 8GB (default is 2GB)

# Enable etcd metrics
--metrics=extensive
Monitor etcd health:
ETCDCTL_API=3 etcdctl endpoint status --cluster
ETCDCTL_API=3 etcdctl endpoint health --cluster

Performance Testing

Load Testing Allocations

#!/bin/bash
# Stress test allocations
for i in {1..1000}; do
  kubectl create -f - <<EOF &
apiVersion: allocation.agones.dev/v1
kind: GameServerAllocation
metadata:
  generateName: load-test-
spec:
  selectors:
  - matchLabels:
      agones.dev/fleet: game-fleet
EOF
done
wait

Measure Allocation Latency

time kubectl create -f gameserverallocation.yaml

Fleet Scale Testing

# Scale to 1000 game servers
kubectl scale fleet game-fleet --replicas=1000

# Measure time to Ready
watch kubectl get fleet game-fleet

Performance Checklist

1

Right-size Controller

Set appropriate CPU/memory based on cluster size
2

Tune API QPS

Increase QPS limits for large clusters (>1000 game servers)
3

Optimize Sidecar

Minimize sidecar resources while maintaining stability
4

Choose Strategy

Use Packed for cloud, Distributed for on-premises
5

Buffer Sizing

Maintain adequate Ready buffer for instant allocations
6

Monitor Metrics

Set up Prometheus and alert on queue depth, allocation latency
7

Cluster Autoscaling

Configure node autoscaling with appropriate delays
8

Load Test

Test allocation throughput before production launch

Build docs developers (and LLMs) love