Skip to main content
This guide covers best practices for running Agones in production, based on real-world deployments and lessons learned.

Architecture Design

Cluster Sizing

Node Pool Strategy

Use dedicated node pools for game servers:
# Taint game server nodes
kubectl taint nodes <node-name> \
  agones.dev/gameserver=true:NoSchedule

# GameServer tolerations
spec:
  template:
    spec:
      tolerations:
      - key: agones.dev/gameserver
        operator: Equal
        value: "true"
        effect: NoSchedule
Benefits:
  • Predictable resource allocation
  • Isolation from system workloads
  • Easier capacity planning
  • Better node autoscaling

Node Capacity Planning

Calculate nodes needed:
GameServers per node = 
  (Node CPU - System overhead) / GameServer CPU request

Example:
Node: 4 vCPU, 16GB RAM
System overhead: 0.5 vCPU, 2GB RAM
GameServer: 0.5 vCPU, 1GB RAM

Capacity: (4 - 0.5) / 0.5 = 7 GameServers/node
Account for:
  • System daemons (kubelet, kube-proxy)
  • Monitoring agents (node-exporter)
  • Logging agents (fluent-bit)
  • CNI overhead

Resource Management

Always set resource requests and limits:
apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: production-fleet
spec:
  template:
    spec:
      template:
        spec:
          containers:
          - name: game-server
            image: my-game:v1.0.0
            resources:
              requests:
                cpu: "500m"      # Guaranteed CPU
                memory: "1Gi"    # Guaranteed memory
              limits:
                cpu: "1000m"     # Max CPU (2x request)
                memory: "2Gi"    # Max memory (2x request)
Set limits to 2x requests to allow bursts while preventing resource hogging. Monitor actual usage and adjust accordingly.

Fleet Configuration

Autoscaling Strategy

1

Choose buffer size

Buffer = ready GameServers available for immediate allocation
apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
  name: production-autoscaler
spec:
  fleetName: production-fleet
  policy:
    type: Buffer
    buffer:
      bufferSize: 10      # Absolute count
      minReplicas: 10     # Minimum capacity
      maxReplicas: 100    # Maximum capacity
Sizing guidelines:
  • Small games (< 100 CCU): buffer = 5-10
  • Medium games (100-1000 CCU): buffer = 10-20
  • Large games (> 1000 CCU): buffer = 20-50 or use percentage
buffer:
  bufferSize: "20%"  # 20% of allocated should be ready
  minReplicas: 10
  maxReplicas: 200
2

Set appropriate min/max

  • minReplicas: Cover baseline load (e.g., internal testing, monitoring)
  • maxReplicas: Set to node capacity × GameServers per node
# Example: 20 nodes, 7 GameServers per node
maxReplicas: 140  # 20 × 7
3

Configure scale-down behavior

apiVersion: agones.dev/v1
kind: Fleet
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  template:
    spec:
      sdkServer:
        logLevel: Info
        grpcPort: 9357
        httpPort: 9358

Health Checks

Configure robust health checking:
apiVersion: agones.dev/v1
kind: GameServer
spec:
  health:
    disabled: false
    periodSeconds: 5          # Check every 5 seconds
    failureThreshold: 3       # Mark unhealthy after 3 failures
    initialDelaySeconds: 10   # Wait 10s after start
Aggressive (fast failure detection):
  • periodSeconds: 3
  • failureThreshold: 2
  • initialDelaySeconds: 5
  • Use for: Session-based games, quick matches
Conservative (stable operation):
  • periodSeconds: 10
  • failureThreshold: 5
  • initialDelaySeconds: 30
  • Use for: Persistent worlds, long sessions
Balanced (recommended):
  • periodSeconds: 5
  • failureThreshold: 3
  • initialDelaySeconds: 10-15
  • Use for: Most game types

Networking

Port Allocation

Firewall Rules

Ensure game ports are accessible:
# Allow UDP traffic to game port range
gcloud compute firewall-rules create game-server-firewall \
  --allow udp:7000-8000 \
  --target-tags game-server \
  --source-ranges 0.0.0.0/0

# Tag nodes
gcloud compute instances add-tags <node-name> \
  --tags game-server \
  --zone <zone>

High Availability

Controller HA

# Deploy multiple controller replicas
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agones-controller
  namespace: agones-system
spec:
  replicas: 3  # Run 3 replicas
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: agones
                  component: controller
              topologyKey: kubernetes.io/hostname
Set via Helm:
helm install agones agones/agones \
  --set agones.controller.replicas=3 \
  --namespace agones-system

Multi-Zone Deployment

apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: multi-zone-fleet
spec:
  template:
    spec:
      template:
        spec:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      agones.dev/fleet: multi-zone-fleet
                  topologyKey: topology.kubernetes.io/zone
This spreads GameServers across availability zones for resilience.

Monitoring and Observability

Essential Metrics

Monitor these key metrics:

GameServer Health

# Ready GameServer count
agones_gameservers_count{type="Ready"}

# Error rate
rate(agones_gameservers_total{type="Error"}[5m])

# Unhealthy rate
rate(agones_gameservers_total{type="Unhealthy"}[5m])

Allocation Performance

# Average latency
sum(rate(agones_gameserver_allocations_duration_seconds_sum[5m])) /
sum(rate(agones_gameserver_allocations_duration_seconds_count[5m]))

# Success rate
sum(rate(agones_gameserver_allocations_duration_seconds_count{status="Allocated"}[5m])) /
sum(rate(agones_gameserver_allocations_duration_seconds_count[5m]))

Fleet Capacity

# Allocation percentage
(agones_fleets_replicas_count{type="allocated"} /
 agones_fleets_replicas_count{type="total"}) * 100

# Available capacity
agones_fleets_replicas_count{type="ready"}

Node Utilization

# Node saturation
agones_nodes_count{empty="false"} /
sum(agones_nodes_count) * 100

# Max GameServers per node
histogram_quantile(1.0,
  sum(rate(agones_gameservers_node_count_bucket[1m])) by (le)
)

Alerting Rules

Critical alerts for production:
prometheus-rules.yaml
groups:
  - name: agones_critical
    interval: 30s
    rules:
      - alert: AgonesControllerDown
        expr: up{job="agones-controller"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Agones controller is down"

      - alert: LowReadyGameServers
        expr: |
          (agones_gameservers_count{type="Ready"} /
           agones_fleets_replicas_count{type="desired"}) < 0.2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Less than 20% GameServers ready"

      - alert: HighAllocationLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(agones_gameserver_allocations_duration_seconds_bucket[5m])) by (le)
          ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "99th percentile allocation latency > 5s"

      - alert: AllocationFailures
        expr: |
          sum(rate(agones_gameserver_allocations_duration_seconds_count{status!="Allocated"}[5m])) /
          sum(rate(agones_gameserver_allocations_duration_seconds_count[5m])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Allocation failure rate > 10%"

      - alert: FleetAtMaxCapacity
        expr: |
          agones_fleet_autoscalers_limited == 1
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Fleet {{ $labels.name }} at max capacity for 30m"

Security

RBAC Configuration

Follow principle of least privilege:
# Service account for game backend
apiVersion: v1
kind: ServiceAccount
metadata:
  name: game-backend
  namespace: default
---
# Role for allocation only
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: game-backend-allocator
  namespace: default
rules:
- apiGroups: ["allocation.agones.dev"]
  resources: ["gameserverallocations"]
  verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: game-backend-allocator-binding
  namespace: default
subjects:
- kind: ServiceAccount
  name: game-backend
  namespace: default
roleRef:
  kind: Role
  name: game-backend-allocator
  apiGroup: rbac.authorization.k8s.io

Network Policies

Restrict network access:
# Allow game traffic only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: gameserver-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      agones.dev/role: gameserver
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector: {}
    ports:
    - protocol: UDP
      port: 7654
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: agones-system
    ports:
    - protocol: TCP
      port: 443  # API server
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: UDP
      port: 53   # DNS

Operational Procedures

Deployment Strategy

1

Canary Deployment

Test new game server versions with small subset:
# Production fleet (90% traffic)
apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: game-production
spec:
  replicas: 90
  template:
    spec:
      template:
        spec:
          containers:
          - name: game-server
            image: my-game:v1.0.0
---
# Canary fleet (10% traffic)
apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: game-canary
spec:
  replicas: 10
  template:
    spec:
      template:
        spec:
          containers:
          - name: game-server
            image: my-game:v1.1.0  # New version
Monitor canary metrics before full rollout.
2

Blue-Green Deployment

# Create green fleet with new version
kubectl apply -f fleet-green.yaml

# Wait for all GameServers ready
kubectl wait --for=jsonpath='{.status.readyReplicas}'=50 \
  fleet/game-green --timeout=600s

# Switch allocation to green
kubectl apply -f - <<EOF
apiVersion: allocation.agones.dev/v1
kind: GameServerAllocation
spec:
  required:
    matchLabels:
      version: green
EOF

# Scale down blue fleet
kubectl scale fleet game-blue --replicas=0
3

Gradual Rollout

Use Fleet rolling update:
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 10%  # Slow, safe rollout
    maxSurge: 10%

Backup and Disaster Recovery

#!/bin/bash
# Backup script

BACKUP_DIR="backups/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Backup Agones configuration
helm get values agones -n agones-system > "$BACKUP_DIR/helm-values.yaml"

# Backup all Agones resources
kubectl get gameservers,fleets,fleetautoscalers,gameserversets \
  --all-namespaces -o yaml > "$BACKUP_DIR/agones-resources.yaml"

# Backup allocation policies
kubectl get gameserverallocationpolicy --all-namespaces -o yaml \
  > "$BACKUP_DIR/allocation-policies.yaml"

# Backup CRDs
kubectl get crd -o yaml | grep -A 10000 "agones.dev" \
  > "$BACKUP_DIR/crds.yaml"

tar czf "$BACKUP_DIR.tar.gz" "$BACKUP_DIR"
echo "Backup saved to $BACKUP_DIR.tar.gz"

Cost Optimization

Right-size Resources

  • Profile actual resource usage
  • Set requests to 95th percentile usage
  • Set limits to 2x requests
  • Use Vertical Pod Autoscaler for recommendations

Use Spot/Preemptible Nodes

  • 60-80% cost savings
  • Requires graceful shutdown handling
  • Use for non-critical game modes
  • Mix with on-demand nodes for stability

Scale to Zero Off-Peak

# Scale down during maintenance
kubectl scale fleet my-fleet --replicas=0

# Or use scheduled autoscaling
# (requires custom controller)

Optimize Node Size

  • Larger nodes = fewer nodes = less overhead
  • But reduces scheduling flexibility
  • Balance based on GameServer size
  • Test different node types

Testing

Load Testing

# Simulate allocation load
for i in {1..100}; do
  kubectl apply -f - <<EOF
apiVersion: allocation.agones.dev/v1
kind: GameServerAllocation
metadata:
  generateName: load-test-
spec:
  required:
    matchLabels:
      agones.dev/fleet: test-fleet
EOF
  sleep 0.1
done

# Monitor allocation latency
kubectl port-forward -n agones-system svc/agones-controller 8080:8080
watch 'curl -s http://localhost:8080/metrics | grep allocation_duration'

Chaos Testing

# Install chaos-mesh
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace

# Test node failure
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: gameserver-pod-failure
spec:
  action: pod-failure
  mode: one
  duration: "30s"
  selector:
    namespaces:
      - default
    labelSelectors:
      agones.dev/role: gameserver
EOF

# Verify GameServers recover
kubectl get gs -w

Checklist

Before going to production:
  • Resources requests/limits configured
  • Health checks tuned
  • Autoscaling configured and tested
  • Monitoring and alerting set up
  • Firewall rules configured
  • Multi-zone deployment enabled
  • Backup procedures documented
  • Rollback procedures tested
  • Load testing completed
  • Runbooks created for common issues
  • On-call rotation established
  • Disaster recovery plan documented

Next Steps

Monitoring

Set up comprehensive monitoring

Troubleshooting

Learn to debug common issues

Upgrades

Plan for safe upgrades

Multi-Cluster

Deploy across multiple regions

Build docs developers (and LLMs) love