Production Best Practices

This guide covers best practices for running Agones in production, based on real-world deployments and lessons learned.

Architecture Design

Cluster Sizing

Node Pool Strategy

Use dedicated node pools for game servers:

# Taint game server nodes
kubectl taint nodes <node-name> \
  agones.dev/gameserver=true:NoSchedule

# GameServer tolerations
spec:
  template:
    spec:
      tolerations:
      - key: agones.dev/gameserver
        operator: Equal
        value: "true"
        effect: NoSchedule

Benefits:

Predictable resource allocation
Isolation from system workloads
Easier capacity planning
Better node autoscaling

Node Capacity Planning

Calculate nodes needed:

GameServers per node = 
  (Node CPU - System overhead) / GameServer CPU request

Example:
Node: 4 vCPU, 16GB RAM
System overhead: 0.5 vCPU, 2GB RAM
GameServer: 0.5 vCPU, 1GB RAM

Capacity: (4 - 0.5) / 0.5 = 7 GameServers/node

Account for:

System daemons (kubelet, kube-proxy)
Monitoring agents (node-exporter)
Logging agents (fluent-bit)
CNI overhead

Resource Management

GameServer Resources
SDK Sidecar Resources
Controller Resources

Always set resource requests and limits:

apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: production-fleet
spec:
  template:
    spec:
      template:
        spec:
          containers:
          - name: game-server
            image: my-game:v1.0.0
            resources:
              requests:
                cpu: "500m"      # Guaranteed CPU
                memory: "1Gi"    # Guaranteed memory
              limits:
                cpu: "1000m"     # Max CPU (2x request)
                memory: "2Gi"    # Max memory (2x request)

Set limits to 2x requests to allow bursts while preventing resource hogging. Monitor actual usage and adjust accordingly.

Configure SDK sidecar resources:

helm install agones agones/agones \
  --set agones.controller.resources.requests.cpu=100m \
  --set agones.controller.resources.requests.memory=256Mi \
  --set agones.controller.resources.limits.cpu=1000m \
  --set agones.controller.resources.limits.memory=1Gi \
  --set agones.sdkserver.sidecar.cpu.request=30m \
  --set agones.sdkserver.sidecar.memory.request=64Mi \
  --namespace agones-system

GameServers	CPU Request	Memory Request
< 500	100m	256Mi
500-2000	500m	512Mi
2000-5000	1000m	1Gi
> 5000	2000m	2Gi

Fleet Configuration

Autoscaling Strategy

Choose buffer size

Buffer = ready GameServers available for immediate allocation

apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
  name: production-autoscaler
spec:
  fleetName: production-fleet
  policy:
    type: Buffer
    buffer:
      bufferSize: 10      # Absolute count
      minReplicas: 10     # Minimum capacity
      maxReplicas: 100    # Maximum capacity

Sizing guidelines:

Small games (< 100 CCU): buffer = 5-10
Medium games (100-1000 CCU): buffer = 10-20
Large games (> 1000 CCU): buffer = 20-50 or use percentage

buffer:
  bufferSize: "20%"  # 20% of allocated should be ready
  minReplicas: 10
  maxReplicas: 200

Set appropriate min/max

minReplicas: Cover baseline load (e.g., internal testing, monitoring)
maxReplicas: Set to node capacity × GameServers per node

# Example: 20 nodes, 7 GameServers per node
maxReplicas: 140  # 20 × 7

Configure scale-down behavior

apiVersion: agones.dev/v1
kind: Fleet
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  template:
    spec:
      sdkServer:
        logLevel: Info
        grpcPort: 9357
        httpPort: 9358

Health Checks

Configure robust health checking:

apiVersion: agones.dev/v1
kind: GameServer
spec:
  health:
    disabled: false
    periodSeconds: 5          # Check every 5 seconds
    failureThreshold: 3       # Mark unhealthy after 3 failures
    initialDelaySeconds: 10   # Wait 10s after start

Health check tuning

Aggressive (fast failure detection):

periodSeconds: 3
failureThreshold: 2
initialDelaySeconds: 5
Use for: Session-based games, quick matches

Conservative (stable operation):

periodSeconds: 10
failureThreshold: 5
initialDelaySeconds: 30
Use for: Persistent worlds, long sessions

Balanced (recommended):

periodSeconds: 5
failureThreshold: 3
initialDelaySeconds: 10-15
Use for: Most game types

Networking

Port Allocation

Dynamic Ports (Recommended)
Passthrough (Advanced)
Static (Not Recommended)

Let Agones assign ports from a range:

spec:
  ports:
  - name: game
    portPolicy: Dynamic
    containerPort: 7654
    protocol: UDP

Configure port range:

helm install agones agones/agones \
  --set agones.controller.portRangeStart=7000 \
  --set agones.controller.portRangeEnd=8000 \
  --namespace agones-system

Benefits:

No port conflicts
Higher density (more GameServers per node)
Simpler configuration

Container port equals host port:

spec:
  ports:
  - name: game
    portPolicy: Passthrough
    containerPort: 7654
    protocol: UDP

Use when:

Clients expect specific port
Legacy game server doesn’t support dynamic ports
Single GameServer per node

Limitations:

Lower node density
Must ensure no port conflicts

spec:
  ports:
  - name: game
    portPolicy: Static
    containerPort: 7654
    hostPort: 7777        # Fixed host port
    protocol: UDP

Avoid in production due to:

Port conflicts
Only 1 GameServer per node
Difficult scaling

Firewall Rules

Ensure game ports are accessible:

# Allow UDP traffic to game port range
gcloud compute firewall-rules create game-server-firewall \
  --allow udp:7000-8000 \
  --target-tags game-server \
  --source-ranges 0.0.0.0/0

# Tag nodes
gcloud compute instances add-tags <node-name> \
  --tags game-server \
  --zone <zone>

# Add ingress rule to node security group
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxx \
  --protocol udp \
  --port 7000-8000 \
  --cidr 0.0.0.0/0

# Add NSG rule
az network nsg rule create \
  --resource-group myResourceGroup \
  --nsg-name myNSG \
  --name AllowGameServers \
  --priority 100 \
  --destination-port-ranges 7000-8000 \
  --protocol Udp \
  --access Allow

High Availability

Controller HA

# Deploy multiple controller replicas
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agones-controller
  namespace: agones-system
spec:
  replicas: 3  # Run 3 replicas
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: agones
                  component: controller
              topologyKey: kubernetes.io/hostname

Set via Helm:

helm install agones agones/agones \
  --set agones.controller.replicas=3 \
  --namespace agones-system

Multi-Zone Deployment

apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: multi-zone-fleet
spec:
  template:
    spec:
      template:
        spec:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      agones.dev/fleet: multi-zone-fleet
                  topologyKey: topology.kubernetes.io/zone

This spreads GameServers across availability zones for resilience.

Monitoring and Observability

Essential Metrics

Monitor these key metrics:

GameServer Health

# Ready GameServer count
agones_gameservers_count{type="Ready"}

# Error rate
rate(agones_gameservers_total{type="Error"}[5m])

# Unhealthy rate
rate(agones_gameservers_total{type="Unhealthy"}[5m])

Allocation Performance

# Average latency
sum(rate(agones_gameserver_allocations_duration_seconds_sum[5m])) /
sum(rate(agones_gameserver_allocations_duration_seconds_count[5m]))

# Success rate
sum(rate(agones_gameserver_allocations_duration_seconds_count{status="Allocated"}[5m])) /
sum(rate(agones_gameserver_allocations_duration_seconds_count[5m]))

Fleet Capacity

# Allocation percentage
(agones_fleets_replicas_count{type="allocated"} /
 agones_fleets_replicas_count{type="total"}) * 100

# Available capacity
agones_fleets_replicas_count{type="ready"}

Node Utilization

# Node saturation
agones_nodes_count{empty="false"} /
sum(agones_nodes_count) * 100

# Max GameServers per node
histogram_quantile(1.0,
  sum(rate(agones_gameservers_node_count_bucket[1m])) by (le)
)

Alerting Rules

Critical alerts for production:

prometheus-rules.yaml

groups:
  - name: agones_critical
    interval: 30s
    rules:
      - alert: AgonesControllerDown
        expr: up{job="agones-controller"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Agones controller is down"

      - alert: LowReadyGameServers
        expr: |
          (agones_gameservers_count{type="Ready"} /
           agones_fleets_replicas_count{type="desired"}) < 0.2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Less than 20% GameServers ready"

      - alert: HighAllocationLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(agones_gameserver_allocations_duration_seconds_bucket[5m])) by (le)
          ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "99th percentile allocation latency > 5s"

      - alert: AllocationFailures
        expr: |
          sum(rate(agones_gameserver_allocations_duration_seconds_count{status!="Allocated"}[5m])) /
          sum(rate(agones_gameserver_allocations_duration_seconds_count[5m])) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Allocation failure rate > 10%"

      - alert: FleetAtMaxCapacity
        expr: |
          agones_fleet_autoscalers_limited == 1
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Fleet {{ $labels.name }} at max capacity for 30m"

Security

RBAC Configuration

Follow principle of least privilege:

# Service account for game backend
apiVersion: v1
kind: ServiceAccount
metadata:
  name: game-backend
  namespace: default
---
# Role for allocation only
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: game-backend-allocator
  namespace: default
rules:
- apiGroups: ["allocation.agones.dev"]
  resources: ["gameserverallocations"]
  verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: game-backend-allocator-binding
  namespace: default
subjects:
- kind: ServiceAccount
  name: game-backend
  namespace: default
roleRef:
  kind: Role
  name: game-backend-allocator
  apiGroup: rbac.authorization.k8s.io

Network Policies

Restrict network access:

# Allow game traffic only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: gameserver-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      agones.dev/role: gameserver
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector: {}
    ports:
    - protocol: UDP
      port: 7654
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: agones-system
    ports:
    - protocol: TCP
      port: 443  # API server
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: UDP
      port: 53   # DNS

Operational Procedures

Deployment Strategy

Canary Deployment

Test new game server versions with small subset:

# Production fleet (90% traffic)
apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: game-production
spec:
  replicas: 90
  template:
    spec:
      template:
        spec:
          containers:
          - name: game-server
            image: my-game:v1.0.0
---
# Canary fleet (10% traffic)
apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: game-canary
spec:
  replicas: 10
  template:
    spec:
      template:
        spec:
          containers:
          - name: game-server
            image: my-game:v1.1.0  # New version

Monitor canary metrics before full rollout.

Blue-Green Deployment

# Create green fleet with new version
kubectl apply -f fleet-green.yaml

# Wait for all GameServers ready
kubectl wait --for=jsonpath='{.status.readyReplicas}'=50 \
  fleet/game-green --timeout=600s

# Switch allocation to green
kubectl apply -f - <<EOF
apiVersion: allocation.agones.dev/v1
kind: GameServerAllocation
spec:
  required:
    matchLabels:
      version: green
EOF

# Scale down blue fleet
kubectl scale fleet game-blue --replicas=0

Gradual Rollout

Use Fleet rolling update:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 10%  # Slow, safe rollout
    maxSurge: 10%

Backup and Disaster Recovery

#!/bin/bash
# Backup script

BACKUP_DIR="backups/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Backup Agones configuration
helm get values agones -n agones-system > "$BACKUP_DIR/helm-values.yaml"

# Backup all Agones resources
kubectl get gameservers,fleets,fleetautoscalers,gameserversets \
  --all-namespaces -o yaml > "$BACKUP_DIR/agones-resources.yaml"

# Backup allocation policies
kubectl get gameserverallocationpolicy --all-namespaces -o yaml \
  > "$BACKUP_DIR/allocation-policies.yaml"

# Backup CRDs
kubectl get crd -o yaml | grep -A 10000 "agones.dev" \
  > "$BACKUP_DIR/crds.yaml"

tar czf "$BACKUP_DIR.tar.gz" "$BACKUP_DIR"
echo "Backup saved to $BACKUP_DIR.tar.gz"

Cost Optimization

Right-size Resources

Profile actual resource usage
Set requests to 95th percentile usage
Set limits to 2x requests
Use Vertical Pod Autoscaler for recommendations

Use Spot/Preemptible Nodes

60-80% cost savings
Requires graceful shutdown handling
Use for non-critical game modes
Mix with on-demand nodes for stability

Scale to Zero Off-Peak

# Scale down during maintenance
kubectl scale fleet my-fleet --replicas=0

# Or use scheduled autoscaling
# (requires custom controller)

Optimize Node Size

Larger nodes = fewer nodes = less overhead
But reduces scheduling flexibility
Balance based on GameServer size
Test different node types

Testing

Load Testing

# Simulate allocation load
for i in {1..100}; do
  kubectl apply -f - <<EOF
apiVersion: allocation.agones.dev/v1
kind: GameServerAllocation
metadata:
  generateName: load-test-
spec:
  required:
    matchLabels:
      agones.dev/fleet: test-fleet
EOF
  sleep 0.1
done

# Monitor allocation latency
kubectl port-forward -n agones-system svc/agones-controller 8080:8080
watch 'curl -s http://localhost:8080/metrics | grep allocation_duration'

Chaos Testing

# Install chaos-mesh
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace

# Test node failure
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: gameserver-pod-failure
spec:
  action: pod-failure
  mode: one
  duration: "30s"
  selector:
    namespaces:
      - default
    labelSelectors:
      agones.dev/role: gameserver
EOF

# Verify GameServers recover
kubectl get gs -w

Checklist

Before going to production:

Next Steps

Monitoring

Set up comprehensive monitoring

Troubleshooting

Learn to debug common issues

Upgrades

Plan for safe upgrades

Multi-Cluster

Deploy across multiple regions

Get Started

Core Concepts

Installation

Game Server Integration

Client SDKs

Operations

Advanced

​Architecture Design

​Cluster Sizing

Node Pool Strategy

Node Capacity Planning

​Resource Management

​Fleet Configuration

​Autoscaling Strategy

​Health Checks

​Networking

​Port Allocation

​Firewall Rules

​High Availability

​Controller HA

​Multi-Zone Deployment

​Monitoring and Observability

​Essential Metrics

GameServer Health

Allocation Performance

Fleet Capacity

Node Utilization

​Alerting Rules

​Security

​RBAC Configuration

​Network Policies

​Operational Procedures

​Deployment Strategy

​Backup and Disaster Recovery

​Cost Optimization

Right-size Resources

Use Spot/Preemptible Nodes

Scale to Zero Off-Peak

Optimize Node Size

​Testing

​Load Testing

​Chaos Testing

​Checklist

​Next Steps

Monitoring

Troubleshooting

Upgrades

Multi-Cluster

Build docs developers (and LLMs) love

Architecture Design

Cluster Sizing

Resource Management

Fleet Configuration

Autoscaling Strategy

Health Checks

Networking

Port Allocation

Firewall Rules

High Availability

Controller HA

Multi-Zone Deployment

Monitoring and Observability

Essential Metrics

Alerting Rules

Security

RBAC Configuration

Network Policies

Operational Procedures

Deployment Strategy

Backup and Disaster Recovery

Cost Optimization

Testing

Load Testing

Chaos Testing

Checklist

Next Steps