Monitoring Agones

Monitoring your Agones infrastructure is critical for maintaining healthy game server deployments. Agones provides built-in support for Prometheus metrics and pre-configured Grafana dashboards.

Overview

Agones exports detailed metrics about:

GameServer lifecycle and state transitions
Fleet replica counts and autoscaling behavior
Allocation latency and success rates
Node utilization and GameServer distribution
Controller and SDK performance

Enabling Metrics

Enable Prometheus Metrics

Metrics are enabled by default in Agones. Configure during Helm installation:

helm install agones agones/agones \
  --set agones.metrics.prometheusEnabled=true \
  --set agones.metrics.prometheusServiceDiscovery=true \
  --namespace agones-system \
  --create-namespace

Configure Prometheus

Add Agones as a scrape target in your Prometheus configuration:

prometheus.yml

scrape_configs:
  - job_name: 'agones'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - agones-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_agones_dev_role]
        action: keep
        regex: controller
      - source_labels: [__meta_kubernetes_pod_container_port_name]
        action: keep
        regex: http

The Agones controller exposes metrics on port 8080 at the /metrics endpoint.

Deploy Grafana Dashboards

Agones includes pre-built Grafana dashboards in the source repository:

kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana-frontend.yaml

Import dashboard ConfigMaps:

kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-gameservers.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-allocations.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-autoscalers.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-controller-usage.yaml

Available Dashboards

Agones provides several specialized Grafana dashboards:

GameServers Dashboard

GameServer count by state
Fleet replica counts (total, allocated, ready, desired)
GameServers per node distribution
State transition rates
Node availability

Allocations Dashboard

Average allocation latency
Allocation error rates
Latency percentiles (50th, 90th, 99th)
Allocation rate by status
Per-fleet allocation metrics

Autoscalers Dashboard

Fleet allocation percentage
Buffer size and limits
Current vs desired replicas
Scaling capability status
Autoscaler policy metrics

Controller Usage

Controller CPU and memory usage
API server request rates
Work queue depth and latency
Cache sync performance

Key Metrics to Monitor

GameServer State Metrics

# Current count of GameServers by state
agones_gameservers_count{type="Ready", fleet_name="my-fleet", namespace="default"}

# Rate of GameServers transitioning to each state
rate(agones_gameservers_total{type="Error"}[5m])

# Average time GameServers spend in each state
sum(rate(agones_gameserver_state_duration_sum[5m])) by (type) /
sum(rate(agones_gameserver_state_duration_count[5m])) by (type)

Fleet Metrics

# Fleet replica counts by type
agones_fleets_replicas_count{name="my-fleet", type="allocated"}
agones_fleets_replicas_count{name="my-fleet", type="ready"}
agones_fleets_replicas_count{name="my-fleet", type="desired"}

# Fleet rollout progress
sum(agones_fleet_rollout_percent{name="my-fleet", type="current_replicas"}) /
sum(agones_fleet_rollout_percent{name="my-fleet", type="desired_replicas"}) * 100

Allocation Metrics

# Average allocation latency
sum(rate(agones_gameserver_allocations_duration_seconds_sum[5m])) /
sum(rate(agones_gameserver_allocations_duration_seconds_count[5m]))

# Allocation error rate
sum(rate(agones_gameserver_allocations_duration_seconds_count{status!="Allocated"}[5m]))

# 99th percentile allocation latency
histogram_quantile(0.99, sum(rate(agones_gameserver_allocations_duration_seconds_bucket[5m])) by (le))

Node Metrics

# Count of nodes with/without GameServers
agones_nodes_count{empty="false"}  # Nodes with GameServers
agones_nodes_count{empty="true"}   # Empty nodes

# GameServers per node distribution
histogram_quantile(0.99, sum(rate(agones_gameservers_node_count_bucket[1m])) by (le))

Autoscaler Metrics

# Current vs desired replicas
agones_fleet_autoscalers_current_replicas_count{name="my-autoscaler"}
agones_fleet_autoscalers_desired_replicas_count{name="my-autoscaler"}

# Autoscaler status
agones_fleet_autoscalers_able_to_scale{name="my-autoscaler"}
agones_fleet_autoscalers_limited{name="my-autoscaler"}

# Buffer configuration
agones_fleet_autoscalers_buffer_size{name="my-autoscaler", type="count"}
agones_fleet_autoscalers_buffer_limits{name="my-autoscaler", type="max"}

Setting Up Alerts

Recommended Prometheus alerting rules:

groups:
  - name: agones_gameservers
    interval: 30s
    rules:
      - alert: HighGameServerErrorRate
        expr: |
          rate(agones_gameservers_total{type="Error"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High GameServer error rate"
          description: "{{ $labels.namespace }}/{{ $labels.fleet_name }} has error rate of {{ $value | humanize }} per second"

      - alert: LowReadyGameServers
        expr: |
          (agones_gameservers_count{type="Ready"} / 
           agones_fleets_replicas_count{type="desired"}) < 0.3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low ready GameServer capacity"
          description: "Fleet {{ $labels.name }} has only {{ $value | humanizePercentage }} ready GameServers"

      - alert: HighAllocationLatency
        expr: |
          histogram_quantile(0.99, 
            sum(rate(agones_gameserver_allocations_duration_seconds_bucket[5m])) by (le)
          ) > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High allocation latency"
          description: "99th percentile allocation latency is {{ $value }}s"

      - alert: AllocationFailures
        expr: |
          rate(agones_gameserver_allocations_duration_seconds_count{status!="Allocated"}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GameServer allocation failures detected"
          description: "Allocation error rate is {{ $value | humanize }} per second"

      - alert: FleetAutoscalerLimited
        expr: |
          agones_fleet_autoscalers_limited == 1
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "FleetAutoscaler is limited"
          description: "Autoscaler {{ $labels.name }} has been limited for 15 minutes"

Adjust alert thresholds based on your workload patterns and SLOs. Start with conservative values and tune based on observed behavior.

Monitoring with Stackdriver

For Google Cloud Platform deployments, enable Stackdriver monitoring:

helm install agones agones/agones \
  --set agones.metrics.stackdriverEnabled=true \
  --set agones.metrics.stackdriverProjectID=YOUR_PROJECT_ID \
  --namespace agones-system

Stackdriver metrics are exported with the agones prefix and include all Prometheus metrics with custom resource labels.

Stackdriver has a minimum reporting period of 60 seconds, compared to 15 seconds for Prometheus. This may affect real-time monitoring capabilities.

Health Checks

Monitor Agones component health:

# Controller health
kubectl get pods -n agones-system -l app=agones,component=controller

# Check controller logs for errors
kubectl logs -n agones-system -l app=agones,component=controller --tail=100

# Metrics endpoint health
kubectl port-forward -n agones-system svc/agones-controller 8080:8080
curl http://localhost:8080/metrics

Best Practices

Set appropriate scrape intervals

Use 15-30 second scrape intervals for Agones metrics
Avoid scraping more frequently than every 10 seconds
Match scrape interval to your alerting requirements

Monitor metric cardinality

Agones metrics include labels for fleet, namespace, and state
High fleet/namespace counts increase cardinality
Consider using metric relabeling to reduce dimensionality

Use recording rules for complex queries

Create recording rules for frequently-used queries to improve dashboard performance:

groups:
  - name: agones_aggregations
    interval: 30s
    rules:
      - record: fleet:allocations:ratio
        expr: |
          sum by (name, namespace) (agones_fleets_replicas_count{type="allocated"}) /
          sum by (name, namespace) (agones_fleets_replicas_count{type="total"})

Retain historical data

Keep at least 30 days of metrics for trend analysis
Use remote write to long-term storage (e.g., Thanos, Cortex)
Archive dashboard snapshots before major changes

Troubleshooting

If metrics are not appearing:

Verify controller is running:
```
kubectl get pods -n agones-system
```

Check metrics endpoint:

kubectl port-forward -n agones-system svc/agones-controller 8080:8080
curl http://localhost:8080/metrics | grep agones_gameservers_count

Verify Prometheus scraping:
- Check Prometheus targets page
- Look for agones-system service discovery
- Verify no authentication errors in Prometheus logs

Check RBAC permissions:

kubectl get clusterrolebinding | grep prometheus
kubectl get clusterrole prometheus -o yaml

Get Started

Core Concepts

Installation

Game Server Integration

Client SDKs

Operations

Advanced

Monitoring Agones

Overview

Enabling Metrics

Available Dashboards

GameServers Dashboard

Allocations Dashboard

Autoscalers Dashboard

Controller Usage

Key Metrics to Monitor

GameServer State Metrics

Fleet Metrics

Allocation Metrics

Node Metrics

Autoscaler Metrics

Setting Up Alerts

Monitoring with Stackdriver

Health Checks

Best Practices

Troubleshooting

Next Steps

Metrics Reference

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

Installation

Game Server Integration

Client SDKs

Operations

Advanced

​Overview

​Enabling Metrics

​Available Dashboards

GameServers Dashboard

Allocations Dashboard

Autoscalers Dashboard

Controller Usage

​Key Metrics to Monitor

​GameServer State Metrics

​Fleet Metrics

​Allocation Metrics

​Node Metrics

​Autoscaler Metrics

​Setting Up Alerts

​Monitoring with Stackdriver

​Health Checks

​Best Practices

​Troubleshooting

​Next Steps

Metrics Reference

Troubleshooting

Build docs developers (and LLMs) love

Overview

Enabling Metrics

Available Dashboards

Key Metrics to Monitor

GameServer State Metrics

Fleet Metrics

Allocation Metrics

Node Metrics

Autoscaler Metrics

Setting Up Alerts

Monitoring with Stackdriver

Health Checks

Best Practices

Troubleshooting

Next Steps