Skip to main content
Monitoring your Agones infrastructure is critical for maintaining healthy game server deployments. Agones provides built-in support for Prometheus metrics and pre-configured Grafana dashboards.

Overview

Agones exports detailed metrics about:
  • GameServer lifecycle and state transitions
  • Fleet replica counts and autoscaling behavior
  • Allocation latency and success rates
  • Node utilization and GameServer distribution
  • Controller and SDK performance

Enabling Metrics

1

Enable Prometheus Metrics

Metrics are enabled by default in Agones. Configure during Helm installation:
helm install agones agones/agones \
  --set agones.metrics.prometheusEnabled=true \
  --set agones.metrics.prometheusServiceDiscovery=true \
  --namespace agones-system \
  --create-namespace
2

Configure Prometheus

Add Agones as a scrape target in your Prometheus configuration:
prometheus.yml
scrape_configs:
  - job_name: 'agones'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - agones-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_agones_dev_role]
        action: keep
        regex: controller
      - source_labels: [__meta_kubernetes_pod_container_port_name]
        action: keep
        regex: http
The Agones controller exposes metrics on port 8080 at the /metrics endpoint.
3

Deploy Grafana Dashboards

Agones includes pre-built Grafana dashboards in the source repository:
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana-frontend.yaml
Import dashboard ConfigMaps:
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-gameservers.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-allocations.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-autoscalers.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-controller-usage.yaml

Available Dashboards

Agones provides several specialized Grafana dashboards:

GameServers Dashboard

  • GameServer count by state
  • Fleet replica counts (total, allocated, ready, desired)
  • GameServers per node distribution
  • State transition rates
  • Node availability

Allocations Dashboard

  • Average allocation latency
  • Allocation error rates
  • Latency percentiles (50th, 90th, 99th)
  • Allocation rate by status
  • Per-fleet allocation metrics

Autoscalers Dashboard

  • Fleet allocation percentage
  • Buffer size and limits
  • Current vs desired replicas
  • Scaling capability status
  • Autoscaler policy metrics

Controller Usage

  • Controller CPU and memory usage
  • API server request rates
  • Work queue depth and latency
  • Cache sync performance

Key Metrics to Monitor

GameServer State Metrics

# Current count of GameServers by state
agones_gameservers_count{type="Ready", fleet_name="my-fleet", namespace="default"}

# Rate of GameServers transitioning to each state
rate(agones_gameservers_total{type="Error"}[5m])

# Average time GameServers spend in each state
sum(rate(agones_gameserver_state_duration_sum[5m])) by (type) /
sum(rate(agones_gameserver_state_duration_count[5m])) by (type)

Fleet Metrics

# Fleet replica counts by type
agones_fleets_replicas_count{name="my-fleet", type="allocated"}
agones_fleets_replicas_count{name="my-fleet", type="ready"}
agones_fleets_replicas_count{name="my-fleet", type="desired"}

# Fleet rollout progress
sum(agones_fleet_rollout_percent{name="my-fleet", type="current_replicas"}) /
sum(agones_fleet_rollout_percent{name="my-fleet", type="desired_replicas"}) * 100

Allocation Metrics

# Average allocation latency
sum(rate(agones_gameserver_allocations_duration_seconds_sum[5m])) /
sum(rate(agones_gameserver_allocations_duration_seconds_count[5m]))

# Allocation error rate
sum(rate(agones_gameserver_allocations_duration_seconds_count{status!="Allocated"}[5m]))

# 99th percentile allocation latency
histogram_quantile(0.99, sum(rate(agones_gameserver_allocations_duration_seconds_bucket[5m])) by (le))

Node Metrics

# Count of nodes with/without GameServers
agones_nodes_count{empty="false"}  # Nodes with GameServers
agones_nodes_count{empty="true"}   # Empty nodes

# GameServers per node distribution
histogram_quantile(0.99, sum(rate(agones_gameservers_node_count_bucket[1m])) by (le))

Autoscaler Metrics

# Current vs desired replicas
agones_fleet_autoscalers_current_replicas_count{name="my-autoscaler"}
agones_fleet_autoscalers_desired_replicas_count{name="my-autoscaler"}

# Autoscaler status
agones_fleet_autoscalers_able_to_scale{name="my-autoscaler"}
agones_fleet_autoscalers_limited{name="my-autoscaler"}

# Buffer configuration
agones_fleet_autoscalers_buffer_size{name="my-autoscaler", type="count"}
agones_fleet_autoscalers_buffer_limits{name="my-autoscaler", type="max"}

Setting Up Alerts

Recommended Prometheus alerting rules:
groups:
  - name: agones_gameservers
    interval: 30s
    rules:
      - alert: HighGameServerErrorRate
        expr: |
          rate(agones_gameservers_total{type="Error"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High GameServer error rate"
          description: "{{ $labels.namespace }}/{{ $labels.fleet_name }} has error rate of {{ $value | humanize }} per second"

      - alert: LowReadyGameServers
        expr: |
          (agones_gameservers_count{type="Ready"} / 
           agones_fleets_replicas_count{type="desired"}) < 0.3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low ready GameServer capacity"
          description: "Fleet {{ $labels.name }} has only {{ $value | humanizePercentage }} ready GameServers"

      - alert: HighAllocationLatency
        expr: |
          histogram_quantile(0.99, 
            sum(rate(agones_gameserver_allocations_duration_seconds_bucket[5m])) by (le)
          ) > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High allocation latency"
          description: "99th percentile allocation latency is {{ $value }}s"

      - alert: AllocationFailures
        expr: |
          rate(agones_gameserver_allocations_duration_seconds_count{status!="Allocated"}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GameServer allocation failures detected"
          description: "Allocation error rate is {{ $value | humanize }} per second"

      - alert: FleetAutoscalerLimited
        expr: |
          agones_fleet_autoscalers_limited == 1
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "FleetAutoscaler is limited"
          description: "Autoscaler {{ $labels.name }} has been limited for 15 minutes"
Adjust alert thresholds based on your workload patterns and SLOs. Start with conservative values and tune based on observed behavior.

Monitoring with Stackdriver

For Google Cloud Platform deployments, enable Stackdriver monitoring:
helm install agones agones/agones \
  --set agones.metrics.stackdriverEnabled=true \
  --set agones.metrics.stackdriverProjectID=YOUR_PROJECT_ID \
  --namespace agones-system
Stackdriver metrics are exported with the agones prefix and include all Prometheus metrics with custom resource labels.
Stackdriver has a minimum reporting period of 60 seconds, compared to 15 seconds for Prometheus. This may affect real-time monitoring capabilities.

Health Checks

Monitor Agones component health:
# Controller health
kubectl get pods -n agones-system -l app=agones,component=controller

# Check controller logs for errors
kubectl logs -n agones-system -l app=agones,component=controller --tail=100

# Metrics endpoint health
kubectl port-forward -n agones-system svc/agones-controller 8080:8080
curl http://localhost:8080/metrics

Best Practices

  • Use 15-30 second scrape intervals for Agones metrics
  • Avoid scraping more frequently than every 10 seconds
  • Match scrape interval to your alerting requirements
  • Agones metrics include labels for fleet, namespace, and state
  • High fleet/namespace counts increase cardinality
  • Consider using metric relabeling to reduce dimensionality
Create recording rules for frequently-used queries to improve dashboard performance:
groups:
  - name: agones_aggregations
    interval: 30s
    rules:
      - record: fleet:allocations:ratio
        expr: |
          sum by (name, namespace) (agones_fleets_replicas_count{type="allocated"}) /
          sum by (name, namespace) (agones_fleets_replicas_count{type="total"})
  • Keep at least 30 days of metrics for trend analysis
  • Use remote write to long-term storage (e.g., Thanos, Cortex)
  • Archive dashboard snapshots before major changes

Troubleshooting

If metrics are not appearing:
  1. Verify controller is running:
    kubectl get pods -n agones-system
    
  2. Check metrics endpoint:
    kubectl port-forward -n agones-system svc/agones-controller 8080:8080
    curl http://localhost:8080/metrics | grep agones_gameservers_count
    
  3. Verify Prometheus scraping:
    • Check Prometheus targets page
    • Look for agones-system service discovery
    • Verify no authentication errors in Prometheus logs
  4. Check RBAC permissions:
    kubectl get clusterrolebinding | grep prometheus
    kubectl get clusterrole prometheus -o yaml
    

Next Steps

Metrics Reference

Complete list of available metrics

Troubleshooting

Debug common monitoring issues

Build docs developers (and LLMs) love