Monitoring your Agones infrastructure is critical for maintaining healthy game server deployments. Agones provides built-in support for Prometheus metrics and pre-configured Grafana dashboards.
Overview
Agones exports detailed metrics about:
GameServer lifecycle and state transitions
Fleet replica counts and autoscaling behavior
Allocation latency and success rates
Node utilization and GameServer distribution
Controller and SDK performance
Enabling Metrics
Enable Prometheus Metrics
Metrics are enabled by default in Agones. Configure during Helm installation: helm install agones agones/agones \
--set agones.metrics.prometheusEnabled= true \
--set agones.metrics.prometheusServiceDiscovery= true \
--namespace agones-system \
--create-namespace
Configure Prometheus
Add Agones as a scrape target in your Prometheus configuration: scrape_configs :
- job_name : 'agones'
kubernetes_sd_configs :
- role : pod
namespaces :
names :
- agones-system
relabel_configs :
- source_labels : [ __meta_kubernetes_pod_label_agones_dev_role ]
action : keep
regex : controller
- source_labels : [ __meta_kubernetes_pod_container_port_name ]
action : keep
regex : http
The Agones controller exposes metrics on port 8080 at the /metrics endpoint.
Deploy Grafana Dashboards
Agones includes pre-built Grafana dashboards in the source repository: kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana-frontend.yaml
Import dashboard ConfigMaps: kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-gameservers.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-allocations.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-autoscalers.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/main/build/grafana/dashboard-controller-usage.yaml
Available Dashboards
Agones provides several specialized Grafana dashboards:
GameServers Dashboard
GameServer count by state
Fleet replica counts (total, allocated, ready, desired)
GameServers per node distribution
State transition rates
Node availability
Allocations Dashboard
Average allocation latency
Allocation error rates
Latency percentiles (50th, 90th, 99th)
Allocation rate by status
Per-fleet allocation metrics
Autoscalers Dashboard
Fleet allocation percentage
Buffer size and limits
Current vs desired replicas
Scaling capability status
Autoscaler policy metrics
Controller Usage
Controller CPU and memory usage
API server request rates
Work queue depth and latency
Cache sync performance
Key Metrics to Monitor
GameServer State Metrics
# Current count of GameServers by state
agones_gameservers_count{type="Ready", fleet_name="my-fleet", namespace="default"}
# Rate of GameServers transitioning to each state
rate(agones_gameservers_total{type="Error"}[5m])
# Average time GameServers spend in each state
sum(rate(agones_gameserver_state_duration_sum[5m])) by (type) /
sum(rate(agones_gameserver_state_duration_count[5m])) by (type)
Fleet Metrics
# Fleet replica counts by type
agones_fleets_replicas_count{name="my-fleet", type="allocated"}
agones_fleets_replicas_count{name="my-fleet", type="ready"}
agones_fleets_replicas_count{name="my-fleet", type="desired"}
# Fleet rollout progress
sum(agones_fleet_rollout_percent{name="my-fleet", type="current_replicas"}) /
sum(agones_fleet_rollout_percent{name="my-fleet", type="desired_replicas"}) * 100
Allocation Metrics
# Average allocation latency
sum(rate(agones_gameserver_allocations_duration_seconds_sum[5m])) /
sum(rate(agones_gameserver_allocations_duration_seconds_count[5m]))
# Allocation error rate
sum(rate(agones_gameserver_allocations_duration_seconds_count{status!="Allocated"}[5m]))
# 99th percentile allocation latency
histogram_quantile(0.99, sum(rate(agones_gameserver_allocations_duration_seconds_bucket[5m])) by (le))
Node Metrics
# Count of nodes with/without GameServers
agones_nodes_count{empty="false"} # Nodes with GameServers
agones_nodes_count{empty="true"} # Empty nodes
# GameServers per node distribution
histogram_quantile(0.99, sum(rate(agones_gameservers_node_count_bucket[1m])) by (le))
Autoscaler Metrics
# Current vs desired replicas
agones_fleet_autoscalers_current_replicas_count{name="my-autoscaler"}
agones_fleet_autoscalers_desired_replicas_count{name="my-autoscaler"}
# Autoscaler status
agones_fleet_autoscalers_able_to_scale{name="my-autoscaler"}
agones_fleet_autoscalers_limited{name="my-autoscaler"}
# Buffer configuration
agones_fleet_autoscalers_buffer_size{name="my-autoscaler", type="count"}
agones_fleet_autoscalers_buffer_limits{name="my-autoscaler", type="max"}
Setting Up Alerts
Recommended Prometheus alerting rules:
groups :
- name : agones_gameservers
interval : 30s
rules :
- alert : HighGameServerErrorRate
expr : |
rate(agones_gameservers_total{type="Error"}[5m]) > 0.1
for : 5m
labels :
severity : warning
annotations :
summary : "High GameServer error rate"
description : "{{ $labels.namespace }}/{{ $labels.fleet_name }} has error rate of {{ $value | humanize }} per second"
- alert : LowReadyGameServers
expr : |
(agones_gameservers_count{type="Ready"} /
agones_fleets_replicas_count{type="desired"}) < 0.3
for : 5m
labels :
severity : critical
annotations :
summary : "Low ready GameServer capacity"
description : "Fleet {{ $labels.name }} has only {{ $value | humanizePercentage }} ready GameServers"
- alert : HighAllocationLatency
expr : |
histogram_quantile(0.99,
sum(rate(agones_gameserver_allocations_duration_seconds_bucket[5m])) by (le)
) > 3
for : 5m
labels :
severity : warning
annotations :
summary : "High allocation latency"
description : "99th percentile allocation latency is {{ $value }}s"
- alert : AllocationFailures
expr : |
rate(agones_gameserver_allocations_duration_seconds_count{status!="Allocated"}[5m]) > 0.1
for : 2m
labels :
severity : critical
annotations :
summary : "GameServer allocation failures detected"
description : "Allocation error rate is {{ $value | humanize }} per second"
- alert : FleetAutoscalerLimited
expr : |
agones_fleet_autoscalers_limited == 1
for : 15m
labels :
severity : info
annotations :
summary : "FleetAutoscaler is limited"
description : "Autoscaler {{ $labels.name }} has been limited for 15 minutes"
Adjust alert thresholds based on your workload patterns and SLOs. Start with conservative values and tune based on observed behavior.
Monitoring with Stackdriver
For Google Cloud Platform deployments, enable Stackdriver monitoring:
helm install agones agones/agones \
--set agones.metrics.stackdriverEnabled= true \
--set agones.metrics.stackdriverProjectID=YOUR_PROJECT_ID \
--namespace agones-system
Stackdriver metrics are exported with the agones prefix and include all Prometheus metrics with custom resource labels.
Stackdriver has a minimum reporting period of 60 seconds, compared to 15 seconds for Prometheus. This may affect real-time monitoring capabilities.
Health Checks
Monitor Agones component health:
# Controller health
kubectl get pods -n agones-system -l app=agones,component=controller
# Check controller logs for errors
kubectl logs -n agones-system -l app=agones,component=controller --tail=100
# Metrics endpoint health
kubectl port-forward -n agones-system svc/agones-controller 8080:8080
curl http://localhost:8080/metrics
Best Practices
Set appropriate scrape intervals
Use 15-30 second scrape intervals for Agones metrics
Avoid scraping more frequently than every 10 seconds
Match scrape interval to your alerting requirements
Monitor metric cardinality
Agones metrics include labels for fleet, namespace, and state
High fleet/namespace counts increase cardinality
Consider using metric relabeling to reduce dimensionality
Use recording rules for complex queries
Create recording rules for frequently-used queries to improve dashboard performance: groups :
- name : agones_aggregations
interval : 30s
rules :
- record : fleet:allocations:ratio
expr : |
sum by (name, namespace) (agones_fleets_replicas_count{type="allocated"}) /
sum by (name, namespace) (agones_fleets_replicas_count{type="total"})
Keep at least 30 days of metrics for trend analysis
Use remote write to long-term storage (e.g., Thanos, Cortex)
Archive dashboard snapshots before major changes
Troubleshooting
If metrics are not appearing:
Verify controller is running:
kubectl get pods -n agones-system
Check metrics endpoint:
kubectl port-forward -n agones-system svc/agones-controller 8080:8080
curl http://localhost:8080/metrics | grep agones_gameservers_count
Verify Prometheus scraping:
Check Prometheus targets page
Look for agones-system service discovery
Verify no authentication errors in Prometheus logs
Check RBAC permissions:
kubectl get clusterrolebinding | grep prometheus
kubectl get clusterrole prometheus -o yaml
Next Steps
Metrics Reference Complete list of available metrics
Troubleshooting Debug common monitoring issues