Skip to main content
Health checking is a critical feature that allows Agones to monitor whether your game server is running correctly. If health checks fail, Agones marks the server as Unhealthy and automatically terminates and replaces it.

How Health Checking Works

Agones uses a streaming health check model where your game server sends periodic health pings to the sidecar:
1

SDK Establishes Stream

When your game server connects to the SDK, it opens a gRPC health check stream to the sidecar.
2

Regular Health Pings

Your game server sends Health() calls at regular intervals (typically every 2-5 seconds).
3

Sidecar Monitors

The sidecar tracks the time since the last health ping. If no ping is received within the configured period, it reports a failure.
4

Failure Threshold

After a configured number of consecutive failures, Agones marks the GameServer as Unhealthy and terminates it.
Health Checking Flow

The Health() Method

All Agones SDKs provide a Health() method that sends a health ping to the sidecar:
func doHealth(sdk *sdk.SDK, ctx context.Context) {
    tick := time.Tick(2 * time.Second)
    for {
        log.Printf("Health Ping")
        err := sdk.Health()
        if err != nil {
            log.Fatalf("Could not send health ping: %v", err)
        }
        select {
        case <-ctx.Done():
            log.Print("Stopped health pings")
            return
        case <-tick:
        }
    }
}

// Start health checking in background
func main() {
    s, _ := sdk.NewSDK()
    ctx, cancel := context.WithCancel(context.Background())
    
    go doHealth(s, ctx)
    
    // Your game logic...
    
    // Stop health checking when shutting down
    cancel()
}
Critical: Health pings must be sent continuously once started. If you stop sending health pings, your server will be marked as Unhealthy and terminated.

Health Configuration

Configure health checking in your GameServer manifest under spec.health:
gameserver.yaml
apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  name: "my-game-server"
spec:
  health:
    # Enable or disable health checking
    disabled: false
    
    # How often to check for health pings (seconds)
    # If no ping received within this period, it's a failure
    periodSeconds: 5
    
    # Number of consecutive failures before marking Unhealthy
    failureThreshold: 3
    
    # Initial delay before starting health checks (seconds)
    # Gives your server time to initialize
    initialDelaySeconds: 5
  
  template:
    spec:
      containers:
      - name: game-server
        image: gcr.io/my-project/my-game-server:latest

Configuration Options

disabled
boolean
default:"false"
Whether health checking is disabled. If true, the server will never be marked as Unhealthy due to missing health pings.
Disabling health checks means Agones cannot detect crashed or frozen servers. Use only for testing.
periodSeconds
integer
default:"5"
How often (in seconds) to check if a health ping was received. If no ping is received within this period, it counts as one failure.Recommendation: Set this to 2-3x your health ping interval. If you send pings every 2 seconds, use periodSeconds: 5 or periodSeconds: 6.
failureThreshold
integer
default:"3"
Number of consecutive failed health checks before marking the GameServer as Unhealthy.Total grace period = periodSeconds × failureThresholdExample: With periodSeconds: 5 and failureThreshold: 3, your server has 15 seconds to send a health ping before being marked Unhealthy.
initialDelaySeconds
integer
default:"5"
How long (in seconds) to wait after the container starts before beginning health checks. This gives your server time to initialize.Important: If your server doesn’t call Ready() within initialDelaySeconds + (periodSeconds × failureThreshold), it will be marked Unhealthy.

Configuration Examples

Fast Health Checking

For servers that initialize quickly and need fast failure detection:
spec:
  health:
    periodSeconds: 3
    failureThreshold: 2
    initialDelaySeconds: 5
  • Health pings expected every 3 seconds
  • Marked Unhealthy after 2 failures (6 seconds)
  • 5 seconds grace period at startup
  • Total grace time: 5 + (3 × 2) = 11 seconds

Slow Initialization

For servers that take time to load assets or initialize:
spec:
  health:
    periodSeconds: 5
    failureThreshold: 3
    initialDelaySeconds: 30
  • Health pings expected every 5 seconds
  • Marked Unhealthy after 3 failures (15 seconds)
  • 30 seconds grace period at startup
  • Total grace time: 30 + (5 × 3) = 45 seconds

Relaxed Health Checking

For stable servers where occasional delays are acceptable:
spec:
  health:
    periodSeconds: 10
    failureThreshold: 5
    initialDelaySeconds: 10
  • Health pings expected every 10 seconds
  • Marked Unhealthy after 5 failures (50 seconds)
  • 10 seconds grace period at startup
  • Total grace time: 10 + (10 × 5) = 60 seconds

Disabled Health Checking

Only for testing or special cases. Not recommended for production.
spec:
  health:
    disabled: true
When disabled:
  • Your server never needs to send health pings
  • It will never be marked Unhealthy
  • You’re responsible for detecting and handling server failures

Best Practices

Send Regular Pings

Send health pings at regular intervals (every 2-5 seconds). Don’t wait for the health check period to elapse.

Start Early

Start health checking immediately after SDK connection, even before calling Ready().

Use Background Tasks

Run health checking in a separate thread/goroutine/task so it doesn’t block your game logic.

Handle Errors

Log errors when health pings fail. This can indicate SDK connection issues.
Rule of thumb: Send health pings at an interval that’s 40-50% of your periodSeconds value to provide a safety buffer.

Advanced Patterns

Conditional Health Checking

Only send health pings when the server is in a healthy state:
type HealthChecker struct {
    sdk       *sdk.SDK
    healthy   bool
    mu        sync.RWMutex
}

func (h *HealthChecker) SetHealthy(healthy bool) {
    h.mu.Lock()
    defer h.mu.Unlock()
    h.healthy = healthy
}

func (h *HealthChecker) Start(ctx context.Context) {
    tick := time.Tick(2 * time.Second)
    for {
        h.mu.RLock()
        healthy := h.healthy
        h.mu.RUnlock()
        
        if healthy {
            err := h.sdk.Health()
            if err != nil {
                log.Printf("Health ping failed: %v", err)
            }
        } else {
            log.Print("Server unhealthy, skipping health ping")
        }
        
        select {
        case <-ctx.Done():
            return
        case <-tick:
        }
    }
}

// Usage
func main() {
    s, _ := sdk.NewSDK()
    checker := &HealthChecker{sdk: s, healthy: true}
    
    ctx, cancel := context.WithCancel(context.Background())
    go checker.Start(ctx)
    
    // If a critical error occurs
    checker.SetHealthy(false) // Server will be marked Unhealthy
}
This pattern is useful when you want to intentionally fail health checks to trigger server replacement on critical errors.

Health Checking with Retries

Retry failed health pings before giving up:
func healthWithRetry(sdk *sdk.SDK, maxRetries int) error {
    var err error
    for i := 0; i < maxRetries; i++ {
        err = sdk.Health()
        if err == nil {
            return nil
        }
        log.Printf("Health ping failed (attempt %d/%d): %v", i+1, maxRetries, err)
        time.Sleep(100 * time.Millisecond)
    }
    return fmt.Errorf("health ping failed after %d retries: %w", maxRetries, err)
}

func doHealthWithRetry(sdk *sdk.SDK, ctx context.Context) {
    tick := time.Tick(2 * time.Second)
    for {
        err := healthWithRetry(sdk, 3)
        if err != nil {
            log.Printf("All health ping retries failed: %v", err)
        }
        
        select {
        case <-ctx.Done():
            return
        case <-tick:
        }
    }
}

Monitoring Health Check Status

Track health check success/failure metrics:
type HealthMetrics struct {
    successCount int64
    failureCount int64
    lastSuccess  time.Time
    lastFailure  time.Time
    mu           sync.RWMutex
}

func (m *HealthMetrics) RecordSuccess() {
    m.mu.Lock()
    defer m.mu.Unlock()
    m.successCount++
    m.lastSuccess = time.Now()
}

func (m *HealthMetrics) RecordFailure() {
    m.mu.Lock()
    defer m.mu.Unlock()
    m.failureCount++
    m.lastFailure = time.Now()
}

func (m *HealthMetrics) GetStats() (success, failure int64, lastSuccess, lastFailure time.Time) {
    m.mu.RLock()
    defer m.mu.RUnlock()
    return m.successCount, m.failureCount, m.lastSuccess, m.lastFailure
}

func doHealthWithMetrics(sdk *sdk.SDK, metrics *HealthMetrics, ctx context.Context) {
    tick := time.Tick(2 * time.Second)
    for {
        err := sdk.Health()
        if err != nil {
            metrics.RecordFailure()
            log.Printf("Health ping failed: %v", err)
        } else {
            metrics.RecordSuccess()
        }
        
        select {
        case <-ctx.Done():
            return
        case <-tick:
        }
    }
}

Troubleshooting

Symptoms: Server transitions to Unhealthy right after starting.Causes:
  • initialDelaySeconds is too short
  • Server initialization takes longer than expected
  • Health pings not starting before Ready() call
Solutions:
  • Increase initialDelaySeconds to give more startup time
  • Start health checking immediately after SDK connection
  • Ensure health check loop starts before long initialization tasks
spec:
  health:
    initialDelaySeconds: 30  # Increase from default
    periodSeconds: 5
    failureThreshold: 3
Symptoms: Server occasionally becomes Unhealthy but recovers.Causes:
  • Health ping interval too slow for configured periodSeconds
  • Network delays between container and sidecar
  • Game logic blocking health check thread
Solutions:
  • Increase periodSeconds to allow more time
  • Increase failureThreshold for more tolerance
  • Send health pings more frequently
  • Ensure health checking runs in separate thread/task
spec:
  health:
    periodSeconds: 10  # More lenient timing
    failureThreshold: 5  # More tolerance for delays
Symptoms: Health() calls return errors.Causes:
  • SDK not connected to sidecar
  • Sidecar container not running
  • Network issues in pod
  • gRPC stream closed
Solutions:
  • Verify SDK connection before starting health checks
  • Check that GameServer manifest includes Agones sidecar
  • Examine pod logs for sidecar errors
  • Implement retry logic for failed health pings
s, err := sdk.NewSDK()
if err != nil {
    log.Fatalf("Could not connect to SDK: %v", err)
}
log.Print("SDK connected successfully")
Symptoms: Crashed server stays in Ready/Allocated state.Causes:
  • Health checking disabled in spec
  • Health check loop continues after crash (unlikely)
  • Container doesn’t exit on crash
Solutions:
  • Ensure spec.health.disabled is false or omitted
  • Configure container to exit on critical errors
  • Add liveness probes if needed
spec:
  health:
    disabled: false  # Ensure health checking is enabled

Health Checking vs Kubernetes Probes

Agones health checking is separate from Kubernetes liveness and readiness probes:
You typically don’t need Kubernetes liveness probes when using Agones health checking, as Agones handles game server health monitoring.

Next Steps

Lifecycle Management

Learn about Ready, Shutdown, and state transitions

SDK Overview

Explore other SDK features

Troubleshooting

Debug common integration issues

Metrics

Monitor server health with metrics

Build docs developers (and LLMs) love