Health Checking

Health checking is a critical feature that allows Agones to monitor whether your game server is running correctly. If health checks fail, Agones marks the server as Unhealthy and automatically terminates and replaces it.

How Health Checking Works

Agones uses a streaming health check model where your game server sends periodic health pings to the sidecar:

SDK Establishes Stream

When your game server connects to the SDK, it opens a gRPC health check stream to the sidecar.

Regular Health Pings

Your game server sends Health() calls at regular intervals (typically every 2-5 seconds).

Sidecar Monitors

The sidecar tracks the time since the last health ping. If no ping is received within the configured period, it reports a failure.

Failure Threshold

After a configured number of consecutive failures, Agones marks the GameServer as Unhealthy and terminates it.

The Health() Method

All Agones SDKs provide a Health() method that sends a health ping to the sidecar:

func doHealth(sdk *sdk.SDK, ctx context.Context) {
    tick := time.Tick(2 * time.Second)
    for {
        log.Printf("Health Ping")
        err := sdk.Health()
        if err != nil {
            log.Fatalf("Could not send health ping: %v", err)
        }
        select {
        case <-ctx.Done():
            log.Print("Stopped health pings")
            return
        case <-tick:
        }
    }
}

// Start health checking in background
func main() {
    s, _ := sdk.NewSDK()
    ctx, cancel := context.WithCancel(context.Background())
    
    go doHealth(s, ctx)
    
    // Your game logic...
    
    // Stop health checking when shutting down
    cancel()
}

Critical: Health pings must be sent continuously once started. If you stop sending health pings, your server will be marked as Unhealthy and terminated.

Health Configuration

Configure health checking in your GameServer manifest under spec.health:

gameserver.yaml

apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  name: "my-game-server"
spec:
  health:
    # Enable or disable health checking
    disabled: false
    
    # How often to check for health pings (seconds)
    # If no ping received within this period, it's a failure
    periodSeconds: 5
    
    # Number of consecutive failures before marking Unhealthy
    failureThreshold: 3
    
    # Initial delay before starting health checks (seconds)
    # Gives your server time to initialize
    initialDelaySeconds: 5
  
  template:
    spec:
      containers:
      - name: game-server
        image: gcr.io/my-project/my-game-server:latest

Configuration Options

disabled

boolean

default:"false"

Whether health checking is disabled. If true, the server will never be marked as Unhealthy due to missing health pings.

Disabling health checks means Agones cannot detect crashed or frozen servers. Use only for testing.

periodSeconds

integer

default:"5"

How often (in seconds) to check if a health ping was received. If no ping is received within this period, it counts as one failure.Recommendation: Set this to 2-3x your health ping interval. If you send pings every 2 seconds, use periodSeconds: 5 or periodSeconds: 6.

failureThreshold

integer

default:"3"

Number of consecutive failed health checks before marking the GameServer as Unhealthy.Total grace period = periodSeconds × failureThresholdExample: With periodSeconds: 5 and failureThreshold: 3, your server has 15 seconds to send a health ping before being marked Unhealthy.

initialDelaySeconds

integer

default:"5"

How long (in seconds) to wait after the container starts before beginning health checks. This gives your server time to initialize.Important: If your server doesn’t call Ready() within initialDelaySeconds + (periodSeconds × failureThreshold), it will be marked Unhealthy.

Configuration Examples

Fast Health Checking

For servers that initialize quickly and need fast failure detection:

spec:
  health:
    periodSeconds: 3
    failureThreshold: 2
    initialDelaySeconds: 5

Health pings expected every 3 seconds
Marked Unhealthy after 2 failures (6 seconds)
5 seconds grace period at startup
Total grace time: 5 + (3 × 2) = 11 seconds

Slow Initialization

For servers that take time to load assets or initialize:

spec:
  health:
    periodSeconds: 5
    failureThreshold: 3
    initialDelaySeconds: 30

Health pings expected every 5 seconds
Marked Unhealthy after 3 failures (15 seconds)
30 seconds grace period at startup
Total grace time: 30 + (5 × 3) = 45 seconds

Relaxed Health Checking

For stable servers where occasional delays are acceptable:

spec:
  health:
    periodSeconds: 10
    failureThreshold: 5
    initialDelaySeconds: 10

Health pings expected every 10 seconds
Marked Unhealthy after 5 failures (50 seconds)
10 seconds grace period at startup
Total grace time: 10 + (10 × 5) = 60 seconds

Disabled Health Checking

Only for testing or special cases. Not recommended for production.

spec:
  health:
    disabled: true

When disabled:

Your server never needs to send health pings
It will never be marked Unhealthy
You’re responsible for detecting and handling server failures

Best Practices

Send Regular Pings

Send health pings at regular intervals (every 2-5 seconds). Don’t wait for the health check period to elapse.

Start Early

Start health checking immediately after SDK connection, even before calling Ready().

Use Background Tasks

Run health checking in a separate thread/goroutine/task so it doesn’t block your game logic.

Handle Errors

Log errors when health pings fail. This can indicate SDK connection issues.

Recommended Health Ping Interval

Rule of thumb: Send health pings at an interval that’s 40-50% of your periodSeconds value to provide a safety buffer.

Advanced Patterns

Conditional Health Checking

Only send health pings when the server is in a healthy state:

type HealthChecker struct {
    sdk       *sdk.SDK
    healthy   bool
    mu        sync.RWMutex
}

func (h *HealthChecker) SetHealthy(healthy bool) {
    h.mu.Lock()
    defer h.mu.Unlock()
    h.healthy = healthy
}

func (h *HealthChecker) Start(ctx context.Context) {
    tick := time.Tick(2 * time.Second)
    for {
        h.mu.RLock()
        healthy := h.healthy
        h.mu.RUnlock()
        
        if healthy {
            err := h.sdk.Health()
            if err != nil {
                log.Printf("Health ping failed: %v", err)
            }
        } else {
            log.Print("Server unhealthy, skipping health ping")
        }
        
        select {
        case <-ctx.Done():
            return
        case <-tick:
        }
    }
}

// Usage
func main() {
    s, _ := sdk.NewSDK()
    checker := &HealthChecker{sdk: s, healthy: true}
    
    ctx, cancel := context.WithCancel(context.Background())
    go checker.Start(ctx)
    
    // If a critical error occurs
    checker.SetHealthy(false) // Server will be marked Unhealthy
}

This pattern is useful when you want to intentionally fail health checks to trigger server replacement on critical errors.

Health Checking with Retries

Retry failed health pings before giving up:

func healthWithRetry(sdk *sdk.SDK, maxRetries int) error {
    var err error
    for i := 0; i < maxRetries; i++ {
        err = sdk.Health()
        if err == nil {
            return nil
        }
        log.Printf("Health ping failed (attempt %d/%d): %v", i+1, maxRetries, err)
        time.Sleep(100 * time.Millisecond)
    }
    return fmt.Errorf("health ping failed after %d retries: %w", maxRetries, err)
}

func doHealthWithRetry(sdk *sdk.SDK, ctx context.Context) {
    tick := time.Tick(2 * time.Second)
    for {
        err := healthWithRetry(sdk, 3)
        if err != nil {
            log.Printf("All health ping retries failed: %v", err)
        }
        
        select {
        case <-ctx.Done():
            return
        case <-tick:
        }
    }
}

Monitoring Health Check Status

Track health check success/failure metrics:

type HealthMetrics struct {
    successCount int64
    failureCount int64
    lastSuccess  time.Time
    lastFailure  time.Time
    mu           sync.RWMutex
}

func (m *HealthMetrics) RecordSuccess() {
    m.mu.Lock()
    defer m.mu.Unlock()
    m.successCount++
    m.lastSuccess = time.Now()
}

func (m *HealthMetrics) RecordFailure() {
    m.mu.Lock()
    defer m.mu.Unlock()
    m.failureCount++
    m.lastFailure = time.Now()
}

func (m *HealthMetrics) GetStats() (success, failure int64, lastSuccess, lastFailure time.Time) {
    m.mu.RLock()
    defer m.mu.RUnlock()
    return m.successCount, m.failureCount, m.lastSuccess, m.lastFailure
}

func doHealthWithMetrics(sdk *sdk.SDK, metrics *HealthMetrics, ctx context.Context) {
    tick := time.Tick(2 * time.Second)
    for {
        err := sdk.Health()
        if err != nil {
            metrics.RecordFailure()
            log.Printf("Health ping failed: %v", err)
        } else {
            metrics.RecordSuccess()
        }
        
        select {
        case <-ctx.Done():
            return
        case <-tick:
        }
    }
}

Troubleshooting

Server marked as Unhealthy immediately

Symptoms: Server transitions to Unhealthy right after starting.Causes:

initialDelaySeconds is too short
Server initialization takes longer than expected
Health pings not starting before Ready() call

Solutions:

Increase initialDelaySeconds to give more startup time
Start health checking immediately after SDK connection
Ensure health check loop starts before long initialization tasks

spec:
  health:
    initialDelaySeconds: 30  # Increase from default
    periodSeconds: 5
    failureThreshold: 3

Intermittent Unhealthy status

Symptoms: Server occasionally becomes Unhealthy but recovers.Causes:

Health ping interval too slow for configured periodSeconds
Network delays between container and sidecar
Game logic blocking health check thread

Solutions:

Increase periodSeconds to allow more time
Increase failureThreshold for more tolerance
Send health pings more frequently
Ensure health checking runs in separate thread/task

spec:
  health:
    periodSeconds: 10  # More lenient timing
    failureThreshold: 5  # More tolerance for delays

Health pings failing with SDK errors

Symptoms: Health() calls return errors.Causes:

SDK not connected to sidecar
Sidecar container not running
Network issues in pod
gRPC stream closed

Solutions:

Verify SDK connection before starting health checks
Check that GameServer manifest includes Agones sidecar
Examine pod logs for sidecar errors
Implement retry logic for failed health pings

s, err := sdk.NewSDK()
if err != nil {
    log.Fatalf("Could not connect to SDK: %v", err)
}
log.Print("SDK connected successfully")

Server never becomes Unhealthy despite crashes

Symptoms: Crashed server stays in Ready/Allocated state.Causes:

Health checking disabled in spec
Health check loop continues after crash (unlikely)
Container doesn’t exit on crash

Solutions:

Ensure spec.health.disabled is false or omitted
Configure container to exit on critical errors
Add liveness probes if needed

spec:
  health:
    disabled: false  # Ensure health checking is enabled

Health Checking vs Kubernetes Probes

Agones health checking is separate from Kubernetes liveness and readiness probes:

You typically don’t need Kubernetes liveness probes when using Agones health checking, as Agones handles game server health monitoring.

Next Steps

Lifecycle Management

Learn about Ready, Shutdown, and state transitions

SDK Overview

Explore other SDK features

Troubleshooting

Debug common integration issues

Metrics

Monitor server health with metrics

Get Started

Core Concepts

Installation

Game Server Integration

Client SDKs

Operations

Advanced

Health Checking

How Health Checking Works

The Health() Method

Health Configuration

Configuration Options

Configuration Examples

Fast Health Checking

Slow Initialization

Relaxed Health Checking

Disabled Health Checking

Best Practices

Send Regular Pings

Start Early

Use Background Tasks

Handle Errors

Recommended Health Ping Interval

Advanced Patterns

Conditional Health Checking

Health Checking with Retries

Monitoring Health Check Status

Troubleshooting

Health Checking vs Kubernetes Probes

Next Steps

Lifecycle Management

SDK Overview

Troubleshooting

Metrics

Build docs developers (and LLMs) love

Get Started

Core Concepts

Installation

Game Server Integration

Client SDKs

Operations

Advanced

​How Health Checking Works

​The Health() Method

​Health Configuration

​Configuration Options

​Configuration Examples

​Fast Health Checking

​Slow Initialization

​Relaxed Health Checking

​Disabled Health Checking

​Best Practices

Send Regular Pings

Start Early

Use Background Tasks

Handle Errors

​Recommended Health Ping Interval

​Advanced Patterns

​Conditional Health Checking

​Health Checking with Retries

​Monitoring Health Check Status

​Troubleshooting

​Health Checking vs Kubernetes Probes

​Next Steps

Lifecycle Management

SDK Overview

Troubleshooting

Metrics

Build docs developers (and LLMs) love

How Health Checking Works

The Health() Method

Health Configuration

Configuration Options

Configuration Examples

Fast Health Checking

Slow Initialization

Relaxed Health Checking

Disabled Health Checking

Best Practices

Recommended Health Ping Interval

Advanced Patterns

Conditional Health Checking

Health Checking with Retries

Monitoring Health Check Status

Troubleshooting

Health Checking vs Kubernetes Probes

Next Steps