ClusterService API - Talos Linux

Overview

The ClusterService provides cluster-wide operations that span multiple nodes. Unlike MachineService which operates on individual nodes, ClusterService methods coordinate actions across the cluster.

service ClusterService {
  rpc HealthCheck(HealthCheckRequest) returns (stream HealthCheckProgress);
}

Health Check

HealthCheck

Performs a comprehensive health check across the entire cluster, validating that all components are functioning correctly.

wait_timeout

Duration

Maximum time to wait for cluster to become healthy. If not specified, returns current status immediately.

cluster_info

ClusterInfo

Optional cluster information to validate against

ClusterInfo

control_plane_nodes

string[]

List of control plane node addresses

worker_nodes

string[]

List of worker node addresses

force_endpoint

string

Override cluster endpoint for validation

Response

The HealthCheck method returns a stream of progress messages as the health check proceeds.

metadata

Metadata

Standard response metadata with hostname

message

string

Human-readable progress message describing the current check

Health Check Stages

The health check validates multiple aspects of the cluster:

1. Node Connectivity

Verifies that all nodes are reachable via the Talos API:

Checking node connectivity...
Node 10.0.0.1: connected
Node 10.0.0.2: connected
Node 10.0.0.3: connected

2. Node Readiness

Checks that all nodes report as ready:

Checking node readiness...
Node 10.0.0.1: ready
Node 10.0.0.2: ready
Node 10.0.0.3: ready

3. etcd Health

Validates etcd cluster health on control plane nodes:

Checking etcd cluster health...
Node 10.0.0.1: etcd member healthy
Node 10.0.0.2: etcd member healthy
Node 10.0.0.3: etcd member healthy
etcd cluster has quorum

4. Kubernetes API Server

Verifies Kubernetes API server is accessible and responding:

Checking Kubernetes API server...
Kubernetes API server is reachable
All Kubernetes API server endpoints are healthy

5. Control Plane Components

Validates control plane components are running:

Checking control plane components...
kube-apiserver: running
kube-controller-manager: running
kube-scheduler: running

6. Node Health

Verifies nodes are healthy from Kubernetes perspective:

Checking Kubernetes node health...
Node worker-1: Ready
Node worker-2: Ready
Node control-plane-1: Ready

7. Static Pods

Checks that all static pods are running:

Checking static pods...
kube-apiserver-control-plane-1: Running
kube-controller-manager-control-plane-1: Running
kube-scheduler-control-plane-1: Running
etcd-control-plane-1: Running

Usage Examples

Basic Health Check

Perform an immediate health check:

import (
    "context"
    "fmt"
    "io"
    "github.com/siderolabs/talos/pkg/machinery/api/cluster"
)

stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{})
if err != nil {
    return err
}

for {
    msg, err := stream.Recv()
    if err == io.EOF {
        break
    }
    if err != nil {
        return err
    }
    
    if msg.Metadata.Error != "" {
        fmt.Printf("[%s] ERROR: %s\n", msg.Metadata.Hostname, msg.Metadata.Error)
    } else {
        fmt.Printf("[%s] %s\n", msg.Metadata.Hostname, msg.Message)
    }
}

Health Check with Timeout

Wait up to 5 minutes for the cluster to become healthy:

import (
    "google.golang.org/protobuf/types/known/durationpb"
    "time"
)

stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
    WaitTimeout: durationpb.New(5 * time.Minute),
})

Health Check with Cluster Info

Validate specific cluster topology:

stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
    ClusterInfo: &cluster.ClusterInfo{
        ControlPlaneNodes: []string{
            "10.0.0.1",
            "10.0.0.2",
            "10.0.0.3",
        },
        WorkerNodes: []string{
            "10.0.1.1",
            "10.0.1.2",
        },
    },
})

Using talosctl

The talosctl health command wraps the HealthCheck RPC:

# Basic health check
talosctl health

# Wait for cluster to become healthy
talosctl health --wait-timeout 10m

# Check specific control plane nodes
talosctl health --control-plane-nodes 10.0.0.1,10.0.0.2,10.0.0.3

Error Handling

Health check errors are reported in the metadata:

for {
    msg, err := stream.Recv()
    if err == io.EOF {
        break
    }
    if err != nil {
        // gRPC level error
        return fmt.Errorf("health check failed: %w", err)
    }
    
    if msg.Metadata.Error != "" {
        // Node-level error
        fmt.Printf("Node %s: %s\n", msg.Metadata.Hostname, msg.Metadata.Error)
    }
    
    if msg.Metadata.Status != nil && msg.Metadata.Status.Code != 0 {
        // gRPC status error
        fmt.Printf("Node %s: %s (code %d)\n",
            msg.Metadata.Hostname,
            msg.Metadata.Status.Message,
            msg.Metadata.Status.Code)
    }
}

Common Error Conditions

Node Unreachable

[10.0.0.2] ERROR: rpc error: code = Unavailable desc = connection refused

Resolution: Check network connectivity, firewall rules, and that the Talos API is running.

etcd Cluster Without Quorum

[10.0.0.1] etcd cluster lost quorum

Resolution: Ensure at least (N/2)+1 control plane nodes are running and healthy.

Node Not Ready

[10.0.0.3] Node is not ready: KubeletNotReady

Resolution: Check kubelet logs and node status:

talosctl logs kubelet -n 10.0.0.3
kubectl describe node <node-name>

API Server Unreachable

Kubernetes API server is not reachable: connection timeout

Resolution: Verify:

Control plane nodes are running
API server pods are healthy
Load balancer (if used) is functioning
Firewall rules allow port 6443

Wait Semantics

When wait_timeout is specified, the health check will:

Perform initial validation
If unhealthy, retry checks every 5 seconds
Continue until either:
- All checks pass (returns successfully)
- Timeout is reached (returns with error)
- Context is cancelled

// Wait up to 10 minutes for cluster to be healthy
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
defer cancel()

stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
    WaitTimeout: durationpb.New(10 * time.Minute),
})

// This will block until healthy or timeout
for {
    msg, err := stream.Recv()
    if err == io.EOF {
        fmt.Println("Cluster is healthy!")
        break
    }
    if err != nil {
        return fmt.Errorf("cluster did not become healthy: %w", err)
    }
    fmt.Println(msg.Message)
}

Use Cases

Pre-Upgrade Validation

Before upgrading the cluster, validate everything is healthy:

talosctl health --wait-timeout 5m || exit 1
talosctl upgrade --nodes 10.0.0.1 --image ghcr.io/siderolabs/installer:v1.7.0

Post-Bootstrap Validation

After bootstrapping a new cluster, wait for it to become healthy:

talosctl bootstrap -n 10.0.0.1
talosctl health --wait-timeout 10m

Continuous Monitoring

Periodically check cluster health:

func monitorClusterHealth(ctx context.Context, client *client.Client) {
    ticker := time.NewTicker(1 * time.Minute)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{})
            if err != nil {
                log.Printf("Health check failed: %v", err)
                continue
            }
            
            healthy := true
            for {
                msg, err := stream.Recv()
                if err == io.EOF {
                    break
                }
                if err != nil || msg.Metadata.Error != "" {
                    healthy = false
                    break
                }
            }
            
            if healthy {
                log.Println("Cluster is healthy")
            } else {
                log.Println("Cluster has issues")
                // Send alert, trigger remediation, etc.
            }
        }
    }
}

CI/CD Integration

Validate cluster health in CI/CD pipelines:

# GitLab CI example
deploy:
  script:
    - kubectl apply -f manifests/
    - talosctl health --wait-timeout 5m
  only:
    - main

Performance Considerations

Health checks validate multiple cluster components and may take time to complete:

Small cluster (3 nodes): ~5-10 seconds
Medium cluster (10 nodes): ~15-30 seconds
Large cluster (100+ nodes): ~1-3 minutes

For large clusters, consider:

Increase timeout: Allow more time for validation
Targeted checks: Validate specific node subsets
Caching: Don’t run health checks too frequently

Best Practices

Always Set a Timeout

Use appropriate timeouts based on cluster size:

// Small cluster
timeout := 2 * time.Minute

// Large cluster
timeout := 10 * time.Minute

stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
    WaitTimeout: durationpb.New(timeout),
})

Handle Partial Failures

Some nodes may be healthy while others are not:

healthyNodes := []string{}
unhealthyNodes := []string{}

for {
    msg, err := stream.Recv()
    if err == io.EOF {
        break
    }
    if err != nil {
        return err
    }
    
    if msg.Metadata.Error != "" {
        unhealthyNodes = append(unhealthyNodes, msg.Metadata.Hostname)
    } else {
        healthyNodes = append(healthyNodes, msg.Metadata.Hostname)
    }
}

if len(unhealthyNodes) > 0 {
    fmt.Printf("Unhealthy nodes: %v\n", unhealthyNodes)
}

Use Context Cancellation

Allow users to cancel long-running health checks:

ctx, cancel := context.WithCancel(context.Background())

// Handle Ctrl+C
go func() {
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, os.Interrupt)
    <-sigCh
    cancel()
}()

stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
    WaitTimeout: durationpb.New(10 * time.Minute),
})

CLI Reference

gRPC API

Configuration API

​Overview

​Health Check

​HealthCheck

​ClusterInfo

​Response

​Health Check Stages

​1. Node Connectivity

​2. Node Readiness

​3. etcd Health

​4. Kubernetes API Server

​5. Control Plane Components

​6. Node Health

​7. Static Pods

​Usage Examples

​Basic Health Check

​Health Check with Timeout

​Health Check with Cluster Info

​Using talosctl

​Error Handling

​Common Error Conditions

​Node Unreachable

​etcd Cluster Without Quorum

​Node Not Ready

​API Server Unreachable

​Wait Semantics

​Use Cases

​Pre-Upgrade Validation

​Post-Bootstrap Validation

​Continuous Monitoring

​CI/CD Integration

​Performance Considerations

​Best Practices

​Always Set a Timeout

​Handle Partial Failures

​Use Context Cancellation

​See Also

Build docs developers (and LLMs) love

Overview

Health Check

HealthCheck

ClusterInfo

Response

Health Check Stages

1. Node Connectivity

2. Node Readiness

3. etcd Health

4. Kubernetes API Server

5. Control Plane Components

6. Node Health

7. Static Pods

Usage Examples

Basic Health Check

Health Check with Timeout

Health Check with Cluster Info

Using talosctl

Error Handling

Common Error Conditions

Node Unreachable

etcd Cluster Without Quorum

Node Not Ready

API Server Unreachable

Wait Semantics

Use Cases

Pre-Upgrade Validation

Post-Bootstrap Validation

Continuous Monitoring

CI/CD Integration

Performance Considerations

Best Practices

Always Set a Timeout

Handle Partial Failures

Use Context Cancellation

See Also