Overview
The ClusterService provides cluster-wide operations that span multiple nodes. Unlike MachineService which operates on individual nodes, ClusterService methods coordinate actions across the cluster.
service ClusterService {
rpc HealthCheck(HealthCheckRequest) returns (stream HealthCheckProgress);
}
Health Check
HealthCheck
Performs a comprehensive health check across the entire cluster, validating that all components are functioning correctly.
Maximum time to wait for cluster to become healthy. If not specified, returns current status immediately.
Optional cluster information to validate against
ClusterInfo
List of control plane node addresses
List of worker node addresses
Override cluster endpoint for validation
Response
The HealthCheck method returns a stream of progress messages as the health check proceeds.
Standard response metadata with hostname
Human-readable progress message describing the current check
Health Check Stages
The health check validates multiple aspects of the cluster:
1. Node Connectivity
Verifies that all nodes are reachable via the Talos API:
Checking node connectivity...
Node 10.0.0.1: connected
Node 10.0.0.2: connected
Node 10.0.0.3: connected
2. Node Readiness
Checks that all nodes report as ready:
Checking node readiness...
Node 10.0.0.1: ready
Node 10.0.0.2: ready
Node 10.0.0.3: ready
3. etcd Health
Validates etcd cluster health on control plane nodes:
Checking etcd cluster health...
Node 10.0.0.1: etcd member healthy
Node 10.0.0.2: etcd member healthy
Node 10.0.0.3: etcd member healthy
etcd cluster has quorum
4. Kubernetes API Server
Verifies Kubernetes API server is accessible and responding:
Checking Kubernetes API server...
Kubernetes API server is reachable
All Kubernetes API server endpoints are healthy
5. Control Plane Components
Validates control plane components are running:
Checking control plane components...
kube-apiserver: running
kube-controller-manager: running
kube-scheduler: running
6. Node Health
Verifies nodes are healthy from Kubernetes perspective:
Checking Kubernetes node health...
Node worker-1: Ready
Node worker-2: Ready
Node control-plane-1: Ready
7. Static Pods
Checks that all static pods are running:
Checking static pods...
kube-apiserver-control-plane-1: Running
kube-controller-manager-control-plane-1: Running
kube-scheduler-control-plane-1: Running
etcd-control-plane-1: Running
Usage Examples
Basic Health Check
Perform an immediate health check:
import (
"context"
"fmt"
"io"
"github.com/siderolabs/talos/pkg/machinery/api/cluster"
)
stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{})
if err != nil {
return err
}
for {
msg, err := stream.Recv()
if err == io.EOF {
break
}
if err != nil {
return err
}
if msg.Metadata.Error != "" {
fmt.Printf("[%s] ERROR: %s\n", msg.Metadata.Hostname, msg.Metadata.Error)
} else {
fmt.Printf("[%s] %s\n", msg.Metadata.Hostname, msg.Message)
}
}
Health Check with Timeout
Wait up to 5 minutes for the cluster to become healthy:
import (
"google.golang.org/protobuf/types/known/durationpb"
"time"
)
stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
WaitTimeout: durationpb.New(5 * time.Minute),
})
Health Check with Cluster Info
Validate specific cluster topology:
stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
ClusterInfo: &cluster.ClusterInfo{
ControlPlaneNodes: []string{
"10.0.0.1",
"10.0.0.2",
"10.0.0.3",
},
WorkerNodes: []string{
"10.0.1.1",
"10.0.1.2",
},
},
})
Using talosctl
The talosctl health command wraps the HealthCheck RPC:
# Basic health check
talosctl health
# Wait for cluster to become healthy
talosctl health --wait-timeout 10m
# Check specific control plane nodes
talosctl health --control-plane-nodes 10.0.0.1,10.0.0.2,10.0.0.3
Error Handling
Health check errors are reported in the metadata:
for {
msg, err := stream.Recv()
if err == io.EOF {
break
}
if err != nil {
// gRPC level error
return fmt.Errorf("health check failed: %w", err)
}
if msg.Metadata.Error != "" {
// Node-level error
fmt.Printf("Node %s: %s\n", msg.Metadata.Hostname, msg.Metadata.Error)
}
if msg.Metadata.Status != nil && msg.Metadata.Status.Code != 0 {
// gRPC status error
fmt.Printf("Node %s: %s (code %d)\n",
msg.Metadata.Hostname,
msg.Metadata.Status.Message,
msg.Metadata.Status.Code)
}
}
Common Error Conditions
Node Unreachable
[10.0.0.2] ERROR: rpc error: code = Unavailable desc = connection refused
Resolution: Check network connectivity, firewall rules, and that the Talos API is running.
etcd Cluster Without Quorum
[10.0.0.1] etcd cluster lost quorum
Resolution: Ensure at least (N/2)+1 control plane nodes are running and healthy.
Node Not Ready
[10.0.0.3] Node is not ready: KubeletNotReady
Resolution: Check kubelet logs and node status:
talosctl logs kubelet -n 10.0.0.3
kubectl describe node <node-name>
API Server Unreachable
Kubernetes API server is not reachable: connection timeout
Resolution: Verify:
- Control plane nodes are running
- API server pods are healthy
- Load balancer (if used) is functioning
- Firewall rules allow port 6443
Wait Semantics
When wait_timeout is specified, the health check will:
- Perform initial validation
- If unhealthy, retry checks every 5 seconds
- Continue until either:
- All checks pass (returns successfully)
- Timeout is reached (returns with error)
- Context is cancelled
// Wait up to 10 minutes for cluster to be healthy
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
defer cancel()
stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
WaitTimeout: durationpb.New(10 * time.Minute),
})
// This will block until healthy or timeout
for {
msg, err := stream.Recv()
if err == io.EOF {
fmt.Println("Cluster is healthy!")
break
}
if err != nil {
return fmt.Errorf("cluster did not become healthy: %w", err)
}
fmt.Println(msg.Message)
}
Use Cases
Pre-Upgrade Validation
Before upgrading the cluster, validate everything is healthy:
talosctl health --wait-timeout 5m || exit 1
talosctl upgrade --nodes 10.0.0.1 --image ghcr.io/siderolabs/installer:v1.7.0
Post-Bootstrap Validation
After bootstrapping a new cluster, wait for it to become healthy:
talosctl bootstrap -n 10.0.0.1
talosctl health --wait-timeout 10m
Continuous Monitoring
Periodically check cluster health:
func monitorClusterHealth(ctx context.Context, client *client.Client) {
ticker := time.NewTicker(1 * time.Minute)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{})
if err != nil {
log.Printf("Health check failed: %v", err)
continue
}
healthy := true
for {
msg, err := stream.Recv()
if err == io.EOF {
break
}
if err != nil || msg.Metadata.Error != "" {
healthy = false
break
}
}
if healthy {
log.Println("Cluster is healthy")
} else {
log.Println("Cluster has issues")
// Send alert, trigger remediation, etc.
}
}
}
}
CI/CD Integration
Validate cluster health in CI/CD pipelines:
# GitLab CI example
deploy:
script:
- kubectl apply -f manifests/
- talosctl health --wait-timeout 5m
only:
- main
Health checks validate multiple cluster components and may take time to complete:
- Small cluster (3 nodes): ~5-10 seconds
- Medium cluster (10 nodes): ~15-30 seconds
- Large cluster (100+ nodes): ~1-3 minutes
For large clusters, consider:
- Increase timeout: Allow more time for validation
- Targeted checks: Validate specific node subsets
- Caching: Don’t run health checks too frequently
Best Practices
Always Set a Timeout
Use appropriate timeouts based on cluster size:
// Small cluster
timeout := 2 * time.Minute
// Large cluster
timeout := 10 * time.Minute
stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
WaitTimeout: durationpb.New(timeout),
})
Handle Partial Failures
Some nodes may be healthy while others are not:
healthyNodes := []string{}
unhealthyNodes := []string{}
for {
msg, err := stream.Recv()
if err == io.EOF {
break
}
if err != nil {
return err
}
if msg.Metadata.Error != "" {
unhealthyNodes = append(unhealthyNodes, msg.Metadata.Hostname)
} else {
healthyNodes = append(healthyNodes, msg.Metadata.Hostname)
}
}
if len(unhealthyNodes) > 0 {
fmt.Printf("Unhealthy nodes: %v\n", unhealthyNodes)
}
Use Context Cancellation
Allow users to cancel long-running health checks:
ctx, cancel := context.WithCancel(context.Background())
// Handle Ctrl+C
go func() {
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt)
<-sigCh
cancel()
}()
stream, err := client.HealthCheck(ctx, &cluster.HealthCheckRequest{
WaitTimeout: durationpb.New(10 * time.Minute),
})
See Also