General Debugging Approach
When troubleshooting Talos clusters, follow this systematic approach:Identify the affected component
Determine which layer is experiencing issues:
- Node/OS level (Talos)
- etcd cluster
- Kubernetes control plane
- Application/workload level
Accessing Logs
Service Logs
View logs for system services:Kernel Logs
View kernel messages and system events:Container Logs
View logs for Kubernetes containers:Node Issues
Node Not Joining Cluster
Symptoms: New node doesn’t appear inkubectl get nodes
Diagnosis:
-
Network connectivity: Verify network configuration
-
Incorrect configuration: Verify control plane endpoint
-
Certificate issues: Check if certificates are valid
Node NotReady Status
Symptoms:kubectl get nodes shows node as NotReady
Diagnosis:
-
Restart kubelet:
-
Check disk pressure:
-
Check memory pressure:
High Resource Usage
Check CPU and memory:etcd Issues
etcd Member Unhealthy
Diagnosis:-
NOSPACE alarm: Database full
-
Network partitions: Check connectivity
- Quorum lost: Restore from backup (see disaster recovery)
etcd Performance Issues
Symptoms: Slow API responses, increased latency Diagnosis:-
Defragment database:
-
Check for large database:
Kubernetes Control Plane Issues
API Server Not Responding
Diagnosis:-
Restart static pod (by removing manifests temporarily):
-
Check certificates:
Scheduler or Controller Manager Issues
Check component status:Networking Issues
Pods Can’t Communicate
Diagnosis:-
Restart CNI pods:
-
Check routing:
DNS Resolution Issues
Diagnosis:-
Restart CoreDNS:
-
Check CoreDNS configuration:
Configuration Issues
Invalid Configuration Applied
Symptoms: Node becomes unreachable after configuration change Recovery:-
Wait for try-mode timeout (if using try mode):
- Configuration automatically rolls back after timeout
-
Access via maintenance mode:
-
Use staged mode for safer changes:
Certificate Errors
Symptoms: “certificate has expired” or “certificate signed by unknown authority” Solutions:-
Regenerate certificates:
-
Use insecure mode for recovery:
Diagnostic Commands Reference
System Information
Service Status
Container Information
Resource Monitoring
Getting Support
When seeking help, collect diagnostic information:Generate Support Bundle
- Service logs
- System information
- Configuration (sanitized)
- Resource states
Community Support
- GitHub Issues: Report bugs at https://github.com/siderolabs/talos/issues
- Slack: Join the Talos community Slack for questions
- Documentation: https://www.talos.dev/docs/
- Talos version (
talosctl version) - Kubernetes version
- Infrastructure (cloud provider, bare metal, etc.)
- Steps to reproduce
- Relevant logs and error messages