Troubleshooting

Overview

This guide provides systematic troubleshooting approaches for common vCluster issues. Use the diagnostic commands and solutions to quickly identify and resolve problems.

General Troubleshooting Approach

Identify the Issue

Clearly define what’s not working:

What were you trying to do?
What happened instead?
When did it start?
Has it ever worked?

Gather Information

Collect diagnostic data:

vcluster debug collect my-vcluster --namespace production

Check Basics

Verify fundamental components:

vcluster list
kubectl get pods -n production -l release=my-vcluster
kubectl logs -n production -l app=vcluster,release=my-vcluster --tail=100

Review Changes

What changed recently?

Configuration updates
Version upgrades
Infrastructure changes
New deployments

Isolate the Problem

Test components individually:

Host cluster connectivity
Virtual cluster API server
Resource syncing
Network policies

Apply Solution

Try fixes from most to least invasive:

Configuration changes
Pod restarts
Resource recreation
Full restoration from backup

Common Issues and Solutions

Connection and Access Issues

Cannot Connect to vCluster

Symptoms:

vcluster connect hangs or times out
Connection refused errors
Authentication failures

Diagnosis:

# Check if vCluster pods are running
kubectl get pods -n production -l release=my-vcluster

# Check service endpoints
kubectl get svc -n production -l release=my-vcluster
kubectl get endpoints -n production

# Check port-forward manually
kubectl port-forward -n production svc/my-vcluster 8443:443

Solutions:

Restart vCluster pods:

kubectl rollout restart statefulset -n production my-vcluster

Check service account permissions:

kubectl get sa -n production
kubectl describe sa vc-my-vcluster -n production

Verify network policies:

kubectl get networkpolicy -n production
kubectl describe networkpolicy -n production

Check for resource constraints:

kubectl describe pod -n production -l release=my-vcluster
kubectl top pods -n production

Try reconnecting with verbose output:

vcluster connect my-vcluster --namespace production --debug

kubectl Commands Timing Out

Symptoms:

Commands hang indefinitely
Timeouts after several seconds
Intermittent connectivity

Diagnosis:

# Test API server health
kubectl get --raw /healthz
kubectl get --raw /readyz

# Check API server logs
kubectl logs -n production -l app=vcluster,release=my-vcluster | grep -i error

# Test specific operations
time kubectl get nodes
time kubectl get pods

Solutions:

Increase timeout:
```
kubectl get pods --request-timeout=60s
```

Check etcd performance:

# Access vCluster pod
kubectl exec -it -n production my-vcluster-0 -- sh

# Inside pod, check etcd
ETCDCTL_API=3 etcdctl --endpoints=https://localhost:2379 \
  --cert=/pki/etcd/tls.crt \
  --key=/pki/etcd/tls.key \
  --cacert=/pki/etcd/ca.crt \
  endpoint health

Reduce cluster load:
- Scale down non-critical workloads
- Check for resource-intensive operations
- Review API server logs for high-frequency requests

Increase API server resources:

# vcluster.yaml
controlPlane:
  statefulSet:
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 1000m
        memory: 2Gi

Authentication/Authorization Errors

Symptoms:

“Unauthorized” or “Forbidden” errors
Permission denied messages
RBAC violations

Diagnosis:

# Check current user
kubectl auth whoami

# Test permissions
kubectl auth can-i get pods
kubectl auth can-i create deployments
kubectl auth can-i '*' '*' --all-namespaces

# View role bindings
kubectl get rolebindings,clusterrolebindings -o wide

Solutions:

Verify kubeconfig context:

kubectl config current-context
kubectl config view

Reconnect to vCluster:

vcluster disconnect
vcluster connect my-vcluster --namespace production

Check certificate validity:

# View certificate details
kubectl config view --raw -o jsonpath='{.users[0].user.client-certificate-data}' | \
  base64 -d | openssl x509 -text -noout

Grant necessary permissions:

# Create role binding
kubectl create rolebinding dev-admin \
  --clusterrole=admin \
  [email protected] \
  --namespace=default

Resource Syncing Issues

Resources Not Syncing to Host

Symptoms:

Resources created in vCluster don’t appear in host namespace
Pods stay pending indefinitely
Services not accessible from host

Diagnosis:

# Check syncer logs
kubectl logs -n production -l app=vcluster,release=my-vcluster -c syncer

# Compare resources
vcluster connect my-vcluster --namespace production
kubectl get pods -o wide
vcluster disconnect
kubectl get pods -n production

# Check sync configuration
vcluster describe my-vcluster --config-only

Solutions:

Verify sync configuration:

# vcluster.yaml
sync:
  toHost:
    pods:
      enabled: true
    services:
      enabled: true
    persistentVolumeClaims:
      enabled: true

Restart syncer:

kubectl delete pod -n production -l app=vcluster,release=my-vcluster

Check service account permissions:

kubectl auth can-i create pods \
  --as=system:serviceaccount:production:vc-my-vcluster \
  -n production

Review resource quotas:

kubectl get resourcequota -n production
kubectl describe resourcequota -n production

Check for naming conflicts:

# Synced resources have specific naming patterns
kubectl get pods -n production -o jsonpath='{.items[*].metadata.name}'

Pods Stuck in Pending

Symptoms:

Pods created but never start
Status remains “Pending”
Containers not running

Diagnosis:

# Check pod status and events
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

# Check node resources
kubectl top nodes
kubectl describe nodes

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

Solutions:

Insufficient resources:

# Scale down other workloads or add nodes
kubectl scale deployment other-app --replicas=0

PVC not bound:

# Check storage class
kubectl get storageclass

# Create PVC manually if needed
kubectl apply -f pvc.yaml

Image pull failures:

# Check image pull secrets
kubectl get secrets
kubectl describe pod <pod-name> | grep -A 10 "Events"

# Test image accessibility
kubectl run test --image=<image> --dry-run=client -o yaml

Node selectors/taints:

# Check node selectors and taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Remove taint if needed
kubectl taint nodes <node-name> key:NoSchedule-

Services Not Accessible

Symptoms:

Services created but not reachable
Connection refused or timeout
DNS resolution failures

Diagnosis:

# Check service and endpoints
kubectl get svc,endpoints
kubectl describe svc <service-name>

# Test DNS resolution
kubectl run test-dns --image=busybox --rm -it -- nslookup <service-name>

# Check network policies
kubectl get networkpolicy
kubectl describe networkpolicy

Solutions:

Verify service sync:

# vcluster.yaml
sync:
  toHost:
    services:
      enabled: true

Check endpoints:

kubectl get endpoints <service-name>
# Should show pod IPs

Test connectivity:

# From within cluster
kubectl run test-curl --image=curlimages/curl --rm -it -- \
  curl http://<service-name>:<port>

Review network policies:

# Temporarily remove network policies to test
kubectl delete networkpolicy --all
# Test connectivity, then restore policies

Performance Issues

High CPU/Memory Usage

Symptoms:

Slow response times
OOMKilled pods
Throttling warnings

Diagnosis:

# Check current usage
kubectl top pods -n production -l release=my-vcluster
kubectl top nodes

# Get resource limits
kubectl get pod -n production -l release=my-vcluster \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'

# Check for memory leaks
kubectl exec -it -n production my-vcluster-0 -- top -b -n 1

Solutions:

Increase resource limits:

# vcluster.yaml
controlPlane:
  statefulSet:
    resources:
      limits:
        cpu: 2000m
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 1Gi

Enable resource limiting in virtual cluster:

policies:
  resourceQuota:
    enabled: true
    quota:
      requests.cpu: "10"
      requests.memory: 20Gi
      limits.cpu: "20"
      limits.memory: 40Gi

Optimize workload placement:

# Use node affinity to spread load
kubectl patch deployment app \
  -p '{"spec":{"template":{"spec":{"affinity":{"podAntiAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"app","operator":"In","values":["app"]}]},"topologyKey":"kubernetes.io/hostname"}}]}}}}}}'

Profile and optimize:

# Enable profiling
kubectl port-forward -n production my-vcluster-0 6060:6060
# Access http://localhost:6060/debug/pprof/

Slow API Response Times

Symptoms:

kubectl commands take long to complete
API timeouts
Unresponsive control plane

Diagnosis:

# Measure API latency
time kubectl get nodes
time kubectl get pods --all-namespaces

# Check API server metrics
kubectl get --raw /metrics | grep apiserver_request_duration

# Check etcd performance
kubectl logs -n production my-vcluster-0 | grep -i "etcd"

Solutions:

Increase API server resources (see above)

Optimize etcd:

# vcluster.yaml
controlPlane:
  backingStore:
    etcd:
      embedded:
        enabled: true
        migrateFromDeployedEtcd: true
    database:
      # Or use external database
      external:
        enabled: true
        endpoint: postgres://...

Reduce API load:

# Find high-frequency API callers
kubectl logs -n production my-vcluster-0 | \
  grep "requestInfo" | \
  awk '{print $NF}' | sort | uniq -c | sort -rn | head -20

Enable API priority and fairness:

controlPlane:
  statefulSet:
    enableServiceLinks: false

Stability Issues

Frequent Pod Restarts

Symptoms:

Pods restarting repeatedly
CrashLoopBackOff status
High restart count

Diagnosis:

# Check restart count
kubectl get pods -n production -l release=my-vcluster

# View crash logs
kubectl logs -n production my-vcluster-0 --previous

# Check for OOM kills
kubectl describe pod -n production my-vcluster-0 | grep -A 5 "Last State"

# Review events
kubectl get events -n production --sort-by='.lastTimestamp' | grep my-vcluster

Solutions:

OOM kills - increase memory:

controlPlane:
  statefulSet:
    resources:
      limits:
        memory: 4Gi

Liveness probe too aggressive:

controlPlane:
  statefulSet:
    probes:
      livenessProbe:
        initialDelaySeconds: 60
        periodSeconds: 20
        failureThreshold: 5

Application errors - check logs:

kubectl logs -n production my-vcluster-0 --previous | tail -100

Resource contention:

# Check node pressure
kubectl describe nodes | grep -A 5 "Conditions"

Data Loss or Corruption

Symptoms:

Resources disappearing
Configuration resets
State inconsistencies

Diagnosis:

# Check PVC status
kubectl get pvc -n production
kubectl describe pvc -n production

# Verify backup storage
kubectl logs -n production my-vcluster-0 | grep -i "backup\|snapshot\|etcd"

# Check for volume mount issues
kubectl describe pod -n production my-vcluster-0 | grep -A 10 "Volumes"

Solutions:

Restore from backup:

vcluster restore my-vcluster \
  oci://ghcr.io/my-org/backups:latest \
  --namespace production

Fix PVC issues:

# Check storage class
kubectl get storageclass
kubectl describe storageclass

# Recreate PVC if corrupted
kubectl delete pvc data-my-vcluster-0 -n production
kubectl rollout restart statefulset my-vcluster -n production

Enable persistent storage:

controlPlane:
  backingStore:
    etcd:
      embedded:
        enabled: true
      persistence:
        enabled: true
        size: 10Gi
        storageClass: fast-ssd

Advanced Debugging

Enable Debug Logging

# vcluster.yaml
controlPlane:
  statefulSet:
    env:
    - name: DEBUG
      value: "true"
    - name: LOG_LEVEL
      value: "debug"

Interactive Debugging Shell

# Shell into control plane pod
vcluster debug shell my-vcluster --namespace production

# Or use kubectl directly
kubectl exec -it -n production my-vcluster-0 -- /bin/sh

Network Debugging

# Deploy debug pod
kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash

# Inside debug pod:
ping <service-name>
nslookup <service-name>
curl http://<service-name>:<port>
traceroute <service-name>

Collect Comprehensive Debug Info

# Generate debug bundle
vcluster debug collect my-vcluster \
  --namespace production \
  --output-filename debug-$(date +%Y%m%d-%H%M%S).tar.gz

# Extract and review
tar -xzf debug-*.tar.gz
cd debug/
ls -R

Getting Help

If you’re still experiencing issues:

GitHub Issues

Search or create an issue: github.com/loft-sh/vcluster/issues

Slack Community

Join the community: vcluster.com/slack

Documentation

Browse docs: vcluster.com/docs

Support

Enterprise support: Contact your account team

When Reporting Issues

Include:

Environment details:
- vCluster version
- Kubernetes version (host and virtual)
- Cloud provider/platform
- Installation method (Helm, Platform, etc.)
Reproduction steps:
- What you did
- What you expected
- What actually happened
Debug information:
- Output of vcluster debug collect
- Relevant logs
- Configuration files (sanitized)
- Error messages
Attempted solutions:
- What you’ve tried
- Results of each attempt

Get Started

Architecture

Deployment

Operations

Resource Syncing

Use Cases

Security

Integrations

Overview

General Troubleshooting Approach

Common Issues and Solutions

Connection and Access Issues

Resource Syncing Issues

Performance Issues

Stability Issues

Advanced Debugging

Enable Debug Logging

Interactive Debugging Shell

Network Debugging

Collect Comprehensive Debug Info

Getting Help

GitHub Issues

Slack Community

Documentation

Support

When Reporting Issues

Next Steps

Monitoring

Managing vClusters

Build docs developers (and LLMs) love

Get Started

Architecture

Deployment

Operations

Resource Syncing

Use Cases

Security

Integrations

​Overview

​General Troubleshooting Approach

​Common Issues and Solutions

​Connection and Access Issues

​Resource Syncing Issues

​Performance Issues

​Stability Issues

​Advanced Debugging

​Enable Debug Logging

​Interactive Debugging Shell

​Network Debugging

​Collect Comprehensive Debug Info

​Getting Help

GitHub Issues

Slack Community

Documentation

Support

​When Reporting Issues

​Next Steps

Monitoring

Managing vClusters

Build docs developers (and LLMs) love

Overview

General Troubleshooting Approach

Common Issues and Solutions

Connection and Access Issues

Resource Syncing Issues

Performance Issues

Stability Issues

Advanced Debugging

Enable Debug Logging

Interactive Debugging Shell

Network Debugging

Collect Comprehensive Debug Info

Getting Help

When Reporting Issues

Next Steps