Severity: P1
Estimated time: 15-25 minutes
Use this when one replica’s PVC is corrupted and that replica must be rebuilt from the current primary.
Symptoms
- A single replica pod repeatedly crashes or fails readiness.
- Pod events/logs show storage or filesystem errors.
status.instancesStatus shows one replica as disconnected/stale.
Prerequisites
The corrupted instance MUST be a replica (not the current primary).
- At least one healthy primary is available.
- Shell variables:
export NS=<rediscluster-namespace>
export CLUSTER=<rediscluster-name>
Diagnosis
Identify current primary
kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.currentPrimary}{"\n"}'
Inspect per-instance status
kubectl get rediscluster "$CLUSTER" -n "$NS" -o go-template='{{range $name,$s := .status.instancesStatus}}{{printf "%s role=%s connected=%t replOffset=%d masterLink=%s\n" $name $s.role $s.connected $s.replicationOffset $s.masterLinkStatus}}{{end}}'
Inspect suspected broken replica pod
export BROKEN_REPLICA=<replica-pod-name>
kubectl describe pod "$BROKEN_REPLICA" -n "$NS"
kubectl logs "$BROKEN_REPLICA" -n "$NS" --tail=200
If BROKEN_REPLICA equals status.currentPrimary, stop and first run Manual Failover.
Recovery Steps
Derive PVC name for the broken replica
export INDEX="${BROKEN_REPLICA##*-}"
export BROKEN_PVC="${CLUSTER}-data-${INDEX}"
Delete the broken replica pod first
kubectl delete pod "$BROKEN_REPLICA" -n "$NS" --wait=true
Delete only the corrupted PVC
This will permanently delete data on this PVC. Ensure you have identified the correct corrupted PVC.
kubectl delete pvc "$BROKEN_PVC" -n "$NS" --wait=true
Trigger reconciliation so the operator recreates PVC and pod immediately
kubectl annotate rediscluster "$CLUSTER" -n "$NS" \
runbooks.redis.io/rebuild-replica-ts="$(date +%s)" --overwrite
Watch for PVC and pod recreation
kubectl get pvc -n "$NS" -l redis.io/cluster="$CLUSTER" -w
kubectl get pods -n "$NS" -l redis.io/cluster="$CLUSTER",redis.io/workload=data -w
Verification
kubectl get pod "$BROKEN_REPLICA" -n "$NS"
kubectl get rediscluster "$CLUSTER" -n "$NS" -o go-template='{{range $name,$s := .status.instancesStatus}}{{printf "%s role=%s connected=%t masterLink=%s\n" $name $s.role $s.connected $s.masterLinkStatus}}{{end}}'
kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.phase}{"\n"}'
Expected:
- New PVC exists for the rebuilt replica.
- Recreated pod becomes ready.
- Rebuilt instance reports
role=slave and masterLinkStatus=up.
- Cluster returns to
Healthy.
Escalation
- If rebuilt replica does not catch up, inspect primary health and network path between pods.
- If multiple PVCs are corrupted, treat as broader storage incident and escalate to platform/storage team.