Stuck Reconciler Debugging

Severity: P1
Estimated time: 10-15 minutes Use this when a RedisCluster is not converging and reconcile loops are not making visible progress.

Symptoms

status.phase remains unchanged (for example Degraded) for multiple reconcile intervals.
Expected actions (pod recreate, service selector update, PVC creation) are not happening.
Cluster events show repeated failures or no recent reconcile activity.

Prerequisites

kubectl access to the cluster and operator namespace.
Shell variables:

export NS=<rediscluster-namespace>
export CLUSTER=<rediscluster-name>

Diagnosis

Check cluster status snapshot

kubectl get rediscluster "$CLUSTER" -n "$NS" -o yaml

Check recent events for the RedisCluster

kubectl get events -n "$NS" \
  --field-selector involvedObject.kind=RedisCluster,involvedObject.name="$CLUSTER" \
  --sort-by=.lastTimestamp

Locate operator deployment and namespace

kubectl get deploy -A -l app.kubernetes.io/name=redis-operator

Set:

export OP_NS=<operator-namespace>
export OP_DEPLOY=<operator-deployment-name>

Check leader election lease and operator logs

kubectl get lease redis-operator-leader -n "$OP_NS" -o yaml
kubectl logs -n "$OP_NS" deploy/"$OP_DEPLOY" --tail=300

Recovery Steps

Force re-enqueue by touching an annotation on the RedisCluster

kubectl annotate rediscluster "$CLUSTER" -n "$NS" \
  runbooks.redis.io/force-reconcile-ts="$(date +%s)" --overwrite

Watch events and object status for one or two reconcile intervals (~30-60s)

kubectl get rediscluster "$CLUSTER" -n "$NS" -w

If still stuck, restart operator deployment

kubectl rollout restart deploy/"$OP_DEPLOY" -n "$OP_NS"
kubectl rollout status deploy/"$OP_DEPLOY" -n "$OP_NS"

Re-check operator logs for new reconcile attempts

kubectl logs -n "$OP_NS" deploy/"$OP_DEPLOY" --tail=300

Verification

kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.phase}{"\n"}'
kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.readyInstances}{"\n"}'
kubectl get events -n "$NS" \
  --field-selector involvedObject.kind=RedisCluster,involvedObject.name="$CLUSTER" \
  --sort-by=.lastTimestamp

Expected:

Reconcile activity resumes in logs/events.
Status fields begin changing again.
Cluster trends back to Healthy.

Escalation

If reconcile restarts but repeatedly fails with the same error, escalate with:
- RedisCluster YAML
- recent events
- operator logs around the failure
If leader election lease is not being renewed, investigate controller-manager health and RBAC before retrying.

Get Started

Core Concepts

Configuration

Operations

Runbooks

Stuck Reconciler Debugging

Symptoms

Prerequisites

Diagnosis

Recovery Steps

Verification

Escalation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Operations

Runbooks

​Symptoms

​Prerequisites

​Diagnosis

​Recovery Steps

​Verification

​Escalation

Build docs developers (and LLMs) love

Symptoms

Prerequisites

Diagnosis

Recovery Steps

Verification

Escalation