Manual Failover

Severity: P1
Estimated time: 10-20 minutes Use this when the operator is unavailable or not making failover progress, but Redis data pods are still running.

Symptoms

Current primary pod is unreachable or unhealthy.
RedisCluster remains degraded and does not fail over automatically.
Operator deployment is down, crashlooping, or otherwise unavailable.

Prerequisites

Requires cluster-admin or equivalent permissions to patch RedisCluster (including status) and Services.

A healthy replica pod that can be promoted.
Shell variables:

export NS=<rediscluster-namespace>
export CLUSTER=<rediscluster-name>

Diagnosis

Inspect cluster status and current primary

kubectl get rediscluster "$CLUSTER" -n "$NS" -o yaml
kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.currentPrimary}{"\n"}'

Inspect pod health and instance statuses

kubectl get pods -n "$NS" -l redis.io/cluster="$CLUSTER",redis.io/workload=data -o wide
kubectl get rediscluster "$CLUSTER" -n "$NS" -o go-template='{{range $name,$s := .status.instancesStatus}}{{printf "%s role=%s connected=%t replOffset=%d masterLink=%s\n" $name $s.role $s.connected $s.replicationOffset $s.masterLinkStatus}}{{end}}'

Choose target pods

CURRENT_PRIMARY: failed/stale primary pod name
CANDIDATE: connected replica with the highest replication offset

Recovery Steps

Set working variables

export CURRENT_PRIMARY=<failed-primary-pod>
export CANDIDATE=<replica-to-promote>

Fence the former primary first

Always fence before promoting to prevent split-brain.

kubectl patch rediscluster "$CLUSTER" -n "$NS" --type=merge \
  -p "{\"metadata\":{\"annotations\":{\"redis.io/fencedInstances\":\"[\\\"$CURRENT_PRIMARY\\\"]\"}}}"

Promote the candidate directly through its instance-manager API

Terminal A:

kubectl port-forward -n "$NS" "pod/$CANDIDATE" 18080:8080

Terminal B:

curl -sS -X POST http://127.0.0.1:18080/v1/promote

Patch cluster status to the new primary

kubectl patch rediscluster "$CLUSTER" -n "$NS" --subresource=status --type=merge \
  -p "{\"status\":{\"currentPrimary\":\"$CANDIDATE\",\"phase\":\"FailingOver\"}}"

Patch the leader Service selector

kubectl patch service "$CLUSTER-leader" -n "$NS" --type=merge \
  -p "{\"spec\":{\"selector\":{\"redis.io/cluster\":\"$CLUSTER\",\"redis.io/instance\":\"$CANDIDATE\"}}}"

Clear fencing and force the former primary to cold-start as a replica

kubectl annotate rediscluster "$CLUSTER" -n "$NS" redis.io/fencedInstances-
kubectl delete pod "$CURRENT_PRIMARY" -n "$NS"

Verification

kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.currentPrimary}{"\n"}'
kubectl get service "$CLUSTER-leader" -n "$NS" -o jsonpath='{.spec.selector.redis\.io/instance}{"\n"}'
kubectl get rediscluster "$CLUSTER" -n "$NS" -o go-template='{{range $name,$s := .status.instancesStatus}}{{printf "%s role=%s connected=%t masterLink=%s\n" $name $s.role $s.connected $s.masterLinkStatus}}{{end}}'
kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.phase}{"\n"}'

Expected:

status.currentPrimary equals CANDIDATE.
-leader Service points to CANDIDATE.
Former primary reports as replica (role=slave).
Cluster returns to Healthy.

Escalation

If promotion endpoint fails or pod is unreachable, choose a different replica candidate.
If no connected replica exists, treat as data-loss/disaster recovery and escalate immediately.
If split-brain persists, follow Split-Brain Recovery.

Get Started

Core Concepts

Configuration

Operations

Runbooks

Symptoms

Prerequisites

Diagnosis

Recovery Steps

Verification

Escalation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Operations

Runbooks

​Symptoms

​Prerequisites

​Diagnosis

​Recovery Steps

​Verification

​Escalation

Build docs developers (and LLMs) love

Symptoms

Prerequisites

Diagnosis

Recovery Steps

Verification

Escalation