Severity: P0/P1
Estimated time: 15-30 minutes
Use this when two data pods appear to be primary (role=master) at the same time.
Symptoms
status.instancesStatus shows more than one role=master.
- Clients report inconsistent reads/writes.
status.currentPrimary does not match observed write leader.
Prerequisites
This procedure requires permissions to patch RedisCluster metadata and status, and patch Services.
- A chosen authoritative primary pod.
- Shell variables:
export NS=<rediscluster-namespace>
export CLUSTER=<rediscluster-name>
Diagnosis
Capture current status
kubectl get rediscluster "$CLUSTER" -n "$NS" -o yaml
kubectl get rediscluster "$CLUSTER" -n "$NS" -o go-template='{{range $name,$s := .status.instancesStatus}}{{printf "%s role=%s connected=%t replOffset=%d masterLink=%s\n" $name $s.role $s.connected $s.replicationOffset $s.masterLinkStatus}}{{end}}'
Identify authoritative primary
- Prefer the primary with the highest
replicationOffset.
- If application owners confirm a different source of truth, use that.
Set working variables
export AUTHORITATIVE=<authoritative-primary-pod>
export STALE_PRIMARY=<stale-primary-pod>
Recovery Steps
Fence the stale primary first
Always fence the stale primary before redirecting traffic.
kubectl patch rediscluster "$CLUSTER" -n "$NS" --type=merge \
-p "{\"metadata\":{\"annotations\":{\"redis.io/fencedInstances\":\"[\\\"$STALE_PRIMARY\\\"]\"}}}"
Point cluster status and leader service to the authoritative primary
kubectl patch rediscluster "$CLUSTER" -n "$NS" --subresource=status --type=merge \
-p "{\"status\":{\"currentPrimary\":\"$AUTHORITATIVE\",\"phase\":\"FailingOver\"}}"
kubectl patch service "$CLUSTER-leader" -n "$NS" --type=merge \
-p "{\"spec\":{\"selector\":{\"redis.io/cluster\":\"$CLUSTER\",\"redis.io/instance\":\"$AUTHORITATIVE\"}}}"
Clear fencing and force stale primary cold start
kubectl annotate rediscluster "$CLUSTER" -n "$NS" redis.io/fencedInstances-
kubectl delete pod "$STALE_PRIMARY" -n "$NS"
On cold start, split-brain guard uses status.currentPrimary and starts this pod as a replica.
Verification
kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.currentPrimary}{"\n"}'
kubectl get service "$CLUSTER-leader" -n "$NS" -o jsonpath='{.spec.selector.redis\.io/instance}{"\n"}'
kubectl get rediscluster "$CLUSTER" -n "$NS" -o go-template='{{range $name,$s := .status.instancesStatus}}{{printf "%s role=%s connected=%t masterLink=%s\n" $name $s.role $s.connected $s.masterLinkStatus}}{{end}}'
kubectl get rediscluster "$CLUSTER" -n "$NS" -o jsonpath='{.status.phase}{"\n"}'
Expected:
- Exactly one pod reports
role=master.
status.currentPrimary and -leader Service point to AUTHORITATIVE.
- Former stale primary reports
role=slave and masterLinkStatus=up.
- Cluster returns to
Healthy.
Escalation
- If both primaries diverged significantly and authoritative source is unclear, pause writes and escalate to incident commander/data owner.
- If stale pod repeatedly returns as master after restart, keep it fenced and escalate with full status/event/log capture.