Failover and Fencing

Overview

Redis Operator implements a fencing-first failover strategy inspired by CloudNativePG. The core principle:

Always fence the old primary before promoting a new one.

This prevents split-brain scenarios where two primaries accept writes simultaneously, leading to data divergence and loss.

Fencing-First Failover Sequence

When the controller detects the primary is unreachable, it executes the following steps in strict order:

Step 1: Detect Primary Failure

The controller polls each instance manager via GET http://<pod-ip>:9121/v1/status at regular intervals (default: every 5 seconds). Failure conditions:

HTTP request timeout (default: 2 seconds)
HTTP 5xx error
Connection refused (pod not running)
Pod marked for deletion (metadata.deletionTimestamp set)

Status tracking (internal/controller/cluster/status.go:12-14):

status.instancesStatus[podName] = InstanceStatus{
    Connected: false,
    LastSeenAt: &metav1.Time{Time: time.Now()},
}

If the current primary’s Connected field is false for two consecutive reconcile cycles, the controller initiates failover.

Step 2: Fence the Former Primary

Before promoting any replica, the controller sets a fencing annotation on the RedisCluster resource:

metadata:
  annotations:
    redis.io/fencedInstances: '["example-0"]'

Fencing effect (internal/instance-manager/reconciler/reconciler.go): The instance manager watches for this annotation. When a pod’s name appears in redis.io/fencedInstances:

The instance manager immediately stops redis-server (sends SIGTERM, waits for graceful shutdown)
The instance manager refuses to restart Redis until the annotation is cleared
Kubernetes liveness probe fails → pod is marked unhealthy
Kubernetes readiness probe fails → pod is removed from Service endpoints

Fencing is irreversible until the operator clears the annotation. A fenced pod will not accept reads or writes, even if it recovers network connectivity.

See internal/controller/cluster/fencing.go:30.

Step 3: Select a Replica to Promote

The controller selects the replica with the smallest replication lag to minimize data loss. Selection criteria (internal/controller/cluster/fencing.go):

Reachable: status.instancesStatus[podName].Connected == true
Is a replica: status.instancesStatus[podName].Role == "slave"
Lowest lag: Smallest status.instancesStatus[podName].ReplicaLagBytes
Stable ordinal: If multiple replicas have the same lag, prefer the lowest ordinal (e.g., example-1 over example-2)

Example status:

status:
  instancesStatus:
    example-0:  # Current primary (unreachable)
      connected: false
      role: master
    example-1:  # Best candidate (lowest lag)
      connected: true
      role: slave
      replicationOffset: 45678
      replicaLagBytes: 120
    example-2:  # Second choice (higher lag)
      connected: true
      role: slave
      replicationOffset: 45000
      replicaLagBytes: 798

The controller selects example-1 for promotion.

Step 4: Promote the Selected Replica

The controller calls the instance manager HTTP API on the selected replica’s pod IP (not through a Service):

POST http://<example-1-pod-ip>:9121/v1/promote

Instance manager promotion logic (internal/instance-manager/webserver/server.go):

Executes REPLICAOF NO ONE via Redis connection
Waits for Redis to confirm promotion (INFO replication shows role:master)
Updates redis.conf to remove replicaof directive
Returns HTTP 200 on success

Promotion time: Typically 1 second (Redis command execution + confirmation). See internal/controller/cluster/fencing.go:52.

Step 5: Update Services and Status

The controller updates Kubernetes resources to reflect the new topology: Service selector update (internal/controller/cluster/services.go):

apiVersion: v1
kind: Service
metadata:
  name: example-leader
spec:
  selector:
    redis.io/instance: example-1  # Changed from example-0
    redis.io/role: primary

Status update:

status:
  currentPrimary: example-1  # Changed from example-0
  phase: Healthy             # Changed from FailingOver

Condition update:

conditions:
- type: PrimaryAvailable
  status: "True"
  reason: PromotionComplete
  message: "Promoted example-1 to primary"
  lastTransitionTime: "2026-02-28T10:15:30Z"

Step 6: Remove Fencing

The controller clears the fencing annotation:

metadata:
  annotations:
    redis.io/fencedInstances: '[]'  # Empty list

Former primary recovery (internal/instance-manager/run/run.go:63-66):

The fenced pod (e.g., example-0) is no longer prevented from starting
The instance manager reads status.currentPrimary from the RedisCluster CR
Sees status.currentPrimary == "example-1" (not example-0)
Boot-time guard activates: starts Redis with REPLICAOF <example-1-ip> 6379
Redis performs partial resync (PSYNC) or full sync (SYNC) as needed
Any writes the former primary accepted after failover are discarded

See internal/controller/cluster/fencing.go:57-58.

Step 7: Reconfigure Other Replicas

The controller updates all remaining replicas to follow the new primary: For each replica (internal/controller/cluster/pods.go):

Send REPLICAOF <new-primary-ip> 6379 via the instance manager HTTP API
Wait for INFO replication to show master_link_status:up
Update status.instancesStatus[podName].MasterLinkStatus = "up"

Replication reconvergence time: Typically 5-15 seconds (depends on data size and network latency).

Split-Brain Prevention

Redis Operator uses two layers of defense against split-brain scenarios.

Layer 1: Fencing-First Failover

Prevents: A recovering former primary from continuing to accept writes during failover. How it works:

Controller sets fence annotation before promoting a replica
Instance manager stops Redis on the fenced pod
Pod is removed from Service endpoints → clients can’t reach it
New primary is promoted
Fence is cleared; former primary restarts as replica

Race condition protection:

Scenario: Former primary recovers network connectivity after Step 2 (fencing) but before Step 4 (promotion)
Outcome: The former primary is already fenced → Redis is stopped → no writes are accepted
Result: No split-brain; new primary is promoted safely

Layer 2: Boot-Time Role Guard

Prevents: A pod from self-electing as primary on startup, regardless of local data state. How it works (internal/instance-manager/run/run.go:63-66): On every cold start, before redis-server is launched:

if os.Getenv("POD_NAME") == cluster.Status.CurrentPrimary {
    // This pod is the designated primary
    // Start Redis without REPLICAOF directive
} else {
    // This pod is NOT the designated primary
    // Start Redis with REPLICAOF <currentPrimary-ip> 6379
    // Discard any local data that diverges from the primary
}

Hard invariant: The split-brain guard in internal/instance-manager/run/run.go must fire before redis-server starts. If POD_NAME != status.currentPrimary, always issue REPLICAOF first, regardless of local data.

Data loss trade-off:

Lost: Any writes the former primary accepted after network partition (before fencing)
Preserved: All writes accepted by the new primary after promotion
Philosophy: Matches CloudNativePG’s pg_rewind behavior — prefer consistency over preserving isolated writes

Layer 3: Runtime Primary Isolation Detection

Prevents: An isolated primary (can’t reach API server or peers) from continuing to accept writes. How it works (internal/instance-manager/webserver/server.go): The liveness probe (GET /healthz) on primary pods includes additional checks:

Kubernetes API reachability: Can the instance manager reach the Kubernetes API server?
Peer reachability: Can the instance manager reach other instance manager pods?

If both checks fail (primary is isolated):

Liveness probe returns HTTP 503 (Service Unavailable)
Kubernetes marks the pod as unhealthy
After livenessProbe.failureThreshold consecutive failures, Kubernetes restarts the pod
On restart, the boot-time guard (Layer 2) ensures the pod starts as a replica

This protection is configurable via spec.primaryIsolation.enabled (default: true). Disable it only in non-production environments.

See api/v1/rediscluster_types.go:260-275.

Failover Timeline Example

Real-world failover scenario:

T+0s:  Primary pod (example-0) loses network connectivity
T+2s:  Controller status poll times out (example-0 unreachable)
T+5s:  Second consecutive poll fails → controller initiates failover
T+5s:  Controller sets fence annotation: redis.io/fencedInstances=["example-0"]
T+6s:  Instance manager on example-0 stops redis-server (graceful SIGTERM)
T+7s:  Controller selects example-1 (smallest lag: 120 bytes)
T+7s:  Controller calls POST http://<example-1-ip>:9121/v1/promote
T+8s:  Instance manager on example-1 executes REPLICAOF NO ONE
T+8s:  Redis on example-1 confirms role:master
T+9s:  Controller updates -leader Service selector to example-1
T+9s:  Controller updates status.currentPrimary to example-1
T+10s: Controller clears fence annotation
T+15s: Former primary (example-0) recovers network, restarts
T+16s: Instance manager on example-0 reads status.currentPrimary=example-1
T+16s: Boot-time guard activates → starts with REPLICAOF example-1-ip 6379
T+20s: Redis on example-0 completes PSYNC (partial resync)
T+20s: Cluster fully reconverged, all replicas following example-1

Total failover time: ~10 seconds (T+5s detection + T+5s execution).

Configuration Options

Status Poll Interval

Controls how frequently the controller polls instance managers for status. Controller flag:

--status-poll-interval=5s

Trade-offs:

Shorter interval (e.g., 2s): Faster failure detection, higher API server load
Longer interval (e.g., 10s): Slower failure detection, lower API server load

HTTP Timeout

Controls how long the controller waits for instance manager HTTP responses. Controller flag:

--http-timeout=2s

Trade-offs:

Shorter timeout (e.g., 1s): Faster failure detection, more false positives during pod startup
Longer timeout (e.g., 5s): Slower failure detection, fewer false positives

Synchronous Replication

Require a minimum number of replicas to acknowledge writes before the primary confirms success. Spec configuration:

spec:
  minSyncReplicas: 1  # Require at least 1 replica to acknowledge writes
  maxSyncReplicas: 2  # Wait for up to 2 replicas (if available)

Redis configuration applied:

min-replicas-to-write 1
min-replicas-max-lag 10

Setting minSyncReplicas > 0 reduces availability: if no replicas are reachable, the primary will reject writes.

See api/v1/rediscluster_types.go:122-130.

Primary Isolation Detection

Enable runtime isolation checks in the primary’s liveness probe. Spec configuration:

spec:
  primaryIsolation:
    enabled: true
    apiServerTimeout: 5s  # Timeout for API server reachability
    peerTimeout: 5s       # Timeout for peer instance-manager reachability

Behavior:

If enabled and the primary can’t reach the API server and can’t reach any peer instance managers, the liveness probe fails
After livenessProbe.failureThreshold failures (default: 3), Kubernetes restarts the pod
On restart, the boot-time guard ensures the pod starts as a replica

This is a defense-in-depth measure. Most split-brain scenarios are prevented by fencing-first failover (Layer 1) and the boot-time guard (Layer 2).

Manual Failover

You can trigger a manual failover by scaling the primary pod to zero or deleting it. Example:

# Delete the current primary pod
kubectl delete pod example-0

# The controller detects the primary is gone and promotes a replica

Controlled switchover (zero downtime):

The controller promotes a replica before deleting the old primary
The old primary is deleted only after the new primary is confirmed healthy
Clients experience zero write downtime (reads are always served by replicas)

See Upgrades for details on controlled switchovers during upgrades.

Sentinel Mode Failover

In Sentinel mode (spec.mode: sentinel), Sentinel handles failover, not the operator. Sentinel failover flow:

Sentinels monitor the primary via PING (default: every 1 second)
When quorum Sentinels agree the primary is down (default: 2 of 3), they elect a new primary
Sentinels run REPLICAOF NO ONE on the selected replica
Sentinels reconfigure other replicas to follow the new primary
The operator detects the change via Sentinel status polling
The operator updates status.currentPrimary and the -leader Service selector

Sentinel failover does not use fencing. Sentinel relies on quorum-based leader election to prevent split-brain.

Failover time: Typically 5-15 seconds (faster than operator-managed failover because Sentinels monitor continuously). See Cluster Modes for Sentinel configuration.

Comparison with Other Operators

See Comparison with Other Redis Operators for how Redis Operator’s failover approach compares to alternatives like OpsTree Redis Operator. Key differentiators:

Fencing-first: Redis Operator fences the old primary before promoting a new one
Boot-time guard: Ensures a restarting pod can never self-elect as primary
Pod IP targeting: Failover commands target specific pods, not load-balanced Services
Controlled switchover: Rolling updates promote a replica first, then delete the old primary (zero downtime)

Next Steps

Architecture — Understand the split control/data plane
Cluster Modes — Standalone vs Sentinel failover behavior
Upgrades — Zero-downtime primary upgrades via rolling updates

Get Started

Core Concepts

Configuration

Operations

Runbooks

Overview

Fencing-First Failover Sequence

Step 1: Detect Primary Failure

Step 2: Fence the Former Primary

Step 3: Select a Replica to Promote

Step 4: Promote the Selected Replica

Step 5: Update Services and Status

Step 6: Remove Fencing

Step 7: Reconfigure Other Replicas

Split-Brain Prevention

Layer 1: Fencing-First Failover

Layer 2: Boot-Time Role Guard

Layer 3: Runtime Primary Isolation Detection

Failover Timeline Example

Configuration Options

Status Poll Interval

HTTP Timeout

Synchronous Replication

Primary Isolation Detection

Manual Failover

Sentinel Mode Failover

Comparison with Other Operators

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Operations

Runbooks

​Overview

​Fencing-First Failover Sequence

​Step 1: Detect Primary Failure

​Step 2: Fence the Former Primary

​Step 3: Select a Replica to Promote

​Step 4: Promote the Selected Replica

​Step 5: Update Services and Status

​Step 6: Remove Fencing

​Step 7: Reconfigure Other Replicas

​Split-Brain Prevention

​Layer 1: Fencing-First Failover

​Layer 2: Boot-Time Role Guard

​Layer 3: Runtime Primary Isolation Detection

​Failover Timeline Example

​Configuration Options

​Status Poll Interval

​HTTP Timeout

​Synchronous Replication

​Primary Isolation Detection

​Manual Failover

​Sentinel Mode Failover

​Comparison with Other Operators

​Next Steps

Build docs developers (and LLMs) love

Overview

Fencing-First Failover Sequence

Step 1: Detect Primary Failure

Step 2: Fence the Former Primary

Step 3: Select a Replica to Promote

Step 4: Promote the Selected Replica

Step 5: Update Services and Status

Step 6: Remove Fencing

Step 7: Reconfigure Other Replicas

Split-Brain Prevention

Layer 1: Fencing-First Failover

Layer 2: Boot-Time Role Guard

Layer 3: Runtime Primary Isolation Detection

Failover Timeline Example

Configuration Options

Status Poll Interval

HTTP Timeout

Synchronous Replication

Primary Isolation Detection

Manual Failover

Sentinel Mode Failover

Comparison with Other Operators

Next Steps