Failover Procedures

Gitaly Cluster provides automatic failover to maintain availability when primary Gitaly nodes fail. Praefect detects failures, promotes healthy secondaries to primary, and schedules replication to restore redundancy.

Failover Overview

When a primary Gitaly node becomes unavailable, Praefect automatically:

Detects the failure through health checks or error tracking
Selects a new primary from healthy secondary nodes
Promotes the secondary to serve as the new primary
Updates routing to direct traffic to the new primary
Schedules replication to restore data to other secondaries

Failover happens per repository (with per_repository election strategy) or per virtual storage (with sql election strategy).

Failover Strategies

Praefect supports multiple failover strategies:

Per-Repository Elector (Recommended)

Each repository can have a different primary node, providing optimal availability and load distribution:

[failover]
enabled = true
election_strategy = "per_repository"

Benefits:

Best availability: one node failure doesn’t affect repositories on other nodes
Better load distribution across the cluster
Granular failover: only affected repositories fail over
Fastest recovery time

Requirements:

PostgreSQL database for storing primary assignments
All Praefect nodes must have database access

SQL Elector

All repositories in a virtual storage share the same primary node:

[failover]
enabled = true
election_strategy = "sql"

SQL elector uses PostgreSQL to coordinate between Praefect nodes:

Each Praefect stores its view of node health in the database
When a primary fails, Praefects vote on a replacement
The candidate with the least replication lag wins
A majority of Praefects must agree on the new primary

The sql election strategy is deprecated and scheduled for removal in GitLab 14.0. Migrate to per_repository following the migration guide.

Local Elector (Development Only)

Each Praefect makes independent failover decisions without coordination:

[failover]
enabled = true
election_strategy = "local"

Characteristics:

No database required
No coordination between Praefect instances
Useful for local development environments
Not suitable for production: different Praefects may disagree on primary

Never use local elector in production. Without coordination, different Praefect instances can route traffic to different primaries, causing split-brain scenarios.

Failover Disabled

Disable automatic failover for testing or troubleshooting:

[failover]
enabled = false

With failover disabled:

Virtual storage becomes unavailable if the configured primary fails
Manual intervention required to restore service
Useful for debugging or performing controlled maintenance

Health Monitoring

Praefect continuously monitors Gitaly node health to detect failures.

Error Tracking

Praefect can promote failover based on error rates:

[failover]
enabled = true
error_threshold_window = "1m"
read_error_threshold_count = 10
write_error_threshold_count = 5

How it works:

Track errors per node within a time window
Mark a node unhealthy if errors exceed the threshold
Read errors are weighted separately from write errors
Unhealthy nodes are excluded from primary election

Error thresholds help detect “soft failures” where a node is responding but returning errors. This complements health checks which only detect “hard failures” (node completely down).

Health Checks

With the per_repository election strategy, Praefect uses active health checks:

[failover]
bootstrap_interval = "1s"    # Initial check interval at startup
monitor_interval = "3s"      # Regular check interval during operation

Health checks verify:

Gitaly node is reachable
gRPC connection is healthy
Node can respond to basic RPCs

Unhealthy nodes are removed from the candidate pool during election.

Primary Selection

When electing a new primary, Praefect considers:

Eligibility Requirements

Health: Node must be currently healthy
Connectivity: Node must be reachable by majority of Praefects
Data consistency: Node must not be significantly behind

Selection Criteria

Among eligible candidates, Praefect selects based on:

Replication lag: Prefer nodes with fewer pending replication jobs
Data recency: Choose the node with the most up-to-date data
Position: If all else equal, select based on configuration order

Praefect prioritizes minimizing data loss over load distribution. The most up-to-date node becomes primary even if it’s already serving many repositories.

Read-Only Mode

After failing over to an outdated node, affected repositories enter read-only mode:

Why Read-Only?

Read-only mode prevents data conflicts: If the new primary accepted writes immediately, those commits might conflict with unreplicated data from the old primary.

Recovery from Read-Only

Praefect automatically removes read-only mode when:

All missing data is replicated from another node, or
Administrator accepts data loss using accept-dataloss

Automatic recovery:

# Praefect's reconciler automatically schedules replication
# No manual intervention required if a complete copy exists

Manual recovery:

# When complete data is unrecoverable, designate authoritative version
praefect -config /path/to/config.toml accept-dataloss \
  -virtual-storage default \
  -relative-path @hashed/ab/cd/abcd1234.git \
  -authoritative-storage gitaly-2

Using accept-dataloss permanently discards any data that exists only on failed nodes. Only use this when the failed primary cannot be recovered and you’ve determined acceptable data loss.

Identifying Data Loss

After a failover, check for repositories with potential data loss:

# Check all virtual storages
praefect -config /path/to/config.toml dataloss

# Check specific virtual storage
praefect -config /path/to/config.toml dataloss -virtual-storage default

Example output:

Virtual storage: default
  Repositories:
    @hashed/ab/cd/abcd1234.git:
      Primary: gitaly-2 (1 behind)
      Outdated secondaries:
        gitaly-1: 1 change(s) behind, last healthy at 2021-01-15 10:30:00

This indicates:

Repository’s current primary is gitaly-2
gitaly-1 is missing 1 change from when it was primary
gitaly-1 was last seen healthy at the shown timestamp

Failover Testing

Regularly test failover to verify configuration:

Planned Failover Test

Verify current state:

# Check which node is primary
praefect -config /path/to/config.toml dataloss

Stop a Gitaly node:
```
systemctl stop gitaly@gitaly-1
```

Monitor Praefect logs:

tail -f /var/log/gitlab/praefect/current | grep -i failover

Verify failover occurred:

praefect -config /path/to/config.toml dataloss

Test repository access:

git clone [email protected]:group/project.git

Restart the node:
```
systemctl start gitaly@gitaly-1
```

Verify replication:

# Check replication queue clears
curl -s http://praefect:10101/metrics | grep replication_queue_depth

Unplanned Failure Simulation

Test handling of abrupt failures:

# Simulate network partition
iptables -A INPUT -s <gitaly-node-ip> -j DROP
iptables -A OUTPUT -d <gitaly-node-ip> -j DROP

# Wait for failover (monitor logs)
# Then restore connectivity
iptables -D INPUT -s <gitaly-node-ip> -j DROP
iptables -D OUTPUT -d <gitaly-node-ip> -j DROP

Perform failover tests in a non-production environment first. Even with proper configuration, unexpected edge cases can occur.

Monitoring Failover Events

Track failover through logs and metrics:

Key Log Messages

level=info msg="primary node changed" virtual_storage=default repository=@hashed/... old_primary=gitaly-1 new_primary=gitaly-2
level=warn msg="node marked unhealthy" virtual_storage=default node=gitaly-1 reason="health check failed"
level=info msg="node marked healthy" virtual_storage=default node=gitaly-1

Key Metrics

# Node health status (1=healthy, 0=unhealthy)
gitaly_praefect_node_health{virtual_storage="default",node="gitaly-1"}

# Number of repositories in read-only mode
gitaly_praefect_read_only_repositories{virtual_storage="default"}

# Primary node changes
rate(gitaly_praefect_primary_elections_total[5m])

Best Practices

Use per-repository election: Provides best availability and granularity
Configure error thresholds: Catch soft failures before they impact users
Monitor replication lag: Large lag increases data loss risk during failover
Test regularly: Verify failover works before you need it
Plan for data loss: Document procedures for handling accept-dataloss scenarios
Size appropriately: Ensure remaining nodes can handle load after failover

Recovering Failed Nodes

When a failed node comes back online:

Praefect detects recovery through health checks
Node marked healthy and eligible for primary election
Replication scheduled to bring node up-to-date
Node rejoins cluster once replication completes

No manual intervention is typically required. Monitor replication queue to verify recovery progresses:

curl -s http://praefect:10101/metrics | grep 'replication_queue.*gitaly-1'

Overview

Setup & Configuration

High Availability

Operations

Advanced Topics

Development

Failover Procedures

Failover Procedures

Failover Overview

Failover Strategies

Per-Repository Elector (Recommended)

SQL Elector

Local Elector (Development Only)

Failover Disabled

Health Monitoring

Error Tracking

Health Checks

Primary Selection

Eligibility Requirements

Selection Criteria

Read-Only Mode

Why Read-Only?

Recovery from Read-Only

Identifying Data Loss

Failover Testing

Planned Failover Test

Unplanned Failure Simulation

Monitoring Failover Events

Key Log Messages

Key Metrics

Best Practices

Recovering Failed Nodes

Next Steps

HA Overview

Praefect Configuration

Build docs developers (and LLMs) love

Overview

Setup & Configuration

High Availability

Operations

Advanced Topics

Development

​Failover Procedures

​Failover Overview

​Failover Strategies

​Per-Repository Elector (Recommended)

​SQL Elector

​Local Elector (Development Only)

​Failover Disabled

​Health Monitoring

​Error Tracking

​Health Checks

​Primary Selection

​Eligibility Requirements

​Selection Criteria

​Read-Only Mode

​Why Read-Only?

​Recovery from Read-Only

​Identifying Data Loss

​Failover Testing

​Planned Failover Test

​Unplanned Failure Simulation

​Monitoring Failover Events

​Key Log Messages

​Key Metrics

​Best Practices

​Recovering Failed Nodes

​Next Steps

HA Overview

Praefect Configuration

Build docs developers (and LLMs) love

Failover Procedures

Failover Overview

Failover Strategies

Per-Repository Elector (Recommended)

SQL Elector

Local Elector (Development Only)

Failover Disabled

Health Monitoring

Error Tracking

Health Checks

Primary Selection

Eligibility Requirements

Selection Criteria

Read-Only Mode

Why Read-Only?

Recovery from Read-Only

Identifying Data Loss

Failover Testing

Planned Failover Test

Unplanned Failure Simulation

Monitoring Failover Events

Key Log Messages

Key Metrics

Best Practices

Recovering Failed Nodes

Next Steps