Failover Procedures
Gitaly Cluster provides automatic failover to maintain availability when primary Gitaly nodes fail. Praefect detects failures, promotes healthy secondaries to primary, and schedules replication to restore redundancy.Failover Overview
When a primary Gitaly node becomes unavailable, Praefect automatically:- Detects the failure through health checks or error tracking
- Selects a new primary from healthy secondary nodes
- Promotes the secondary to serve as the new primary
- Updates routing to direct traffic to the new primary
- Schedules replication to restore data to other secondaries
Failover happens per repository (with
per_repository election strategy) or per virtual storage (with sql election strategy).Failover Strategies
Praefect supports multiple failover strategies:Per-Repository Elector (Recommended)
Each repository can have a different primary node, providing optimal availability and load distribution:- Best availability: one node failure doesn’t affect repositories on other nodes
- Better load distribution across the cluster
- Granular failover: only affected repositories fail over
- Fastest recovery time
- PostgreSQL database for storing primary assignments
- All Praefect nodes must have database access
SQL Elector
All repositories in a virtual storage share the same primary node:- Each Praefect stores its view of node health in the database
- When a primary fails, Praefects vote on a replacement
- The candidate with the least replication lag wins
- A majority of Praefects must agree on the new primary
Local Elector (Development Only)
Each Praefect makes independent failover decisions without coordination:- No database required
- No coordination between Praefect instances
- Useful for local development environments
- Not suitable for production: different Praefects may disagree on primary
Failover Disabled
Disable automatic failover for testing or troubleshooting:- Virtual storage becomes unavailable if the configured primary fails
- Manual intervention required to restore service
- Useful for debugging or performing controlled maintenance
Health Monitoring
Praefect continuously monitors Gitaly node health to detect failures.Error Tracking
Praefect can promote failover based on error rates:- Track errors per node within a time window
- Mark a node unhealthy if errors exceed the threshold
- Read errors are weighted separately from write errors
- Unhealthy nodes are excluded from primary election
Error thresholds help detect “soft failures” where a node is responding but returning errors. This complements health checks which only detect “hard failures” (node completely down).
Health Checks
With theper_repository election strategy, Praefect uses active health checks:
- Gitaly node is reachable
- gRPC connection is healthy
- Node can respond to basic RPCs
Primary Selection
When electing a new primary, Praefect considers:Eligibility Requirements
- Health: Node must be currently healthy
- Connectivity: Node must be reachable by majority of Praefects
- Data consistency: Node must not be significantly behind
Selection Criteria
Among eligible candidates, Praefect selects based on:- Replication lag: Prefer nodes with fewer pending replication jobs
- Data recency: Choose the node with the most up-to-date data
- Position: If all else equal, select based on configuration order
Praefect prioritizes minimizing data loss over load distribution. The most up-to-date node becomes primary even if it’s already serving many repositories.
Read-Only Mode
After failing over to an outdated node, affected repositories enter read-only mode:Why Read-Only?
Read-only mode prevents data conflicts: If the new primary accepted writes immediately, those commits might conflict with unreplicated data from the old primary.Recovery from Read-Only
Praefect automatically removes read-only mode when:- All missing data is replicated from another node, or
- Administrator accepts data loss using
accept-dataloss
Identifying Data Loss
After a failover, check for repositories with potential data loss:- Repository’s current primary is
gitaly-2 gitaly-1is missing 1 change from when it was primarygitaly-1was last seen healthy at the shown timestamp
Failover Testing
Regularly test failover to verify configuration:Planned Failover Test
-
Verify current state:
-
Stop a Gitaly node:
-
Monitor Praefect logs:
-
Verify failover occurred:
-
Test repository access:
-
Restart the node:
-
Verify replication:
Unplanned Failure Simulation
Test handling of abrupt failures:Perform failover tests in a non-production environment first. Even with proper configuration, unexpected edge cases can occur.
Monitoring Failover Events
Track failover through logs and metrics:Key Log Messages
Key Metrics
Best Practices
- Use per-repository election: Provides best availability and granularity
- Configure error thresholds: Catch soft failures before they impact users
- Monitor replication lag: Large lag increases data loss risk during failover
- Test regularly: Verify failover works before you need it
- Plan for data loss: Document procedures for handling
accept-datalossscenarios - Size appropriately: Ensure remaining nodes can handle load after failover
Recovering Failed Nodes
When a failed node comes back online:- Praefect detects recovery through health checks
- Node marked healthy and eligible for primary election
- Replication scheduled to bring node up-to-date
- Node rejoins cluster once replication completes
Next Steps
HA Overview
Review high availability concepts
Praefect Configuration
Configure failover settings