Introduction
This article provides guidance on identifying and recovering from a Consul split-brain scenario. Split-brain occurs when Consul servers lose connectivity with each other, causing inconsistencies in the cluster state.
Problem
Consul servers in a cluster become unsynchronized due to network issues or misconfigurations, leading to split-brain scenarios where multiple servers attempt to assume the leader role simultaneously. This can result in inconsistent data and operational disruptions.
Prerequisites
-
Consul version compatible with snapshots and Raft-based election configurations.
-
Administrative access to Consul servers and their log files.
-
Knowledge of the Consul configuration and data directories.
Cause
A split-brain scenario can occur due to:
-
Network partitions or connectivity issues between Consul servers.
-
Misconfigured bootstrap settings.
-
Improper handling of Consul data or logs.
Common symptoms include:
-
Servers repeatedly entering the
Candidate
state. -
Logs indicating mismatched election terms among servers.
-
Errors like "Failed to make RequestVote RPC."
Example Log Output
Below are sample log entries that indicate a potential split-brain scenario:
2020/10/19 16:21:23 [INFO] raft: Node at 10.90.168.42:8300 [Candidate] entering Candidate state in term 3732
2020/10/19 16:21:23 [DEBUG] raft: Votes needed: 2
2020/10/19 16:21:23 [DEBUG] raft: Vote granted from foobar in term 3732. Tally: 1
2020/10/19 16:28:53 [WARN] raft: Election timeout reached, restarting election
2020/10/19 16:28:53 [INFO] raft: Node at 00.00.000.00:8300 [Candidate] entering Candidate state in term 992
2020/10/19 16:28:53 [DEBUG] raft: Votes needed: 2
2020/10/19 16:28:53 [DEBUG] raft: Vote granted from foobar2 in term 992. Tally: 1
2020/10/19 16:28:53 [ERR] raft: Failed to make RequestVote RPC to {Voter <Voter ID>}
2020/10/19 16:29:04 [WARN] raft: Election timeout reached, restarting election
2020/10/19 16:29:04 [INFO] raft: Node at 00.00.000.00:8300 [Candidate] entering Candidate state in term 989
2020/10/19 16:29:04 [DEBUG] raft: Votes needed: 2
2020/10/19 16:29:04 [DEBUG] raft: Vote granted from <ID> in term 989. Tally: 1
If the election terms between servers are not the same, the servers are out of sync, indicating a split-brain scenario.
Solution
- Verify that you have a recent valid consul snapshot. If the consul snapshot isn't set up or enabled at all, please expect data loss.
- Check consul config files and make sure that bootstrap is set to at least 3. Setting bootstrap_except to 3 means in layman's terms that in order for an election to take place. Each individual server needs to wait till it has received a ping and joined with 2 other consul servers in order to start an election process. To learn more about how consul elections are done, please take a look at this website: https://raft.github.io/
- Go to the consul data directory and either delete, move, or rename the consul data directory.
- Perform a rolling restart of all consul servers.
Outcome
The cluster should resume normal operation with consistent election terms and no split-brain behavior. Verify by:
-
Running
consul operator raft list-peers
to check Raft peer consistency. -
Observing stable leader election in logs.
If the issue persists, contact HashiCorp support or consult related documentation.