Identifying consul split-brain
- Retrieve the consul server logs from each server by running
consul monitor -log-level=debug
- Wait for the line that states the election term. It should look something like this:
2020/10/19 16:21:23 [INFO] raft: Node at 10.90.168.42:8300 [Candidate] entering Candidate state in term 3732
2020/10/19 16:21:23 [DEBUG] raft: Votes needed: 2
2020/10/19 16:21:23 [DEBUG] raft: Vote granted from foobar in term 3732. Tally: 1
2020/10/19 16:28:53 [WARN] raft: Election timeout reached, restarting election
2020/10/19 16:28:53 [INFO] raft: Node at 00.00.000.00:8300 [Candidate] entering Candidate state in term 992
2020/10/19 16:28:53 [DEBUG] raft: Votes needed: 2
2020/10/19 16:28:53 [DEBUG] raft: Vote granted from foobar2 in term 992. Tally: 1
2020/10/19 16:28:53 [ERR] raft: Failed to make RequestVote RPC to {Voter <Voter ID>)
020/10/19 16:29:04 [WARN] raft: Election timeout reached, restarting election
2020/10/19 16:29:04 [INFO] raft: Node at 00.00.000.00:8300 [Candidate] entering Candidate state in term 989
2020/10/19 16:29:04 [DEBUG] raft: Votes needed: 2
2020/10/19 16:29:04 [DEBUG] raft: Vote granted from <ID> in term 989. Tally: 1
- If the election terms between all of your consul servers are not the same, it indicate that they're out of sync, it is likely you're in a split-brain scenario.
Recovering from a consul split-brain
- Verify that you have a recent valid consul snapshot. If consul snapshot isn't set up or enabled at all, please expect data loss.
- Check consul config files and make sure that bootstrap is set to at least 3. Setting bootstrap_except to 3 means in layman's terms that in order for an election to take place. Each individual server needs to wait till it has received a ping and joined with 2 other consul servers in order to start an election process. To learn more about how consul elections are done, please take a look at this website: https://raft.github.io/
- Go to the consul data directory and either delete, move, or rename the consul data directory.
- Perform a rolling restart of all consul servers.