Introduction

Consul uses Raft as the consensus protocol which is used during leader elections. At a predefined interval, Consul will check the health of nodes within the cluster, if a leader node becomes unavailable a new leadership election will begin. Generally, the leadership election process is quick and does not run into issues, however, there are times when the leadership election is unable to move forward.

Problem
Possible Causes & Solutions
Additional Information

Problem

If you suspect that your cluster does not have a leader you can run the command consul operator raft list-peers and you will be met with the following error:

Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

Generally, when the election is not able to occur there will be indicators in the agent logs that will assist you in identifying the core cause of the failure. As a good first step, it is best to review the state of the cluster using the command consul members which should give you an idea of which servers are part of the cluster.

Possible Causes & Solutions

Bootstrap Expect has not yet been met

The server attempts to start up and the following is seen in the logs:

[WARN] agent: bootstrap_expect > 0: expecting 3 servers
[WARN] agent.auto_config: bootstrap_expect > 0: expecting 3 servers

Solution/Troubleshooting

The Consul server agents are expecting X servers which have been defined in the Consul using the parameter bootstrap_expect.
- The bootstrap_expect setting is only referenced when Consul's configured data_dir location doesn't contain any existing Consul directories and data (aka empty).
- Ensure the required amount of Consul server agents are started and attempting to connect to each other.
- Review Consul's configuration to verify that the retry join is set with the appropriate servers.
  - If using Cloud Auto-Joining, please make sure that the server instances are tagged accordingly to Provider-specific configurations.
- Ensure all required Consul ports are open for traffic by the appropriate servers in the network.
- Review the logs for each agent to identify if they are running into an issue connecting to the other nodes.

Corrupt Raft Database

The server attempts to start up and the following is seen in the logs:

“panic: invalid freelist page: XXXXXX”

Error is typically due to having a corrupted raft database, which is generally caused by storage issues in Consul
- Consul's storage backend ran out of storage space, causing Consul Leader Node to go into an unstable state.

Solution/Troubleshooting

Stop the Consul service
Move the raft directory to a larger volume
Update the Consul config file to use the new raft path
Restart Consul

Unstable Leader

Follower nodes might not become a Leader after the leader became unstable

"consul[14326]: {"@level":"warn","@message":"unable to get address for server, using fallback address", "error":"Could not find address for server id "
"error","@message":"failed to make requestVote RPC", "error":"dial tcp :8300: connect: connection refused"

Solution/Troubleshooting

Stop all remaining servers by executing the consul leave command
- If the leave exits with an error, kill the agent forcibly.
Go to the the data-dir of each Consul server.
- Inside the data-dir, there will be a raft/sub-directory. Create a raft/peers.json
  - For Raft protocol version 3 and later, this should be formatted as a JSON array containing the node ID, address:port, and suffrage information of each Consul server in the cluster, like the below example:
```
[
{
"id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
"address": "10.1.0.1:8300",
"non_voter": false
},
{
"id": "8b6dda82-3103-11e7-93ae-92361f002671",
"address": "10.1.0.2:8300",
"non_voter": false
},
{
"id": "97e17742-3103-11e7-93ae-92361f002671",
"address": "10.1.0.3:8300",
"non_voter": false
}
]
```
    - Run this command to quickly generate the data using Raft Protocol V3:
```
curl -s localhost:8500/v1/agent/members | jq '[ .[] | select(.Tags.role == "consul" and .Status == 1) | {id: .Tags.id, address: "\(.Addr):\(.Port)", non_voter: false} ]'
```
    CAUTION: Please double-check the content returned is accurate before redirecting to <data_dir>/raft/peers.json file
  - If the server is configured for Raft protocol version 2 and earlier, then the peers.json file should be formatted as a JSON array containing the address and port of each Consul server in the cluster, as shown in the below example:
```
["10.1.0.1:8300", "10.1.0.2:8300", "10.1.0.3:8300"]
```
    - Run this command to quickly generate the data using Raft Protocol V2:
```
curl -s localhost:8500/v1/agent/members | jq '[ .[] | select(.Tags.role == "consul" and .Status == 1) | "\(.Addr):\(.Port)" ]'
```
    CAUTION: Please double-check the content returned is accurate before redirecting to <data_dir>/raft/peers.json file
Ensure that this file is the same across all remaining server nodes.
Now restart all the remaining servers.

Only attempt if the peers.json method fails

The bootstrap_expect can be adjusted to 1, on the available nodes.
- Make a backup the raft directory, and then delete the raft directory on the non-working "Leader" node, then restart the Consul Service.
- Ensure that the retry_join parameter is set in your Consul configuration.
- The former leader node should come up, and sync its newly created raft database with another cluster member, as it should join the cluster as a follower of another node, which should become the new leader after seeing that the former leader is wanting to join the cluster and requests a database sync.
- A X number node cluster should be up and running at this point.
- Additional steps to complete the restoration of a 3 Node Cluster
  - The third node was added back in with a Bootstrap Expect value of 1
  - After a soak period to allow sync to the leader, the bootstrap expect value is changed back to 3 on the three nodes, and a rolling restart of the three nodes was completed to implement the new Bootstrap value. (Note--A new Leader election process will occur when the Leader Node is restarted - This is expected behavior.)

Additional Information

Consul Outage Recovery
Bootstrap Options -bootstrap-expect
DNS and Domain Options -retry-join
Consul Required Ports

Consul Leader Election Unable to Occur

Introduction

Table of Contents

Problem

Possible Causes & Solutions

Bootstrap Expect has not yet been met

The server attempts to start up and the following is seen in the logs:

Solution/Troubleshooting

Corrupt Raft Database

The server attempts to start up and the following is seen in the logs:

Solution/Troubleshooting

Unstable Leader

Solution/Troubleshooting

Additional Information

Articles in this section

Introduction

Table of Contents

Problem

Possible Causes & Solutions

Bootstrap Expect has not yet been met

The server attempts to start up and the following is seen in the logs:

Solution/Troubleshooting

Corrupt Raft Database

The server attempts to start up and the following is seen in the logs:

Solution/Troubleshooting

Unstable Leader

Solution/Troubleshooting

Additional Information

Articles in this section

Related articles