Problem
- An active node in Vault HA Cluster in IS/Raft fails, but the failover process does not trigger- the failover process does not promote available standby node to a new leader.
- Restarting the standby nodes does not trigger leader election process.
Prerequisites
- Vault HA Cluster is only available through Vault Enterprise. In addition, this article assumes Integrated Storage (Raft) backend.
Cause
- The failure of an active node can be due to several common reasons, such as disk space issues, networking issues for node communication, performance issues for Vault storage.
- In order for a failover process to be properly triggered and electing a new leader, it is important that there are enough healthy standby nodes to reach a quorum.
Solutions:
- It is important to identify why an active node fails first. Start tracking the Vault Operational Logs and start reading from when the issue starts. The common causes are outlined in the Cause section above.
- Looking at logs for other nodes in the Vault cluster, it is important that once the other nodes can talk to each other, otherwise the nodes cannot reach a quorum for a leader election to occur.
-
Note that it may be possible to join or remove peers using operator raft command. If there are nodes that are unhealthy it may be possible to add new node to the raft cluster and remove the existing unhealthy node.
Outcome
- The vault operator raft list-peers command will show if there is an active node or not.