Vault HA Cluster Failover Issues with Integrated Storage/Raft

Jason Peng

December 20, 2024 17:27
Updated

Problem

An active node in Vault HA Cluster in IS/Raft fails, but the failover process does not trigger- the failover process does not promote available standby node to a new leader.
Restarting the standby nodes does not trigger leader election process.

Prerequisites

Vault HA Cluster is only available through Vault Enterprise. In addition, this article assumes Integrated Storage (Raft) backend.

Cause

The failure of an active node can be due to several common reasons, such as disk space issues, networking issues for node communication, performance issues for Vault storage.
In order for a failover process to be properly triggered and electing a new leader, it is important that there are enough healthy standby nodes to reach a quorum.

Solutions:

It is important to identify why an active node fails first. Start tracking the Vault Operational Logs and start reading from when the issue starts. The common causes are outlined in the Cause section above.
Looking at logs for other nodes in the Vault cluster, it is important that once the other nodes can talk to each other, otherwise the nodes cannot reach a quorum for a leader election to occur.

Note that it may be possible to join or remove peers using operator raft command. If there are nodes that are unhealthy it may be possible to add new node to the raft cluster and remove the existing unhealthy node.

Outcome

The vault operator raft list-peers command will show if there is an active node or not.

Additional Information

See more