Introduction
Problem
New Raft Nodes failed to Join Cluster
Typical Symptoms: Attempting to restore Raft DB node(s), is not joining the cluster after restart and unsealing. Ref: https://developer.hashicorp.com/vault/tutorials/raft/raft-lost-quorum
The following errors noted in the logs:
Aug 30 14:46:09 vault-server[3329]: 2023-08-30T14:46:09.175Z [INFO] core: security barrier not initialized
Aug 30 14:46:09 vault-server[3329]: 2023-08-30T14:46:09.175Z [INFO] core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery
Aug 30 14:46:10 vault-server[3329]: 2023-08-30T14:46:10.908Z [INFO] core: security barrier not initialized
Aug 30 14:46:10 vault-server[3329]: 2023-08-30T14:46:10.908Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault-server.url:8200
Aug 30 14:46:10 vault-server[3329]: 2023-08-30T14:46:10.908Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault-server.url:8200
Aug 30 14:46:10 vault-server[3329]: 2023-08-30T14:46:10.908Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault-server.url:8200
Aug 30 14:46:10 vault-server[3329]: 2023-08-30T14:46:10.908Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault-server.url:8200
Aug 30 14:46:10 vault-server[3329]: 2023-08-30T14:46:10.981Z [ERROR] core: failed to retry join raft cluster: retry=2s
Aug 30 14:46:12 vault-server[3329]: 2023-08-30T14:46:12.982Z [INFO] core: security barrier not initialized
Running Vault Status command shows Vault as unsealed however:
server# vault status
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 5
Threshold 3
Version 1.10.3+ent
Storage Type raft
Cluster Name vault-cluster
Cluster ID c7q1234ke-fdbe-e805-df49-a3aedfa8ks
HA Enabled true
HA Cluster https://vault-server.url:8201
HA Mode active
Active Since 2023-08-30T13:36:13.376802075Z
Raft Committed Index 708726011
Raft Applied Index 708726022
Last WAL 413797054
Prerequisites (if applicable)
- Running Vault version prior to 1.12.9
- Running integrated storage (Raft)
Cause
-
Detailed description of issue
Prior to v1.12.9, v1.13.5, and v1.14.0, adding newly initialized nodes to a Raft cluster whose Raft Applied Index exceeded its Raft Autopilot max_trailing_logs value (default 1000) would attempt to join the cluster, begin its bootstrap, but be left in an incomplete state.
Workaround details
Overview of possible solutions (if applicable)
Solutions:
-
Workaround: Details: Prior to v1.12.9, v1.13.5, v1.14.0, increasing the max_trailing_logs in excess of the Raft Applied Index, re-initializing any nodes that were left in an incomplete state, and retry joining them to the cluster.
-
Caution is advised to this approach for Committed Indexes over 100,000,000. We have observed setting this value to very high numbers (e.g., 10,000,000,000) can have a deleterious effect on the Cluster, causing the active node to become unresponsive to client requests while still sending Raft heartbeats, preventing an election from occurring and thereby causing a partial outage.
-
- Upgrade to an unaffected version.
Outcome
Resolved in version 1.12.9, 1.13.5, 1.14.0