Problem statement
As we know that Vault has introduced the Automate Upgrades with Vault Enterprise
feature in the Vault version starting from 1.11.0, but we have seen that customers are using their own upgrade strategy for upgrading the Vault version i.e. AMI upgrade, VM recreates, etc but when customers doing the upgrade from Vault version 1.11.0 to 1.11.x(>0), using their own strategy then they are facing the issue of Quorum Lost, which is the downtime for the cluster until the Active node of the cluster didn't get upgraded to the 1.11.x(>0) version, because Standby Nodes get added to a cluster as NON-Voter due to default enablement of the automated upgrade migrations
feature in Vault version 1.11.x which will not allow Standby nodes to give the vote as there is a difference in Vault Version on Active(on lesser version as Standby) & Standby Nodes hence it brings the cluster in Quorum Lost situation.
I have performed this test with 3 nodes cluster & upgrading the vault from 1.11.0 to 1.11.3, so when I upgraded the one standby node from 1.11.0 to 1.11.3 & then ran the below command on Active Node:-
$ vault operator raft autopilot state
#Output:-
#Upgrade Info:
#Status: await-new-voters
#Target Version: 1.11.3
#Target Version Voters:
#Target Version Non-Voters: node_vaults0
#Other Version Voters: node_vaults1, node_vaults2
#Other Version Non-Voters:
Then I can see that Standby node node_vaults0 came as non-voters & when I upgraded the second standby node at that time Quorum got lost & all nodes went in Standby state until I upgraded the last Active Nodes on the latest version in my case it is 1.11.3
At this time we can see the below errors in the last Active Vault Node.
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact quorum of nodes, stepping down
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.111Z [DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.50.42:8201 alpn=raft_storage_v1 host=raft-023bf9ba-536f-a7ee-14f1-16339682d2b0
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.112Z [ERROR] storage.raft: failed to heartbeat to: peer=192.168.50.42:8201 error="dial tcp 192.168.50.42:8201: connect: connection refused"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact: server-id=node_vaults2 time=2.524339216s
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact quorum of nodes, stepping down
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [INFO] storage.raft: entering follower state: follower="Node at 192.168.50.41:8201 [Follower]" leader-address= leader-id=
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [INFO] storage.raft: aborting pipeline replication: peer="{Voter node_vaults0 192.168.50.40:8201}"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [WARN] core: leadership lost, stopping active operation
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [ERROR] replication.index.perf: failed to persist checkpoint: error="failed to persist: leadership lost while committing log"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [INFO] core: pre-seal teardown starting
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [TRACE] core.snapshotmgr: shutting down automatic snapshots
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [DEBUG] storage.raft.autopilot: state update routine is now stopped
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [DEBUG] storage.raft.autopilot: autopilot is now stopped
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [ERROR] replication.index.local: failed to persist checkpoint: error="failed to persist: node is not the leader"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [ERROR] replication.index.periodic: failed to save checkpoint:
Sep 15 10:04:52 vaults1 vault[3943]: error=
Sep 15 10:04:52 vaults1 vault[3943]: | 2 errors occurred:
Sep 15 10:04:52 vaults1 vault[3943]: | * failed to persist checkpoint: failed to persist: leadership lost while committing log
Sep 15 10:04:52 vaults1 vault[3943]: | * failed to persist checkpoint: failed to persist: node is not the leader
Sep 15 10:04:52 vaults1 vault[3943]: |
Sep 15 10:04:52 vaults1 vault[3943]:
Solution
So when we upgraded the Vault to 1.11.x version from any lower version(less than 1.11.x) at that time we should disable the default automated upgrade migrations
feature of the Vault.
Below is the output when automated upgrade migrations
the feature is enabled, highlighted in brown color.
$ vault operator raft autopilot get-config
#Key Value
#--- -----
#Cleanup Dead Servers false
#Last Contact Threshold 10s
#Dead Server Last Contact Threshold 24h0m0s
#Server Stabilization Time 10s
#Min Quorum 0
#Max Trailing Logs 1000
#Disable Upgrade Migration false
For disabling this feature, we have to run the below command on Active Node.
$ vault operator raft autopilot set-config -disable-upgrade-migration=true
$ vault operator raft autopilot get-config
#Key Value
#--- -----
#Cleanup Dead Servers false
#Last Contact Threshold 10s
#Dead Server Last Contact Threshold 24h0m0s
#Server Stabilization Time 10s
#Min Quorum 0
#Max Trailing Logs 1000
#Disable Upgrade Migration true
The value of highlighted parameter Disable Upgrade Migration becomes true which disabled this default feature & after that, we will not get the Quorum Lost issue.
References