Quorum lost while upgrading the vault from 1.11.0 to later version of it – HashiCorp Help Center

Problem statement

As we know that Vault has introduced the Automate Upgrades with Vault Enterprise feature in the Vault version starting from 1.11.0, but we have seen that customers are using their own upgrade strategy for upgrading the Vault version i.e. AMI upgrade, VM recreates, etc but when customers doing the upgrade from Vault version 1.11.0 to 1.11.x(>0), using their own strategy then they are facing the issue of Quorum Lost, which is the downtime for the cluster until the Active node of the cluster didn't get upgraded to the 1.11.x(>0) version, because Standby Nodes get added to a cluster as NON-Voter due to default enablement of the automated upgrade migrations feature in Vault version 1.11.x which will not allow Standby nodes to give the vote as there is a difference in Vault Version on Active(on lesser version as Standby) & Standby Nodes hence it brings the cluster in Quorum Lost situation.

I have performed this test with 3 nodes cluster & upgrading the vault from 1.11.0 to 1.11.3, so when I upgraded the one standby node from 1.11.0 to 1.11.3 & then ran the below command on Active Node:-

$ vault operator raft autopilot state
#Output:-
  #Upgrade Info:
    #Status: await-new-voters
    #Target Version: 1.11.3
    #Target Version Voters:
    #Target Version Non-Voters: node_vaults0
    #Other Version Voters: node_vaults1, node_vaults2
    #Other Version Non-Voters:

Then I can see that Standby node node_vaults0 came as non-voters & when I upgraded the second standby node at that time Quorum got lost & all nodes went in Standby state until I upgraded the last Active Nodes on the latest version in my case it is 1.11.3

At this time we can see the below errors in the last Active Vault Node.

Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact quorum of nodes, stepping down
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.111Z [DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.50.42:8201 alpn=raft_storage_v1 host=raft-023bf9ba-536f-a7ee-14f1-16339682d2b0
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.112Z [ERROR] storage.raft: failed to heartbeat to: peer=192.168.50.42:8201 error="dial tcp 192.168.50.42:8201: connect: connection refused"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact: server-id=node_vaults2 time=2.524339216s
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact quorum of nodes, stepping down
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [INFO] storage.raft: entering follower state: follower="Node at 192.168.50.41:8201 [Follower]" leader-address= leader-id=
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [INFO] storage.raft: aborting pipeline replication: peer="{Voter node_vaults0 192.168.50.40:8201}"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [WARN] core: leadership lost, stopping active operation
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [ERROR] replication.index.perf: failed to persist checkpoint: error="failed to persist: leadership lost while committing log"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [INFO] core: pre-seal teardown starting
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [TRACE] core.snapshotmgr: shutting down automatic snapshots
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [DEBUG] storage.raft.autopilot: state update routine is now stopped
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [DEBUG] storage.raft.autopilot: autopilot is now stopped
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [ERROR] replication.index.local: failed to persist checkpoint: error="failed to persist: node is not the leader"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [ERROR] replication.index.periodic: failed to save checkpoint:
Sep 15 10:04:52 vaults1 vault[3943]: error=
Sep 15 10:04:52 vaults1 vault[3943]: | 2 errors occurred:
Sep 15 10:04:52 vaults1 vault[3943]: | * failed to persist checkpoint: failed to persist: leadership lost while committing log
Sep 15 10:04:52 vaults1 vault[3943]: | * failed to persist checkpoint: failed to persist: node is not the leader
Sep 15 10:04:52 vaults1 vault[3943]: |
Sep 15 10:04:52 vaults1 vault[3943]:

Solution

So when we upgraded the Vault to 1.11.x version from any lower version(less than 1.11.x) at that time we should disable the default automated upgrade migrations feature of the Vault.

Below is the output when automated upgrade migrations the feature is enabled, highlighted in brown color.

$ vault operator raft autopilot get-config
  #Key                                 Value
  #---                                 -----
  #Cleanup Dead Servers                false
  #Last Contact Threshold              10s
  #Dead Server Last Contact Threshold  24h0m0s
  #Server Stabilization Time           10s
  #Min Quorum                          0
  #Max Trailing Logs                   1000
  #Disable Upgrade Migration           false

For disabling this feature, we have to run the below command on Active Node.

$ vault operator raft autopilot set-config -disable-upgrade-migration=true
$ vault operator raft autopilot get-config
  #Key                                 Value
  #---                                 -----
  #Cleanup Dead Servers                false
  #Last Contact Threshold              10s
  #Dead Server Last Contact Threshold  24h0m0s
  #Server Stabilization Time           10s
  #Min Quorum                          0
  #Max Trailing Logs                   1000
  #Disable Upgrade Migration           true

The value of highlighted parameter Disable Upgrade Migration becomes true which disabled this default feature & after that, we will not get the Quorum Lost issue.

References

Raft Upgrade Automation

Perform the test in my Local Lab by upgrading the Vault from 1.11.0 to 1.11.3 with the default feature enabled & then perform the test again by disabling the default feature of automated upgrade migration.

Articles in this section

Related articles