Problem statement
Vault Enterprise provides automated version upgrades with the autopilot feature when using Integrated Storage. This feature, which was introduced in v1.11.0, allows you to start new Vault nodes alongside the older version ones and automatically switch to the new nodes after they reach quorum. But there might be cases or scenarios where adding new nodes is not desired or possible and customers opt to upgrade Vault using their own strategy like AMI upgrade or upgrading vault in the same VM. When upgrading Vault this way, customer will lose quorum causing downtime for the cluster until the active node of the cluster is upgraded.
For example, when testing with 3 nodes cluster & upgrading the vault from 1.11.0 to 1.11.3, after one standby node from the cluster is upgraded from 1.11.0 to 1.11.3 & then you check the autopilot state like shown here we can see that standby node node_vaults0 came as non-voter
$ vault operator raft autopilot state
#Output:-
#Upgrade Info:
#Status: await-new-voters
#Target Version: 1.11.3
#Target Version Voters:
#Target Version Non-Voters: node_vaults0
#Other Version Voters: node_vaults1, node_vaults2
#Other Version Non-Voters:
When disable_upgrade_migration
flag is false which is set by default, when you add a new node to the cluster, it gets added as a non-voter
and the status of autopilot upgrade changes to await-new-voters
. Since the new nodes or upgraded vault nodes in your case will come back to the cluster as non-voter
, you lose quorum until you upgrade the leader. Even though you will lose quorum in the process, you should have a functioning cluster after you upgrade all the nodes.
You will see logs similar to this in the last Active Vault Node when the quorum is lost
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact quorum of nodes, stepping down
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.111Z [DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.50.42:8201 alpn=raft_storage_v1 host=raft-023bf9ba-536f-a7ee-14f1-16339682d2b0
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.112Z [ERROR] storage.raft: failed to heartbeat to: peer=192.168.50.42:8201 error="dial tcp 192.168.50.42:8201: connect: connection refused"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact: server-id=node_vaults2 time=2.524339216s
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [WARN] storage.raft: failed to contact quorum of nodes, stepping down
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.284Z [INFO] storage.raft: entering follower state: follower="Node at 192.168.50.41:8201 [Follower]" leader-address= leader-id=
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [INFO] storage.raft: aborting pipeline replication: peer="{Voter node_vaults0 192.168.50.40:8201}"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [WARN] core: leadership lost, stopping active operation
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [ERROR] replication.index.perf: failed to persist checkpoint: error="failed to persist: leadership lost while committing log"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [INFO] core: pre-seal teardown starting
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [TRACE] core.snapshotmgr: shutting down automatic snapshots
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.285Z [DEBUG] storage.raft.autopilot: state update routine is now stopped
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [DEBUG] storage.raft.autopilot: autopilot is now stopped
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [ERROR] replication.index.local: failed to persist checkpoint: error="failed to persist: node is not the leader"
Sep 15 10:04:52 vaults1 vault[3943]: 2022-09-15T10:04:52.286Z [ERROR] replication.index.periodic: failed to save checkpoint:
Sep 15 10:04:52 vaults1 vault[3943]: error=
Sep 15 10:04:52 vaults1 vault[3943]: | 2 errors occurred:
Sep 15 10:04:52 vaults1 vault[3943]: | * failed to persist checkpoint: failed to persist: leadership lost while committing log
Sep 15 10:04:52 vaults1 vault[3943]: | * failed to persist checkpoint: failed to persist: node is not the leader
Sep 15 10:04:52 vaults1 vault[3943]: |
Sep 15 10:04:52 vaults1 vault[3943]:
Solution : Disable automated upgrade in the autopilot
When disable_upgrade_migration
flag is set to true, then when you add a new node to the cluster or when you upgrade your vault node, the node gets added back to the cluster as voter
so you will not lose the quorum of the cluster during the process of upgrading all the standby nodes one by one. Upgrading your cluster this way, you would do the following:
- Set the
disable_upgrade_migration
flag to true on the leader node using cli like thisvault operator raft autopilot set-config -disable-upgrade-migration=true
$ vault operator raft autopilot set-config -disable-upgrade-migration=true
$ vault operator raft autopilot get-config
#Key Value
#--- -----
#Cleanup Dead Servers false
#Last Contact Threshold 10s
#Dead Server Last Contact Threshold 24h0m0s
#Server Stabilization Time 10s
#Min Quorum 0
#Max Trailing Logs 1000
#Disable Upgrade Migration true
- Upgrade standby vault nodes one-by-one and make sure they join the cluster back as
voter
before moving on to the new node - Upgrade the active node at last, I would recommend you to step-down the leader, let a new node take leadership and then perform the upgrade on your last node.
References