Introduction
Problem
Vault fails to unseal after performing a Vault snapshot restore. In this particular use case the Snapshot was taken from a different Vault Cluster. The restore was performed on a 3 node Vault cluster, which has been newly initialized.
Prerequisites (if applicable)
- Vault Enterprise Edition
- Integrated storage (Raft) backend
- Vault Snapshots
Cause
-
The Vault Operational logs for the Vault Active Node shows the following relevant entries:
[DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.0.39:8201 alpn=raft_storage_v1 host=raft-2b6373f6-48d7-ac09-b671-039d9b81d540 [ERROR] storage.raft: failed to heartbeat to: peer=192.168.0.39:8201 backoff time=40ms error="dial tcp 192.168.0.39:8201: connect: connection refused" [DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.0.44:8201 alpn=raft_storage_v1 host=raft-2b6373f6-48d7-ac09-b671-039d9b81d540 [ERROR] storage.raft: failed to heartbeat to: peer=192.168.0.44:8201 backoff time=40ms error="dial tcp 192.168.0.44:8201: connect: connection refused" [WARN] storage.raft: failed to contact: server-id=vaultnode3 time=2.906055356s [WARN] storage.raft: failed to contact: server-id=vaultnode2 time=2.500476109s [WARN] storage.raft: failed to contact quorum of nodes, stepping down [INFO] storage.raft: entering follower state: follower="Node at 192.168.0.45:8201 [Follower]" leader-address= leader-id= [WARN] core: leadership lost, stopping active operation [DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.0.39:8201 alpn=raft_storage_v1 host=raft-2b6373f6-48d7-ac09-b671-039d9b81d540 [ERROR] storage.raft: failed to appendEntries to: peer="{Nonvoter vaultnode3 192.168.0.39:8201}" error="dial tcp 192.168.0.39:8201: connect: connection refused" [DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.0.44:8201 alpn=raft_storage_v1 host=raft-2b6373f6-48d7-ac09-b671-039d9b81d540 [ERROR] storage.raft: failed to heartbeat to: peer=192.168.0.44:8201 backoff time=80ms error="dial tcp 192.168.0.44:8201: connect: connection refused" [DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.0.44:8201 alpn=raft_storage_v1 host=raft-2b6373f6-48d7-ac09-b671-039d9b81d540 [ERROR] storage.raft: failed to appendEntries to: peer="{Nonvoter vaultnode2 192.168.0.44:8201}" error="dial tcp 192.168.0.44:8201: connect: connection refused" [INFO] core: running post snapshot restore invalidations [DEBUG] sealwrap: unwrapping entry: key=core/master [TRACE] UNWRAP exact key ID match decrypt success: sealWrapper.Name=awskms sealWrapper.Wrapper.KeyId=arn:aws:kms:us-east-2:**********:key/********-****-****-****-********942d [TRACE] decrypted value using seal: seal_name=awskms [INFO] core: failed to perform key upgrades, reloading using auto seal [TRACE] UNWRAP exact key ID match decrypt success: sealWrapper.Name=awskms sealWrapper.Wrapper.KeyId=arn:aws:kms:us-east-2:**********:key/********-****-****-****-********942d [TRACE] decrypted value using seal: seal_name=awskms [DEBUG] sealwrap: unwrapping entry: key=core/keyring [TRACE] UNWRAP exact key ID match decrypt success: sealWrapper.Name=awskms sealWrapper.Wrapper.KeyId=arn:aws:kms:us-east-2:**********:key/********-****-****-****-********942d [TRACE] decrypted value using seal: seal_name=awskms [INFO] core: done reloading root key using auto seal [ERROR] core: cluster setup failed: error="node is not the leader" [INFO] core: marked as sealed [DEBUG] core: clearing forwarding clients [DEBUG] core: done clearing forwarding clients [DEBUG] core: finished triggering standbyStopCh for runStandby [DEBUG] core: closed perf standby [DEBUG] core: closed periodic license compare stop channel [DEBUG] core: shutting down periodic key rotation checker [DEBUG] core: shutting down periodic leader refresh [DEBUG] core: shutting down periodic metrics [DEBUG] core: shutting down leader elections [INFO] core: pre-seal teardown starting
This is the relevant snippet:
cluster setup failed: error="node is not the leader
- The Vault Operational logs for the Vault (Performance) Standby nodes show the following relevant entries:
[DEBUG] core: parsing information for new active node: active_cluster_addr=https://192.168.0.5:8201 active_redirect_addr=https://192.168.0.5:8200
[DEBUG] core: refreshing forwarding connection: clusterAddr=https://192.168.0.5:8201
[DEBUG] core: clearing forwarding clients
[DEBUG] core: forwarding: stopping heartbeating
[DEBUG] core: done clearing forwarding clients
[DEBUG] core: done refreshing forwarding connection: clusterAddr=https://192.168.0.5:8201
[DEBUG] core: new leader found, triggering new leader channel
[DEBUG] core.cluster-listener: creating rpc dialer: address=192.168.0.5:8201 alpn=req_fw_sb-act_v1 host=fw-fa6454ce-ef2c-a040-3095-6acb751d0c75
[DEBUG] core.cluster-listener: performing client cert lookup
[INFO] core: waiting to become performance standby
[ERROR] core: failed to elect as performance standby: error="rpc error: code = FailedPrecondition desc = node is not in HA cluster membership"
[DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = FailedPrecondition desc = node is not in HA cluster membership"
[ERROR] core: shutting down core: error="node has been removed from the HA cluster"
[DEBUG] core: shutdown called
[INFO] core: marked as sealed
[DEBUG] core: clearing forwarding clients
[DEBUG] core: forwarding: stopping heartbeating
[DEBUG] core: done clearing forwarding clients
[DEBUG] core: finished triggering standbyStopCh for runStandby
[DEBUG] core: closed perf standby
[DEBUG] core: closed periodic license compare stop channel
[DEBUG] core: shutting down periodic key rotation checker
[DEBUG] core: shutting down periodic leader refresh
[DEBUG] core: shutting down periodic metrics
[DEBUG] core: shutting down leader elections
[DEBUG] core: runStandby done
[DEBUG] core.cluster-listener: performing server cert lookup
[DEBUG] core.cluster-listener: error handshaking cluster connection: error="tls: no certificates configured"
[ERROR] storage.raft.raft-net: failed to accept connection: error="Raft RPC layer closed"
[INFO] core: stopping cluster listeners
[INFO] core.cluster-listener: forwarding rpc listeners stopped
[ERROR] storage.raft.raft-net: failed to decode incoming command: error="transport shutdown"
[WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=["raft_storage_v1"]
[DEBUG] core.cluster-listener: error handshaking cluster connection: error="unsupported protocol"
[INFO] core.cluster-listener: rpc listeners successfully shut down
[INFO] core: cluster listeners successfully shut down
[DEBUG] core: sealing barrier
[INFO] core: vault is sealedThis is the relevant snippet:
failed to elect as performance standby: error="rpc error: code = FailedPrecondition desc = node is not in HA cluster membership"
Overview of possible solutions (if applicable)
Solutions:
In this particular use case, the Vault snapshot was created from cluster A and the restore test was performed on cluster B, this while both clusters were up and running and able to communicate with each other. This is resulted in the Vault Performance Standby nodes in cluster B trying to join cluster A instead of cluster B on which the Vault SnapShot was restored. This caused the Vault cluster not being able to form a quorum and therefore not being able to unseal.
Vault will try to reach nodes based on the ip addresses stored in the integrated storage (Raft) backend.
Therefore restore or upgrade tests should be performed using isolated hosts, to avoid issues such as for example, irrevocable leases or Vault Instances attempting to join the source cluster.
Outcome
When isolated Vault clusters are used for restore or upgrade tests, the issue is not seen and the Vault Snapshot restore should be successful.
Additional Information
Vault Documentation Integrated storage (Raft) backend
Vault Documentation Restore a Vault snapshot
-
Vault Documentation Upgrade Vault