How-to restore a snapshot to a Vault cluster running on Kubernetes – HashiCorp Help Center

Introduction

Expected Outcome

As part of a disaster recovery process you may need to restore a snapshot to a Vault cluster running on Kubernetes. The Standard Procedure for Restoring a Vault Cluster guide can be a useful starting point however additional steps may be required to perform a restore which is where this guide may assist.

Prerequisites

The existing prerequisites on the Standard Procedure for Restoring a Vault Cluster guide must be satisfied.
Access to a Vault snapshot.
Access to either the same HSM / Cloud unseal method used by the snapshot, or the recovery keys in order to perform a seal migration.
Access to the unseal keys if using Shamir seal.
A minimum of one Vault instance/pod in an operational state (initialised, unsealed, responding to commands).
The Vault instance/pod must be the active/leader node.

Considerations prior to restore

The strategy used for your specific scenario should be considered as this process can be used for multiple scenarios, however this document is written with two in mind:
- 1: Recovering from a scenario where the Vault cluster is in an unhealthy and unrecoverable state.
  - If targeting the first scenario of restoring to an unhealthy cluster the process requires scaling down the StatefulSet to zero replicas in order to remove the unhealthy instances. Once this has been completed the PVC's associated with all Vault pods should be deleted, enabling the new replicas to start with a fresh data volume when scaling back up. If the storageclass / storage provider in use supports generation of snapshots consideration should be made with regards to taking a snapshot of the PV's before deletion. Once the PVC's have been deleted the StatefulSet can be scaled back up to the desired number of replicas. At this point one Vault replica will need to be initialised, following which a login using the newly generated root token should be performed. Next, the main Procedure section further down this guide should be followed.
- 2: Vault is in a healthy state however you wish to restore a snapshot in order to recover deleted data.
  - If targeting the second scenario of restoring a snapshot to a healthy Vault cluster and it has been confirmed all replicas are in sync (Raft Applied Index value of vault status matches, or is very close) and communication between each replica is functioning (Last Echo time in the output of vault operator members is current for all replicas) there is no need to scale down the replica count - restoring a snapshot to the active/leader Vault instance will result in the snapshot being distributed to all other nodes in the cluster.
Understand how you will get access to the snapshot file used for restore, i.e. does it need to be transferred from an S3 bucket directly onto the Vault pod, or can you restore from a snapshot that is present on your local workstation?
Examine the Vault configuration file to help define the plan for additional instances joining the cluster once the restore is completed - i.e, is retry_join configured, allowing automatic joins, or do manual join commands need to be issued?
Consider the network connectivity between where you are performing the restore from and the Kubernetes cluster:
- Will a restore from your local workstation to a Kubernetes cluster route through an ingress controller / reverse proxy that imposes limits such as session time length, maximum file size etc that the snapshot transfer may breach?
  - If this is true, then copying the snapshot file directly to the instance/pod via kubectl cp in order to perform the restore 'locally' may be required.
  - You may still need to adjust configuration parameters on the Vault server - refer below.
- If the snapshot file is large and you are restoring the file from your local workstation, how long will it take for that transfer to complete?
  - You should be aware of the following Vault configuration parameters that may require adjustment:
    - VAULT_CLIENT_TIMEOUT - Environment variable that controls the Vault client side timeout, defaults to 60 seconds. Run export VAULT_CLIENT_TIMEOUT=600s or similar to temporarily increase this on the machine from which you will perform the restore.
    - http_read_timeout - Configuration parameter for the Vault server defined within the listener stanza in the Vault configuration file that is not present out of the box and uses a default value of 30 seconds. If restores fail with an i/o timeoutafter exactly 30 seconds this will need to be manually set as a higher value, for example:

/ $ cat /vault/config/extraconfig-from-values.hcl
...
listener "tcp" {
  address = "[::]:8200"
  http_read_timeout = "600s"
...
}

The http_read_timeout value is read from the Vault configuration file once at start-up, meaning if it has been manually defined for the first time or modified in order to perform a restore the pod must now be deleted and recreated.

Procedure

Once you have decided on the method of restore (remote or local) and made any configuration changes required you can move on to the restore process.
Obtain a copy of the snapshot file that you will restore to the cluster.
Confirm only one pod is present and that it is the leader/active node by checking the value of HA Mode is active in the output of vault status.
Make a note of the value for Raft Applied Index in the vault status output.
If performing a 'local' restore, i.e. directly on the Vault pod, copy the snapshot to the pod using kubectl cp, for example: kubectl cp /tmp/backup.snap vault/vault-0:/vault/backup.snapwill copy the file backup.snap located in /tmp on your local workstation to the vault-0 pod located within the vault namespace and write it as /vault/backup.snap in the pod. See the Additional Information section below for a link to the kubectl cp documentation for further examples.
Login to Vault using a sufficiently privileged token.
Restore the snapshot to the Vault cluster, an example command is as follows: vault operator raft snapshot restore -force /vault/backup.snap.
Once complete run vault status and confirm the value for Sealedis false. If Vault is still sealed it must be unsealed using the vault operator unseal command.
Run vault status once unsealed and compare the current value to the value previously observed for Raft Applied Index in order to confirm the restore was successful. Additionally confirm that the HA Mode is active.
Perform relevant tests to confirm desired data and configurations are present, i.e. confirm KV entries are present, authenticate to Vault using a previously functioning authentication method (OIDC, LDAP etc).
If not yet done so, scale up the StatefulSet to the desired number of replicas.
Confirm the new pods have joined the cluster by checking the output of vault operator members (Vault 1.10 or newer) or vault operator raft list-peers if on an older version of Vault. If you are not utilising retry_join functionality you will likely need to tell the new pods to join the cluster using vault operator raft join command syntax - see raft_join
If you adjusted any configuration, such as http_read_timeout or VAULT_CLIENT_TIMEOUT these can now be unset / restored to default values.

Additional Information

Vault Documentation: Integrated Storage / Raft retry_join
Vault Documentation: Integrated Storage / Raft vault operator raft join
Vault Documentation: VAULT_CLIENT_TIMEOUT
Vault Documentation: http_read_timeout
Kubernetes Documentation: kubectl cp