Introduction
Expected Outcome
As part of a disaster recovery process you may need to restore a snapshot to a Vault cluster running on Kubernetes. The Standard Procedure for Restoring a Vault Cluster guide can be a useful starting point however additional steps may be required to perform a restore which is where this guide may assist.
Prerequisites
- The existing prerequisites on the Standard Procedure for Restoring a Vault Cluster guide must be satisfied.
- Access to a Vault snapshot.
- Access to either the same HSM / Cloud unseal method used by the snapshot, or the recovery keys in order to perform a seal migration.
- Access to the unseal keys if using Shamir seal.
- A minimum of one Vault instance/pod in an operational state (initialised, unsealed, responding to commands).
- The Vault instance/pod must be the active/leader node.
Considerations prior to restore
- The strategy used for your specific scenario should be considered as this process can be used for multiple scenarios, however this document is written with two in mind:
- 1: Recovering from a scenario where the Vault cluster is an unhealthy and unrecoverable state.
- If targeting the first scenario of restoring to an unhealthy cluster the process requires scaling down the StatefulSet to zero replicas in order to remove the unhealthy instances. Once this has been completed the PVC's associated with all Vault pods should be deleted, enabling the new replicas to start with a fresh data volume when scaling back up. If the storageclass / storage provider in use supports generation of snapshots consideration should be made with regards to taking a snapshot of the PV's before deletion. Once the PVC's have been deleted the StatefulSet can be scaled back up to the desired number of replicas. At this point one Vault replica will need to be initialised, following which a login using the newly generated root token should be performed. Next, the main Procedure section further down this guide should be followed.
- 2: Vault is in a healthy state however you wish to restore a snapshot in order to recover deleted data.
- If targeting the second scenario of restoring a snapshot to a healthy Vault cluster and it has been confirmed all replicas are in sync (Raft Applied Index value of
vault status
matches, or is very close) and communication between each replica is functioning (Last Echo time in the output ofvault operator members
is current for all replicas) there is no need to scale down the replica count - restoring a snapshot to the active/leader Vault instance will result in the snapshot being distributed to all other nodes in the cluster.
- If targeting the second scenario of restoring a snapshot to a healthy Vault cluster and it has been confirmed all replicas are in sync (Raft Applied Index value of
- 1: Recovering from a scenario where the Vault cluster is an unhealthy and unrecoverable state.
-
Understand how you will get access to the snapshot file used for restore, i.e. does it need to be transferred from an S3 bucket directly onto the Vault pod, or can you restore from a snapshot that is present on your local workstation?
-
Examine the Vault configuration file to help define the plan for additional instances joining the cluster once the restore is completed - i.e, is
retry_join
configured, allowing automatic joins, or do manual join commands need to be issued? -
Consider the network connectivity between where you are performing the restore from and the Kubernetes cluster:
- Will a restore from your local workstation to a Kubernetes cluster route through an ingress controller / reverse proxy that imposes limits such as session time length, maximum file size etc that the snapshot transfer may breach?
- If this is true, then copying the snapshot file directly to the instance/pod via
kubectl cp
in order to perform the restore 'locally' may be required. - You may still need to adjust configuration parameters on the Vault server - refer below.
- If this is true, then copying the snapshot file directly to the instance/pod via
- If the snapshot file is large and you are restoring the file from your local workstation, how long will it take for that transfer to complete?
- You should be aware of the following Vault configuration parameters that may require adjustment:
-
VAULT_CLIENT_TIMEOUT
- Environment variable that controls the Vault client side timeout, defaults to 60 seconds. Runexport VAULT_CLIENT_TIMEOUT=600s
or similar to temporarily increase this on the machine from which you will perform the restore. -
http_read_timeout
- Configuration parameter for the Vault server defined within thelistener
stanza in the Vault configuration file that is not present out of the box and uses a default value of 30 seconds. If restores fail with ani/o timeout
after exactly 30 seconds this will need to be manually set as a higher value, for example:
-
- You should be aware of the following Vault configuration parameters that may require adjustment:
- Will a restore from your local workstation to a Kubernetes cluster route through an ingress controller / reverse proxy that imposes limits such as session time length, maximum file size etc that the snapshot transfer may breach?
/ $ cat /vault/config/extraconfig-from-values.hcl
...
listener "tcp" {
address = "[::]:8200"
http_read_timeout = "600s"
...
}
-
The
http_read_timeout
value is read from the Vault configuration file once at start-up, meaning if it has been manually defined for the first time or modified in order to perform a restore the pod must now be deleted and recreated.
Procedure
- Once you have decided on the method of restore (remote or local) and made any configuration changes required you can move on to the restore process.
- Obtain a copy of the snapshot file that you will restore to the cluster.
- Confirm only one pod is present and that it is the leader/active node by checking the value of
HA Mode
isactive
in the output ofvault status
. - Make a note of the value for
Raft Applied Index
in thevault status
output. - If performing a 'local' restore, i.e. directly on the Vault pod, copy the snapshot to the pod using
kubectl cp
, for example:kubectl cp /tmp/backup.snap vault/vault-0:/vault/backup.snap
will copy the file backup.snap located in /tmp on your local workstation to the vault-0 pod located within the vault namespace and write it as /vault/backup.snap in the pod. See the Additional Information section below for a link to thekubectl cp
documentation for further examples. - Login to Vault using a sufficiently privileged token.
- Restore the snapshot to the Vault cluster, an example command is as follows:
vault operator raft snapshot restore -force /vault/backup.snap
. - Once complete run
vault status
and confirm the value forSealed
isfalse
. If Vault is still sealed it must be unsealed using thevault operator unseal
command. - Run
vault status
once unsealed and compare the current value to the value previously observed forRaft Applied Index
in order to confirm the restore was successful. Additionally confirm that theHA Mode
isactive
. - Perform relevant tests to confirm desired data and configurations are present, i.e. confirm KV entries are present, authenticate to Vault using a previously functioning authentication method (OIDC, LDAP etc).
- If not yet done so, scale up the StatefulSet to the desired number of replicas.
- Confirm the new pods have joined the cluster by checking the output of
vault operator members
(Vault 1.10 or newer) orvault operator raft list-peers
if on an older version of Vault. If you are not utilisingretry_join
functionality you will likely need to tell the new pods to join the cluster usingvault operator raft join
command syntax - see raft_join - If you adjusted any configuration, such as
http_read_timeout
orVAULT_CLIENT_TIMEOUT
these can now be unset / restored to default values.
Additional Information
- Vault Documentation: Integrated Storage / Raft
retry_join
- Vault Documentation: Integrated Storage / Raft
vault operator raft join
- Vault Documentation:
VAULT_CLIENT_TIMEOUT
- Vault Documentation:
http_read_timeout
- Kubernetes Documentation:
kubectl cp