Introduction
Expected Outcome
As part of a disaster recovery process you may need to restore a snapshot to a Vault cluster that is running on Kubernetes. The Standard Procedure for Restoring a Vault Cluster guide can be a useful starting point however additional steps may be required to perform a restore which is where this guide may assist.
Prerequisites
- The existing prerequisites on the Standard Procedure for Restoring a Vault Cluster guide must be satisfied.
- Access to a Vault snapshot.
- If you are restoring to an existing cluster we recommend scaling down the Deployment/StatefulSet to one instance/pod.
- One Vault instance/pod in an operational state (initialised, unsealed, responding to commands).
- The Vault instance/pod must be the active/leader node.
Considerations prior to restore
-
Understand how you will get access to the snapshot file used for restore, i.e. does it need to be transferred from an S3 bucket directly onto the Vault pod, or can you restore from a snapshot that is present on your local workstation?
-
Examine the Vault configuration file to help define the plan for additional instances joining the cluster once the restore is completed - i.e, is
retry_join
configured, allowing automatic joins, or do manual join commands need to be issued? -
Consider the network connectivity between where you are performing the restore from and the Kubernetes cluster:
- Will a restore from your local workstation to a Kubernetes cluster route through an ingress controller / reverse proxy that imposes limits such as session time length, maximum file size etc that the snapshot transfer may breach?
- If this is true, then copying the snapshot file directly to the instance/pod via
kubectl cp
in order to perform the restore 'locally' may be required. - You may still need to adjust configuration parameters on the Vault server - refer below.
- If this is true, then copying the snapshot file directly to the instance/pod via
- If the snapshot file is large and you are restoring the file from your local workstation, how long will it take for that transfer to complete?
- You should be aware of the following Vault configuration parameters that may require adjustment:
VAULT_CLIENT_TIMEOUT
- Environment variable that controls the Vault client side timeout, defaults to 60 seconds. Runexport VAULT_CLIENT_TIMEOUT=600s
or similar to temporarily increase this on the machine from which you will perform the restore.http_read_timeout
- Configuration parameter for the Vault server defined within thelistener
stanza in the Vault configuration file that is not present out of the box and uses a default value of 30 seconds. If restores fail with ani/o timeout
after exactly 30 seconds this will need to be manually set as a higher value, for example:
- You should be aware of the following Vault configuration parameters that may require adjustment:
- Will a restore from your local workstation to a Kubernetes cluster route through an ingress controller / reverse proxy that imposes limits such as session time length, maximum file size etc that the snapshot transfer may breach?
/ $ cat /vault/config/extraconfig-from-values.hcl
...
listener "tcp" {
address = "[::]:8200"
http_read_timeout = "600s"
...
}
-
The
http_read_timeout
value is read from the Vault configuration file once at start-up, meaning if it has been manually defined for the first time or modified in order to perform a restore the pod must now be deleted and recreated.
Procedure
- Once you have decided on the method of restore (remote or local) and made any configuration changes required you can move on to the restore process.
- Obtain a copy of the snapshot file that you will restore to the cluster.
- Confirm only one pod is present and that it is the leader/active node by checking the value of
HA Mode
isactive
in the output ofvault status
. - Make a note of the value for
Raft Applied Index
in thevault status
output. - If performing a 'local' restore, i.e. directly on the Vault pod, copy the snapshot to the pod using
kubectl cp
, for example:kubectl cp /tmp/backup.snap vault/vault-0:/vault/backup.snap
will copy the file backup.snap located in /tmp on your local workstation to the vault-0 pod located within the vault namespace and write it as /vault/backup.snap in the pod. See the Additional Information section below for a link to thekubectl cp
documentation for further examples. - Login to Vault using a sufficiently privileged token.
- Restore the snapshot to the Vault cluster, an example command is as follows:
vault operator raft snapshot restore -force /vault/backup.snap
. - Once complete run
vault status
and compare the current value to the value previously observed forRaft Applied Index
in order to confirm the restore was successful. Additionally confirm that theHA Mode
isactive
. - Perform relevant tests to confirm desired data and configurations are present, i.e. confirm KV entries are present, authenticate to Vault using a previously functioning authentication method (OIDC, LDAP etc).
- Scale up the Deployment/StatefulSet to the desired number of pods.
- Confirm the new pods have joined the cluster by checking the output of
vault operator members
(Vault 1.10 or newer) orvault operator raft list-peers
if on an older version of Vault. If you are not utilisingretry_join
functionality you will likely need to tell the new pods to join the cluster usingvault operator raft join
syntax - see https://developer.hashicorp.com/vault/docs/commands/operator/raft#join. - If you adjusted any configuration, such as
http_read_timeout
orVAULT_CLIENT_TIMEOUT
these can now be unset / restored to default values.
Additional Information
- Integrated Storage / Raft
retry_join
: https://developer.hashicorp.com/vault/docs/configuration/storage/raft#retry_join - Integrated Storage / Raft
vault operator raft join
https://developer.hashicorp.com/vault/docs/commands/operator/raft#join kubectl cp
: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cpVAULT_CLIENT_TIMEOUT
https://developer.hashicorp.com/vault/docs/commands#vault_client_timeouthttp_read_timeout
https://developer.hashicorp.com/vault/docs/configuration/listener/tcp#http_read_timeout