During incidents, it can be challenging to determine what data to collect for post incident analysis. This is intended to be a guide & starting point of key information best to gather.
There are two main areas of focus for data collection - the vault process itself, and the state of underlying infrastructure running the vault process. If the Vault process is hung/unresponsive, the data from the underlying infrastructure becomes key in conducting root cause analysis for the acute incident.
tl;dr - During an acute incident, please capture a vault debug
(auth with a sudo access token, or this policy example) if the API is responsive, along with infrastructure checks below. If the Vault API is unresponsive, please capture a stack trace, along with requested data from the underlying infrastructure - including server operational logs.
Please remember to collect this data prior to performing any restoration activities, ie - restarting vault process
Vault Debug
The vault debug command is the standard, best way to gather all relevant data for troubleshooting the vault process. Ideally this should be run to gather needed data about vault, if the vault API is responsive. Please remember to authenticate with a sudo level token or similar policy example prior to running the vault debug command.
If the vault API is not responsive, there is no need to run vault debug, as it would likely fail like the other API commands being issued. Please capture a stack trace from the active/hung node(s) when the Vault API is not responsive, along with server operational logs.
To determine if the vault API is responsive:
- does
vault status
return output? If vault commands are hung/timing out, it's unlikely the API is responsive enough for vault debug to gather the needed information. - does
sys/health
return output? If you want to check the API with a direct call, Ex: curl https://ip.of.api.addr:8200/v1/sys/health.
Data Collection from the OS / Infrastructure
Collecting data from the underlying infrastructure is especially key if vault debug is not able to be taken. Most common infrastructures are:
Linux (systemd)
Please collect the following information to help provide context and detail of the infra running the Vault process.
From Active node & any other problematic node
systemd commands - click to expand
// Commands to gather OS level (systemd) data
// Collect Linux processes and system real-time information with top. This command includes the
// -b flag for batch (needed to redirect to a file) and the -n flag to specify the number of iterations.
// This command will run for 18 seconds, and send the output to the specified file
top -b -n 5 > top-output.txt
// Check memory usage
free -ht
// Check file system disk space usage
df -h > df-output.txt
// Check open tcp connections. requires sudo or root access
sudo ss -tnlp > ss-output.txt
// Check for open files. Requires sudo or root access. lsof might not be installed;
// there is no tidy equivalent.
sudo lsof -p $(pidof vault) > lsof-vault-output.txt
// if lsof cannot be installed, try:
// to get PID of vault
pidof vault
// to list the file descriptors for the vault processes. requires sudo or root access
// update <pidof_vault> to the output from pidof / pgrep command.
sudo ls -l /proc/<pidof_vault>/fd > proc-vault-ls-output.txt
// Collect vault server operational logs (journald example)
journalctl -b --no-pager -u vault | gzip -9 > /tmp/"$(hostname)-$(date +%Y-%m-%dT%H-%M-%SZ)-vault.log.gz"
Kubernetes
In k8s environments, having a pod with the vault process hung/unresponsive is not as likely, due to liveness probes, etc. If the Vault process is hung in the pod, please collect a stack trace , and vault server operational logs. Additional pod OS level data to gather:
From Active node & any other problematic node
Alpine Linux commands - click to expand
// Commands to gather OS level (Alpine Linux) data
// Collect Linux processes and system real-time information with top. This command includes the
// -b flag for batch (needed to redirect to a file) and the -n flag to specify the number of iterations.
// This command will run for 18 seconds, and send the output to the specified file
top -b -n 5 > top-output.txt
// Check memory usage
free -ht
// Check file system disk space usage
df -h > df-output.txt
// Check open tcp connections
netstat -tlpn > netstat-output.txt
// Check for open files. lsof is usually installed in Alpine Linux;
// there is no tidy equivalent.
lsof -p $(pidof vault) > lsof-vault-output.txt
Next Steps
When engaging with Support, please upload the files generated from vault debug, along with the infrastructure data, server operational logs & stack traces collected to the ZenDesk ticket via SendSafely.
Further Reading
too many open files - troubleshooting