Vault operator diagnose is a command that helps troubleshoot common issues with Vault server startup and configuration. It performs a series of checks on the environment, configuration, storage, listeners, and other components of Vault and reports any errors or warnings. It can also provide guidance on how to fix the detected problems.
Usage
The vault operator diagnose command can be run with the same configuration file and environment variables as the vault server command. It can be used before starting the server or while the server is running. However, some checks may not be applicable or may return false results if the server is already operational.
The command has the following syntax:
vault operator diagnose [options] [config_file]
The options are:
- -format
(string: "table") - Print the output in the given format. Valid formats are "table", "json", or "yaml". This can also be specified via the VAULT_FORMAT environment variable.
- -config
(string: "") - The path to the vault configuration file used by the vault server on startup.
--debug
- Dump all information collected by Diagnose. The default is false.
- -skip
(string: ""): Skip the health checks named as arguments. May be 'listener', 'storage',
or 'autounseal'.
Output
The vault operator diagnose command will output a series of lines with prefixes indicating the status of each check. The prefixes are:
- [ success ] - Denotes that the check was successful.
- [ warning ] - Denotes that the check has passed, but that there may be potential issues to look into that may relate to the issues vault is experiencing. Diagnose warns frequently. These warnings are meant to serve as starting points in the debugging process.
- [ failure ] - Denotes that the check has failed. Failures are critical issues in the eyes of the diagnose command.
In addition to these prefixed lines, there may be output lines that are not prefixed, but are color-coded purple. These are advice lines from Diagnose, and are meant to offer general guidance on how to go about fixing potential warnings or failures that may arise.
Warn or fail prefixes in nested checks will bubble up to the parent if the prefix supersedes the parent prefix. Fail supersedes warn, and warn supersedes ok. For example, if the TLS checks under the Storage check fails, the [ failure ] prefix will bubble up to the Storage check.
Examples
Here are some examples of using vault operator diagnose with different scenarios and outputs.
Example 1: Successful diagnose
In this example, vault operator diagnose is run with a valid configuration file and no errors are detected.
vault-diagnose:~# vault operator diagnose -config /etc/vault/server.hcl
Vault v1.13.2 (b9b773f1628260423e6cc9745531fd903cae853f), built 2023-04-25T13:02:50Z
Results:
[ warning ] Vault Diagnose: HCP link check will not run on OSS Vault.
[ success ] Check Operating System
[ success ] Check Open File Limits: Open file limits are set to 1048576.
[ success ] Check Disk Usage: /etc/resolv.conf usage ok.
[ success ] Check Disk Usage: /vault/file usage ok.
[ success ] Check Disk Usage: /etc/hosts usage ok.
[ success ] Check Disk Usage: /dev/termination-log usage ok.
[ success ] Check Disk Usage: /etc/hostname usage ok.
[ success ] Check Disk Usage: /vault/logs usage ok.
[ success ] Check Disk Usage: /opt/instruqt/bootstrap usage ok.
[ success ] Parse Configuration
[ warning ] Check Telemetry: Telemetry is using default configuration
By default only Prometheus and JSON metrics are available. Ignore this warning if you are using telemetry or are using these
metrics and are satisfied with the default retention time and gauge period.
[ success ] Check Storage
[ success ] Create Storage Backend
[ success ] Check Raft Folder Permissions: Raft BoltDB file has correct set of permissions.
[ success ] Check For Raft Quorum: Voter quorum exists: 5 voters.
[ skipped ] Check Service Discovery: No service registration configured.
[ success ] Create Vault Server Configuration Seals
[ skipped ] Check Transit Seal TLS: No transit seal found in seal configuration.
[ success ] Create Core Configuration
[ success ] Initialize Randomness for Core
[ success ] HA Storage
[ success ] Create HA Storage Backend
[ skipped ] Check HA Consul Direct Storage Access: No HA storage stanza is configured.
[ success ] Determine Redirect Address
[ success ] Check Cluster Address: Cluster address is logically valid and can be found.
[ success ] Check Core Creation
[ skipped ] Check For Autoloaded License: License check will not run on OSS Vault.
[ warning ] Start Listeners
[ warning ] Check Listener TLS: Listener at address 127.0.0.1:8200: TLS is disabled in a listener config stanza.
[ success ] Create Listeners
[ skipped ] Check Autounseal Encryption: Skipping barrier encryption test. Only supported for auto-unseal.
[ success ] Check Server Before Runtime
[ success ] Finalize Shamir Seal
Example 2: Failed storage check
In this example, vault operator diagnose is run with an invalid configuration file that points to a non-existent storage backend.
vault-diagnose:~# vault operator diagnose -config=server.hcl
Vault v1.13.0+ent (d07983ca207de9fd561c24390ee2d84379739a32), built 2023-03-01T11:09:12Z
Results:
[ failure ] Vault Diagnose
[ success ] Check Operating System
[ success ] Check Open File Limits: Open file limits are set to 65535.
[ success ] Check Disk Usage: / usage ok.
[ success ] Parse Configuration
[ warning ] Check Telemetry: Telemetry is using default configuration
By default only Prometheus and JSON metrics are available. Ignore this warning
if you are using telemetry or are using these metrics and are satisfied with the
default retention time and gauge period.
[ failure ] Check Storage: Diagnose could not initialize storage backend.
[ failure ] Create Storage Backend: Error initializing storage of type raft: failed to create fsm: failed to open bolt file: open /opt/vault/data_not_exist/vault.db: no such file or directory
Example 3: Failed permission storage check
In this example, vault operator diagnose detected a permission denied to on the /var/vault/data/vault.db
file under the Check Storage section.
vault-diagnose:~# vault operator diagnose -config=server.hcl
Vault v1.13.2 (b9b773f1628260423e6cc9745531fd903cae853f), built 2023-04-25T13:02:50Z
Results:
[ failure ] Vault Diagnose
[ success ] Check Operating System
[ success ] Check Open File Limits: Open file limits are set to 1048576.
[ success ] Check Disk Usage: /etc/resolv.conf usage ok.
[ success ] Check Disk Usage: /vault/file usage ok.
[ success ] Check Disk Usage: /etc/hosts usage ok.
[ success ] Check Disk Usage: /dev/termination-log usage ok.
[ success ] Check Disk Usage: /etc/hostname usage ok.
[ success ] Check Disk Usage: /vault/logs usage ok.
[ success ] Check Disk Usage: /opt/instruqt/bootstrap usage ok.
[ success ] Parse Configuration
[ warning ] Check Telemetry: Telemetry is using default configuration
By default only Prometheus and JSON metrics are available. Ignore this warning
if you are using telemetry or are using these metrics and are satisfied with the
default retention time and gauge period.
[ failure ] Check Storage: Diagnose could not initialize storage backend.
[ failure ] Create Storage Backend: Error initializing storage of type raft: failed to create fsm: failed to open bolt file:
error checking raft FSM db file "/var/vault/data/vault.db": stat /var/vault/data/vault.db: permission denied
Example 4: Running diagnose on a running server
In this example, vault operator diagnose is run while the vault server is already running. Some checks may not be applicable or may return false results in this case. For example, the storage check may fail because the storage backend is already in use by the running server.
vault-diagnose:~# vault operator diagnose -config=server.hcl
Vault v1.13.2+ent (ca5ea376af76440caa5e82e9b659b9e7c09268b4), built 2023-04-25T19:30:45Z
Results:
[ failure ] Vault Diagnose
[ success ] Check Operating System
[ success ] Check Open File Limits: Open file limits are set to
65535.
[ success ] Check Disk Usage: / usage ok.
[ success ] Parse Configuration
[ warning ] Check Telemetry: Telemetry is using default
configuration
By default only Prometheus and JSON metrics are available. Ignore this
warning if you are using telemetry or are using these metrics and are
satisfied with the default retention time and gauge period.
[ failure ] Check Storage: Diagnose could not initialize storage
backend.
[ failure ] Create Storage Backend: Error initializing storage of
type raft: failed to create fsm: failed to open bolt file: timeout
Make sure to kill vault process for the command to operate correctly:
vault-diagnose:~# pkill -9 vault
Example 5: Failed listener check
In this example, vault operator diagnose is run with a configuration file that specifies an invalid TLS certificate for the listener.
vault-diagnose:~# vault operator diagnose -config=server.hcl
Vault v1.13.0+ent (d07983ca207de9fd561c24390ee2d84379739a32), built 2023-03-01T11:09:12Z
Results:
[ failure ] Vault Diagnose
[ success ] Check Operating System
[ success ] Check Open File Limits: Open file limits are set to 65535.
[ success ] Check Disk Usage: / usage ok.
[ success ] Parse Configuration
[ warning ] Check Telemetry: Telemetry is using default configuration
By default only Prometheus and JSON metrics are available. Ignore this warning
if you are using telemetry or are using these metrics and are satisfied with the
default retention time and gauge period.
[ warning ] Check Storage
[ success ] Create Storage Backend
[ warning ] Check Raft Folder Permissions: too many permissions: perms are drwxr-xr-x
[ warning ] Check For Raft Quorum: Only one server node found. Vault is not running in high availability mode.
[ skipped ] Check Service Discovery: No service registration configured.
[ success ] Create Vault Server Configuration Seals
[ skipped ] Check Transit Seal TLS: No transit seal found in seal configuration.
[ success ] Create Core Configuration
[ success ] Initialize Randomness for Core
[ success ] HA Storage
[ success ] Create HA Storage Backend
[ skipped ] Check HA Consul Direct Storage Access: No HA storage stanza is configured.
[ success ] Determine Redirect Address
[ success ] Check Cluster Address: Cluster address is logically valid and can be found.
[ success ] Check Core Creation
[ success ] Check For Autoloaded License
[ failure ] Start Listeners
[ failure ] Check Listener TLS: 0.0.0.0:8200: Failed to verify primary provided leaf certificate: x509: certificate has expired or is not yet valid: current time 2023-05-10T20:32:39Z is after 2023-04-29T21:03:27Z.
[ warning ] 0.0.0.0:8200: Leaf certificate 177880144847789704766300930146131327546232323996 is expired or near expiry. Time to expire is: 983h29m12.721186498s.
[ success ] Create Listeners
[ skipped ] Check Autounseal Encryption: Skipping barrier encryption test. Only supported for auto-unseal.
[ success ] Check Server Before Runtime
[ success ] Finalize Shamir Seal
Troubleshooting
If you encounter any issues with using vault operator diagnose, you can try the following steps:
- Make sure you are using the same configuration file and environment variables as the vault server command.
- Make sure you have the latest version of Vault installed.
- Check the Vault documentation and learn guides for more information on how to configure and operate Vault.
- Contact HashiCorp support or join the Vault community forum for further assistance.
References:
(1) Announcing HashiCorp Vault 1.8. https://www.hashicorp.com/blog/vault-1-8.
(2) operator diagnose - Command | Vault | HashiCorp Developer. https://developer.hashicorp.com/vault/docs/commands/operator/diagnose.
(3) Diagnose Server Issues | Vault - HashiCorp Learn. https://learn.hashicorp.com/tutorials/vault/diagnose-startup-issues?in=vault/monitoring.
(4) Vault Community hours : https://www.youtube.com/watch?v=P8maAN9s3s4