Introduction
Expected Outcome
Benchmarking a Vault cluster is an important activity which can help in understanding the expected behaviours under load in particular scenarios with the current configuration.
Performing benchmarks can also be a good measure of the time taken for for particular secrets and authentication requests. This is an addendum to other articles on performance and tuning (example) which can help in identifying cluster sizing and appropriate hardware resource allocation.
Prerequisites
- Do not benchmark your production cluster.
- Separate Vault cluster for benchmarking or a development environment.
- Benchmark tools
- Telemetry enabled
- Audit logging enabled
- Trace level operational logging enabled.
Procedure
Benchmark tools
The vault-guides repository provides scripts for testing a cluster as well as a Terraform template to deploy a cluster in AWS which are not supported by HashiCorp and are merely suggestive guides to be tailored further in your use-cases.
Guides on the usage of scripts are available here:
Cluster configuration
Integrated Storage backend
For this exercise it is best to test against a cluster of the same size with the same configuration, but at a minimum 3 nodes would be needed.
Consul backend
For a Consul backed cluster, it is recommended that, at a minimum, 5 Consul nodes and 3 Vault nodes exist in the cluster.
Single node deployments will not give an accurate depiction of how a cluster will operate under load. Additionally, any TLS or other configuration specific items that would be deployed on the production cluster should be applied here.
Running the benchmark
When running the benchmark, it is best to run it over an amount of time as opposed to just once. The script run_tests.sh will default to 6 hours. This can be changed by setting the -d flag to which ever time interval is desired. To see how much load the cluster can take, simply increase the values for each script as detailed on the README.md.
Investigating the results.
When investigating the results of the benchmark test, it is best to use telemetry and the Vault logs in order to understand how the cluster will respond under load. The full list of telemetry metrics is listed on the telemetry documentation page. Some suggested metrics are listed below, but these are only a minimal set of the metrics to discover. Based on the benchmarking test that has been performed, it is recommended to also check the metrics for the specific secrets and authentication methods that were created and used.
Suggested metrics:
vault.runtime.sys_bytes
vault.runtime.alloc_bytes
vault.expire.revoke
vault.expire.renew
vault.token.create
vault.token.revoke
vault.merkle.flushDirty
vault.merkle.flushDirty.outstanding_pages
vault.ha.rpc.client.forward
vault.ha.rpc.client.forward.errors
vault.consul.*
vault.raft.*
- for operations against the backend such as get
vault.raft.state.leader
vault.raft.state.candidate
Additionally, a review of the operational and audit logs for request and connection errors, as well as any instability in leadership, will be very helpful in seeing any items that may be of concern.