Benchmarking workloads in Vault – HashiCorp Help Center

Introduction

Expected Outcome

Benchmarking a Vault cluster is an important activity which can help in understanding the expected behaviours under load in particular scenarios with the current configuration.

Performing benchmarks can also be a good measure of the time taken for for particular secrets and authentication requests. This is an addendum to other articles on performance and tuning (example) which can help in identifying cluster sizing and appropriate hardware resource allocation.

Prerequisites

Do not benchmark your production cluster.
Separate Vault cluster for benchmarking or a development environment.
Benchmark tools
Telemetry enabled
Audit logging enabled
Trace level operational logging enabled.

Procedure

Benchmark tools

The vault-guides repository provides scripts for testing a cluster as well as a Terraform template to deploy a cluster in AWS which are not supported by HashiCorp and are merely suggestive guides to be tailored further in your use-cases.

https://github.com/hashicorp/vault-guides/tree/master/operations/benchmarking

Guides on the usage of scripts are available here:

https://medium.com/hashicorp-engineering/hashicorp-vault-performance-benchmark-13d0ea7b703f

Cluster configuration

Integrated Storage backend

For this exercise it is best to test against a cluster of the same size with the same configuration, but at a minimum 3 nodes would be needed.

Consul backend

For a Consul backed cluster, it is recommended that, at a minimum, 5 Consul nodes and 3 Vault nodes exist in the cluster.

Single node deployments will not give an accurate depiction of how a cluster will operate under load. Additionally, any TLS or other configuration specific items that would be deployed on the production cluster should be applied here.

Running the benchmark

When running the benchmark, it is best to run it over an amount of time as opposed to just once. The script run_tests.sh will default to 6 hours. This can be changed by setting the -d flag to which ever time interval is desired. To see how much load the cluster can take, simply increase the values for each script as detailed on the README.md.

Investigating the results.

When investigating the results of the benchmark test, it is best to use telemetry and the Vault logs in order to understand how the cluster will respond under load. The full list of telemetry metrics is listed on the telemetry documentation page. Some suggested metrics are listed below, but these are only a minimal set of the metrics to discover. Based on the benchmarking test that has been performed, it is recommended to also check the metrics for the specific secrets and authentication methods that were created and used.

Suggested metrics:

vault.runtime.sys_bytes

vault.runtime.alloc_bytes

vault.expire.revoke

vault.expire.renew

vault.token.create

vault.token.revoke

vault.merkle.flushDirty

vault.merkle.flushDirty.outstanding_pages

vault.ha.rpc.client.forward

vault.ha.rpc.client.forward.errors

vault.consul.*

vault.raft.* - for operations against the backend such as get

vault.raft.state.leader

vault.raft.state.candidate

Additionally, a review of the operational and audit logs for request and connection errors, as well as any instability in leadership, will be very helpful in seeing any items that may be of concern.

Additional Information