In the event that your total lease count is beginning to climb rapidly in an unanticipated manner, you may want to gain better insight into where and how these leases are being generated. Leveraging Telemetry Metrics being emitted by Vault is a simple way to gain this insight, as well as set up threshold alerting to avoid potential performance issues in the future.
A Vault operator should be able to gain insight into what Auth Method or Secrets engine is generating leases in an unanticipated manner.
Prerequisites (if applicable)
- Vault Telemetry enabled, and being ingested into tooling such as Splunk, Grafana, etc.
Vault performance is beginning to be impacted, or Vault lease count is growing rapidly outside of expected patterns based on Vault usage
Metrics of Interest
- `vault.expire_num_leases` - This metric keeps track of the total number of leases that can be expired.
- `vault.secret.lease.creation` - This metric counts the number of leases generated by secrets engines. This can be split by specific metadata such as namespace, mount point, and creation ttl.
- `vault.token.creation` - This metric keeps track of the number of tokens that have been created. This can be split by specific metadata such as namespace, mount point, creation ttl, and token type.
By using some of the previously mentioned metrics, one can gain insight into where these leases can be coming from. By checking the total number of leases that can be expired over a time period, we can identify spikes of lease creations. These can be either from Token generation (Auth logins), or Dynamic Secret generation. Let's take a look at an example workflow where an application logs into via AppRole Authentication, then generates a dynamic secret for a MySQL database. Thinking through the logical steps that would need to take place to achieve the expected outcomes we can generally expect to see a total of 2 leases being generated per workflow execution. 1 lease for the Token generated upon successful authentication against the AppRole Auth method, and one that corresponds to the dynamically generated MySQL credential.
If we were to run this workflow say 10 times in a 1 minute timeframe, we can expect to see `vault.expire_num_leases` to have changed by 20. If we had a scenario where someone was trying to develop a similar workflow but misconfigured something along the way and it started to just request database credentials in a loop. We would see the `vault.expire_num_leases` start to climb unexpectedly. We can take a look at the time frame where we began to see the increase and correlate this climb with the other metrics mentioned. If we look at the `vault.secret.lease.creation` metric, more specifically split on the mount point, we can get a count of how many leases were generated in that time frame. Using these data points we can see if any mount point was generating an abnormal amount of leases and begin investigating with the Vault consumers.