Introduction
Problem
When a cluster is under heavy load caused by write operations, data consistency issues can arise on the performance standby nodes. While this is rare, if this behavior is present and it is not an eventual consistency issue, checking the items below will help diagnose the issue.
Prerequisites
- A Vault cluster with an enterprise license.
Cause
- Dirty page flushing issues due to heavy write operations.
- Merkle Sync issues on the replication cluster.
Solutions:
- Investigate and reduce the number of write operations performed.
- Upgrade your cluster to versions 1.12(Integrated Storage) or 1.13(Consul storage)
How to investigate excessive writes:
-
Metrics
This list is not exhaustive and while these are some of the more frequently used metrics to identify these issues, please see the links at the bottom of the page to view other items related to this list.
-
- vault.merkle.flushDirty - This metric is generally used to show issues with cluster to cluster communication, however some mechanisms are in play in Performance Standby nodes that rely on this. It is also a good metric for uncovering heavy writes to the cluster.
- vault.secret.lease.creation - This metric is helpful in understanding where certain secrets may be generating an excessive amount of leases. If the TTL seems normal, yet the app is creating too many leases, then that application will need to be adjusted to write to the Vault database less frequently.
- vault.identity.entity.creation - In some instances, 3rd party plugins might be configured to create updates to the entity and identity of the specific secret. If this is not necessary, reducing the entity change frequency can help. If it is necessary due to requirements, then reducing the secret retrieval frequency can help.
-
Bolt DB inspection
- Note - This must be done when a node is taken offline or you have restored a snapshot outside of your production cluster.
- Listing the keys can help understand which keys you have in which namespaces. Depending on how the various applications are connecting, this may be useful in finding anomalies.
- Diving in to the pages can help to understand where various leases might be getting generated.
-
Logging
- In your Vault Operational Logs there may be a lot of the following on the performance standby. This occurs when the cluster is under heavy writes and communication between the nodes is strained:
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Canceled desc = context canceled"
[ERROR] core: forward request error: error="error during forwarding RPC request"
-
- In your audit logs it is possible to also see how many requests are being made per application and how often.
-
Merkle Diff errors.
In versions prior to 1.12 for Integrated Storage backed clusters and version 1.13 for Consul backed clusters there is a known issue with cluster to cluster communication. To identify whether this is an issue, please see this document for the various items that can be checked to see if the cluster is affected by this issue. If so, it can have an unintended effect on Performance Standby nodes. If this issue is present and the cluster is not on versions 1.12 or 1.13, please upgrade as soon as possilbe. If using version 1.12 and on Integrated storage, it will be needed to set the VAULT_REPLICATION_USE_UNDO_LOGS=true
environment variable. It is enabled by default in 1.13.
Outcome
Investigating this issue is rather complex, but diving in to how applications are using your Vault cluster will go a long way in maintaining a set of healthy and happy clusters.