This is a write-up prepared by Vault Engineering to help consumers of Vault Enterprise better understand replication mechanics, and common scenarios observed with Merkle Sync loop. It was written for the scope of Vault Enterprise version 1.13.2, but much of this holds true across versions.
The following is an overview of the Vault replication system, the merkle sync issue observed by customers and remediation steps.
A merkle tree is an in-memory hash tree representing all encrypted data in Vault where each leaf of the tree contains a hash of the key and the encrypted value of that key. Vault uses merkle trees to compare if two clusters are in sync. Each Vault cluster maintains two merkle trees: shared and local. The shared merkle tree contains replicated data while the local merkle tree contains only data local to that cluster. For disaster recovery replication both trees are used, however, only the shared merkle tree is used for performance replication. A representation of a merkle tree is as follows:
When data is modified Vault will first update its physical storage with the change and will then update its merkle trees. When a change is made to a merkle tree leaf each parent of that leaf will need to be updated to reflect the new hash (up to the merkle root).
Vault stores its merkle trees in physical storage and can be loaded into memory during startup. Any changes made to the merkle tree during runtime reside in memory as dirty pages until they can be periodically flushed to physical storage (2% of dirty pages are flushed every 200ms). Although they reside in memory and could be lost in the event of a node failure, Vault uses the WAL stored in physical storage for merkle tree recovery if all pages aren’t flushed to storage.
Write-ahead-logs (WAL) encapsulate changes to data within Vault and are used in replication and for crash recovery. Write-ahead-logs include what the merkle root hash was before the included changes are applied and up to 62 batched changes. WAL is stored in an in-memory buffer called logshipper and in storage for tree recovery after a restart.
Each log contains:
- Monotonically increasing index
- Array of entries:
- Key (path in storage)
- Transaction operation: put or delete
- Encrypted value of that key
- Value hash
- If the data is seal wrapped
- Existing value (for undo logs)
The logshipper is an in-memory buffer containing the last N WALs and is responsible for shipping WALs to secondary clusters. As the Vault primary cluster inserts data into its merkle tree a corresponding WAL is appended to the logshipper buffer. Once the buffer is full the last WAL in the buffer will be removed as more WAL entries are added.
The logshipper has a configurable length of how many WALs it can hold at any given time. By default the log shipper can hold 16384 WAL entries. Additionally the log shipper’s total size is capped to avoid consuming too much memory on a system. The default size is 10% of the total memory available on the server and each log shipper (one for disaster recovery replication, one for performance replication) will split the configured size.
When a secondary enters merkle diff it requests the primary cluster to create a snapshot of its merkle tree. Next it compares the two trees and creates a list of any pages that are different between the two. If any differences are found, Vault will enter merkle sync mode. If no differences are found Vault will enter stream-wals mode.
Merkle sync is the process of requesting and applying differences found in a secondary merkle tree during the merkle diff process. The primary cluster will pull the requested key from storage and send over corresponding data. Once all conflicts have been resolved the secondary merkle tree should have a merkle root hash that corresponds to a WAL entry in the primary logshipper.
When a secondary cluster enters stream wals it sends its current merkle root hash to the primary cluster. The primary cluster then finds that merkle root hash in the logshipper and then plays forward all changes in the logshipper from that point. If the secondary cluster requests the WAL for a merkle root hash that does not exist in the logshipper, the secondary will enter merkle diff mode. When this happens the following log lines appear in the secondary server log:
2023-04-04T15:26:12.929Z [INFO] core: non-matching guard, exiting
2023-04-04T15:26:12.929Z [INFO] replication: no matching WALs available
Merkle Sync Loops
Sometimes a Vault secondary can enter what is known as a merkle sync loop where the cluster is unable to stay in stream-wals. At the time of writing this there are four known sources of merkle sync loops:
- logshipper is misconfigured for the scale Vault is operating at,
- clients are generating an unnecessary amount of changes,
- merkle tree corruption,
- and a race with merkle sync (addressed with undo logs in 1.13+).
Logshipper Length and Size
Recall from above that the logshipper is a fixed length in-memory buffer. New WALs are appended to the buffer and each WAL is shifted to the right in the buffer. Once the oldest WAL hits the end of the buffer it is removed from the logshipper.
If the amount of WAL being generated on the primary is high, it’s possible that the secondary cannot keep up and requests WAL for a merkle root that is no longer in the logshipper. When this happens the secondary will enter merkle diff as detailed above.
To tune logshipper see the following Vault documentation. Please reach out to Support for guidance with logshipper tuning.
In some situations, clients can generate a lot of WAL to be replicated and may cause the secondary to lag behind until it eventually needs to resynchronize. The following examples may help clients optimize how they use Vault in order to decrease the amount of WAL that needs to be replicated.
Token Creation, Renewals and Revocations
Clients generating tokens very frequently (for example several times a second) can generate a lot of replicated changes because each token will have a lease associated with it. As leases expire or are renewed, these generate even more changes to the Merkle tree and can make it difficult for secondaries to apply all the changes on busy systems. Additionally once a token lease expires, it will be revoked, which can cause even more changes that need to be replicated. When possible, clients should generate tokens and renew the lease associated with the token. A good rule of thumb is to renew a token lease at roughly 80% of the lease’s time-to-live (TTL).
Merkle Tree Corruption
Recall from above that merkle trees are an in-memory hash tree representation of the encrypted data in storage. Since storage and the merkle tree are separate, it is possible that the merkle tree could become out of sync with storage. This can happen either from a bug in Vault or modifying storage directly.
Detecting merkle tree corruption on the primary cluster can be difficult. When inspecting secondary debug logs you may see merkle diffs that result in no conflicting keys or merkle syncs that continue to sync the same keys each time.
The process of fixing a corrupted merkle tree is known as reindexing. Vault will create a new in-memory merkle tree by listing and reading all keys from storage. Once created it will compare the replication merkle tree against the new tree and replace any tree pages that differ.
An overview of reindexing can be found at Monitor replication - Reindex. Please reach out to Support for any guidance regarding reindexing.
Merkle Sync Race Condition
Recall from above that merkle diff and merkle sync are done over multiple gRPC calls. When the primary creates a snapshot of its merkle tree, there is a chance that the data has changed since the time of the snapshot and the syncing of the data on the secondary. The pathological case here is if the merkle diff and merkle sync process takes a long time and one or more keys are updated at a quicker cadence, this issue could be hit on every merkle-sync and the cluster will never enter the stream wal state. To fix this issue, Vault introduced undo logs in 1.13.
Undo Logs - Vault 1.13
When the primary creates a snapshot of its merkle tree for the secondary there is a chance that the data has changed since the time of the snapshot and the syncing of the data. To fix this issue, the primary tree now notices if the secondary is requesting keys that have changed since the snapshot was taken and it instead pulls the values from the logshipper. This is done to avoid a situation where the computed merkle root hash will never match a WAL entry in the logshipper on the primary, resulting in another merkle diff and/or sync. Please consult with Support regarding Undo Logs.
- Replication parameters in Vault Configuration
- Troubleshoot and tune enterprise replication - logshipper
- Vault Replication stuck in merkle-sync
- Monitor replication - Reindex