Hashicorp Vault is a powerful tool for managing secrets and protecting sensitive data, and its replication feature allows for high availability and disaster recovery. However, syncing problems can arise when setting up and maintaining replication between Vault clusters. One common issue is when the data in the primary and secondary clusters becomes out of sync, resulting in inconsistencies and potential data loss.
Identification of the merkle-sync/merkle-diff loop issue:
- The general characteristic of the merkle loop is that the the state on the secondary vault cluster never fully reconciles to the "stream-wals" state. "stream-wals" indicates that the secondary is in sync with the primary and is able to replicate data from the primary.
- It will show a state transition to "stream-wals" and then immediately a state transition back to "merkle-sync" or "merkle-diff".
- The logs on the primary and secondary vault clusters will show "conflicting_keys" and analyzing these keys will help in understanding how to proceed further.
In such cases, re-indexing the primary cluster may be a solution, but it can take a significant amount of time and may not resolve the issue. Alternatively, if time constraints prevent the use of re-indexing or it did not work, disabling replication on the primary and performing a full data sync can be an alternative solution. Also, please note that there are other methods to solve this problem, however, we are going to proceed with re-enabling replication in this article.
In the example below, we'll assume we are trying to enable a DR replication on a vault cluster with consul storage.
On the DR Primary:
- Disable replication on the DR primary
On the DR Secondary:
- Stop the Vault service on all Vault cluster nodes
- Stop the Consul agent on all Vault nodes
- Take a final snapshot from the leader server to backup your data
consul snapshot save backup.snap
. Useconsul snapshot save -stale backup.snap
to create a potentially stale snapshot from any available server. This is useful for situations where a cluster is in a degraded state and no leader is available. - Use the command
consul kv delete -recurse -token=$CONSUL_TOKEN vault/
to delete Vault data from Consul nodes, or clear the data folder on the Consul nodes. The pathvault/
depends on the storage path defined in the storage stanza in the Vault config file - Restart the Consul service on the Consul nodes(Optional)
- Start the Consul agents on the Vault nodes
- Start the Vault service on the Vault nodes
On the DR Primary:
On the DR Secondary:
By following these steps, the data in the primary and secondary clusters will be completely synced, and replication can be re-enabled. It's important to note that this process will result in a temporary loss of availability on the secondary cluster while the Vault and Consul services are stopped. The primary vault cluster is expected to stay up and running while performing this process. It's also important to test the replication thoroughly after the process to ensure that it is working as expected.
Additionally, it's recommended to have a backup of the data before proceeding with this process, in case any errors happen during the deletion process.
It's also important to mention, that in the example above, this process will only work if you are using Consul as the storage backend. If you are using any other storage backend, the steps may be different.