DR Replication Issues – HashiCorp Help Center

At some point during Vault operation with DR replication you may encounter an issue where your Vault clusters, that being the DR primary and the DR secondary clusters, are no longer in sync with one another.

Replication is monitored via the use of the sys/replication/status endpoint which returns various parameters to check. When discussing the issue of synchronization between DR primary and secondary clusters, the chief among these parameters is the state parameter of the DR secondary.

This parameter dictates what the state of synchronization between the 2 clusters is. The value of stream-wals indicates normal streaming. This is the value you want to observe. The value of idle indicates a generic issue and won't be discussed in this article. Finally, an issue that can occur is that you'll see the state parameter oscillate between the values of merkle-diff and merkle-sync while never reaching the state of stream-wals.

Often, this particular issue will be accompanied by an error in the DR secondary's operational logs that resembles the following:

2022-12-25T08:11:11.065-0500 [ERROR] replication: encountered error, applying backoff: backoff=2s error="state is still irreconcilable after reindex, try reindexing primary cluster"

There are two common issues here. Each will be discussed below along with recommended remediation efforts.

There may exist some number of bad pages in the primary cluster's data set. As per the recommendation of the aforementioned error, a reindex if the primary cluster may be executed to fix these bad pages and ultimately restore replication functionality. This can cause a very long delay depending on the number and size of objects in the data store. Setting the skip_flush parameter to true can increase the performance of this action. You can monitor the progress of the reindex by watching the primary cluster's Vault operational logs.
A trickier issue to diagnose is a deficiency in how the log shipper is tuned. Namely, it may become necessary to tune the log shipper buffer length. A symptom of this error is if the secondary DR cluster takes a very long time attempting to reindex, followed by the aforementioned replication error in the Vault operational logs. You can determine if tuning the log shipper buffer is necessary by first setting the log_level of the DR secondary clusters to DEBUG and observing the logs, looking for a message similar to the following: [DEBUG] replication: starting merkle sync: num_conflict_keys=129387. This can be helpful in diagnosing this issue. By default, the the log shipper buffer length is 16364 entries. A good rule of thumb is to set the logshipper_buffer_length to a number greater than the number of conflicting keys. So given our previous example log message, we could set logshipper_buffer_length = 130000. Note that this value needs to be configured in the up-stream cluster shipping WAL (that being the primary cluster) and that requires a restart of Vault servers in that cluster to implement the change.

Note that both of these resolutions will require downtime of the Vault service. You can monitor the state parameter and watch for it to reflect merkle-sync which indicates replication is working as expected.

Articles in this section

Related articles