At some point during Vault operation with DR replication you may encounter an issue where your Vault clusters, that being the DR primary and the DR secondary clusters, are no longer in sync with one another.
Replication is monitored via the use of the sys/replication/status endpoint which returns various parameters to check. When discussing the issue of synchronization between DR primary and secondary clusters, the chief among these parameters is the state parameter of the DR secondary.
This parameter dictates what the state of synchronization between the 2 clusters is. The value of stream-wals
indicates normal streaming. This is the value you want to observe. The value of idle
indicates a generic issue and won't be discussed in this article. Finally, an issue that can occur is that you'll see the state
parameter oscillate between the values of merkle-diff
and merkle-sync
while never reaching the state of stream-wals
.
Often, this particular issue will be accompanied by an error in the DR secondary's operational logs that resembles the following:
2022-12-25T08:11:11.065-0500 [ERROR] replication: encountered error, applying backoff: backoff=2s error="state is still irreconcilable after reindex, try reindexing primary cluster"
There are two common issues here. Each will be discussed below along with recommended remediation efforts.
- There may exist some number of bad pages in the primary cluster's data set. As per the recommendation of the aforementioned error, a reindex if the primary cluster may be executed to fix these bad pages and ultimately restore replication functionality. This can cause a very long delay depending on the number and size of objects in the data store. Setting the skip_flush parameter to
true
can increase the performance of this action. You can monitor the progress of the reindex by watching the primary cluster's Vault operational logs. - A trickier issue to diagnose is a deficiency in how the log shipper is tuned. Namely, it may become necessary to tune the log shipper buffer length. A symptom of this error is if the secondary DR cluster takes a very long time attempting to reindex, followed by the aforementioned replication error in the Vault operational logs. You can determine if tuning the log shipper buffer is necessary by first setting the
log_level
of the DR secondary clusters toDEBUG
and observing the logs, looking for a message similar to the following:[DEBUG] replication: starting merkle sync: num_conflict_keys=129387
. This can be helpful in diagnosing this issue. By default, the the log shipper buffer length is16364
entries. A good rule of thumb is to set the logshipper_buffer_length to a number greater than the number of conflicting keys. So given our previous example log message, we could setlogshipper_buffer_length = 130000
. Note that this value needs to be configured in the up-stream cluster shipping WAL (that being the primary cluster) and that requires a restart of Vault servers in that cluster to implement the change.
Note that both of these resolutions will require downtime of the Vault service. You can monitor the state
parameter and watch for it to reflect merkle-sync
which indicates replication is working as expected.