Introduction
Leader election in primary cluster breaks replication to secondary clusters.
Scenario
During a leadership change on a primary cluster, two RPC clients on a secondary cluster (WAL streaming and another activity such as heart beating) can race to authenticate a new token. The non-WAL streaming RPC will authenticate first, however a bug in the WAL stream error handling can cause the new auth token to be wiped. This can cause the primary cluster to still believe that the connection has a token. The end result is a secondary cluster that cannot maintain any replication activities until either:
1) Replication is restarted on either the primary or secondary cluster
2) A leadership change happens on either cluster
Recommendation
There are 3 workarounds for this issue:
1) First is to perform a leadership election in either cluster.
2) Second is perform a POST action to the sys/replication/recovery endpoint if an election does not fix replication between clusters.
3) The third and most involved workaround is to follow the update-primary procedure to re-establish replication between clusters.
Additional Information
-
https://developer.hashicorp.com/vault/docs/commands/operator/step-down
- https://developer.hashicorp.com/vault/api-docs/system/replication#attempt-recovery
- https://developer.hashicorp.com/vault/api-docs/system/replication/replication-dr#update-dr-secondary-s-primary
- https://developer.hashicorp.com/vault/api-docs/system/replication/replication-performance#update-performance-secondary-s-primary