Vault Replication issues on AWS Auto Scaling Groups – HashiCorp Help Center

Introduction

This article documents workarounds for a known defect in Vault Enterprise Replication where a Secondary cluster, Disaster Recovery (DR) and or Performance Replication (PR), is unable to re-establish connections with their configured primary cluster.

While the issue can occur on any platform, it's commonly observed when Vault is deployed on AWS using Auto Scaling Groups (ASGs).

Update:

A fix for this issue was included in Vault Enterprise versions 1.13.3, 1.12.7 and 1.11.11

Problem

A Secondary DR / PR cluster keeps a list containing the node IPs of it's configured primary cluster (known_primary_cluster_addrs) . The replication mechanism on the secondary cluster uses this list to connect to the primary cluster.

Where replication has been setup for the first time and the leader node in the primary cluster is removed (typical when scaling nodes in AWS ASGs or refreshing them - terminate, recreate), the secondary cluster will lose connection to the primary cluster and retry to connect to one of the other nodes in it's known_primary_cluster_addrs list. Once the connection has been restored, it will update the known_primary_cluster_addrs list.

The problem arises when a second or repeated loss of connection to the newly active leader node on the primary cluster occurs. The expectation is that the secondary cluster attempts to re-establish a connection to the newly added primary nodes in the known_primary_cluster_addrs list. Instead the secondary cluster attempts to connect to the original primary leader node that no longer exists. This connection eventually fails due to a timeout and replication is then never restored.

Prerequisites (if applicable)

Vault Enterprise all versions
Vault Enterprise Replication

Cause

The behaviour is caused by a known defect in the upstream code (common go-gRPC) resulting in a loss of retry attempts when a connection to the initial leader node fails due to a connection timeout.

This can be observed by inspecting trace level operational logs on the leader node in the secondary cluster:

[DEBUG] core.cluster-listener: creating rpc dialer: address=172.31.36.75:8201 alpn=replication_dr_v1 host=47dcd914-0a07-6a97-b157-00476f1c9f1e
[TRACE] core: replication: error sending echo request to primary: error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" addrs=["https://172.31.0.183:8201", "https://172.31.38.52:8201", "https://172.31.17.58:8201"]

In the log example, the initial leader node on the primary (which no longer exists) had the IP address 172.31.36.75.

The known_primary_cluster_addrs list contains addresses 172.31.0.183, 172.31.38.52, 172.31.17.58 which corresponds to the current 3 nodes in the primary cluster.

Overview of possible solutions

Solutions:

In order to restore the replication of secondary DR / PR clusters with the primary, the following solution may be used:

Re-establish the connection by updating the assigned primaries.
- Disaster Recovery - Update the assigned primary
- Performance Replication - Update Performance Secondary's Primary

Alternatively, it's recommended to place a Load Balancer (LB) in front of each cluster. This will simplify the workflow as Vault Enterprise Replication can be configured to point to a single target (the LB). The considerations detailed below should be made where this solution is pursued instead:

When enabling replication on the primary cluster (both DR & PR), the primary_cluster_addr value needs to match the address of the LB that's in front of the said primary cluster.
The LB should be configured with health checks & route all incoming replication connections (TCP port 8201 for RPC) to the active node in the primary cluster as replication only occurs between each active leader in the two clusters.
The LB should not offload TLS for RPC (8201). Vault uses mTLS for cluster-to-cluster replication which breaks when using TLS offloading.
It's advised to set the resolver_discover_servers replication parameter to false when using a LB in combination with replication. In case this is not set, issues with DR promotion or demotion may be experienced.
In case each of the clusters are already fronted by a LB, consider adding an additional LB with the sole purpose of routing replication traffic to the active node. This enables the ability to easily remove the added LB's once a fix for the defect is released.

Outcome

Vault Enterprise Replication successfully connects and replicates data between the primary and secondary clusters.

Additional Information

Vault Tutorial: Vault Server Logs
Vault Tutorial: Update the assigned primary
Vault API: Update Performance Secondary's Primary
Vault API: /sys/health
Vault API: Enable Performance Primary Replication
Vault API: Enable DR Primary Replication
Vault Documentation: Replication Parameters
Go-gRPC Github Issue: pickfirst: improve behavior when the first address is a blackhole #5701