DR replication configuration fails due to "error fetching secondary bootstrap package" – HashiCorp Help Center

Introduction

Once a Disaster Recovery Secondary token is generated on the DR Primary cluster, the replication setup and configuration process consists of two parts. The first part of the process authenticates the DR Secondary cluster to the DR Primary cluster using the generated token. This occurs via an API call to the DR Primary cluster (default TCP port 8200). The second part of the process establishes cluster-to-cluster communication on the cluster_addr port (default TCP port 8201) of the active node in each cluster. The second bootstrapping configuration part may fail for different reasons. This article touches on a few examples of errors that could occur and how to resolve them.

Problem

Enabling replication on the DR Secondary cluster results in an error message after providing the secondary activation token, causing the replication configuration to fail. The error is typically encountered while trying to write the secondary replication token to the sys/replication/dr/secondary/enable endpoint or clicking "Enable Replication" in the UI.

Prerequisites (if applicable)

Vault Enterprise Replication configured on the DR Primary cluster.

Cause

The errors listed below exemplify what can be encountered when the replication configuration fails. They are usually caused by network communication-related problems between the primary and secondary clusters. Take note that all of them look similar and share the same * error fetching secondary bootstrap package message, the last portion however differs.

A DR Secondary cluster attempted a connection to the DR Primary cluster_addr, but:

Example 1:

Did not receive a response in a timely fashion, causing the connection to timeout:

1 error occurred: * error fetching secondary bootstrap package: 
error authing for bootstrap: rpc error: code = Unavailable desc = connection error: 
desc = "transport: Error while dialing dial tcp x.x.x.x:8201: i/o timeout"

Example 2:

The connection was refused:

1 error occurred: * error fetching secondary bootstrap package: 
error authing for bootstrap: rpc error: code = Unavailable desc = connection error: 
desc = "transport: Error while dialing dial tcp x.x.x.x:8201: connect: connection refused"

Example 3:

The connection failed due to a TLS handshake failure:

1 error occurred: * error fetching secondary bootstrap package: 
error authing for bootstrap: rpc error: code = Unavailable desc = connection error: 
desc = "transport: Error while dialing remote error: tls: handshake failure"

Example 4:

The connection failed due to a TLS unrecognized name error:

1 error occurred: * error fetching secondary bootstrap package: 
error authing for bootstrap: rpc error: code = Unavailable desc = connection error: 
desc = "transport: Error while dialing remote error: tls: unrecognized name"

Overview of Possible Solutions

Ensure that network communication is possible between the DR primary and DR secondary clusters for both the API and cluster ports in both directions (default TCP port 8200 and TCP port 8201).
- In example 1, traffic was only allowed from the DR secondary cluster to the DR primary cluster. Returning traffic was blocked, resulting in a i/o timeout.
- In example 2, traffic from the DR secondary cluster to the DR primary cluster was blocked, resulting in a connection refused.
Ensure that the proper primary_cluster_addr is specified on the DR primary cluster when enabling replication and that the address is reachable from the DR secondary cluster when the clusters are located on separate networks.
Ensure that the load balancer facing the DR Primary cluster forwards traffic to the correct Vault cluster.
- In example 3, the primary cluster is faced by a load balancer. The primary_cluster_addr on the DR primary was set to the load balancer address, the load balancer was configured to forward traffic to a non-Vault server, resulting in a tls handshake failure.
- In example 4, the primary cluster is faced by a load balancer. The primary_cluster_addr on the DR primary was set to the load balancer address, and the load balancer was configured to forward traffic to a Vault server belonging to a different Vault cluster. This causes a CN/SAN name mismatch between the data in the TLS certificate and the server name URI.

Additional Information

HashiCorp Learn: Disaster Recovery Replication Setup
Vault API doc: primary_cluster_addr
Article: Troubleshooting Replication: Problems During Initial Bootstrap