Introduction
Once a Disaster Recovery Secondary token is generated on the DR Primary cluster, the replication setup and configuration process consists of two parts. The first part of the process authenticates the DR Secondary cluster to the DR Primary cluster using the generated token. This occurs via an API call to the DR Primary cluster (default TCP port 8200). The second part of the process establishes cluster-to-cluster communication on the cluster_addr port (default TCP port 8201) of the active node in each cluster. The second bootstrapping configuration part may fail for different reasons. This article touches on a few examples of errors that could occur and how to resolve them.
Problem
Enabling replication on the DR Secondary cluster results in an error message after providing the secondary activation token, causing the replication configuration to fail. The error is typically encountered while trying to write the secondary replication token to the sys/replication/dr/secondary/enable
endpoint or clicking "Enable Replication" in the UI.
Prerequisites (if applicable)
- Vault Enterprise Replication configured on the DR Primary cluster.
Cause
The errors listed below exemplify what can be encountered when the replication configuration fails. They are usually caused by network communication-related problems between the primary and secondary clusters. Take note that all of them look similar and share the same * error fetching secondary bootstrap package
message, the last portion however differs.
A DR Secondary cluster attempted a connection to the DR Primary cluster_addr, but:
Example 1:
Did not receive a response in a timely fashion, causing the connection to timeout:
1 error occurred: * error fetching secondary bootstrap package:
error authing for bootstrap: rpc error: code = Unavailable desc = connection error:
desc = "transport: Error while dialing dial tcp x.x.x.x:8201: i/o timeout"
Example 2:
The connection was refused:
1 error occurred: * error fetching secondary bootstrap package:
error authing for bootstrap: rpc error: code = Unavailable desc = connection error:
desc = "transport: Error while dialing dial tcp x.x.x.x:8201: connect: connection refused"
Example 3:
The connection failed due to a TLS handshake failure:
1 error occurred: * error fetching secondary bootstrap package:
error authing for bootstrap: rpc error: code = Unavailable desc = connection error:
desc = "transport: Error while dialing remote error: tls: handshake failure"
Example 4:
The connection failed due to a TLS unrecognized name error:
1 error occurred: * error fetching secondary bootstrap package:
error authing for bootstrap: rpc error: code = Unavailable desc = connection error:
desc = "transport: Error while dialing remote error: tls: unrecognized name"
Overview of Possible Solutions
- Ensure that network communication is possible between the DR primary and DR secondary clusters for both the API and cluster ports in both directions (default TCP port 8200 and TCP port 8201).
- In example 1, traffic was only allowed from the DR secondary cluster to the DR primary cluster. Returning traffic was blocked, resulting in a
i/o timeout
. - In example 2, traffic from the DR secondary cluster to the DR primary cluster was blocked, resulting in a
connection refused
.
- In example 1, traffic was only allowed from the DR secondary cluster to the DR primary cluster. Returning traffic was blocked, resulting in a
-
Ensure that the proper primary_cluster_addr is specified on the DR primary cluster when enabling replication and that the address is reachable from the DR secondary cluster when the clusters are located on separate networks.
-
Ensure that the load balancer facing the DR Primary cluster forwards traffic to the correct Vault cluster.
- In example 3, the primary cluster is faced by a load balancer. The
primary_cluster_addr
on the DR primary was set to the load balancer address, the load balancer was configured to forward traffic to a non-Vault server, resulting in atls handshake failure
. - In example 4, the primary cluster is faced by a load balancer. The
primary_cluster_addr
on the DR primary was set to the load balancer address, and the load balancer was configured to forward traffic to a Vault server belonging to a different Vault cluster. This causes a CN/SAN name mismatch between the data in the TLS certificate and the server name URI.
- In example 3, the primary cluster is faced by a load balancer. The
Additional Information
- HashiCorp Learn: Disaster Recovery Replication Setup
- Vault API doc:
primary_cluster_addr
- Article: Troubleshooting Replication: Problems During Initial Bootstrap