Troubleshooting Replication: Problems During Initial Bootstrap – HashiCorp Help Center

Introduction

While setting up Performance or Disaster Recovery (DR) replication clusters, you may encounter problematic edge cases. This guide will attempt to capture such edge cases and detail their unique symptoms for identification purposes, and provide possible solutions to resolve the root problems.

Prerequisites

The scope of this guide is for failures that occur during the initial replication configuration/bootstrap steps, so they will only be relevant if the following steps have been completed:

Familiarity with the general Vault Enterprise Replication docs, as well as the API docs for the relevant replication endpoints
Familiarity with the documents for Setting up Performance Replication or Disaster Recovery Replication Setup, as applicable
Both the existing Vault cluster (planned replication primary) and new, empty Vault cluster (planned replication secondary) exist at the same version of Vault Enterprise (or new cluster has a version more recent than existing cluster)
Replication has been successfully enabled on the primary, and a bootstrap token has been generated
An attempt at enabling replication on the secondary has occurred, but this step fails (and/or errors are seen in the logs of the secondary or primary during this step)

Diagnosing Bootstrap Failures

Here are some common error messages seen in various failures for the secondary bootstrap step, what situation causes them, and how to resolve:

Error:

* error fetching secondary bootstrap package: 
error authing for bootstrap: 
rpc error: code = Unavailable desc = all SubConns are in TransientFailure, 
latest connection error: connection error: 
desc = "transport: Error while dialing remote error: tls: internal error"

Likely Scenario:

Typically this error is seen when the "primary_cluster_addr" used is set to a load balancer configured to route requests to any member of the primary Vault cluster. This is a common configuration for API communications, as Vault standby nodes can service read requests (and forward write requests to the active node), but the bootstrap process uses RPC to communicate (over a different port) and a standby will not forward bootstrap requests to the active node in the primary cluster. This causes the bootstrap to fail with the ambiguous tls error message seen above.

Solution:

To maintain the benefits of the HA cluster's ability to service read requests from all nodes, a separate load balancer can be set up specifically for the cross-cluster RPC communications that only forwards requests to the active node in the cluster. Typically this is done by configuring the load balancer's health check to hit the /sys/health endpoint of the Vault API, and only considering a 200 response code as "healthy": https://www.vaultproject.io/api/system/health#200

Once you have the separate load balancer for the cluster communications, you can specify it for use with the "primary_cluster_addr" parameter when enabling replication on the primary with the "/sys/replication/dr/primary/enable" endpoint. Note: this means that you might need to disable the replication previously enabled, and generate a new token for the secondary.

Error:

* error fetching secondary bootstrap package: 
error authing for bootstrap: 
rpc error: code = Unavailable desc = connection error: 
desc = "transport: Error while dialing remote error: tls: handshake failure"

Likely Scenario(s):

This error message indicates that the RPC communication between this secondary cluster and the primary cluster is failing when initiating an mTLS connection. The server logs should have more details, usually this is one of the following:

Improper forwarding by a load balancer of the cluster port (8201 by default) to the api port (8200 by default), resulting in a protocol mismatch (HTTPS vs RPC) and subsequent TLS failure
An application-level load balancer (as opposed to a network-level load balancer) that attempts to use HTTPS instead of the RPC protocol, resulting in a protocol mismatch and subsequent TLS failure
The load balancer attempting to offer its own certificate (as with TLS termination) on the RPC port, which will conflict with the internal self-signed certificates generated<Brief description of how to confirm the problem is solved and what to do if it isn’t> by Vault for cluster communications

Solution:

Ensure that the load balancers used for cross-cluster communication are:

Forwarding the RPC connections to the correct ports to the Vault server listening for RPC communications
Operating at the network level (and not the application level) to prevent protocol mismatch
Not offering up their own certificates for RPC communications between the clusters

Additional Information

Introduction

Prerequisites

Diagnosing Bootstrap Failures

Error:

Likely Scenario:

Solution:

Error:

Likely Scenario(s):

Solution:

Additional Information

Articles in this section

Related articles