Introduction
While setting up Performance or Disaster Recovery (DR) replication clusters, you may encounter problematic edge cases. This guide will attempt to capture such edge cases and detail their unique symptoms for identification purposes, and provide possible solutions to resolve the root problems.
Prerequisites
The scope of this guide is for failures that occur during the initial replication configuration/bootstrap steps, so they will only be relevant if the following steps have been completed:
- Familiarity with the general Vault Enterprise Replication docs, as well as the API docs for the relevant replication endpoints
- Familiarity with the documents for Setting up Performance Replication or Disaster Recovery Replication Setup, as applicable
- Both the existing Vault cluster (planned replication primary) and new, empty Vault cluster (planned replication secondary) exist at the same version of Vault Enterprise (or new cluster has a version more recent than existing cluster)
- Replication has been successfully enabled on the primary, and a bootstrap token has been generated
- An attempt at enabling replication on the secondary has occurred, but this step fails (and/or errors are seen in the logs of the secondary or primary during this step)
Diagnosing Bootstrap Failures
Here are some common error messages seen in various failures for the secondary bootstrap step, what situation causes them, and how to resolve:
Error:
* error fetching secondary bootstrap package:
error authing for bootstrap:
rpc error: code = Unavailable desc = all SubConns are in TransientFailure,
latest connection error: connection error:
desc = "transport: Error while dialing remote error: tls: internal error"
Likely Scenario:
Typically this error is seen when the "primary_cluster_addr" used is set to a load balancer configured to route requests to any member of the primary Vault cluster. This is a common configuration for API communications, as Vault standby nodes can service read requests (and forward write requests to the active node), but the bootstrap process uses RPC to communicate (over a different port) and a standby will not forward bootstrap requests to the active node in the primary cluster. This causes the bootstrap to fail with the ambiguous tls error message seen above.
Solution:
To maintain the benefits of the HA cluster's ability to service read requests from all nodes, a separate load balancer can be set up specifically for the cross-cluster RPC communications that only forwards requests to the active node in the cluster. Typically this is done by configuring the load balancer's health check to hit the /sys/health endpoint of the Vault API, and only considering a 200 response code as "healthy": https://www.vaultproject.io/api/system/health#200
Once you have the separate load balancer for the cluster communications, you can specify it for use with the "primary_cluster_addr" parameter when enabling replication on the primary with the "/sys/replication/dr/primary/enable" endpoint. Note: this means that you might need to disable the replication previously enabled, and generate a new token for the secondary.
Error:
* error fetching secondary bootstrap package:
error authing for bootstrap:
rpc error: code = Unavailable desc = connection error:
desc = "transport: Error while dialing remote error: tls: handshake failure"
Likely Scenario(s):
This error message indicates that the RPC communication between this secondary cluster and the primary cluster is failing when initiating an mTLS connection. The server logs should have more details, usually this is one of the following:
- Improper forwarding by a load balancer of the cluster port (8201 by default) to the api port (8200 by default), resulting in a protocol mismatch (HTTPS vs RPC) and subsequent TLS failure
- An application-level load balancer (as opposed to a network-level load balancer) that attempts to use HTTPS instead of the RPC protocol, resulting in a protocol mismatch and subsequent TLS failure
- The load balancer attempting to offer its own certificate (as with TLS termination) on the RPC port, which will conflict with the internal self-signed certificates generated<Brief description of how to confirm the problem is solved and what to do if it isn’t> by Vault for cluster communications
Solution:
Ensure that the load balancers used for cross-cluster communication are:
- Forwarding the RPC connections to the correct ports to the Vault server listening for RPC communications
- Operating at the network level (and not the application level) to prevent protocol mismatch
- Not offering up their own certificates for RPC communications between the clusters