Introduction
Load Balancing with replication is one of the more complex Vault topics and a key driver of misconfiguration, architecture, and environmental issues reported by customers.
Due to its complexity, there are a large amount of resources scattered across HashiCorp's official documentation and knowledge base, making the topic even more daunting for those not already familiar.
This article aims to curate these resources and offer a convenient point of reference for related information. It is designed with the intention to be updated by support as additional documentation is made available.
Vault and Load Balancers
The following links provide valuable information into various aspects of load balancing traffic to and from Vault.
- Load Balancer Recommendations - Recommendations around how you can use the /sys/health endpoint and TLS termination
- Best Practices - AWS NLB Configuration for Vault - Details on the requirements around setting up a Network Load Balancer in AWS to work with Vault. Similar principals apply to other cloud providers
- The /sys/health endpoint - Critical for load balancers to measure the health of Vault nodes and connections. Includes important status codes returned by Vault
- Network Connectivity with Vault - Details the port requirements and their uses. It is important to note that Vault requires port 443 inbound, and ports 8200 & 8201 bidirectionally to establish and maintain a healthy relationship
- Port Traffic Consideration with Load Balancer - Explains how Vault handles internal certificates for internal cluster communications, the use of the cluster port 8201, the use of the primary_cluster_addr parameter, and load balancer choice considerations - or why we generally recommend using a layer 4 load balancer instead of an application layer load balancer
- PROXY Protocol Support - Details on how to help Vault learn the true client IP when clients are connecting via a load balancer or proxy
- Recommended Architecture - Touches briefly on how to architect a load balancer with HA (Noting that High Availability and Replication are two separate features of Vault with different architectural requirements)
Load Balancing and Replication
Most enterprise use cases involving load balancers also utilise Vault's Replication features. As such, this section covers some basic scenarios to be aware of in conjunction with the above.
The following topics should be understood in order to effectively configure, deploy and manage load balanced Vault with replication enabled.
- Multi-Cluster Architecture - Covers the fundamentals of how to architect multiple clusters with replication
- Enabling Replication with primary_cluster_addr and primary_api_addr as needed - Covers critically important aspects of how clusters communicate with each other when they are not directly accessible due to a load balancer
- Monitoring Vault Replication - This guide offers guidance on how to monitor replication effectively
- Troubleshooting Replication Problems During Initial Bootstrap - Misconfigurations are one of the top causes of errors when setting up replication, this part of the documentation highlights what can break and how it can break
- "Intelligently" load balancing traffic based on geolocation - One issue encountered by many customers utilising Performance Replication is that some things (tokens, leases) are not replicated from primary to secondary. In the event of a round robin style load balancing, this can result in invalid or missing token errors, among other things. A solution not uncommonly used is to deploy a Global Traffic Management (GTM) or an equivalent load balancer to distribute requests based on geolocation and ensure that each session is sent to its designated cluster. Note that configuration of the load balancer itself is out of the scope of HashiCorp support
An Important Note
As can be seen by the broad range of resources in this article, implementing Vault with replication and load balancing is not a one-size-fits-all operation. The above should be taken into consideration within the scope of each individual deployment.
Lastly, configuration of Vault replication via a load balancer is one of the more complex Vault administration tasks, and one of the most commonly misconfigured functionalities. For enterprise customers, it is highly recommended that prior to any production-grade implementation, you consult your Customer Success Manager for a review of your deployment plan.