Introduction
HashiCorp Vault is a leading tool for secrets management, often deployed in multi-cluster environments to ensure high availability and disaster recovery. When setting up a Vault multi-cluster environment, proper configuration of proxy settings is crucial.
Proxy misconfigurations can lead to issues such as tls handshake timeout
errors when secondary clusters attempt to communicate with the primary cluster during initial bootstrap. This article explains the role of the NO_PROXY
environment variable and how it can be used to resolve Vault cluster-to-cluster TLS handshake timeouts.
Purpose
NO_PROXY
is an environment variable that specifies which addresses should bypass a pre-configured proxy server. It becomes particularly important in a Vault multi-cluster environment where internal cluster communications from primary to secondary must be direct and not routed through an external proxy, which can cause delays, security concerns, or connectivity issues. For instance, NO_PROXY
can be used to ensure direct communication between primary and secondary over Vault's cluster port 8201 without proxy interference.
Proxy Types
Forward Proxy:
- typically referred to simply as "proxy"
- usually configured on the same node as the service it is proxying (Vault in this case)
- proxies all requests leaving the node (unless specified by Vault's
NO_PROXY
, in this case) - example: VPN
Reverse Proxy:
- usually referred to as "load balancer" (although load balancing is just one of its features)
- usually configured on a separate server to receive traffic
- proxies all incoming request and sends out accordingly
- example: AWS NLB/ALB
Example Scenario:
Consider an AWS application load balancer (ALB) configured for Vault's secondary cluster. ALBs will always terminate TLS, which is problematic for Vault replication because communication between clusters is encrypted with mutually-authenticating TLS sessions (mTLS). Thus, any premature TLS termination will result in a trust violation and throw a TLS-related error.
Instead of re-configuring your ALB, you can actually set NO_PROXY
to the address of the proxy itself to disable it. This is because Go (Vault's language) will automatically disable the proxy if it detects NO_PROXY
set to the actual proxy address (ip/domain/wildcard).
In this case, setting NO_PROXY
on the secondary to the address of the ALB effectively disabled it, which allowed traffic to be directly routed to the primary, and resolved the TLS issue.
Observed Errors
The following are the relevant errors observed from the above scenario, note the TLS handshake timeout
is a good indication of a proxy in place between clusters. If this proxy is an AWS ALB (as it was here), then the error is most likely due to the proxy terminating mTLS-encrypted cert over 9201.
From Secondary CLI:
vault write -f sys/replication/performance/secondary/update-primary token=$REPLICATION_TOKEN primary_api_addr=<REPLICATION_PRIMARY>
# resulting error message
Error writing data to sys/replication/performance/secondary/update-primary: Error making API request
URL: PUT https://.../v1/sys/replication/performance/secondary/update-primary
Code: 500. Errors:
* 1 error occurred:
* error response unwrapping secondary token: status code is 403, message is
"403 Forbidden"
vault write -f sys/replication/performance/secondary/update-primary token=$REPLICATION_TOKEN primary_api_addr=<primary's address>
# resulting error message
Error writing data to sys/replication/performance/secondary/update-primary: Error making API request
URL: PUT https://.../v1/sys/replication/performance/secondary/update-primary
Code: 500. Errors:
* 1 error occurred:
* error unwrapping secondary token: Post "https://.../v1/sys/wrapping/unwrap":
net/http: TLS handshake timeout
Secondary UI after attempting to update primary:
error response unwrapping secondary token: status code is 504, message is "504 Gateway Timeout"
error unwrapping secondary token: Post "https://.../v1/sys/wrapping/unwrap": Moved Temporarily
# under replication -> details: connection_state lists transient_failure and this:
There has been some transient failure.
Your cluster will eventually switch back to
connection and try to establish a connection again
Configuration
- To recap,
NO_PROXY
should primarily be used to exclude specific addresses from being routed to the configured proxy. - However, as a quick way to bypass for all traffic, set
NO_PROXY
to the IP/DNS of the proxy - Ensure that
NO_PROXY
is correctly configured to include all relevant addresses. Verify that no proxy server is being used for internal cluster communication. - For multi-nodes, set
NO_PROXY
via the CLI or via Vault's unit file on all nodes of each cluster, restart the standbys first, then step-down the active node and restart its service.
NO_PROXY
can take as its definition the following:
- an IP address prefix (1.2.3.4)
- an IP address prefix in CIDR notation (1.2.3.4/8)
- a domain name, or a special DNS label (*)
- any of the above with a literal port number (1.2.3.4:80)
- a comma-separated list of any of the above
Here is an example configurations for a Unix-like environment:
# wildcard notation, set to a load balancer to disable
export NO_PROXY='*.example.lb.com'
Other Common Causes and Solutions
A TLS handshake timeout error indicates that the TLS handshake process is not completing within the expected time frame. If the proxy is determined to not be the cause, here are some general troubleshooting steps, and other possible causes:
- Increase Logging Verbosity -- Set Vault's log level to debug to capture detailed information about the TLS handshake process.
- Verify Proxy Settings -- Double-check that NO_PROXY is correctly set and that no proxy is being used for internal cluster communications: env | grep NO_PROXY
- Check Network ConnectivityUse ping, traceroute, and other network tools to ensure there are no network issues.
Network Latency or Congestion:
High latency or network congestion can delay the TLS handshake. Use network diagnostic tools to check for issues:
ping primary-vault.example.com
traceroute primary-vault.example.com
Certificate Issues:
Invalid or incorrectly configured certificates can prevent a successful TLS handshake. Verify that the certificates are properly configured and trusted by both the primary and secondary clusters.
Firewall and Security Groups:
Ensure that all necessary ports (typically 8200 for Vault) are open and that there are no firewall rules or security groups blocking communication between clusters.
DNS Resolution:
Slow or incorrect DNS resolution can cause timeouts. Verify that DNS settings are correct and that the domain names resolve quickly to the correct IP addresses.
Additional Resources
-
Vault documentation on
NO_PROXY
- Vault documentation on using with
VAULT_HTTP_PROXY
andVAULT_PROXY_ADDR
-
Go http client logic to disable proxy if
NO_PROXY
is set to IP/domain of proxy - Go
NO_PROXY
environment variable definition - Vault documentation explaining replication mTLS and not to terminate at LB