Introduction
When Terraform Enterprise is deployed in Active/Active operational mode, its internally-managed Vault is deployed in High Availability (HA) mode. In HA mode, one Vault node becomes the active node and handles all requests while the other nodes are passive (standby), ready to take over if the active node fails. Standby nodes will forward requests to the active node, which advertises its address via the encrypted storage, allowing clients to connect to any node in the cluster.
Problem
The Vault request forwarding procedure can fail in Terraform Enterprise for various reasons, resulting in an unhealthy node on which Vault repeatedly logs the following error, with more specific error logs following it.
[ERROR] core: forward request error: error="error during forwarding RPC request"
Prerequisites
- Terraform Enterprise is deployed in Active/Active operational mode.
- Terraform Enterprise is not configured to use external Vault.
Cause
This issue can have several causes, ranging from network configurations to misconfigured Terraform Enterprise application settings. Several common and known causes are outlined in the solutions below.
Solutions
Solution 1: Correct the Vault Cluster Address
A misconfiguration of the Vault cluster address may result in one of the following errors.
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: no application protocol\""
In Terraform Enterprise Flexible Deployment Options, the Vault Cluster Address setting is configurable via the TFE_VAULT_CLUSTER_ADDRESS setting. By default, it is set to http://{{ GetPrivateIP }}:8201, a go-sockaddr template which resolves to the first routable private IP address.
-
Docker and Podman:
TFE_VAULT_CLUSTER_ADDRESSshould generally be set to the routable private IP address of the underlying host, asGetPrivateIPwill resolve to the container's private IP address, which is not routable across nodes. - Kubernetes: The default value should generally be used, as it will resolve to the pod's private IP address, which is routable from other pods.
-
Replicated: The Vault cluster address is set to
http://<HOST_PRIVATE_IP>:8201and is not configurable.
Solution 2: Publish Required Network Ports
For Docker, Podman, or Nomad installations, the Terraform Enterprise container must publish port 8201. This is the port on which Vault listens for server-to-server cluster requests by default. If the port is not published, Terraform Enterprise's internal Vault server cannot be reached from outside the container network.
Solution 3: Verify Firewall Rules
The Terraform Enterprise nodes must be able to communicate with one another over port 8201, per the network traffic requirements. If a firewall drops or rejects packets on the network, Vault may log one of the following errors.
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.171.59:8201: i/o timeout\""
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.97.91:8200: connect: connection refused\""
Solution 4: Exclude Traffic from Service Mesh
If a service mesh such as Istio or Linkerd is deployed in the same cluster as Terraform Enterprise, its proxying between Vault nodes may cause certificate verification errors.
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match fw-f457c844-ecc5-0f57-6b76-7d2f11aa49a8\""
Port 8201 uses MTLS, so both the client and the server have pre-existing expectations of the TLS certificate they will see, which breaks when the service mesh tries to proxy it. To resolve this, exclude the inbound and outbound traffic on port 8201 in the service mesh configuration. For an example, see the Istio-specific annotations documentation.
Solution 5: Adjust for non-RFC 1918 CIDR Ranges
When using the default value for TFE_VAULT_CLUSTER_ADDRESS (http://{{ GetPrivateIP }}:8201), the private IP address may be computed to an empty or unexpected value, potentially leading to the following error.
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""
This can occur in Kubernetes clusters with a pod CIDR range that is not a standard private IP address range (e.g., 240.0.0.0/4). Google Kubernetes Engine, for example, recommends using non-RFC 1918 IP address ranges.
To confirm if this is the cause, run the appropriate command below to view the status of the Vault node.
- Replicated/Docker
$ docker exec <TFE_CONTAINER> bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'
- Podman
$ podman exec <TFE_CONTAINER> bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'
- Kubernetes
$ kubectl exec -n <TFE_NAMESPACE> <TFE_POD> -- bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'
The command output will resemble the following.
Key Value --- ----- Seal Type shamir Initialized true Sealed false Total Shares 1 Threshold 1 Version 1.16.3 Build Date 2024-05-29T14:28:42Z Storage Type postgresql Cluster Name vault-cluster-7937f3d5 Cluster ID e9e952e5-3c06-ac98-7ef8-98b36765c651 HA Enabled true HA Cluster https://10.0.171.59:8201 HA Mode standby Active Node Address http://127.0.0.1:8200
If the HA Cluster address is missing an IP address in the domain of the URL, it is possible the GetPrivateIP function failed to find a valid IP address.
HA Cluster https://:8201
In these cases, if a non-RFC 1918 pod CIDR range is a requirement, you can set the Vault address using a slightly modified go-sockaddr template as a workaround.
TFE_VAULT_CLUSTER_ADDRESS: http://{{GetAllInterfaces | include \"type\" \"ip\" | include \"flags\" \"up\" | exclude \"flags\" \"loopback\" | sort \"default,type,size\" | include \"RFC\" \"6890\" | attr \"address\" }}:8201