Introduction
When Terraform Enterprise is deployed in Active/Active operational mode, its internally-managed Vault is deployed in High Availability (HA) mode. In HA mode, one Vault node becomes the active node and handles all requests while the other nodes are passive (standby), ready to take over if the active node fails. Standby nodes will forward requests to the active node, which advertises its address via the encrypted storage, allowing clients to connect to any node in the cluster.
Problem
The Vault request forwarding procedure can fail in Terraform Enterprise for various reasons, resulting in an unhealthy node on which Vault is repeatedly logging the error below, with more specific error logs following it.
2024-07-10T18:00:09.682Z [ERROR] core: forward request error: error="error during forwarding RPC request"
Prerequisites
- Terraform Enterprise is deployed in Active/Active operational mode
- Terraform Enterprise is not configured to use external Vault
Causes and Solutions
This issue can have several causes, ranging from configurations in the network to misconfigured Terraform Enterprise application settings. Several common and known causes are outlined below.
Misconfigured Vault Cluster Address
In Terraform Enterprise Flexible Deployment Options, the Vault Cluster Address setting is configurable via the TFE_VAULT_CLUSTER_ADDRESS setting, given the variety of platforms on which it can be deployed. By default, this it is set to http://{{ GetPrivateIP }}:8201
, a go-sockaddr template which resolves to the first routable private IP address in the list of available network interfaces. A misconfiguration of this setting may result in one of the following errors.
2024-06-18T22:08:11.294Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""
2024-07-22T16:21:49.748Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: no application protocol\""
In a Docker and Podman deployments, TFE_VAULT_CLUSTER_ADDRESS should generally be set to the routable private IP address of the underlying host, as GetPrivateIP
will resolve to the private IP address of the container itself (i.e the IP address assigned to the container by its Docker bridge network), which is not routable across nodes.
In Kubernetes deployments, the default value should generally be used, as it will resolve to the pod's private IP address, an address which is routable from other pods (each a Terraform Enterprise node).
In Replicated deployments, the Vault cluster address is set to http://<HOST_PRIVATE_IP>:8201
and it is not configurable.
Firewall
The Terraform Enterprise nodes must be able to communicate with one another over port 8201, per the network traffic requirements. If packets are dropped or rejected by a firewall on the network, Vault may log one of the following errors.
2024-07-10T18:00:09.684Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.171.59:8201: i/o timeout\""
2024-07-22T16:24:28.254Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.97.91:8200: connect: connection refused\""
Service Mesh (Istio, Linkerd)
If a service mesh such as Istio or Linkerd is deployed in the same cluster as Terraform Enterprise, it is possible that its proxying between Vault nodes will certificate verification errors, such as the following.
2024-11-26T14:00:31.122Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match fw-f457c844-ecc5-0f57-6b76-7d2f11aa49a8\""
Port 8201 is MTLS, so both the client and the server have pre-existing expectations of what TLS certificate they're going to see, which breaks when the service mesh tries to proxy it. To resolve this, exclude the inbound and outbound traffic on port 8201 in the service mesh configuration (see the following documentation for the annotations specific to Istio).
CIDR Range
In some cases, when using the default value for TFE_VAULT_CLUSTER_ADDRESS (http://{{ GetPrivateIP }}:8201
), the private IP address may be computed to an empty or unexpected value, potentially leading to the following error.
2024-06-18T22:08:11.294Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""
This has been known to manifest in Kubernetes clusters with a pod CIDR range which is not a standard private IP address range (i.e 240.0.0.0/4
). Google Kubernetes Engine, specifically, recommends the use of non-RFC 1918 IP address ranges to prevent IP address exhaustion in the VPC network.
To confirm if this is the cause, run the following command to view the status of the Vault node on the instance.
Replicated/Docker
docker exec <TFE_CONTAINER> bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'
Podman
podman exec <TFE_CONTAINER> -- bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'
Kubernetes
kubectl exec -n <TFE_NAMESPACE> <TFE_POD> -- bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'
The output of the command will resemble that shown below.
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 1
Threshold 1
Version 1.16.3
Build Date 2024-05-29T14:28:42Z
Storage Type postgresql
Cluster Name vault-cluster-7937f3d5
Cluster ID e9e952e5-3c06-ac98-7ef8-98b36765c651
HA Enabled true
HA Cluster https://10.0.171.59:8201
HA Mode standby
Active Node Address http://127.0.0.1:8200
The HA Mode
key in the output indicates if the Vault node is active or standby. The HA Cluster
key in the output should contain the routable IP address of the active Vault node as the domain of its URL value. If the HA Cluster
address is missing an IP address in the domain of the URL, it is possible the GetPrivateIP
failed to find a valid IP address.
HA Cluster https://:8201
In these cases, if a non-RFC 1918 pod CIDR range is a requirement, the Vault address can be set using a slightly modified go-sockaddr template shown below as a workaround.
TFE_VAULT_CLUSTER_ADDRESS: http://'{{GetAllInterfaces | include \"type\" \"ip\" | include \"flags\" \"up\" | exclude \"flags\" \"loopback\" | sort \"default,type,size\" | include \"RFC\" \"6890\" | attr \"address\" }}':8201