Terraform Enterprise (Active/Active) In Unhealthy State With "forward request error" Error in Vault – HashiCorp Help Center

Introduction

When Terraform Enterprise is deployed in Active/Active operational mode, its internally-managed Vault is deployed in High Availability (HA) mode. In HA mode, one Vault node becomes the active node and handles all requests while the other nodes are passive (standby), ready to take over if the active node fails. Standby nodes will forward requests to the active node, which advertises its address via the encrypted storage, allowing clients to connect to any node in the cluster.

Problem

The Vault request forwarding procedure can fail in Terraform Enterprise for various reasons, resulting in an unhealthy node on which Vault is repeatedly logging the error below, with more specific error logs following it.

2024-07-10T18:00:09.682Z [ERROR] core: forward request error: error="error during forwarding RPC request"

Prerequisites

Terraform Enterprise is deployed in Active/Active operational mode
Terraform Enterprise is not configured to use external Vault

Causes and Solutions

This issue can have several causes, ranging from configurations in the network to misconfigured Terraform Enterprise application settings. Several common and known causes are outlined below.

Misconfigured Vault Cluster Address

In Terraform Enterprise Flexible Deployment Options, the Vault Cluster Address setting is configurable via the TFE_VAULT_CLUSTER_ADDRESS setting, given the variety of platforms on which it can be deployed. By default, this it is set to http://{{ GetPrivateIP }}:8201, a go-sockaddr template which resolves to the first routable private IP address in the list of available network interfaces. A misconfiguration of this setting may result in one of the following errors.

2024-06-18T22:08:11.294Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""

2024-07-22T16:21:49.748Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: no application protocol\""

In a Docker and Podman deployments, TFE_VAULT_CLUSTER_ADDRESS should generally be set to the routable private IP address of the underlying host, as GetPrivateIP will resolve to the private IP address of the container itself (i.e the IP address assigned to the container by its Docker bridge network), which is not routable across nodes.

In Kubernetes deployments, the default value should generally be used, as it will resolve to the pod's private IP address, an address which is routable from other pods (each a Terraform Enterprise node).

In Replicated deployments, the Vault cluster address is set to http://<HOST_PRIVATE_IP>:8201 and it is not configurable.

Published Ports

For Docker, Podman, or Nomad installations, the Terraform Enterprise container must publish port 8201, the port on which Vault listens for server-to-server cluster requests by default. If the port is not published, Terraform Enterprise's internal Vault server cannot be reached from outside the container network.

Firewall

The Terraform Enterprise nodes must be able to communicate with one another over port 8201, per the network traffic requirements. If packets are dropped or rejected by a firewall on the network, Vault may log one of the following errors.

2024-07-10T18:00:09.684Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.171.59:8201: i/o timeout\""

2024-07-22T16:24:28.254Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.97.91:8200: connect: connection refused\""

Service Mesh (Istio, Linkerd)

If a service mesh such as Istio or Linkerd is deployed in the same cluster as Terraform Enterprise, it is possible that its proxying between Vault nodes will certificate verification errors, such as the following.

2024-11-26T14:00:31.122Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match fw-f457c844-ecc5-0f57-6b76-7d2f11aa49a8\""

Port 8201 is MTLS, so both the client and the server have pre-existing expectations of what TLS certificate they're going to see, which breaks when the service mesh tries to proxy it. To resolve this, exclude the inbound and outbound traffic on port 8201 in the service mesh configuration (see the following documentation for the annotations specific to Istio).

CIDR Range

In some cases, when using the default value for TFE_VAULT_CLUSTER_ADDRESS (http://{{ GetPrivateIP }}:8201), the private IP address may be computed to an empty or unexpected value, potentially leading to the following error.

2024-06-18T22:08:11.294Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""

This has been known to manifest in Kubernetes clusters with a pod CIDR range which is not a standard private IP address range (i.e 240.0.0.0/4). Google Kubernetes Engine, specifically, recommends the use of non-RFC 1918 IP address ranges to prevent IP address exhaustion in the VPC network.

To confirm if this is the cause, run the following command to view the status of the Vault node on the instance.

Replicated/Docker

docker exec <TFE_CONTAINER> bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'

Podman

podman exec <TFE_CONTAINER> bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'

Kubernetes

kubectl exec -n <TFE_NAMESPACE> <TFE_POD> -- bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && wait-for-token -- vault status'

The output of the command will resemble that shown below.

Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           1
Threshold              1
Version                1.16.3
Build Date             2024-05-29T14:28:42Z
Storage Type           postgresql
Cluster Name           vault-cluster-7937f3d5
Cluster ID             e9e952e5-3c06-ac98-7ef8-98b36765c651
HA Enabled             true
HA Cluster             https://10.0.171.59:8201
HA Mode                standby
Active Node Address    http://127.0.0.1:8200

The HA Mode key in the output indicates if the Vault node is active or standby. The HA Cluster key in the output should contain the routable IP address of the active Vault node as the domain of its URL value. If the HA Cluster address is missing an IP address in the domain of the URL, it is possible the GetPrivateIP failed to find a valid IP address.

HA Cluster             https://:8201

In these cases, if a non-RFC 1918 pod CIDR range is a requirement, the Vault address can be set using a slightly modified go-sockaddr template shown below as a workaround.

TFE_VAULT_CLUSTER_ADDRESS: http://{{GetAllInterfaces | include \"type\" \"ip\" | include \"flags\" \"up\" | exclude \"flags\" \"loopback\" |  sort \"default,type,size\" | include \"RFC\" \"6890\" | attr \"address\" }}:8201

Additional Information