Problem
In a Terraform Enterprise active-active configuration, a secondary node fails to start. The logs show the error local node not active but active cluster node not found.
Prerequisites
- Terraform Enterprise configured for active-active operational mode.
- A Replicated or Flexible Deployment Options (FDO) installation using Docker or Podman.
Cause
A time drift between the nodes in the cluster can cause this startup failure. The internal Vault component is unable to create a new token and fails to discover other active nodes in the cluster if the system clocks are not synchronized.
Procedure
Follow these steps to diagnose and resolve the issue.
Step 1: Diagnose the Issue
Check the Terraform Enterprise container logs for the specific Vault token creation error. The command varies by version.
For versions
v202308-1and older:$ docker logs tfe-vault
For versions
v202309-1and newer (note: your container name may differ):$ docker exec -it terraform-enterprise-tfe-1 more /var/log/terraform-enterprise/vault.log
Look for the following error message in the output:
+ Retrying to create vault token Error creating token: Error making API request. URL: POST http://tfe-vault:8200/v1/auth/token/create Code: 500. Errors: * local node not active but active cluster node not found
Check the status of the internal Vault instance to confirm it is in
standbymode.For versions
v202308-1and older:$ docker exec -it tfe-vault vault status
For versions
v202309-1and newer (note: your container name may differ):$ docker exec -it terraform-enterprise bash -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
The output should indicate that the node is in standby mode and has not found an active node.
## ... HA Enabled true HA Cluster n/a HA Mode standby Active Node Address <none>
Check the system time on all Terraform Enterprise nodes to identify any drift.
$ date
Step 2: Resolve the Issue
- Correct the time drift on the affected nodes. The specific commands depend on your operating system, but most systems use a Network Time Protocol (NTP) service to synchronize clocks.
Verify that the Vault cluster members can communicate over port
8201. Run these commands from different nodes.On an unhealthy node, start a listener on port
8201.$ nc -l $PRIVATE_IP_OF_UNHEALTHY_NODE 8201
From a healthy node, attempt to connect to the unhealthy node's listener.
$ nc -vz $PRIVATE_IP_OF_UNHEALTHY_NODE 8201
A successful connection will produce the following output.
Connection to $PRIVATE_IP_OF_UNHEALTHY_NODE 8201 port [tcp/*] succeeded!
Perform a rolling restart of the Terraform Enterprise application, starting with the healthy node first.
First, stop the application on all nodes.
For Replicated deployments:
## On the healthy node $ tfe-admin node-drain $ replicatedctl app stop ## On the unhealthy node(s) $ replicatedctl app stop -f
For Flexible Deployment Options with Docker:
$ docker compose -f /path/to/docker-compose.yaml down
For Flexible Deployment Options with Podman:
$ podman kube down /path/to/podman_kube.yaml
Start the Terraform Enterprise application, beginning with the healthy node. After starting, confirm the time is synchronized across all nodes using the
datecommand.For Replicated deployments:
$ replicatedctl app start
For Flexible Deployment Options with Docker:
$ docker compose -f /path/to/docker-compose.yaml up --detach
For Flexible Deployment Options with Podman:
$ podman play kube /path/to/podman_kube.yaml
- Continue the startup process for the remaining nodes.
Outcome
After resolving the time drift and restarting the application, all nodes in the Terraform Enterprise cluster should start successfully.
Additional Information
For more details on Terraform Enterprise architecture, you may find the official documentation on active-active installations helpful.