Problem
Terraform Enterprise Active/Active secondary nodes fail to start with error: local node not active but active cluster node not found
Prerequisites
- Terraform Enterprise Active/Active deployment
Cause
- Time drift between nodes causes a startup failure as Vault is unable to create a new token and fails to discover other nodes in the cluster. Please check the Docker container logs:
# v202308-1 or older
docker logs tfe-vault
# v202309-1 or newer. NOTE: container name might differ from example
docker exec -it terraform-enterprise-tfe-1 more /var/log/terraform-enterprise/vault.log
# Error message
+ Retrying to create vault token
Error creating token: Error making API request.
URL: POST http://tfe-vault:8200/v1/auth/token/create
Code: 500. Errors:
* local node not active but active cluster node not found
- Check the output of the
vault status
command:# v202308-1 or older
docker exec -it tfe-vault vault status
# v202309-1 or newer. NOTE: container name might differ from example
docker exec -it terraform-enterprise bash -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
# outputKey Value --- ----- Seal Type shamir Initialized true Sealed false Total Shares 1 Threshold 1 Version 1.12.3 Build Date 2023-02-02T09:07:27Z Storage Type postgresql Cluster Name vault-cluster-aa3acd6e Cluster ID 16e7772d-8e76-e9d0-4f1c-db1398b86d22 HA Enabled true HA Cluster n/a HA Mode standby Active Node Address <none>
- Check output of the
date
command on all nodes to confirm if time drift has occurred.
Solution
- Fix the time drift. The instructions will be specific to your operating system, but most will have an implementation of the Network Time Protocol (NTP) that will allow you to synchronize your system's clocks.
- Perform a connectivity test between Vault cluster nodes on port 8201 to ensure proper communication:
# From an Unhealthy node
nc -vz $PRIVATE_IP_OF_HEALTHY_NODE 8201
# From an Unhealthy node use netcat to listen on port 8201
nc -l $PRIVATE_IP_OF_UNHEALTHY_NODE 8201
# From the remaining nodes run
nc -vz $PRIVATE_IP_OF_UNHEALTHY_NODE 8201
# Expected output
Connection to $PRIVATE_IP 8201 port [tcp/*] succeeded! - Stop the Terraform Enterprise application on all nodes and perform a reboot. Begin with the healthy node first. Where the Terraform Enterprise application starts successfully.
# For Replicated deployments
# Healthy node
tfe-admin node-drain
replicatedctl app stop
# Unhealthy node(s)
replicatedctl app stop -f
# For Flexible Deployment Options
# Docker
docker compose down /path/to/docker-compose.yaml
# Podman
podman kube down /path/to/podman_kube.yaml -
Start the Terraform Enterprise app. Begin with the healthy node and check the output of the
date
command to confirm time is in-sync across the nodes. - Continue with the startup process for the remaining nodes.
# For Replicated deployments
replicatedctl app start
# For Flexible Deployment Options
# Docker
docker compose up --detach /path/to/docker-compose.yaml
# Podman
podman play kube /path/to/podman_kube.yaml
Outcome
The Terraform Enterprise application starts successfully on all Active/Active nodes.