Terraform Enterprise Active/Active node fails to start with error "local node not active but active cluster node not found" – HashiCorp Help Center

Problem

Terraform Enterprise Active/Active secondary nodes fail to start with error: local node not active but active cluster node not found

Prerequisites

Terraform Enterprise active-active operational mode
Replicated or FDO Docker/Podman

Cause

Time drift between nodes causes a startup failure as Vault is unable to create a new token and fails to discover other nodes in the cluster. Please check the Docker container logs:

# v202308-1 or older
docker logs tfe-vault

# v202309-1 or newer. NOTE: container name might differ from example
docker exec -it terraform-enterprise-tfe-1 more /var/log/terraform-enterprise/vault.log

# Error message
+ Retrying to create vault token
Error creating token: Error making API request.
URL: POST http://tfe-vault:8200/v1/auth/token/create
Code: 500. Errors:
* local node not active but active cluster node not found

Check the output of the vault status command:

# v202308-1 or older
docker exec -it tfe-vault vault status

# v202309-1 or newer. NOTE: container name might differ from example
docker exec -it terraform-enterprise bash -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'

# output
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           1
Threshold              1
Version                1.12.3
Build Date             2023-02-02T09:07:27Z
Storage Type           postgresql
Cluster Name           vault-cluster-aa3acd6e
Cluster ID             16e7772d-8e76-e9d0-4f1c-db1398b86d22
HA Enabled             true
HA Cluster             n/a
HA Mode                standby
Active Node Address    <none>

Check output of the date command on all nodes to confirm if time drift has occurred.

Solution

Fix the time drift. The instructions will be specific to your operating system, but most will have an implementation of the Network Time Protocol (NTP) that will allow you to synchronize your system's clocks.

Perform a connectivity test between Vault cluster nodes on port 8201 to ensure proper communication:

# From an Unhealthy node
nc -vz $PRIVATE_IP_OF_HEALTHY_NODE 8201

# From an Unhealthy node use netcat to listen on port 8201
nc -l $PRIVATE_IP_OF_UNHEALTHY_NODE 8201

# From the remaining nodes run
nc -vz $PRIVATE_IP_OF_UNHEALTHY_NODE 8201

# Expected output
Connection to $PRIVATE_IP 8201 port [tcp/*] succeeded!

Stop the Terraform Enterprise application on all nodes and perform a reboot. Begin with the healthy node first. Where the Terraform Enterprise application starts successfully.

# For Replicated deployments

# Healthy node
tfe-admin node-drain
replicatedctl app stop

# Unhealthy node(s)
replicatedctl app stop -f

# For Flexible Deployment Options

# Docker
docker compose down /path/to/docker-compose.yaml

# Podman
podman kube down /path/to/podman_kube.yaml

Start the Terraform Enterprise app. Begin with the healthy node and check the output of the date command to confirm time is in-sync across the nodes.

Continue with the startup process for the remaining nodes.

# For Replicated deployments
replicatedctl app start

# For Flexible Deployment Options

# Docker 
docker compose up --detach /path/to/docker-compose.yaml

# Podman
podman play kube /path/to/podman_kube.yaml

Outcome

The Terraform Enterprise application starts successfully in all nodes.

Problem

Prerequisites

Cause

Solution

Outcome

Articles in this section

Related articles