Problem
In a Terraform Enterprise Active/Active installation, the embedded Vault cluster nodes both show standby for the High Availability (HA) mode, and the application fails to start on one of the nodes.
The Terraform Enterprise Vault container log shows a connection error.
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing read tcp 172.22.0.12:36792->10.4.1.1:8201: read: connection reset by peer\"" [ERROR] core: forward request error: error="error during forwarding RPC request"
Running vault status inside the tfe-vault container on an affected node shows HA Mode as standby.
$ docker exec -ti tfe-vault vault status Key Value -- ----- Seal Type shamir Initialized true Sealed false Total Shares 1 Threshold 1 Version 1.10.3 Storage Type postgresql Cluster Name vault-cluster-3dcccDca8e Cluster ID 2410f422-1183-7444-d043-abc4n35b1 HA Enabled true HA Cluster https://10.10.10.10:8201 HA Mode standby
Cause
On startup, not all nodes in the Active/Active cluster are able to acquire the Vault HA mode leader lock. This can prevent one of the nodes from becoming the active leader, causing both to remain in standby mode.
Solution
The vault operator step-down command forces a Vault node in an HA cluster to step down from active duty. When you execute this command against a standby node, the request is forwarded to the active node. This action allows the cluster to re-elect a leader, and the nodes can successfully acquire the lock on their next attempt.
Procedure
Perform the following steps on each affected Terraform Enterprise node.
-
Connect to the
tfe-vaultcontainer.$ docker exec -it tfe-vault sh
-
Run the step-down command to force a leader re-election.
$ vault operator step-down
-
Exit the container shell and gracefully drain the Terraform Enterprise node.
$ exit $ tfe-admin node-drain
-
Stop the application on the node.
$ replicatedctl app stop
-
Monitor the application status until it reports a
stoppedstate.$ watch replicatedctl app status
-
Start the application.
$ replicatedctl app start
Outcome Validation
After completing the procedure, check the Vault status again. The output should show HA Mode as active on one node and standby on the other.
$ docker exec -it tfe-vault vault status
Additional Information
For more details on the Vault commands used in this guide, please refer to the official documentation: