Problem
Terraform Enterprise Active/Active Vault cluster nodes HA mode shows standby for both nodes and the application fails to start on one node. The Terraform Enterprise Vault container shows the following error:
2023-04-24T22:32:33.684748000Z 2023-04-24T22:32:33.684Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing read tcp 172.22.0.12:36792->10.4.1.1:8201: read: connection reset by peer\""
2023-04-24T22:32:33.685001000Z 2023-04-24T22:32:33.684Z [ERROR] core: forward request error: error="error during forwarding RPC request"
$ docker exec -ti tfe-vault vault status
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 1
Threshold 1
Version 1.10.3
Storage Type postgresql
Cluster Name vault-cluster-3dcccDca8e
Cluster ID 2410f422-1183-7444-d043-abc4n35b1
HA Enabled true
HA Cluster https://10.10.10.10:8201
HA Mode standby
Cause
On occasion, not all nodes in the active/active group can acquire the Vault HA mode leader lock at startup. To resolve this, issue the command vault operator step-down
and the Terraform Enterprise nodes will be successful at acquiring the lock on its next attempt.
Solution
Using vault operator step-down
will force the Vault node within an HA cluster to step down from active duty. When executed against a non-active node, i.e. a standby or performance standby node, the request will be forwarded to the active node.
##### Perform the same steps on all nodes ######
# Connect to tfe-vault container
docker exec -it tfe-vault sh
## Step down command
vault operator step-down
# Shutdown the TFE node gracefully.
tfe-admin node-drain
# Stop the application on both nodes
replicatedctl app stop
# Monitor the status
watch replicatedctl app status
# Start the application
replicatedctl app start
# Check Vault status
docker exec -it tfe-vault vault status
Outcome
vault status
should show HA Mode as active on one node and standby on another.