Introduction
When starting Terraform Enterprise the application may fail to startup and report an error starting the container ptfe_base_startup
(after Terraform Enterprise 202205-1, this container is called tfe-base-startup
)
Problem
When the ptfe_base_startup
container is launched it waits for a period of time for dependent services to become available before continuing. One of those tasks is to retrieve a token created by the Vault service to facilitate encryption of sensitive values. If the Vault service is unable to decrypt it's own unseal keys, or doesn't perform it's startup within the timeout period, ptfe_base_startup
will report an error and fail the startup process.
If enabled, the installation dashboard will report that the container failed:
Alternatively replicatedctl app status
will report:
[ { "AppID": "cf2420c1fb6c43957c238b0bec5255e0", "Sequence": 576, "PatchSequence": 0, "State": "stopped", "DesiredState": "started", "Error": "Container ptfe_base_startup failed: Container 72b4ef0621d5acd05a325d0e00f88f335af925a4eefc07ed7e1ca2ab85f425ff exited with non-zero exit status 1: ", "IsCancellable": false, "IsTransitioning": false, "LastModifiedAt": "2021-10-22T04:26:14.687941937Z" } ]
Check the logs for the ptfe_base_startup
container with docker logs ptfe_base_startup
or for TFE 202205-1 or later docker logs tfe-base-startup
,to confirm that a time out has occurred, for example:
INFO: Vault token retrieval timeout not yet reached
INFO: Vault token retrieval timeout not yet reached
INFO: Vault token retrieval timeout not yet reached
INFO: Vault token retrieval timeout not yet reached
INFO: Vault token retrieval timeout not yet reached
INFO: Vault token retrieval timeout not yet reached
INFO: Vault token retrieval timeout not yet reached
INFO: Vault token retrieval timeout not yet reached
ERROR: Operation timed out waiting for vault token
Check the logs for the internal Vault service, ptfe_vault
, with docker logs ptfe_vault
or for TFE 202205-1 or later docker logs tfe-vault
to determine the cause of the issue.
Cause 1
Failures during the unseal process are reported with an error message similar to:
get unseal: could not decrypt unseal key: crypto: could not decrypt ciphertext: chacha20poly1305: message authentication failed
This indicates that the value present in the configuration for the enc_password
attribute isn't the one that was used when the instance was initially installed.
Solution 1
The enc_password
value must be restored to the value that was used when the instance was initially installed. Even if the instance has been re-created, the unseal keys stored in the database can only be retrieved with the correct value.
Cause 2
Failures to retrieve the token required by ptfe_base_startup
will not log an error in the ptfe_vault
container logs, so we will need to check the timestamps of the operations.
ptfe_vault
output:
... 2022-01-28T01:39:54.970Z [INFO] identity: groups restored 2022-01-28T01:39:54.985Z [INFO] expiration: lease restore complete 2022-01-28T01:39:54.991Z [INFO] core: usage gauge collection is disabled 2022-01-28T01:39:54.995Z [INFO] core: post-unseal setup complete + Retrying to create vault token + Successfully created vault token
ptfe_base_startup
output:
... 2022/01/28 01:36:47 execing command; /usr/bin/wait-for-token [-- true] INFO: Vault token retrieval timeout not yet reached ... INFO: Vault token retrieval timeout not yet reached ERROR: Operation timed out waiting for vault token
ptfe_base_startup
will wait up to 60 seconds for the required token to become available. If we compare when ptfe_base_startup
began waiting, 01:36:47
, to when ptfe_vault
finished, 01:39:54
we can see that the token was not available by 01:37:48
so ptfe_base_startup
timed out.
Solution 2
Insufficient disk throughput can result in too high contention while the application starts up, resulting in time outs.
The resolution to this issue will depend on your operating environment and the types of disk available, however it is recommended to utilize SSD based storage and to monitor IOPs throughput when the application is starting up. Burstable throughput may not always be available and may cause this issue intermittently.
For more details please refer to: Capacity and Performance: Disk I/O
Cause 3
There is a known issue where Dynatrace OneAgent will prevent Terraform Enterprise from starting up successfully. Dynatrace will scan files related to TFE's start up process which will cause the files to be in a busy state when certain functions of the start up workflow are executed.
Solution 3
Disable or uninstall Dynatrace OneAgent.