Problem
All Terraform Enterprise runs are stuck in a queued status and do not proceed.
Prerequisites
- Terraform Enterprise version
v202302-1or newer using the agent-based run pipeline. - The agents functionality is enabled in the admin settings at
https://$YOUR_TFE_FQDN/app/admin/settings. - The Terraform Enterprise application is configured to use the default agent image.
You can confirm this with the following steps.
For Replicated deployments, this command returns an empty value.
# For Replicated - The command will return an empty value
replicatedctl app-config export --template '{{.custom_agent_image_tag.Value}}'For Flexible Deployment Options, the TFE_RUN_PIPELINE_IMAGE environment variable is unset. For more information, refer to the documentation for TFE_RUN_PIPELINE_IMAGE.
Cause
The hashicorp/tfe-agent:latest image has been overwritten or is corrupt. The tfe-task-worker container logs or the log file at /var/log/terraform-enterprise/task-worker.log show the following error.
2023-06-01T13:28:07.327116291Z {"@level":"info","@message":"Error in configuration: unrecognized environment variables found:","@module":"tfe-task-worker.executor.task-output","@timestamp":"2023-06-01T13:28:07.325925Z","id":"b2aca9f9-5853-4ba3-9505-c1d5f348e241","name":"agent-run","stream":"stderr"}Solutions
This issue can be resolved by removing the corrupt local agent image and allowing Terraform Enterprise to pull a fresh copy. The procedure varies based on your deployment type.
Solution 1: For Replicated Deployments
Connect to the Terraform Enterprise instance over SSH and execute the following commands to restart the application and refresh the agent image.
- Stop the Terraform Enterprise application.
$ replicatedctl app stop
- Confirm the application is fully stopped before proceeding.
$ replicatedctl app status
- Delete the local
tfe-agentimage, including all associated tags.
$ docker rmi $(docker images --filter reference='hashicorp/tfe-agent' --quiet=true)
- Start the Terraform Enterprise application. This will automatically pull a new agent image.
$ replicatedctl app start
- Verify that the new image has been created.
$ docker images | grep hashicorp/tfe-agent ## The output should be similar to the following: ## hashicorp/tfe-agent latest f78028f6be16 7 minutes ago 387MB
Solution 2: For Flexible Deployment Options (FDO)
Connect to the Terraform Enterprise instance over SSH and follow the commands for your specific container runtime.
Docker
- Stop and remove the running containers.
$ docker compose -f /path/to/docker-compose.yaml down
- Delete the local
tfe-agentimage.
$ docker rmi $(docker images --filter reference='hashicorp/tfe-agent' --quiet=true)
- Start the application. This will pull a new agent image.
$ docker compose -f /path/to/docker-compose.yaml up -d
Podman
- Stop and remove the running pods.
$ podman kube down /path/to/podman.yaml
- Delete the local
tfe-agentimage.
$ podman rmi $(podman images --filter reference='hashicorp/tfe-agent' --quiet=true)
- Start the application. This will pull a new agent image.
$ podman play kube /path/to/podman.yaml
Kubernetes
- Uninstall the Terraform Enterprise Helm release.
$ helm uninstall terraform-enterprise -n <TFE_NAMESPACE>
- On each node, remove the cached
tfe-agentimage usingcrictl.
$ crictl rmi hashicorp/tfe-agent:latest
- Reinstall the Terraform Enterprise Helm release.
$ helm install terraform-enterprise hashicorp/terraform-enterprise –n <TFE_NAMESPACE> --values <OVERRIDES_FILE>
Outcome
After the corrupt image is replaced and the application is restarted, Terraform Enterprise runs will proceed from the queue as expected.
Additional Information
- For a related issue, please see the article TFE: The runs are stuck in Plan Queued.