Problem
After a system restart or maintenance, Terraform runs are stuck across all workspaces in remote execution mode. No new agent containers are being created for the new Terraform jobs. This issue may occur across all Terraform Enterprise Flexible Deployment Options.
Cause
When Terraform Enterprise is restarted without allowing active jobs to complete or terminate gracefully, the agent containers running those jobs may not shutdown properly. The agent container remains active but unmanaged. Therefore it becomes orphaned, which in turn causes container name conflicts in remote execution mode.
The error messages can be found in the /var/logs/terraform-enterprise/task-worker.log
. It shows that the system cannot create new containers due to name conflicts with existing ones that weren’t properly terminated.
err: create container: Error response from daemon: Conflict. The container name tfe-agent-xxxx is already in use by container xxx
You have to remove (or rename) that container to be able to reuse that name.
error occurred: Init error removing container ": Error response from daemon: page not found
error executing task.
Solutions
To resolve the issue, take the following steps:
- Cancelled all jobs that were not progressing from the web UI.
- Terminated all orphaned agent containers that might be causing naming conflicts with command
docker rm -f <container_id>
Outcome
- Test by launching new plan jobs from different workspaces in remote execution mode
- An new agent container should be created, you can verify by using command line
docker ps
- Terraform plans should be processed without any errors.
References: