Problem
Runs are stuck in queued state across all workspaces in Terraform Enterprise. The tfe-task-worker logs contain errors indicating it is unable to find the hashicorp/tfe-agent image:
{"@level":"info","@message":"{\"errorDetail\":{\"message\":\"repository docker.io/hashicorp/tfe-agent not found: does not exist or no pull access\"},\"error\":\"repository docker.io/hashicorp/tfe-agent not found: does not exist or no pull access\"}","@module":"tfe-task-worker.executor","@timestamp":"2023-07-29T00:03:14.529587Z","image":"hashicorp/tfe-agent:latest"}
The tfe-agent-setup container logs a permission denied error in attempting to access the Docker daemon socket:
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/images/json?filters=%7B%22label%22%3A%7B%22com.hashicorp.container-type%3Dtfe-agent%22%3Atrue%7D%7D": dial unix /var/run/docker.sock: connect: permission denied
"docker inspect" requires at least 1 argument.
See 'docker inspect --help'.
Usage: docker inspect [OPTIONS] NAME|ID [NAME|ID...]
Return low-level information on Docker objects
Base image not available
Prerequisites
-
Terraform Enterprise on Replicated releases
v202302-1throughv202308-1 run_pipeline_modesetting is set toagentand using the default agent image- Docker configured with
--selinux-enabledoption - SELinux enabled and enforcing
Cause
When Terraform Enterprise starts, an ephemeral container called tfe-agent-setup performs some work to build the default image used for the tfe-agent container at run time. This involves accessing the Docker socket to list Docker images on the system, an action which is denied by SELinux.
/var/log/audit/audit.log
type=AVC msg=audit(1692033693.534:19021): avc: denied { connectto } for pid=19093 comm="docker" path="/run/docker.sock" scontext=system_u:system_r:container_t:s0:c501,c877 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=1
Since the tfe-agent is unable to build the hashicorp/tfe-agent:latest, the tfe-task-worker is subsequently unable to find the image locally when attempting to start agents to perform runs and unsuccessfully attempts to find the image in external registries.
Solution
This is a bug that has been fixed in Terraform Enterprise v202309-1; upgrading to this version or later will resolve the issue. If an upgrade is not immediately feasible, an alternate solution can be to run SELinux in permissive mode or start Docker without the --selinux-enabled option. Assuming neither of these options are desirable for maintaining security posture, SELinux can be temporarily set to permissive mode prior to starting Terraform Enterprise and reverted to enforcing mode post-startup. This will allow the tfe-agent-setup container to successfully build the hashicorp/tfe-agent:latest image once such that it is available to the tfe-task-worker to use to spawn ephemeral TFC agents to perform runs.
- Set SELinux to permissive mode:
setenforce 0
- Stop Terraform Enterprise
replicatedctl app stop
- Wait until the application is in a stopped state, and start Terraform Enterprise.
replicatedctl app start
- Ensure the application is in a started state and the
hashicorp/tfe-agent:latestimage exists in the local repository by runningdocker image ls - Set SELinux back to enforcing mode
setenforce 1
Note that this workaround would need to be performed again under the following conditions:
1. The hashicorp/tfe-agent:latest image is removed from the local repository, such as when rebuilding Terraform Enterprise nodes or through any manual actions which would remove it (docker rmi, docker
image prune, etc.).
2. Changes are made to the CA bundle setting or the Terraform Enterprise certificates are changed
Additional information
If you continue to experience issues, please contact HashiCorp Support.