Problem
Runs are stuck in queued state across all workspaces in Terraform Enterprise. The tfe-task-worker
logs contain errors indicating it is unable to find the hashicorp/tfe-agent
image:
{"@level":"info","@message":"{\"errorDetail\":{\"message\":\"repository docker.io/hashicorp/tfe-agent not found: does not exist or no pull access\"},\"error\":\"repository docker.io/hashicorp/tfe-agent not found: does not exist or no pull access\"}","@module":"tfe-task-worker.executor","@timestamp":"2023-07-29T00:03:14.529587Z","image":"hashicorp/tfe-agent:latest"}
The tfe-agent-setup
container logs a permission denied error in attempting to access the Docker daemon socket:
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/images/json?filters=%7B%22label%22%3A%7B%22com.hashicorp.container-type%3Dtfe-agent%22%3Atrue%7D%7D": dial unix /var/run/docker.sock: connect: permission denied
"docker inspect" requires at least 1 argument.
See 'docker inspect --help'.
Usage: docker inspect [OPTIONS] NAME|ID [NAME|ID...]
Return low-level information on Docker objects
Base image not available
Prerequisites
-
Terraform Enterprise on Replicated releases
v202302-1
throughv202308-1
run_pipeline_mode
setting is set toagent
and using the default agent image- Docker configured with
--selinux-enabled
option - SELinux enabled and enforcing
Cause
When Terraform Enterprise starts, an ephemeral container called tfe-agent-setup
performs some work to build the default image used for the tfe-agent container at run time. This involves accessing the Docker socket to list Docker images on the system, an action which is denied by SELinux.
/var/log/audit/audit.log
type=AVC msg=audit(1692033693.534:19021): avc: denied { connectto } for pid=19093 comm="docker" path="/run/docker.sock" scontext=system_u:system_r:container_t:s0:c501,c877 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=1
Since the tfe-agent is unable to build the hashicorp/tfe-agent:latest
, the tfe-task-worker
is subsequently unable to find the image locally when attempting to start agents to perform runs and unsuccessfully attempts to find the image in external registries.
Solution
This is a bug that has been fixed in Terraform Enterprise v202309-1
; upgrading to this version or later will resolve the issue. If an upgrade is not immediately feasible, an alternate solution can be to run SELinux in permissive mode or start Docker without the --selinux-enabled
option. Assuming neither of these options are desirable for maintaining security posture, SELinux can be temporarily set to permissive mode prior to starting Terraform Enterprise and reverted to enforcing mode post-startup. This will allow the tfe-agent-setup
container to successfully build the hashicorp/tfe-agent:latest
image once such that it is available to the tfe-task-worker
to use to spawn ephemeral TFC agents to perform runs.
- Set SELinux to permissive mode:
setenforce 0
- Stop Terraform Enterprise
replicatedctl app stop
- Wait until the application is in a stopped state, and start Terraform Enterprise.
replicatedctl app start
- Ensure the application is in a started state and the
hashicorp/tfe-agent:latest
image exists in the local repository by runningdocker image ls
- Set SELinux back to enforcing mode
setenforce 1
Note that this workaround would need to be performed again under the following conditions:
1. The hashicorp/tfe-agent:latest
image is removed from the local repository, such as when rebuilding Terraform Enterprise nodes or through any manual actions which would remove it (docker rmi
, docker
image prune
, etc.).
2. Changes are made to the CA bundle setting or the Terraform Enterprise certificates are changed
Additional information
If you continue to experience issues, please contact HashiCorp Support.