Problem
In Terraform Enterprise, runs are stuck in a queued state across all workspaces.
The tfe-task-worker container logs indicate it is unable to find the hashicorp/tfe-agent image.
{
"@level": "info",
"@message": "{\"errorDetail\":{\"message\":\"repository docker.io/hashicorp/tfe-agent not found: does not exist or no pull access\"},\"error\":\"repository docker.io/hashicorp/tfe-agent not found: does not exist or no pull access\"}",
"@module": "tfe-task-worker.executor",
"@timestamp": "2023-07-29T00:03:14.529587Z",
"image": "hashicorp/tfe-agent:latest"
}The tfe-agent-setup container logs show a permission denied error when attempting to access the Docker daemon socket.
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/images/json?filters=%7B%22label%22%3A%7B%22com.hashicorp.container-type%3Dtfe-agent%22%3Atrue%7D%7D": dial unix /var/run/docker.sock: connect: permission denied "docker inspect" requires at least 1 argument. See 'docker inspect --help'. Usage: docker inspect [OPTIONS] NAME|ID [NAME|ID...] Return low-level information on Docker objects Base image not available
Prerequisites
-
Terraform Enterprise on Replicated releases
v202302-1throughv202308-1. - The
run_pipeline_modesetting is set toagentand uses the default agent image. - Docker is configured with the
--selinux-enabledoption. - SELinux is enabled and in enforcing mode.
Cause
When Terraform Enterprise starts, an ephemeral container named tfe-agent-setup builds the default image for the tfe-agent container. This process requires accessing the Docker socket at /var/run/docker.sock to list Docker images. SELinux denies this action, as shown in the audit log.
type=AVC msg=audit(1692033693.534:19021): avc: denied { connectto } for pid=19093 comm="docker" path="/run/docker.sock" scontext=system_u:system_r:container_t:s0:c501,c877 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=1Because the tfe-agent-setup container fails to build the hashicorp/tfe-agent:latest image, the tfe-task-worker cannot find the image locally when attempting to start agents for runs. It then unsuccessfully tries to pull the image from external registries, causing runs to remain queued.
Solutions
Solution 1: Upgrade Terraform Enterprise
This issue is resolved in Terraform Enterprise version v202309-1 and later. Upgrading to the latest version is the recommended permanent solution.
Solution 2: Apply a Temporary Workaround
If an immediate upgrade is not feasible, you can apply a workaround by temporarily setting SELinux to permissive mode during the Terraform Enterprise startup process. This allows the tfe-agent-setup container to build the agent image successfully.
Follow these steps to apply the workaround.
-
Set SELinux to permissive mode.
# setenforce 0
-
Stop the Terraform Enterprise application.
# replicatedctl app stop
-
Wait for the application to reach a stopped state, then start it again.
# replicatedctl app start
-
Ensure the application is in a started state and verify that the
hashicorp/tfe-agent:latestimage exists in the local repository.# docker image ls
-
Set SELinux back to enforcing mode.
# setenforce 1
Note that this workaround must be repeated under the following conditions:
- The
hashicorp/tfe-agent:latestimage is removed from the local repository (e.g., when rebuilding nodes or through manual actions likedocker rmiordocker image prune). - You make changes to the CA bundle setting or update the Terraform Enterprise certificates.
Additional Information
- SELinux Support in Terraform Enterprise
- For more information on SELinux modes, refer to the Red Hat documentation on how to run SELinux in permissive mode.