Problem
Terraform Enterprise runs remain queued indefinitely. The tfe-task-worker component, which is responsible for starting ephemeral agents to perform runs, logs the following error.
{
"@level": "error",
"@message": "error running task instance",
"@module": "tfe-task-worker.executor",
"@timestamp": "2023-10-31T22:55:09.529030Z",
"err": "fork/exec /workers/tfe-sentinel-worker: too many open files"
}Subsequent jobs immediately fail as they are dequeued.
{
"@level": "debug",
"@message": "dequeued task",
"@module": "tfe-task-worker.dequeuer.agent-run",
"@timestamp": "2023-10-31T22:55:30.888186Z",
"id": "dc212e25-3013-4758-8517-7b8c2ad42bce"
}
{
"@level": "debug",
"@message": "executing task",
"@module": "tfe-task-worker.dequeuer.agent-run",
"@timestamp": "2023-10-31T22:55:30.888281Z",
"capacity": 10,
"id": "dc212e25-3013-4758-8517-7b8c2ad42bce",
"running": 1
}
{
"@level": "error",
"@message": "error executing task",
"@module": "tfe-task-worker.dequeuer.agent-run",
"@timestamp": "2023-10-31T22:55:30.888600Z",
"id": "dc212e25-3013-4758-8517-7b8c2ad42bce"
}Prerequisites
- Terraform Enterprise Replicated or Flexible Deployment Options (FDO) versions
v202302-1throughv202311-1. - Agent pipeline mode (Replicated Deployment).
-
TFE_RUN_PIPELINE_DRIVERis set todocker(Flexible Deployment Options).
Cause
A file handle leak in the tfe-task-worker component causes it to reach its nofile limit after significant uptime. This issue typically occurs in Docker installations configured with a lower per-container nofile limit.
For example, the Amazon Linux 2023 Docker package starts the daemon with a lower default ulimit of nofile=32768:65536. This is set by default via a flag in the /etc/sysconfig/docker file.
## By default we limit the number of open files per container OPTIONS="--default-ulimit nofile=32768:65536"
Solutions
Solution 1: Upgrade Terraform Enterprise
This issue is resolved in Terraform Enterprise versions v202311-1 and newer. The recommended solution is to upgrade to the latest version to permanently fix the file handle leak.
Solution 2: Apply a Workaround
If an immediate upgrade is not feasible, you can apply a temporary workaround.
- Restart Terraform Enterprise to release the accumulated file handles. After the restart, you may need to cancel and re-trigger any queued runs.
- To prevent recurrence, either remove or increase the default
nofileulimit for the Docker daemon on the host system.
You can monitor the tfe-task-worker file handles with the following command.
$ ls /proc/$(pidof /usr/bin/tfe-task-worker)/fd | wc -l