Problem
Terraform runs remain queued indefinitely and the following error is logged by the tfe-task-worker, which is the Terraform Enterprise component responsible for starting ephemeral agents to perform runs.
2023-10-31T22:55:09.530503000Z {"@level":"error","@message":"error running task instance","@module":"tfe-task-worker.executor","@timestamp":"2023-10-31T22:55:09.529030Z","err":"fork/exec /workers/tfe-sentinel-worker: too many open files"}
Subsequent jobs immediately fail as they are dequeued:
2023-10-31T22:55:30.888483000Z {"@level":"debug","@message":"dequeued task","@module":"tfe-task-worker.dequeuer.agent-run","@timestamp":"2023-10-31T22:55:30.888186Z","id":"dc212e25-3013-4758-8517-7b8c2ad42bce"}
2023-10-31T22:55:30.889163000Z {"@level":"debug","@message":"executing task","@module":"tfe-task-worker.dequeuer.agent-run","@timestamp":"2023-10-31T22:55:30.888281Z","capacity":10,"id":"dc212e25-3013-4758-8517-7b8c2ad42bce","running":1}
2023-10-31T22:55:30.889859000Z {"@level":"error","@message":"error executing task","@module":"tfe-task-worker.dequeuer.agent-run","@timestamp":"2023-10-31T22:55:30.888600Z","id":"dc212e25-3013-4758-8517-7b8c2ad42bce"}
Prerequisites
- Terraform Enterprise Replicated Deployment and Flexible Deployment Options v202302-1 through v202311-1
- Agent pipeline mode (Replicated Deployment)
TFE_RUN_PIPELINE_DRIVER
is set to docker (Flexible Deployment Options)
Cause
This is caused by a file handle leak in the tfe-task-worker, causing it to reach its nofile limit after significant uptime. This issue generally only manifests in Docker installations that are configured with a lower per-container nofile limit. For example, the Amazon Linux 2023 Docker package, which starts the daemon with a lower default ulimit of nofile=32768:65536
set by default via a flag in the /etc/sysconfig/docker
file.
# By default we limit the number of open files per container
OPTIONS="--default-ulimit nofile=32768:65536"
Solutions
This has been fixed in releases after v202311-1. To resolve the issue on impacted releases, restart Terraform Enterprise to release file handles and proceed to cancel and re-trigger runs. If an upgrade is not immediately feasible, either remove or increase the default nofile
ulimit and monitor file handles in use on the system to plan application restarts. The tfe-task-worker's file handles can be monitored with ls /proc/$(pidof /usr/bin/tfe-task-worker)/fd | wc -l
.
Additional Information
If you continue to experience issues, please contact HashiCorp Support.