Introduction
Problem
Terraform Enterprise is running on OpenShift/Kubernetes, but at some point no new runs start. Terraform Enterprise remains available, yet the runs fail to initiate for an unknown reason.
No indications in the Terraform Enterprise logs point to an error or issue.
Prerequisites
- Terraform Enterprise lower then v202504-1
- Deployment platform OpenShift/Kubernetes
Cause
When a run needs to start in a workspace for a plan or apply, Terraform Enterprise translates it into a job in OpenShift/Kubernetes. This job then starts a pod with a Terraform Enterprise Agent, which downloads the code and executes the Terraform binary for the plan or apply. Once completed, the Kubernetes job is marked as successful and cleaned up.
Issues may arise where the final step within the OpenShift/Kubernetes environment does not complete properly. In such cases, the job is marked as successful but is not fully cleaned up, leaving completed jobs still visible, as shown in the example below.
kubectl get jobs -n terraform-enterprise-agents
NAME COMPLETIONS DURATION AGE
tfe-task-08613daa-f6a7-403e-8707-6aa5acfa8fb3 1/1 121m 179m
tfe-task-21745fc8-25bf-4fcb-9e1a-fce81d68ba80 1/1 3h11m 6h30m
tfe-task-48e306e9-40c1-468f-909c-b61eb6c8cbad 1/1 3h21m 5h35m
tfe-task-c065d786-3fd9-447d-8deb-9ed482b44f3b 1/1 71m 4h16m
tfe-task-f90862a2-c4f7-4e22-9082-245beaea9812 1/1 3h20m 6h2m
These five jobs still count toward the concurrency limit set in Terraform Enterprise through the TFE_CAPACITY_CONCURRENCY
parameter, which defaults to 10. As a result, Terraform Enterprise will only run five concurrent runs instead of the expected 10. If the number of these uncleaned successful jobs reaches the configured concurrency limit, no new runs will start.
Simply removing the successfully completed jobs using a command like the one below is not sufficient.
kubectl -n terraform-enterprise-agents delete job -l job-name-prefix=tfe-task --field-selector=status.successful=1
Solution
Option 1:
The quickest solution is to restart the Terraform Enterprise pods. This resets job information within the Terraform Enterprise database and the OpenShift/Kubernetes platform.
Option 2:
Upgrade to Terraform Enterprise version 202504-1.
Outcome
Allowing new runs to start as expected.
Additional Information
-
Details about the
TFE_CAPACITY_CONCURRENCY
can be found here