Problem
When a large number of runs are triggered in Terraform Enterprise, such as from a VCS mono-repository, you may observe that many runs become stuck in the Plan queued or Apply queued state indefinitely.
You can observe this issue in the Kubernetes event log.
Check the events in the agent namespace.
$ kubectl -n terraform-enterprise-agents get events
The output shows a FailedScheduling warning due to insufficient memory.
11m Warning FailedScheduling pod/tfe-task-fe14c45c-83ea-4d7b-9dba-9c54f8057347-5df5z 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Prerequisites
- Terraform Enterprise installed in a Kubernetes environment.
Cause
The stuck runs are typically caused by insufficient memory in the Kubernetes cluster. When a run is triggered, Terraform Enterprise attempts to schedule a new agent pod. If the cluster nodes lack the available memory to meet the pod's request, the pod cannot be scheduled, and the run remains queued.
For example, consider a Kubernetes cluster with a single 16GB node and default Terraform Enterprise capacity settings (10 concurrent runs, 2048MB memory per run). If all 10 runs are triggered simultaneously, they will request a total of 20.48GB (10 x 2048MB) of memory. This exceeds the node's capacity, especially after accounting for the memory required by the main Terraform Enterprise application pod and other system processes (approximately 5GB).
Solutions
Solution 1: Increase Environment Memory
The recommended solution is to increase the memory of your Kubernetes nodes to meet the demands of your workload. In the example above, increasing the node's memory to 32GB would accommodate the 20.48GB required for 10 concurrent runs, plus the additional resources needed for Terraform Enterprise and system processes.
Solution 2: Decrease Concurrent Runs
If you cannot increase the environment's memory, you can decrease the number of concurrent runs to align with your available resources. Using the same example of a 16GB node, approximately 11GB of memory is available for runs after accounting for system and application needs. You would need to lower the concurrency setting from the default of 10 to 5 to stay within this limit.
While it is possible to reduce the memory capacity per run below the 2048MB default, this is not advised as it may cause individual runs to fail.
Outcome
After correctly sizing your Kubernetes nodes and Terraform Enterprise capacity settings, new runs should execute without becoming stuck in a queued state.
Note: Even in a properly sized environment, intermittent issues may cause a run to become stuck. In these cases, you may need to manually cancel and retry the run.
Additional Information
- For more details on capacity settings, refer to the Terraform Enterprise configuration reference.