Introduction
When for example you have a VCS mono-repo that triggers a lot of runs in Terraform Enterprise, you might see many runs getting stuck on Plan queued
or Apply queued
.
Problem
Many runs on Terraform Enterprise get stuck indefinitely.
In the events log the following log line can be oberserved:
$ kubectl -n terraform-enterprise-agents get events
11m Warning FailedScheduling pod/tfe-task-fe14c45c-83ea-4d7b-9dba-9c54f8057347-5df5z 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Prerequisites
- Terraform Enterprise on Kubernetes
Cause
Your Terraform Enterprise environment is most likely undersized for memory. This prevents the agents to start up and perform the run.
Example:
Your Kubernetes cluster has a single node for example with 16GB of memory and the default Terraform Enterprise capacity settings. So 10 concurrent runs and 2048MB of memory per run.
This results in a undersized Kubernetes node, because when all 10 runs are triggered, they ask for 10x2048MB of memory, so about 20GB in total. With an additional +/-5GB needed for the Terraform Enterprise pod.
Solutions
- Increase the memory size of you environment.
So looking at the example above, we could double the memory of the Kubernetes node to 32GB. This would accommodate the 10 concurrent runs of 2048MB fine.
- Decrease the number of concurrent runs.
If you are unable to increase the memory size of you environment, another option might be to decrease the number of concurrent runs. Looking again at the example above, we have 16GB of memory. With about 5GB needed for TFE and the system itself, we are left with 11GB of memory for the runs. This means we need to drop the concurrency to 5, instead of the default 10.
Although it is possible to reduce the memory capacity below the default 2048MB, this is not advised.
Outcome
After sizing your Kubernetes nodes and Terraform Enterprise capacity settings properly runs will not get stuck.
Caveat:
It is always possible a run might get stuck, even with a properly sized environment. You will need to manually cancel and retry the run. Engineering is looking into a way to detect a failed pod and notify the user about stuck runs.