Problem
In Terraform Enterprise Flexible Deployment Options on Kubernetes, a Terraform run appears stuck in the plan queued state or another non-finalized state. This indicates a failure during the process of creating or executing the run's associated Kubernetes Job.
Cause
Terraform Enterprise performs runs using ephemeral HCP Terraform Agents that are launched within a Kubernetes Job. The task-worker service in Terraform Enterprise creates these jobs. A failure can occur at several points in this process:
- The
task-workerservice may fail to create the Kubernetes Job. - Kubernetes may fail to schedule or create the agent pod due to configuration errors, resource limitations, or environmental issues.
- The HCP Terraform Agent container may fail to start or register with the Terraform Enterprise platform.
Solutions
This guide provides a methodical approach to troubleshooting failures for runs in Terraform Enterprise on Kubernetes.
Solution 1: Perform Initial Diagnostics
Begin by investigating the task-worker logs and the state of the Kubernetes Job.
-
Check the
task-workerlogs.Execute the following command on the Terraform Enterprise pod to view the
task-workerlogs and confirm it created a Kubernetes job for the run.$ kubectl exec -ti -n <TFE_NAMESPACE> <TFE_POD> -- cat /var/log/terraform-enterprise/task-worker.log
The expected output shows that the
task-workerreceived, dequeued, and began executing the task.{"@level":"info","@message":"request complete","@module":"task-worker.router","duration":3942329,"host":"127.0.0.1:8000","method":"POST","path":"/v1/task/invoke","remote_addr":"127.0.0.1:39582","status_code":201,"trace_id":"83836356-fccd-4c93-b996-d421b41ff0a9"} {"@level":"debug","@message":"dequeued task","@module":"task-worker.dequeuer.agent-run","id":"713bf8e0-e61a-496f-8bb2-9c331874df19"} {"@level":"debug","@message":"executing task","@module":"task-worker.dequeuer.agent-run","capacity":10,"id":"713bf8e0-e61a-496f-8bb2-9c331874df19","running":1} -
Enable debug mode to inspect failed jobs.
When a run is canceled or errors, the
task-workerimmediately deletes the Kubernetes Job. To aid troubleshooting, you can prevent this automatic cleanup by enabling theTFE_RUN_PIPELINE_KUBERNETES_DEBUG_ENABLEDsetting. This allows you to inspect the Job and Pod after the run has been canceled. By default, jobs have a time-to-live (TTL) of one day, which you can configure withTFE_RUN_PIPELINE_KUBERNETES_DEBUG_JOBS_TTL. -
Inspect the Kubernetes Job and Pod.
Describe the Job to confirm it created a pod for the agent. Filter by the run ID.
$ kubectl describe jobs -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID>
If the Job created a pod, describe the pod to check for scheduling issues.
$ kubectl describe pods -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID>
If the agent pod was successfully scheduled, proceed to the other solutions to diagnose specific failures.
Solution 2: Resolve Image Pull Failures
Image pull failures often occur when using a custom worker image. If the Kubelet cannot pull the image, it will generate events visible in the pod description.
$ kubectl describe pods -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID> ## ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 30s default-scheduler Successfully assigned terraform-enterprise-agents/tfe-task-7e9ef694-3a15-4ab0-ae31-8fd6a384160a-9w584 to ip-10-0-174-139.ec2.internal Normal Pulling 18s (x2 over 30s) kubelet Pulling image "quay.io/my-org/custom-agent-test" Warning Failed 18s (x2 over 30s) kubelet Failed to pull image "quay.io/my-org/custom-agent-test": failed to pull and unpack image "quay.io/my-org/custom-agent-test:latest": failed to resolve reference "quay.io/my-org/custom-agent-test:latest": unexpected status from HEAD request to https://quay.io/v2/my-org/custom-agent-test/manifests/latest: 401 UNAUTHORIZED Warning Failed 18s (x2 over 30s) kubelet Error: ErrImagePull Normal BackOff 6s (x2 over 29s) kubelet Back-off pulling image "quay.io/my-org/custom-agent-test" Warning Failed 6s (x2 over 29s) kubelet Error: ImagePullBackOff
In this example, the 401 UNAUTHORIZED error indicates an authentication failure. This can be resolved by configuring Terraform Enterprise with an image pull secret using the TFE_RUN_PIPELINE_KUBERNETES_IMAGE_PULL_SECRET_NAME setting.
Solution 3: Resolve Pod Scheduling Issues
The Kubernetes scheduler may be unable to schedule the agent pod. Describe the pod and review the events to find the cause.
$ kubectl describe pods -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID> ## ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 1s (x2 over 5m8s) default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
In this example, the pod could not be scheduled due to Insufficient memory. You may need to add more resources to your Kubernetes nodes.
In other cases, events like autoscaling may delay scheduling, causing the task-worker to time out. The default timeout is 60 seconds. If you see a context deadline exceeded error in the task-worker logs, you can extend this timeout by setting TFE_RUN_PIPELINE_KUBERNETES_POD_TIMEOUT (available in v202404-2 and later).
Solution 4: Resolve Out of Memory (OOM) Errors
If a Terraform process uses more memory than its container limit, the Linux out-of-memory (OOM) killer will terminate it. This may not generate a Kubernetes event for the pod itself. The run will fail with a generic error.
Operation failed: failed running terraform plan (exit 1)
To confirm an OOM error, check the kernel logs on the node where the pod was scheduled.
-
SSH to the node and run
journalctl.$ journalctl -b -k -p err
-
If you cannot SSH to the node, start a debug pod on the node and view the
dmesglogs.$ kubectl debug node/<NODE_NAME> -it --rm --image=busybox $ cat host/var/log/dmesg
To resolve this, increase the maximum amount of memory a Terraform run can use with the TFE_CAPACITY_MEMORY setting.
Solution 5: Diagnosing Agent Failures
If the Kubernetes Job and pod were created successfully, the HCP Terraform Agent itself may have failed during startup. Agent logs are streamed to the task-worker and are available in the main Terraform Enterprise container logs. You can also view them directly from the agent pod if you have enabled debug mode.
$ kubectl logs -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID>
Review these logs for errors related to the agent's environment setup or registration with Terraform Enterprise.