How to Troubleshoot Stuck Runs in Terraform Enterprise Flexible Deployment Options on Kubernetes – HashiCorp Help Center

Overview

In Terraform Enterprise Flexible Deployment Options on Kubernetes, Terraform runs are performed by ephemeral Terraform Cloud Agents which are managed by Terraform Enterprise and are launched as part of a Kubernetes Job. This Job is created by the task-worker process, a service that manages asynchronous units of work in TFE, such as runs and policy evaluations. These jobs can fail for a variety of reasons including configuration errors, resource limitations, environmental issues, and application-specific problems, some of which are outside the control of the task-worker. This guide provides a methodical approach on troubleshooting failures or other unexpected behavior for runs in Terraform Enterprise on Kubernetes.

Procedure

Typically when runs are stuck in the "plan queued" state, there has been some failure in the task-worker service, the Kubernetes Job, or in the TFC Agent. Begin by viewing the task-worker logs to determine if a Kubernetes job was successfully created for the run. The following command can be run on the Terraform Enterprise pod which is processing the job to isolate the logs from the task-worker.

kubectl exec -ti -n <TFE_NAMESPACE> <TFE_POD> -- cat /var/log/terraform-enterprise/task-worker.log

The expected output is the following, indicated the task-worker received a task, dequeued it, and has begun executing it.

{"@level":"info","@message":"request complete","@module":"task-worker.router","@timestamp":"2024-04-30T18:49:28.864077Z","duration":3942329,"host":"127.0.0.1:8000","method":"POST","path":"/v1/task/invoke","remote_addr":"127.0.0.1:39582","status_code":201,"trace_id":"83836356-fccd-4c93-b996-d421b41ff0a9"}
{"@level":"debug","@message":"dequeued task","@module":"task-worker.dequeuer.agent-run","@timestamp":"2024-04-30T18:49:29.695890Z","id":"713bf8e0-e61a-496f-8bb2-9c331874df19"}
{"@level":"debug","@message":"executing task","@module":"task-worker.dequeuer.agent-run","@timestamp":"2024-04-30T18:49:29.696069Z","capacity":10,"id":"713bf8e0-e61a-496f-8bb2-9c331874df19","running":1}

Assuming the logs from the task-worker are not indicative of an issue, the next place to check is the Kubernetes Job the task-worker will have created for the run.

When a run enters a finalized state (i.e is cancelled or errors), the task-worker will immediately delete the Kubernetes Job as part of a cleanup process. In troubleshooting situations, it is often useful to disable this behavior by setting TFE_RUN_PIPELINE_KUBERNETES_DEBUG_ENABLED to true, which will configure the task worker skip cleanup and allow operators to inspect the Job and Pod after the run has been cancelled. Jobs are created with a time-to-live of 1 day by default, at which point they will be cleaned up by Kubernetes. If desired, this TTL can also be configured via the TFE_RUN_PIPELINE_KUBERNETES_DEBUG_JOBS_TTL setting.

Describe the Job created for the run to confirm it was able to create a pod for the TFC Agent. Find the specific Job by filtering on the label run_id=<RUN_EXTERNAL_ID>.

kubectl describe jobs -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID>

There should be events in the output indicating a pod was created for the agent job. Assuming the pod was created, describe the agent pod and view the events for any issues related to scheduling, etc.

kubectl describe pods -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID>

If the agent pod was successfully scheduled, check the agent logs for errors (see Agent Issues below). The sections below outline a few common issues that arise with remote runs in Kubernetes and can be used as a point of reference for general troubleshooting.

Image Pull Failures

Image pull failures will typically manifest when using a custom worker image. If the Kubelet is unable to pull the custom worker image, Kubernetes events will be created and can viewed by describing the agent pod.

$ kubectl describe pods -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID> 
Name: tfe-task-7e9ef694-3a15-4ab0-ae31-8fd6a384160a-9w584
Namespace: <TFE_NAMESPACE>-agents
Priority: 0
Service Account: default
Node: ip-10-0-174-139.ec2.internal/10.0.174.139
Start Time: Tue, 30 Apr 2024 12:20:19 -0400
Labels: app=terraform-enterprise
batch.kubernetes.io/controller-uid=2dcf544b-5934-46d9-be94-e713960da391
batch.kubernetes.io/job-name=tfe-task-7e9ef694-3a15-4ab0-ae31-8fd6a384160a
controller-uid=2dcf544b-5934-46d9-be94-e713960da391
job-name=tfe-task-7e9ef694-3a15-4ab0-ae31-8fd6a384160a
organization_name=example-organization
run_id=<RUN_EXTERNAL_ID>
run_type=Plan
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30s default-scheduler Successfully assigned terraform-enterprise-agents/tfe-task-7e9ef694-3a15-4ab0-ae31-8fd6a384160a-9w584 to ip-10-0-174-139.ec2.internal
Normal Pulling 18s (x2 over 30s) kubelet Pulling image "quay.io/my-org/custom-agent-test"
Warning Failed 18s (x2 over 30s) kubelet Failed to pull image "quay.io/my-org/custom-agent-test": failed to pull and unpack image "quay.io/my-org/custom-agent-test:latest": failed to resolve reference "quay.io/my-org/custom-agent-test:latest": unexpected status from HEAD request to https://quay.io/v2/my-org/custom-agent-test/manifests/latest: 401 UNAUTHORIZED
Warning Failed 18s (x2 over 30s) kubelet Error: ErrImagePull
Normal BackOff 6s (x2 over 29s) kubelet Back-off pulling image "quay.io/my-org/custom-agent-test"
Warning Failed 6s (x2 over 29s) kubelet Error: ImagePullBackOff

Image pull failures can have several causes such as network issues, authorization failures or configuration issues, such as an invalid image reference. View the pod events to determine which of these may be affecting this run. In the example above, TFE was not configured with a TFE_RUN_PIPELINE_KUBERNETES_IMAGE_PULL_SECRET_NAME, which caused unauthorized errors when the Kubelet attempted to pull the image.

Scheduling Issues

In some cases, the the Kubernetes scheduler is unable to schedule the agent pod on a node. There can be several causes for this such as node readiness or pod resources requests. To determine the cause, describe the pod and view the events, which will indicate the reason the pod was unschedulable.

$ kubectl describe pods -n <TFE_NAMESPACE>-agents -l run_id=<RUN_EXTERNAL_ID>
Name: tfe-task-713bf8e0-e61a-496f-8bb2-9c331874df19-99sfg
Namespace: <TFE_NAMESPACE>-agents
Priority: 0
Service Account: default
Node: <none>
Labels: app=terraform-enterprise
batch.kubernetes.io/controller-uid=232b1ded-523d-4c74-8d66-f511db0f22e8
batch.kubernetes.io/job-name=tfe-task-713bf8e0-e61a-496f-8bb2-9c331874df19
controller-uid=232b1ded-523d-4c74-8d66-f511db0f22e8
job-name=tfe-task-713bf8e0-e61a-496f-8bb2-9c331874df19
organization_name=example-organization
run_id=<RUN_EXTERNAL_ID>
run_type=Plan
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 1s (x2 over 5m8s) default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

In the example above, the Kubernetes scheduler was unable to find any nodes with sufficient resources on which to schedule the pod.

In some cases, agent containers are not started within a certain timeout period, such as during an autoscaling event which may delay scheduling. This will cause the task-worker to mark the mark the job as errored and delete the Kubernetes Job after one minute (the default timeout).

{"@level":"debug","@message":"dequeued task","@module":"task-worker.dequeuer.agent-run","@timestamp":"2024-03-13T17:38:58.391297Z","id":"ba02248e-9c3e-4cd8-9ea2-fa792677c3bd"}
{"@level":"debug","@message":"executing task","@module":"task-worker.dequeuer.agent-run","@timestamp":"2024-03-13T17:38:58.391324Z","capacity":17,"id":"ba02248e-9c3e-4cd8-9ea2-fa792677c3bd","running":17}
{"@level":"error","@message":"error running task instance","@module":"task-worker.executor","@timestamp":"2024-03-13T17:39:58.422502Z","err":"error waiting for kubernetes container to start :pod container is not ready: context deadline exceeded"}
{"@level":"error","@message":"error executing task","@module":"task-worker.dequeuer.agent-run","@timestamp":"2024-03-13T17:39:58.432169Z","id":"ba02248e-9c3e-4cd8-9ea2-fa792677c3bd"}

In these situations, it can be useful to extend the timeout beyond one minute to accommodate for events which delay scheduling via the TFE_RUN_PIPELINE_KUBERNETES_POD_TIMEOUT setting, which takes a value of time in seconds, 60 being the default, and is available in releases v202404-2 and later.

Out of Memory

In some cases, Terraform's memory usage can exceed the container's memory limits. In these cases, because Terraform runs as a child process the agent container's init process, the Linux oom-killer will kill the Terraform process, with no record of an OOM in the Kubernetes events since the pod itself was not OOM killed. These typically manifest in the following error, which will be surfaced on the runs page.

Operation failed: failed running terraform plan (exit 1)

To identify if this is the cause, SSH onto the node and run the following command.

journalctl -b -k -p err

If it is not possible to SSH onto the node on which the agent pod was scheduled and start a debug pod on the node and view the dmesg logs.

kubectl debug node/<NODE_NAME> -it --rm --image=busybox
cat host/var/log/dmesg

To resolve this, increase the maximum amount of memory a Terraform run is allowed to use.

Agent Issues

If the job was created and the pod successfully scheduled, check for any failures in the logs from the TFC Agent, which can occur while the agent process sets up its run environment and registers with the Terraform Enterprise platform. These logs are streamed to the task-worker and will be available in the logs of the Terraform Enterprise container, however it is also possible to view the logs by enabling the TFE_RUN_PIPELINE_KUBERNETES_DEBUG_ENABLED setting and viewing the logs of the agent pod.

kubectl logs -n <RELEASE_NAMESPACE>-agents -l run=<RUN_ID>

Additional Information