Problem
When you run HCP Terraform agents as Kubernetes deployments with single execution mode enabled, the agent pods may enter a CrashLoopBackOff state. This known issue can cause Terraform runs to become stuck, especially when many runs are queued.
For more details on the underlying issue, refer to the article: HCP Terraform Agents on Kubernetes keep restarting with 'Back-off restarting failed container' warnings.
This guide provides a workaround using Kubernetes Jobs instead of Kubernetes deployments to prevent this issue.
Prerequisites
- A Kubernetes cluster where you can deploy resources.
- An HCP Terraform agent token.
Procedure
To prevent the CrashLoopBackOff issue, you can deploy the HCP Terraform agent as a Kubernetes Job. This configuration ensures that the agent completes its single execution and exits cleanly without being marked as a failed container by a deployment's ReplicaSet.
Below is an example Kubernetes Job configuration for running an HCP Terraform agent in single execution mode.
apiVersion: batch/v1
kind: Job
metadata:
name: tfc-agent-job
namespace: tfc-agent
spec:
## Set a large number of completions for the job to ensure it continues to create new pods for new runs.
completions: 1000000
## This value determines how many agent pods can run in parallel.
## Set this to a value less than or equal to your HCP Terraform organization's agent limit.
parallelism: 2
## (Optional) This TTL mechanism cleans up finished jobs after a specified time (in seconds).
## If unset, pods from completed or failed jobs are retained, which can pressure the API server.
ttlSecondsAfterFinished: 3600
template:
spec:
containers:
- image: hashicorp/tfc-agent:latest
name: tfc-agent-job
env:
- name: TFC_AGENT_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: TFC_AGENT_TOKEN
value: "<TFC_AGENT_TOKEN>"
- name: TFC_AGENT_SINGLE
value: "true"
- name: TFC_AGENT_AUTO_UPDATE
value: "disabled"
resources:
requests:
cpu: '1'
memory: 2G
limits:
cpu: '2'
memory: 4G
restartPolicy: Never