The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.
All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
Introduction
When running HCP Terraform agents as Kubernetes deployments, agents that are enabled with single execution mode can run into the issue of CrashLoopBackOff. This is a known issue that is mentioned in this article: Terraform Cloud Agents on Kubernetes keep restarting with 'Back-off restarting failed container' warnings. This can cause stuck runs when there are many queued Terraform runs. The workaround will be using Kubernetes Jobs instead of Kubernetes deployments to prevent this issue from re-occurring.
Expected Outcome
HCP Terraform agents enabled with single execution mode in Kubernetes clusters to successfully restarted without experiencing CrashLoopBackOff, specifically Back-off restarting failed container.
Solution
Example configuration of a Kubernetes job for running tfc-agents in single execution mode:
apiVersion: batch/v1
kind: Job
metadata:
name: tfc-agent-job
namespace: tfc-agent
spec:
completions: 1000000 # Set a large number of completion in order for the job to continue.
parallelism: 2 # This value equals to the TFC agent count.
ttlSecondsAfterFinished: 3600 # Delete finished and failed jobs after 1 hour.
template:
spec:
containers:
- image: hashicorp/tfc-agent:latest
name: tfc-agent-job
env:
- name: TFC_AGENT_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: TFC_AGENT_TOKEN
value: "<TFC_AGENT_TOKEN>"
- name: TFC_AGENT_SINGLE
value: "single"
- name: TFC_AGENT_AUTO_UPDATE
value: "disabled"
resources:
requests:
cpu: '1'
memory: 2G
limits:
cpu: '2'
memory: 4G
restartPolicy: Never
-
.spec.completion
: this is for setting the maximum number of job completion; recommend to set to a very large value, for example, 1000000 to prevent the Kubernetes Job from being terminated and causing no new runs can be executed. -
.spec.parallelism
: if unset, the default is 1; for agent jobs, set a value that is less than or equal to the HCP Terraform organization's agents limit. - (Optional)
.spec.ttlSecondsAfterFinished
: A time-to-live mechanism to clean up old Jobs that have finished execution. If unset, pods of finished (either Complete or Failed) jobs will be kept in the system. Keeping too many finished jobs will put pressure on the API server. Also there isn't a single "recommended" time, as it depends on your specific needs and how long you want to retain finished jobs before they are automatically deleted.