Introduction
For Autoscaling, we generally recommend using the Terraform Cloud Operator for Kubernetes that lets you create and manage HCP Terraform agents, agent pools, and tokens through a single Kubernetes custom resource. The operator uses a Custom Resource Definition (CRD) to manage HCP Terraform workspaces.
This article explains how to safely autoscale the HCP Terraform agents using the Cloud Operator for Kubernetes.
Background Working of Cloud Operator for Kubernetes
The Cloud Operator determines how many agents are needed based on the number of runs in the target workspaces, having status of either plan_queued, apply_queued, planning, or applying.
When kubelet terminates a pod, it starts by sending a SIGTERM to the process in the container and after the grace period expires, it then triggers forcible shutdown (SIGKILL
).
The HCP Terraform Agent's handling SIGTERM
is such that it completes it's active run(s) before exiting. So, this ensures that no long running applies would be interrupted in between of the scaling events.
Expected Outcome
The Agent pools are autoscaled safely without affecting any running applies.
Prerequisites (if applicable)
- A running Kubernetes cluster v1.16+ with the Terraform Cloud Operator for Kubernetes installed
- Kubectl
Use Case
A safe and efficient autoscaling strategy for Terraform agent pools, so that when Kubernetes forcibly destroys the pod after sending a SIGTERM, it doesn't affect the long running active runs.
Procedure
-
You can enable auto-scaling for your agents by setting the
minReplicas
andmaxReplicas
fields underspec.autoscaling
configuration in your AgentPool specification which define the number of agents that the operator will deploy based on the number of pending Terraform workloads. - Open the
agentpool.yml
file and add the following configuration and adjust the values forminReplicas
andmaxReplicas
according to your requirement.
apiVersion: app.terraform.io/v1alpha2
kind: AgentPool
spec:
##...
autoscaling:
minReplicas: 0
maxReplicas: 1
cooldownPeriodSeconds: 300
targetWorkspaces:
- name: greetings
Also, for longer applies (expected to run for more than 15 mins), you can specify the terminationGracePeriod
of the agent pods, in the PodSpec (agentDeployment.spec
) to override the default value of 900 seconds (15mins).
3. Apply the updatedAgentPool
spec using -
kubectl apply -n $NAMESPACE -f agentpool.yml
Additional Information
- GitHub Repository for Terraform Cloud Operator - https://github.com/hashicorp/terraform-cloud-operator/tree/v2.3.0
- API reference - https://github.com/hashicorp/terraform-cloud-operator/blob/main/docs/api-reference.md#agentdeploymentautoscaling