How to AutoScale the HCP Terraform Agents Pool – HashiCorp Help Center

Introduction

For Autoscaling, we generally recommend using the Terraform Cloud Operator for Kubernetes that lets you create and manage HCP Terraform agents, agent pools, and tokens through a single Kubernetes custom resource. The operator uses a Custom Resource Definition (CRD) to manage HCP Terraform workspaces.

This article explains how to safely autoscale the HCP Terraform agents using the Cloud Operator for Kubernetes.

Background Working of Cloud Operator for Kubernetes

The Cloud Operator determines how many agents are needed based on the number of runs in the target workspaces, having status of either plan_queued, apply_queued, planning, or applying.

When kubelet terminates a pod, it starts by sending a SIGTERM to the process in the container and after the grace period expires, it then triggers forcible shutdown (SIGKILL).

The HCP Terraform Agent's handling SIGTERM is such that it completes it's active run(s) before exiting. So, this ensures that no long running applies would be interrupted in between of the scaling events.

Expected Outcome

The Agent pools are autoscaled safely without affecting any running applies.

Prerequisites (if applicable)

A running Kubernetes cluster v1.16+ with the Terraform Cloud Operator for Kubernetes installed
Kubectl

Use Case

A safe and efficient autoscaling strategy for Terraform agent pools, so that when Kubernetes forcibly destroys the pod after sending a SIGTERM, it doesn't affect the long running active runs.

Procedure

You can enable auto-scaling for your agents by setting the minReplicas and maxReplicas fields under spec.autoscaling configuration in your AgentPool specification which define the number of agents that the operator will deploy based on the number of pending Terraform workloads.
Open the agentpool.yml file and add the following configuration and adjust the values for minReplicas and maxReplicasaccording to your requirement.

apiVersion: app.terraform.io/v1alpha2
kind: AgentPool
spec:
##...
autoscaling:
minReplicas: 0
maxReplicas: 1
cooldownPeriodSeconds: 300
targetWorkspaces:
- name: greetings

Also, for longer applies (expected to run for more than 15 mins), you can specify the terminationGracePeriod of the agent pods, in the PodSpec (agentDeployment.spec) to override the default value of 900 seconds (15mins).

3. Apply the updatedAgentPoolspec using -

kubectl apply -n $NAMESPACE -f agentpool.yml

Additional Information

GitHub Repository for Terraform Cloud Operator - https://github.com/hashicorp/terraform-cloud-operator/tree/v2.3.0
API reference - https://github.com/hashicorp/terraform-cloud-operator/blob/main/docs/api-reference.md#agentdeploymentautoscaling