Problem
Autoscaling in Kubernetes enables dynamic scaling of workloads based on resource utilization. However, for versions of the Terraform Enterprise (TFE) Flexible Deployment Option (FDO) earlier than v202501-1, autoscaling configurations are not officially supported. Implementing autoscaling in these unsupported versions may lead to queued Terraform runs and impact workflow efficiency.
Prerequisites
- Terraform Enterprise (TFE) Flexible Deployment Option (FDO) on Kubernetes, versions prior to
v202501-1.
Cause
In TFE FDO versions below v202501-1, cluster scaling events during periods of heavy load can trigger micro-outages at the workspace plan and apply level. These brief disruptions occur due to how FDO manages resource allocation and workload distribution in unsupported versions.
This can result in the following behaviors:
- Queued Runs: Ongoing and new runs are delayed and enter a queued state.
- Ineffective Node Scaling: Even when Kubernetes deploys additional nodes, TFE cannot efficiently rebalance workloads, preventing the expected performance improvement.
- Resource Allocation Lag: TFE struggles to recognize and utilize newly added resources during autoscaling events, which compounds delays.
Solutions
Solution 1: Upgrade Terraform Enterprise
The recommended solution is to upgrade your TFE instance to version v202501-1 or newer. In these versions, Kubernetes autoscaling is officially supported, and the underlying issues with resource allocation during scaling events have been resolved.
Solution 2: Adjust Worker Timeout for Existing Installations
If upgrading immediately is not possible, you can mitigate the issue in an unsupported configuration by adjusting a timeout setting. With autoscaling enabled in your Kubernetes environment, modify the TFE_RUN_PIPELINE_KUBERNETES_WORKER_TIMEOUT setting in accordance with the recommended configuration for cluster autoscaling. This may help reduce the frequency of queued runs during scaling events.