Problem
Autoscaling in Kubernetes (K8s) enables dynamic scaling of workloads based on resource utilization. However, for versions of the TFE Flexible Deployment Option (FDO) earlier than v202501-1, autoscaling configurations are not officially supported until v202501-1. Implementing autoscaling in these versions may lead to queued Terraform runs, potentially impacting workflow efficiency.
Prerequisites (if applicable)
- K8s Terraform Enterprise (TFE) Flexible Deployment Option (FDO) versions below 202501-1
Cause
-
In FDO versions below
v202501-1
, cluster scaling events during periods of heavy load can trigger micro-outages at the workspace plan/apply level. These brief disruptions frequently occur due to how FDO manages resource allocation and workload distribution in unsupported versions.As a result:
- Runs Get Queued: Ongoing and new runs are delayed, entering a queued state.
- No Relief from New Nodes: Even when Kubernetes deploys additional nodes, FDO cannot efficiently rebalance workloads, preventing the expected performance improvement.
- Resource Allocation Lag: FDO struggles to recognize and utilize newly added resources during autoscaling events, compounding delays.
Solutions:
- Upgrade to v202501-1 upon where FDO k8s autoscaling is supported
- With autoscaling enabled within the k8s environment, adjust the
TFE_RUN_PIPELINE_KUBERNETES_WORKER_TIMEOUT
setting in accordance with the recommended configuration