Introduction
The issue has reported on Terraform Enterprise (TFE) version v202302-1 after a migration from a standalone TFE installation to an Active/Active installation.
Problem
Runs from the majority of workspaces are failing when the "Remote" execution mode is used.
If private/custom agents are used, the workspaces do not fail anymore.
Cause
There is a known issue in the TFE version v202302-1 for the agent run pipeline mode vs the legacymode.
Overview of possible solutions:
- Switch to legacy workers mode.
Rollback to legacy workers (command for standalone):
$ replicatedctl app-config set runpipelinemode --value 'legacy'
$ replicatedctl app apply-config
Rollback to legacy workers (command for active/active):
$ tfe-admin app-config -k <KEY> -v <VALUE>
If falling back to the legacy mode does not result in successful runs for all the workspaces and some of them now show killed, check the support bundle for the following errors to confirm that Terraform is killed by the OS:
terraform invoked oom-killer
- Once confirmed, adjust the following values for the Capacity of the workers accordingly based on your system needs: capacity_concurrency, capacity_cpus and capacity_memory
See more details here: Capacity and Performance Guide
For Active/Active these are the commands that need to be run.
$ tfe-admin app-config -k capacity_memory -v <value>
$ tfe-admin app-config -k capacity_concurrency -v <value>
$ tfe-admin app-config -k capacity_cpus -v <value>
- Restart TFE
Outcome
The runs are successful.
NOTE
If needed, a switch to the agent run pipeline mode can be done as follows:
# TFE standalone
$ replicatedctl app-config set runpipelinemode --value ''
$ replicatedctl app apply-config
# TFE Active/Active$ tfe-admin app-config -k runpipelinemode -v ''