Introduction
The issue has reported on Terraform Enterprise (TFE) version v202302-1
after a migration from a standalone TFE installation to an Active/Active installation.
Problem
Runs from the majority of workspaces are failing when the "Remote" execution mode is used.
If private/custom agents are used, the workspaces do not fail anymore.
Cause
There is a known issue in the TFE version v202302-1
for the agent
run pipeline mode vs the legacy
mode.
Overview of possible solutions:
- Switch to legacy
workers mode.
Rollback to legacy workers (command for standalone):
$ replicatedctl app-config set runpipelinemode --value 'legacy'
$ replicatedctl app apply-config
Rollback to legacy workers (command for active/active):
$ tfe-admin app-config -k <KEY> -v <VALUE>
If falling back to the legacy
mode does not result in successful runs for all the workspaces and some of them now show killed
, check the support bundle for the following errors to confirm that Terraform is killed by the OS:
terraform invoked oom-killer
- Once confirmed, adjust the following values for the Capacity of the workers accordingly based on your system needs: capacity_concurrency
, capacity_cpus
and capacity_memory
See more details here: Capacity and Performance Guide
For Active/Active these are the commands that need to be run.
$ tfe-admin app-config -k capacity_memory -v <value>
$ tfe-admin app-config -k capacity_concurrency -v <value>
$ tfe-admin app-config -k capacity_cpus -v <value>
- Restart TFE
Outcome
The runs are successful.
NOTE
If needed, a switch to the agent
run pipeline mode can be done as follows:
# TFE standalone
$ replicatedctl app-config set runpipelinemode --value ''
$ replicatedctl app apply-config
# TFE Active/Active$ tfe-admin app-config -k runpipelinemode -v ''