Recovering from Excessive VCS-Triggered Runs in Terraform Enterprise
Problem
A Version Control System (VCS) connection has triggered an excessive number of runs in Terraform Enterprise, negatively affecting system performance. Runs are stuck in a pending status because the system is flooded with multiple invalid requests from an errored VCS repository connection.
Prerequisites
- Terraform Enterprise installation
- Administrative access to the Terraform Enterprise host machine
Cause
A misconfiguration or error in a VCS repository connection can trigger a runaway process, flooding the run queue with invalid requests.
Solution
This procedure details how to stop the excessive runs and restore system stability.
- In the Terraform Enterprise UI, navigate to the workspace settings for the affected workspace and disable runs triggered by VCS connections.
- Navigate to the TFE Capacity settings page in the admin settings and set the concurrent run limit to
1. This prevents new runs from starting while you perform maintenance. -
Log in to the
atlascontainer on the Terraform Enterprise host machine. The command varies based on your TFE version.For Terraform Enterprise
v202205-1(build 619) and newer, run the following command:$ sudo docker exec -it tfe-atlas /usr/bin/init.sh /app/scripts/wait-for-token -- bash -i -c 'cd /app && ./bin/rails c'
For older versions, run the following command:
$ sudo docker exec -it ptfe_atlas /usr/bin/init.sh /app/scripts/wait-for-token -- bash -i -c 'cd /app && ./bin/rails c'
-
In the Rails console, execute the following Ruby script. This script iterates through all workspaces, finds any runs that are currently in a
planningstate, and updates their status toerrored.Workspace.find_each { |w| w.runs.planning.each { |r| r.update_attribute(:status, Run::ERRORED) r.plan.update_attribute(:status, Plan::ERRORED) r.apply.update_attribute(:status, Apply::ERRORED) } w.unlock! } - After the script completes, restart the Terraform Enterprise application. If you use external services like an external PostgreSQL database, restart those as well.
- Once the system is stable, reset the concurrency limit back to its original setting. We recommend increasing this number gradually to ensure stability.
Outcome
After completing the procedure, log in to the admin console and verify that the count of pending runs has returned to a normal level. The system should now be functioning correctly.
Additional Information
For more details on managing a Terraform Enterprise instance, refer to the official administration documentation.