VCS triggered Excessive jobs in Terraform Enterprise and recovery is necessary.

April 18, 2023 03:46
Updated

Problem

A VCS job has errored and triggered an excessive amount of runs in TFE which is affecting system performance.

Prerequisites (if applicable)
- Terraform Enterprise
Cause
- VCS triggered a runaway process.

Solution

Runs stuck in pending status due to the system being flooded with multiple invalid requests from an errored VCS repository.

Disable triggered runs in the affected workspace
Set concurrent runs to 1
- https://TFE_HOSTNAME:8800/settings#capacity
Login into Atlas container :

$ sudo docker exec -it ptfe_atlas /usr/bin/init.sh /app/scripts/wait-for-token -- bash -i -c 'cd /app && ./bin/rails c' 

# Terraform Enterprise v202205-1(619) and newer
$ sudo docker exec -it tfe-atlas /usr/bin/init.sh /app/scripts/wait-for-token -- bash -i -c 'cd /app && ./bin/rails c'

Locate runs and set them to ERRORED.

Workspace.find_each { |w|
  w.runs.planning.each { |r|
    r.update_attribute(:status, Run::ERRORED)
    r.plan.update_attribute(:status, Plan::ERRORED)
    r.apply.update_attribute(:status, Apply::ERRORED)
  }
  w.unlock!
}

Restart The TFE system once the Rails query has been completed, This includes external services if applicable such as RDS(Postgres)
Reset the concurrency back to its original setting. ( It is recommended to slowly increase this number back to its original)

Outcome

Log in to the admin console and view the count of pending runs. The system should be functioning properly now.

See more