Problem
Agents registered to a Terraform Enterprise Agent pool encounter 500
errors when contacting Terraform Enterprise on the /api/agent/jobs
endpoint, all the others agent pools work as expected. Registering new agents to the affected pool does not resolve the issue.
Cause
This is the result of a bug in cancellation logic for jobs. This bug occurs when a Run has been cancelled, but stays queued as the next job in the agent pool. As a result TFE is trying to handle the next job, but without success, because the run status is cancelled. The database records for the job do not get updated properly, blocking subsequent runs from succeeding.
Therefore, runs stay in an invalid state. They have a canceled_at
field but but remain in an active state consistent with an apply.
Solution
Connect to the Terraform Enterprise host via SSH and launch the Rails Console. Once at the Rails Console prompt, run the following commands:
# Find Run records in an invalid state, that were canceled but are not "final" runs = Run.not_final.where.not(canceled_at: nil).load # If runs were found, force-cancel them. This will un-block other queued runs in the associated workspaces. # Replace with the TFE user's email address to whom to attribute the force-cancel me = User.find_by_email 'tfe-admin-email-HERE@domain.com' # Force-cancel each run runs.each{ |r| puts r.external_id; r.force_cancel!(user: me, comment: "Correction by TFE admin") } # Find AgentJob records where the underlying Run was canceled, but the AgentJob is still active. agent_jobs = AgentJob.ready.select{ |aj| aj.workload&.run&.canceled? } # If AgentJobs were found, mark them completed. This will un-block other AgentJob items that are queued. agent_jobs.each { |aj| aj.complete! }