Problem
Agents registered to a specific Terraform Enterprise agent pool encounter 500 errors when contacting the /api/agent/jobs endpoint. All other agent pools function as expected, and registering new agents to the affected pool does not resolve the issue. This effectively halts all Terraform runs in that pool.
Prerequisites
- Administrative SSH access to the Terraform Enterprise host.
- An administrative user account within Terraform Enterprise.
Cause
This issue is caused by a bug in the job cancellation logic. When a run is canceled, it can remain queued as the next job for the agent pool in an invalid state. The database record for the job is not updated correctly, which blocks all subsequent runs from executing in that pool.
The affected runs have a canceled_at timestamp but remain in an active state, preventing the agent pool from processing new jobs.
Solution
The solution involves accessing the Terraform Enterprise Rails console to manually correct the database records for the stuck runs and agent jobs.
-
Connect to the Terraform Enterprise host via SSH and launch the Rails console.
$ tfe-admin console
-
Find all run records that are in an invalid state (canceled but not marked as final). Execute the following command at the Rails console prompt.
runs = Run.not_final.where.not(canceled_at: nil).load
-
If the previous command found invalid runs, define an administrative user to attribute the correction to. Replace
tfe-admin-email@example.comwith the email address of a valid TFE administrator.me = User.find_by_email 'tfe-admin-email@example.com'
-
Force-cancel each invalid run to unblock the associated workspaces.
runs.each{ |r| puts r.external_id; r.force_cancel!(user: me, comment: "Correction by TFE admin") } -
Find any
AgentJobrecords where the underlying run was canceled but the job itself is still active.agent_jobs = AgentJob.ready.select{ |aj| aj.workload&.run&.canceled? } -
Mark these agent jobs as complete. This action will unblock other queued jobs in the agent pool.
agent_jobs.each { |aj| aj.complete! }
After completing these steps, the agent pool should resume processing new runs.
Additional Information
- For more details on accessing the console, refer to the guide on How To Access the Terraform Enterprise Rails Console.