Problem
Your Terraform Enterprise (TFE) instance experiences incidents where Terraform plans become stuck in a queued state and are not executed. This requires manual intervention to resolve and delays critical infrastructure operations. You may observe log entries related to the WorkspaceDestroyWorker job indicating database deadlocks.
An example of a relevant error message in the Sidekiq logs includes:
2025-07-16T17:24:28.274464000Z {"component":"sidekiq","log":"2025-07-16 17:24:27 [ERROR] record_type=Workspace record_id=15449 exception=PG::TRDeadlockDetected: ERROR: deadlock detected"}
2025-07-16T17:24:28.274538000Z {"component":"sidekiq","log":"DETAIL: Process 810204 waits for ShareLock on transaction 72024777; blocked by process 810199."}
2025-07-16T17:24:28.274613000Z {"component":"sidekiq","log":"Process 810199 waits for ShareLock on transaction 72024482; blocked by process 810204."}
2025-07-16T17:24:28.274811000Z {"component":"sidekiq","log":" msg=WorkspaceDestroyWorker failure worker=WorkspaceDestroyWorker"}
2025-07-16T17:24:28.274911000Z {"component":"sidekiq","log":"2025-07-16 17:24:27 [INFO] workspace_id=15449 elapsed_time=550.586048422s msg=Worker finish worker=WorkspaceDestroyWorker"}Prerequisites
This issue is most prevalent in versions of TFE prior to the August 2024 releases (e.g., before v202406-1), which include architectural improvements to background job processing.
Cause
The root cause of this issue is a large number of workspace deletion jobs being sent to TFE in a short period. When multiple workspaces that share tags are deleted concurrently, it can lead to deadlocks in the PostgreSQL database, specifically on the tags relation.
These database deadlocks cause the WorkspaceDestroyWorker jobs in Sidekiq to fail, time out, and retry. The volume of these failing and retrying jobs overwhelms the Sidekiq queue, effectively creating a gridlock that prevents other higher-priority jobs, such as Terraform plans and applies, from being processed.
Solutions
There are two approaches to addressing this issue: immediate mitigation for an active incident and long-term preventative measures.
Solution 1: Mitigate an Active Incident with HashiCorp Support
Warning: The steps to resolve an active gridlock incident are high-risk and can cause further issues if not performed correctly. Do not attempt these actions without direct guidance from HashiCorp Support.
- Contact HashiCorp Support: If you observe symptoms of plan queuing and suspect a gridlock, immediately open an Urgent support ticket. The support team has an internal Standard Operating Procedure (SOP) for this specific scenario.
- Gather Diagnostics: As soon as possible, capture monitoring diagnostics and generate a TFE support bundle. This information is critical for the support and engineering teams to analyze the state of your instance.
- Follow Guided Intervention: HashiCorp Support will guide you through the necessary steps to resolve the gridlock. These actions may include:
- Temporarily disabling ingress traffic from the load balancer.
- Draining currently executing runs.
- Canceling queued jobs to clear the backlog.
- Carefully pausing and clearing specific
Sidekiqqueues.
Solution 2: Implement Preventative and Corrective Measures
To prevent future occurrences of this issue, you should adopt a set of preventative and corrective actions.
- Upgrade Terraform Enterprise: Plan an upgrade to a recent version of TFE (minimum version
v202406-1or later). Newer versions include enhancements to howSidekiqhandles bulk jobs at scale, which helps mitigate the risk of gridlock. - Limit Bulk Workspace Deletion: When you must perform bulk deletions, limit the batch size to 25 workspaces or fewer at a time. This reduces the likelihood of database contention and deadlocks.
- Monitor Sidekiq Queues During Deletions: After running a deletion batch, monitor the
Sidekiqqueues in the admin dashboard (provided by HashiCorp Support). Ensure the queue sizes are decreasing and approaching zero before you execute the next batch. This confirms that the system is processing the jobs successfully. - Use Maintenance Windows: Schedule any large-scale workspace management activities, such as bulk deletions, during a planned maintenance window to minimize the impact on users.
- Enable Proactive Monitoring: Implement monitoring for your Terraform Enterprise instance's
Sidekiqqueues and Redis performance. This allows you to identify potential issues early and provides valuable data for investigation if problems arise. - Open a Proactive Support Ticket: Before you perform planned bulk deletion activities, notify HashiCorp Support by opening a proactive ticket. Specify the maintenance window and describe the planned activity.
Outcome
By upgrading TFE and implementing the preventative measures for bulk workspace deletions, your instance will be more resilient to this type of incident. The risk of Sidekiq queue gridlock will be significantly reduced, ensuring that Terraform plans and other critical jobs are processed in a timely manner.
Additional Information
- Sidekiq Queue Names: Some TFE versions introduced changes to
Sidekiqqueue names (e.g.,cleanupis nowlow). The fundamental principles of managing and monitoring these queues remain the same. - Pausing Queues: Pausing
Sidekiqqueues is a dangerous operation that can cause a gridlock situation if not performed with extreme care. This action should only be taken under the direct guidance of HashiCorp Support during an active incident.