Problem
Your Terraform Enterprise instance experiences incidents where Terraform plans become stuck in a queued state and are not executed. This situation requires manual intervention to resolve and delays critical infrastructure operations.
The Sidekiq logs may show entries related to the WorkspaceDestroyWorker job indicating database deadlocks. An example of a relevant error message includes:
{"component":"sidekiq","log":"[ERROR] record_type=Workspace record_id=15449 exception=PG::TRDeadlockDetected: ERROR: deadlock detected"}
{"component":"sidekiq","log":"DETAIL: Process 810204 waits for ShareLock on transaction 72024777; blocked by process 810199."}
{"component":"sidekiq","log":"Process 810199 waits for ShareLock on transaction 72024482; blocked by process 810204."}
{"component":"sidekiq","log":" msg=WorkspaceDestroyWorker failure worker=WorkspaceDestroyWorker"}
{"component":"sidekiq","log":"[INFO] workspace_id=15449 elapsed_time=550.586048422s msg=Worker finish worker=WorkspaceDestroyWorker"}Prerequisites
This issue is most prevalent in versions of Terraform Enterprise prior to the August 2024 releases (e.g., before v202406-1), which include architectural improvements to background job processing.
Cause
The root cause of this issue is a large number of workspace deletion jobs being sent to Terraform Enterprise in a short period. When multiple workspaces that share tags are deleted concurrently, it can lead to deadlocks in the PostgreSQL database, specifically on the tags relation.
These database deadlocks cause the WorkspaceDestroyWorker jobs in Sidekiq to fail, time out, and retry. The volume of these failing and retrying jobs overwhelms the Sidekiq queue, effectively creating a gridlock that prevents other higher-priority jobs, such as Terraform plans and applies, from being processed.
Solutions
There are two approaches to addressing this issue: immediate mitigation for an active incident and long-term preventative measures.
Solution 1: Mitigate an Active Incident
Warning: The steps to resolve an active gridlock incident are high-risk and can cause further issues if not performed correctly. Do not attempt these actions without direct guidance from your designated support provider.
- Engage HashiCorp support: If you observe symptoms of plan queuing and suspect a gridlock, immediately contact your support provider for assistance. They can provide guidance based on established procedures for this specific scenario.
- Gather diagnostics: As soon as possible, capture monitoring diagnostics and generate a Terraform Enterprise support bundle. This information is critical for analyzing the state of your instance.
-
Follow guided intervention: HashiCorp support will guide you through the necessary steps to resolve the gridlock. These actions may include:
- Temporarily disabling ingress traffic from the load balancer.
- Draining currently executing runs.
- Canceling queued jobs to clear the backlog.
- Carefully pausing and clearing specific Sidekiq queues.
Solution 2: Implement Preventative and Corrective Measures
To prevent future occurrences of this issue, adopt the following preventative and corrective actions.
-
Upgrade Terraform Enterprise: Plan an upgrade to a recent version of Terraform Enterprise (minimum version
v202406-1or later). Newer versions include enhancements to how Sidekiq handles bulk jobs at scale, which helps mitigate the risk of gridlock. - Limit bulk workspace deletion: When you must perform bulk deletions, limit the batch size to 25 workspaces or fewer at a time. This reduces the likelihood of database contention and deadlocks.
- Monitor Sidekiq queues during deletions: After running a deletion batch, monitor the Sidekiq queues in the admin dashboard. Ensure the queue sizes are decreasing and approaching zero before you execute the next batch. This confirms that the system is processing the jobs successfully.
- Use maintenance windows: Schedule any large-scale workspace management activities, such as bulk deletions, during a planned maintenance window to minimize the impact on users.
- Enable proactive monitoring: Implement monitoring for your Terraform Enterprise instance's Sidekiq queues and Redis performance. This allows you to identify potential issues early and provides valuable data for investigation if problems arise.
- Notify HashiCorp support: Before you perform planned bulk deletion activities, notify your support provider. Specify the maintenance window and describe the planned activity.
Outcome
By upgrading Terraform Enterprise and implementing preventative measures for bulk workspace deletions, your instance will be more resilient to this type of incident. The risk of Sidekiq queue gridlock will be significantly reduced, ensuring that Terraform plans and other critical jobs are processed in a timely manner.
Additional Information
-
Sidekiq Queue Names: Some Terraform Enterprise versions introduced changes to Sidekiq queue names (e.g.,
cleanupis nowlow). The fundamental principles of managing and monitoring these queues remain the same. - Pausing Queues: Pausing Sidekiq queues is a high-risk operation that can cause a gridlock situation if not performed with extreme care. This action should only be taken under the direct guidance of your support provider during an active incident.
- For more details on monitoring your instance, refer to the official Terraform Enterprise documentation on administration and monitoring.