Introduction
Problem
Terraform Enterprise, at scale, can start having issues with jobs moving through run phases and generalized application slowness. Expanding on these behaviors: Jobs can be enqueued and will slowly move through run stages [(speculative plan), plan, sentinel, apply], Application Slowness -- the Terraform Enterprise user interface will load slowly or queries through the Ruby Console will take an uncharacteristic amount of time.
Prerequisites
- Large Scale deployments of Terraform Enterprise
- Terraform Enterprise utilizing Aurora PostgreSQL
Cause
- PostgreSQL by default has two parameters that interact with MVCC (link) set too low for large scale deployments with large concurrent utilization.
Identifying log lines:
2025-09-07 23:05:05 [ERROR] error=Sidekiq::Limiter::OverLimit msg=ERROR! AgentJobManager#_real_evaluate raised an exception organization_id=org-KnZRYP8ufuGbZxj5 worker=EvaluateAgentJobsJob
2025-09-05T18:37:37.311Z pid=908 tid=64zg class=TBMCancelWorkspaceRunWorker jid=c7cd14a65bff5943116214ce queue=runs uniquejobs=client until_executed=uniquejobs:35daa3a2a6f16045949d807aee91b825 WARN: {:message=>"SidekiqUniqueJobs Lock failed", :worker=>"ArchiveConfigVersionsWorker", :args=>["org-KnZRYP8ufuGbZxj5", 100], :lock_args=>["org-KnZRYP8ufuGbZxj5", 2025-09-05 18:40:00 UTC], :queue=>"cleanup"}
Identifying Behaviors:
Check for: Dead Locks, AutoVacuum Failures, DB Object Fragmentation.
Observable Parameter:
From the AWS RDS dashboard please observe -- LWLock:MultiXact -- if there is any utilization of this metric please consider increasing the following parameters.
Parameters to change:
multixact_offset_buffersmultixact_member_buffers
Overview of possible solutions (if applicable)
Solutions:
Change buffer and offset by page amounts as needed until performance improves:
multixact_offset_buffers = 128multixact_member_buffers = 256
These settings will need to be implemented in a scheduled maintenance window.
Workaround -- to alleviate pressure until the maintenance window:
- Implement a cron job to execute
VACUUM FREEZE VERBOSE;every 6 hours.
Outcome
Restoration of: application performance, jobs progressing through run stages, resolution of deadlocks.