Agent pool shows IDLE agents while runs waiting on queue – HashiCorp Help Center

Introduction

Problem

There are runs on the queue waiting to be executed and at the same time in the agent pool they are showing as IDLE.

Example

There is an agent pool with 120 agents. There are 120 workspaces connected to a mono repo with VCS in Terraform Enterprise. A change is made which triggers the 120 workspaces to execute a run and apply the changes. Each run should take around 3 - 4 minutes to complete

The result is something like the following when viewing the runs

after 30 seconds:  23 running. - 10 on-hold - 0 completed
after 60 seconds:  49 running  - 25 on-hold - 0 completed
after 90 seconds:  66 running  - 39 on-hold - 0 completed
after 120 seconds: 86 running  - 34 on-hold - 0 completed
after 150 seconds: 106 running - 13 on-hold - 1 completed
after 180 seconds: 108 running - 5  on-hold - 7 completed
after 210 seconds: 93 running  - 0  on-hold - 27 completed

It takes a long time before runs are getting close to 100 simultaneous runs. Never to 120 within a minute.

Cause

The agents in an agent pool verify by default every 30 seconds if there is a run on the queue for them to take. Not all 120 agents check at the same time if there is something for them to start. When they check that there is a run they dequeue it from the queue to execute it. This could fail because another agent made the exact same change for the same run. The agent that failed to dequeue the run will wait for another 30 seconds before retrying.

When you have many agents the risk of agents failing to dequeue gets higher and the waiting of 30 seconds becomes an issue.

Solutions

Within Terraform Enterprise the value of the agents checking from default 30 seconds can be altered to a lower value. This means a retry of the agent that failed to dequeue will try quicker for a new run.

Make the following change to alter the value of the agent:

Login to Terraform Enterprise
Admin settings --> Settings
Alter the polling interval to 5 seconds

Save settings

Outcome

With a lower polling interval the agents that fail to dequeue a run will try again quicker. The number of agents running will be higher and closer to the maximum running agents.

Additional Information

Documentation about the agents settings in Terraform Enterprise can be found here