Introduction
As organizational usage of Terraform Enterprise grows, the number of allocated CPU cores may become insufficient. This article details how to identify and resolve CPU resource-related issues in Terraform Enterprise.
Problem
CPU resource contention may cause one or more of the following symptoms:
- The Terraform Enterprise web interface is slow to respond or sluggish.
- Terraform runs take longer than usual to complete.
- Network packets are intermittently dropped or time out.
Cause
To confirm if CPU resources are contested, connect to the affected server over SSH and perform the following checks to calculate the CPU load percentage.
Step 1: Check CPU Load Average
Check the /proc/loadavg file to view the system's load average over the last 1, 5, and 15 minutes.
$ cat /proc/loadavg
Example output:
0.20 0.18 0.12 1/80 11206
This file shows the CPU load over the last 1, 5, and 15 minutes, the number of currently running processes over the total, and the last process ID (PID) used. For this diagnosis, focus on the first three numbers. Select the number that is most relevant to your observation. For example, if a slow Terraform run occurred 5 minutes ago, use the second number (the 5-minute load average).
Step 2: Count CPU Cores
Run the following command to count the number of CPU cores in the system.
$ cat /proc/cpuinfo | grep processor | wc -l
Step 3: Calculate CPU Load Percentage
To determine the percentage load, divide the relevant load average from Step 1 by the number of CPU cores from Step 2.
For example, if the 5-minute load average was 4.0 and the system has 8 CPU cores, the CPU is at 50% load (4 / 8 = 0.5). If the load average was 10.0 with 8 cores, the CPU is at 125% load (10 / 8 = 1.25), indicating that the allocated CPU resources are insufficient and should be increased.
Solutions
Solution 1: Increase CPU Resources
The most effective long-term solution is to increase the CPU resources allocated to the Terraform Enterprise instance. As a guideline, use the highest load average number observed during peak usage as the minimum number of CPU cores to allocate. This ensures the system load does not consistently exceed 100%.
Solution 2: Reduce Run Concurrency (Temporary Measure)
If you cannot immediately add CPU resources, you can reduce the number of concurrent runs to lower the overall load. This can be useful if a single, CPU-intensive workspace must be deployed before you can schedule downtime for an upgrade. The default value is 10.
For Standalone Deployments
Set a new concurrency limit.
$ replicatedctl app-config set capacity_concurrency <number_of_runs>
For Active/Active Deployments
Set a new concurrency limit. This setting syncs across all active nodes.
$ tfe-admin app-config set capacity_concurrency <number_of_runs>
Solution 3: Limit CPU Resources Per Run (Temporary Measure)
For Terraform Enterprise v202109-1 or later, you can limit the CPU resources allocated to each run. This can help CPU-heavy workspaces complete successfully, though it may increase their runtime. If the capacity_cpus setting is 0, resource usage is unlimited.
For Standalone Deployments
-
Check the current value.
$ replicatedctl app-config get capacity_cpus
-
Set a new CPU core limit per run.
$ replicatedctl app-config set capacity_cpus <number_of_cores>
For Active/Active Deployments
-
Check the current value.
$ tfe-admin app-config get capacity_cpus
-
Set a new CPU core limit per run.
$ tfe-admin app-config set capacity_cpus <number_of_cores>
Additional Information
Why do low CPU resources sometimes cause intermittent networking issues?
Terraform Enterprise uses Docker as a runtime environment, which relies on a virtual in-kernel network interface for communication. When CPU load exceeds 100%, the kernel must wait for running tasks to finish before it can process I/O operations in the network queue. This condition is known as I/O Wait (iowait).
The top man page defines iowait as the time the CPU is idle while the system has pending disk or network I/O requests. If the time spent waiting in the I/O queue exceeds a few seconds, network timeouts can occur within Terraform Enterprise.
For related performance issues, refer to the guide on Diagnosing Disk I/O Issues.