Diagnosing CPU Load Issues – HashiCorp Help Center

Introduction

As organizational usage of Terraform Enterprise grows, there may come a time when the number of CPU cores may no longer be sufficient. This knowledgebase article aims to detail the correct identification and resolution of CPU-resource-related issues.

Problem

Typically CPU resource issues are accompanied by one or several symptoms:

- The web interface may be slow to respond or sluggish

- Terraform runs may take longer than typical to complete

- Network packets may be intermittently dropped or time out

Cause

If CPU resources are being contested, the easiest way to confirm is to simply SSH into the affected server and run the following two commands:

1. cat /proc/loadavg

Example:

0.20 0.18 0.12 1/80 11206

This file represents the amount of load the CPU experienced over the last 1, 5, and 15 minutes, along with the number of currently running processes over the total number of processes, and finally, the last PID used. For our cases, only the first 3 numbers matter. Pick the one that is most relevant to what is being measured: for example, if a Terraform run occurred about 5 minutes ago, use the second number from the file.

2. cat /proc/cpuinfo | grep processor | wc -l

This command just counts the number of CPU cores in the system and outputs the number.

With these two numbers, you can determine a percentage load that the CPU is under by dividing the output of command #1 by the output of command #2. So if the load average were 4, and the number of CPU cores was 8, the CPU would be under 50% load. If the load average were 10, and the number of CPU cores remained 8, the CPU would be under 125% load and the allocated CPU resources should probably be resized.

Solution

Adding additional CPU resources should resolve the issue. To choose an appropriate number of CPU cores, use the load average number from command #1 above as a minimum number of CPU cores to ensure load doesn't continue to reach over 100%.

Recognizing that adding additional CPU resources requires downtime, operators of Terraform Enterprise do have two stopgap measures available until such time as the Terraform Enterprise instance can be stopped so that CPU resources can be upgraded: reducing the number of concurrent runs Terraform Enterprise is allowed to start, or reducing the amount of CPU resources allocated to each run (only in Terraform Enterprise versions >= v202109-1).

By reducing the configured concurrency capacity, an operator can choose to allow fewer concurrent runs at a time in order to allow these runs to use more CPU resources. This can be useful if there is a single workspace that is more CPU intensive that must be deployed before CPU resources can be upgraded.

If Terraform Enterprise v202109-1 or later is deployed, then an operator has the option of configuring the allocated CPU resources per Terraform run. This can cause CPU-heavy workspaces to reliably complete successfully, at the cost of a potentially longer runtime.

To configure these options on an active Terraform Enterprise instance, SSH into the affected instance and follow the below instructions:

On Terraform Enterprise Standalone

To configure fewer concurrent runs:

replicatedctl app-config set capacity_concurrency <numer_of_runs>

By default, this is set to 10.

If Terraform Enterprise v202109-1 is installed, it's worth ensuring that capacity_cpus is also set appropriately. You can retrieve the current value with replicatedctl app-config get capacity_cpus. If it's set to 0, resource usage is unlimited.

To configure CPU resource limits you can use replicatedctl app-config set capacity_cpus <number_of_cores>.

On Terraform Enterprise Active/Active

Note that Terraform Enterprise's settings are synced across the active nodes, so limiting the CPU usage of one node will also limit the other.

To configure fewer concurrent runs:

tfe-admin app-config set capacity_concurrency <numer_of_runs>

By default, this is set to 10.

If Terraform Enterprise v202109-1 is installed, it's worth ensuring that capacity_cpus is also set appropriately. You can retrieve the current value with tfe-admin app-config get capacity_cpus. If it's set to 0, resource usage is unlimited.

To configure CPU resource limits you can use tfe-admin app-config set capacity_cpus <number_of_cores>.

Additional Information

Why do low CPU resources sometimes exhibit themselves as intermittent networking issues?

Terraform Enterprise uses Docker as a runtime environment. Docker, by default, uses a virtual in-kernel network interface to facilitate cross-container, host-to-container, and container-to-host communication. When the CPU load is over 100%, the kernel must wait for the currently running tasks to finish before being able to run any IO operations that may be in the network queue. This condition is called I/O Wait. I/O Wait (iowait) is the percentage of time that the CPU was idle, during which the system had pending disk or network IO requests. The top man page gives this simple explanation: “I/O wait = time waiting for I/O completion.” In other words, the presence of IO wait tells us that the system is idle at a time when it could be processing outstanding requests. If the time spent waiting in the IO queue exceeds a few seconds, this can cause network timeouts to occur with Terraform Enterprise. Disk IO and IOPS are discussed in this Knowledge Base article.