Introduction
As an organization's usage of Terraform Enterprise grows, the allocated disk I/O may become insufficient. This article details how to identify and resolve disk-resource-related performance issues.
Problem
Insufficient disk I/O resources can cause one or more of the following symptoms:
- The web interface or API is slow to respond or sluggish.
- Terraform runs take longer than usual to complete.
- Terraform runs fail with
invalid configurationorinvalid run parameterserrors. - Disk operations fail outright.
Cause
To determine if disk I/O operations are being contested, you can use system utilities or cloud provider metrics.
Using iostat for Diagnosis
Connect to the affected server instance over SSH and run the iostat command. This program is typically available in the sysstat package of your Linux distribution.
Execute the following command to get an instantaneous readout of disk I/O usage.
$ iostat -x
Example output:
Linux 5.12.8-arch1-1 (upstairs-pc) 11/01/2021 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 2.78 0.00 1.38 0.11 0.00 95.74 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd dm-0 3.63 8.57 33.28 0.00 2383184 9259848 0 dm-1 0.00 0.01 0.00 0.00 2348 0 0 dm-2 3.62 8.55 33.92 0.00 2379909 9437524 0 nvme0n1 2.56 8.61 33.28 0.00 2394550 9259849 0
Key metrics to observe include:
-
%iowait: This value shows the percentage of time that the CPU or CPUs were idle while the system had an outstanding disk I/O request. A consistently high value (e.g., above 10%) indicates that the CPU is waiting for the disk. -
tps: This value represents transactions per second (IOPS). Compare this number to your disk's provisioned IOPS limit.
Identifying Disk Contention on Cloud Platforms
If you are using a cloud provider, you can use their monitoring services to view historical I/O data, which can help identify resource contention over time. Typically, IOPS is the limiting factor.
- AWS: You can use CloudWatch metrics for EBS volumes. For more information, see Amazon EBS I/O characteristics and monitoring.
- Azure: Azure provides disk metrics under the Metrics area of the virtual machine details. For more information, see Disk IO, throughput, and queue depth metrics.
- GCP: You can use Cloud Monitoring to review persistent disk metrics. For more information, see Reviewing persistent disk performance metrics.
Solutions
There are two primary solutions to resolve disk I/O contention. The first is a permanent hardware upgrade, and the second is a temporary configuration change to reduce load.
Solution 1: Upgrade Disk Hardware
The most effective long-term solution is to provision a faster disk with higher IOPS and throughput capabilities. Cross-reference the data from your monitoring tools against the disk's expected performance to ensure that there are sufficient resources available.
For example, if %iowait is consistently above 10% and tps is approaching 2000, you should replace the disk with one provisioned for over 2000 IOPS.
Refer to the Terraform Enterprise documentation on disk I/O and consider the load profile of your workspaces when selecting new hardware.
Solution 2: Reduce Run Concurrency (Temporary)
Upgrading disk resources may require downtime. As a temporary measure, you can reduce the number of concurrent runs that Terraform Enterprise is allowed to start. This reduces contention for disk I/O resources by allowing fewer runs to compete at the same time.
To configure these options, connect to the Terraform Enterprise instance over SSH and follow the instructions for your deployment type.
On Terraform Enterprise Standalone
Set the concurrency capacity. The default value is 10.
$ replicatedctl app-config set capacity_concurrency <number_of_runs>
On Terraform Enterprise Active/Active
Set the concurrency capacity. The default value is 10. Terraform Enterprise syncs this setting across all active nodes.
$ tfe-admin app-config set capacity_concurrency <number_of_runs>