Introduction
As organizational usage of Terraform Enterprise grows, there may come a time when the allocated Disk IO may no longer be sufficient. This knowledge base article aims to detail the correct identification and resolution of disk-resource-related issues.
Problem
Typically memory resource issues are accompanied by one or several symptoms:
- The web interface or API may be slow to respond or sluggish
- Terraform runs may take longer than typical to complete
- Terraform runs may fail with invalid configuration
, invalid run parameters
- Disk operations may fail outright
Cause
If DIsk I/O operations are being contested, the easiest way to confirm is to SSH into the affected server and run the following commands:
1. iostat -x
This program is likely available in the sysstat
package of your Linux distribution.
Example output:
Linux 5.12.8-arch1-1 (upstairs-pc) 11/01/2021 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 2.78 0.00 1.38 0.11 0.00 95.74 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd dm-0 3.63 8.57 33.28 0.00 2383184 9259848 0 dm-1 0.00 0.01 0.00 0.00 2348 0 0 dm-2 3.62 8.55 33.92 0.00 2379909 9437524 0 nvme0n1 2.56 8.61 33.28 0.00 2394550 9259849 0
The first line states the kernel version, hostname, date, CPU architecture, and CPU core count.
The avg-cpu
line details typical CPU conditions. The relevant information from this set is%iowait
.
I/O Wait shows the percentage of time that the CPU or CPUs were idle during which the system had an outstanding I/O request.
The next line is the table header for the rest of the data. This contains the device name, the transactions per second (IOPS), read throughput in kB/s, write throughput in kB/s, the throughput of discards (typically TRIM), number of kB read, number of kB written, and finally the number of kB discarded.
Please note that this gives you an instantaneous readout of the current disk I/O usage, not a historical reading. If a cloud provider is in use, their metrics may have historical I/O data that could serve to indicate contested disk I/O.
Alternatively, the cloud provider may have metrics and monitoring available that contain similar information that can be used in determining which disk I/O resource is being limited. Typically, IOPS is the limiting factor.
AWS
CloudWatch metrics can be utilized for EBS volumes, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-io-characteristics.html#ebs-io-metrics
Azure
Azure keeps disk metrics under the Metrics area of the VM details. More information can be found at https://docs.microsoft.com/en-us/azure/virtual-machines/disks-metrics#disk-io-throughput-and-queue-depth-metrics
GCP
Cloud Monitoring can be utilized for monitoring persistent disk information in Google Cloud, see https://cloud.google.com/compute/docs/disks/review-disk-metrics
Solution
This data can be cross-referenced against the disk's expected throughput and IOPS to ensure that there are sufficient resources available. For example, if `%iowait` were lifting above 10% and TPS shows 2000+, the disk should be replaced with one provisioned with over 2000 IOPS.
Adding a faster disk, or one more capable of handling concurrent I/O tasks should resolve the issue. Make sure to refer to our documentation on disk I/O and consider the load profile of the workspaces.
Recognizing that adding additional disk I/O resources requires downtime, operators of Terraform Enterprise do have a stopgap measure available until such time as the Terraform Enterprise instance can be stopped: reducing the number of concurrent runs Terraform Enterprise is allowed to start.
By reducing the configured concurrency capacity, an operator can choose to allow fewer concurrent runs at a time in order to allow these runs to reduce contention of disk I/O resources.
To configure these options on an active Terraform Enterprise instance, SSH into the affected instance and follow the below instructions:
On Terraform Enterprise Standalone
To configure fewer concurrent runs:
replicatedctl app-config set capacity_concurrency <numer_of_runs>
By default, this is set to 10.
On Terraform Enterprise Active/Active
Note that Terraform Enterprise's settings are synced across the active nodes, so limiting the number of runs on one node will also limit the other.
To configure fewer concurrent runs:
tfe-admin app-config set capacity_concurrency <numer_of_runs>
By default, this is set to 10.