In using Terraform Enterprise (TFE), it is important to configure monitoring and alerting to proactively detect anomalous incidents, performance degradation, and capture utilization trends.
Performance metrics and log details can be exported from a TFE instance to a number of tools for analysis, including Amazon CloudWatch, Azure Monitor, Google Cloud Operations, and Prometheus DB.
Metrics can be exported from a TFE instance in either Prometheus format or JSON.
Establishing Metric Baselines
When monitoring TFE applications, it is important to establish baselines for metrics in order to surface resource utilization patterns and set appropriate thresholds for alerting. A good rule of thumb here would be to collect metrics for 1-2 weeks before setting alert thresholds. This can also help in indicating if hosts or containers have been under-provisioned for your workloads, in which case adjustments should be made to the underlying resources before establishing a new baseline.
Container Level Metrics
TFE applications are deployed as a collection of docker containers. Performance metrics can be exported from each to monitor resource consumption at the container level. These include (but are not limited to):
- CPU usage at the kernel and user space level
- Memory usage and set limit
- Disk IOPS and byte counts
More details on container level metrics and instructions for setup can be found here.
Host Level Metrics
In addition to container level metrics, it is important to monitor host level metrics to identify if baseline resource utilization has been set appropriately or if any resource limitation is exceeded. There may be differences in the desired metrics to monitor based on operational mode (External Services, Active/Active, Mounted Disk). The following are useful metrics to collect from the TFE instance host, and should serve as a good minimum requirement:
- CPU utilization
- Memory utilization
- Disk space
- Disk IOPS (read/write)
NOTE: For TFE instances deployed in External Services or Active/Active mode, metrics should also be collected for the following services in addition to the TFE instance and containers:
* Database (Postgresql, RDS, etc) * Redis (Active/Active mode only)
Links to relevant metrics can be found in the Appendix.
Appendix II (Examples)
- Capacity and Performance
- Terraform Enterprise Metrics
- Container Metrics
- Terraform Enterprise and Postgres Utilization