Introduction
Problem
A memory/caching issue in certain versions of Terraform caused runs to fail with varying symptoms, including intermittent or consistent:
-
terraform init
failing to complete -
terraform plan
failing to start with errors during refresh - Terraform runs failing with Out of Memory (OOM) issues
-
Terraform Enterprise UI page loads slowing down
Prerequisites (if applicable)
- Terraform versions
v1.5.3
tov1.6.0
- Terraform AWS provider versions
v4.67.0
tov5.20.0
Cause
- The main problem is peak memory usage:
- this peak occurs when Terraform makes calls to configured providers to load their resource/data source schemas,
- the Terraform protocol contains a single RPC which asks for all schemas regardless of the specific resources configured,
- this means that as the Terraform AWS provider grows, the memory requirements for using it also grow,
- some specific resources such as QuickSight and WAFv2 have extremely large nested schemas which can have an outsized effect on memory
- for example -
quicksight
resources are the largest contributors to the memory jump, and they were added inv4.67.0
andv5.1.0
- for example -
-
The memory requirements vary based on the particular resource configured.
- The problem occurred more frequently on Terraform configuration with multiple AWS providers configured or a history of several versions of a provider.
-
The pressure on memory requirements has resulted in OOM errors in some cases.
- Issue 31722 investigated the increasing size of the provider when combined with addition of resources with a deep and complex schema has significantly increased the peak memory requirements of using the provider.
During the
terraform init
- terraform locates and 'installs' the Terraform Providers used within the configuration, including the child modules called.
- Terraform Cloud and Terraform Enterprise install providers as part of every run.
- Terraform CLI finds and installs providers when initializing a working directory. It can automatically download providers from a Terraform registry, or load them from a local mirror or cache. If you are using a persistent working directory, you must reinitialize whenever you change a configuration's providers.
To save time and bandwidth, Terraform CLI supports an optional plugin cache. You can enable the cache using theplugin_cache_dir
setting in the CLI configuration file.https://developer.hashicorp.com/terraform/language/providers#provider-installation
Overview of possible solutions (if applicable)
Solutions:
-
Most users will see a significant decrease in memory footprint by upgrading to:
- Terraform
v1.6.0
and newer, - Terraform AWS provider
v5.20.0
and newer, - other providers may also be affected and may also require updates.
- Terraform
Outcome
- Changes included from Terraform v1.6.0 onwards, included new functionality that allowed a cached provider schema to be used rather than obtaining another copy, which significantly reduces memory consumption for configurations that include multiple instances of the same provider. Additionally, a regex cache to was added to the Terraform AWS provider (released in v5.14.0) which in testing seems to have a significant impact on memory consumption.
Additional Information
On Terraform Enterprise, generally, running out of memory, impacts Terraform operations, more than CPU:
The required CPU resources for an individual Terraform run vary considerably, but in general they are a much more minor factor than memory due to Terraform mostly waiting on IO from APIs to return.
https://developer.hashicorp.com/terraform/enterprise/system-overview/capacity#cpu
Some memory issues present as SIC-001 errors. They occur when oom-killer events occur on the linux OS. Messages are written to the dmesg log when this happens and are included in the support bundle at <host>/default/commands/dmesg/stdout
The SIC-001 (Source Ingress Controller ) error is a generic failure to process a Terraform slug. A slug refers to a blob of data which contains the current state of the Terraform configuration files. Terraform Enterprise uses slug services to pull VCS information in to extract, merge, and process Terraform configuration files. After a slug is ingressed and processed it is then uploaded to blob storage via
archivist
.
Related to the Terraform AWS provider:
-
memory consumption increase since
v4.67.0
https://github.com/hashicorp/terraform-provider-aws/issues/31722 - memory allocation: https://github.com/hashicorp/terraform-provider-aws/issues/33553
- monitoring of memory usage in providers: https://github.com/hashicorp/terraform-provider-aws/issues/32289
Related to the fix made:
-
Updates to Terraform were released in
v1.6.0
, noting that updates are also required in theaws
providers for the dependencies to take effect.
- core: Terraform will now skip requesting the (possibly very large) provider schema from providers which indicate during handshake that they don't require that for correct behavior, in situations where Terraform Core itself does not need the schema. (#33486)
https://github.com/hashicorp/terraform/blob/v1.6.0-beta1/CHANGELOG.md#160-august-31-2023