Introduction
When a run starts in Terraform Enterprise, it initiates a process to spawn a worker/agent container from an image in which can be either default worker/agent image or custom worker image/custom agent image. The work manager(or tfe task worker for agents) who is responsible for spawning the container then executes series of commands on the container including Terraform CLI operations, such as plan command.
Use Case
There are chances that Terraform plan operations are unable to complete their tasks successfully. This document captures potential cases where plan operation could end up in failed state and the methods of recovery from the problems.
Container runs out of memory (worker/agent)
Each worker container is allocated with preset size of memory as described in the Capacity and Performance document. Then Terraform plan operation demands more memory than the preset size during its execution, there could be a few variants of error messages for this type of problem, it might show the error message Killed at the end of the output as last line, or connection is shut down
, unexpected EOF
, or Error: rpc error: code = Canceled desc = context canceled
towards the end of the log output.
Solutions
- Terraform CLI 0.12.29 and above can play significant role in decreasing memory footprint due to the change on how it stores resource graph objects. If you are not already on this version of Terraform CLI, HashiCorp Support encorages you to upgrade Terraform CLI 0.12.29 and above when possible.
- Increase memory capacity of worker container where upgrade is not an option or not resolving the symptom. The maximum memory per run is set in the installer dashboard at
https://TFE_HOSTNAME:8800
under Settings > Capacity. Experimenting with this setting by incrementally increasing the memory allocation setting may be necessary if the initial change does not resolve the issue. - When increases the maximum memory per run, the total memory required for the host machine to operate Terraform Enterprise will also increase as side effect, the memory capacity of the host machine should also be adjusted in accordance with the guideline.
Symlinks to files or directories outside source code repository
Terraform Enterprise does not allow to have symlinks to files or directories outside source code repository. Using symlink is prohibited in this scenario due to security awareness. When Terraform Enterprise detects this potential threat during initialization of worker container, the operation should fail and display with error message Setup failed: failed unpacking terraform config: Invalid symlink ("<<SOURCE>> -> <<DESTINATION>>) has absolute target
on the plan output panel.
Solutions
Currently there is no other way around this other than removing all the symlinks in this scenario from the source code repository.
Plan timeout
Terraform Enterprise allows Site Administrator to configure the time out of plan operation. By default the timeout is preset with 2 hours and when it exceeded, similar error below will be displayed on the plan output. As it states in the out that we need to review the output to determine why the run has exceeded its timeout, there can also be various causes to this symptom.
------------ Terraform Enterprise System Message ------------
WARNING: This plan has timed out and will now terminate!
Terraform Enterprise enforces a 30m0s maximum run time for this operation. Please
review the logs above to determine why the run has exceeded its timeout. You
can re-run this operation by queueing a new plan in Terraform Enterprise.
-------------------------------------------------------------
Solutions
-
The plan operation may get timeout due to each resource in the workspace need to communicate with the providers via APIs and also need to take number of resources into account, therefore limit the number of resources in a workspace to an appropriate amount will help to reduce bottleneck during plan operation.
-
Increase plan timeout, this option is straightforward and applicable to the scenario where you are unable to optimize Terraform configuration any further. To increase timeout, navigate to the icon on the user profile icon right corner and select “Site Admin”, alternatively navigate to this URL
https://TFE_HOSTNAME/app/admin/settings
-
Increase number of parallel operations. By default, a plan operation is executed with 10 parallel operations. The can be changed by setting the
TFE_PARALLELISM
environment variable on the workspace as described in the special environment variables documentation. -
Upgrade the TFE host instance class. The higher the CPU count, the faster TFE will be able to process the configuration. It is important to note that other external factors, such as the speed at which API calls to providers process may limit the effectiveness of this option.
-
Upgrade disk IOPS. With larger configurations or increased concurrency, TFE may begin to push hard on the disk I/O. Particularly when running in cloud environments, it may be necessary to increase the allocated IOPS of the disk.
-
Refactor configuration to reduce processing time of Terraform Graph.
Additional information
It is important to understand the consequences of an attempt to shorten processing time of plan operation, as plan operation itself consists of multiple sub-processes; graph walk, diff, and refresh. Each of the changes suggested in this document have potential consequences; for example, increasing the amount of memory allocated to each run may exhaust system memory if not accounted for.
If you continue to experience the issues after following the guides, please contact HashiCorp Support to request for further assistance.