Problem
The Terraform Enterprise application becomes unresponsive or crashes due to an Out of Memory (OOM) Killer event terminating a critical process.
Prerequisites
- Terraform Enterprise v202307-1 to v202503-1
- Replicated or Flexible Deployment Options (FDO) with Docker
Cause
The Terraform Enterprise application may become unavailable due to unbounded memory growth by the puma worker processes inside the tfe-atlas container. This memory growth can continue until the host's Linux Kernel triggers the OOM Killer to terminate processes, preventing a system-wide crash.
You can identify this issue by observing the following indicators.
-
OOM Killer Event in System Logs
Review the kernel logs for an Out of Memory event targeting a Terraform Enterprise process.
$ sudo journalctl -k -p err -b -- Logs begin at Thu 2024-01-11 18:21:01 UTC, end at Sat 2024-01-13 00:08:00 UTC. -- Jan 11 20:50:21 ip-10-118-128-187.us-east-2.compute.internal kernel: Out of memory: Kill process 6269 (terraform-state) score 224 or sacrifice child Jan 11 20:50:21 ip-10-118-128-187.us-east-2.compute.internal kernel: Killed process 6269 (terraform-state) total-vm:14121672kB, anon-rss:7247892kB, file-rss:0kB, shmem-rss:0kB
-
Terraform Enterprise Application Logs
The application logs show an unexpected process termination caused by a
SIGKILLsignal.2024-01-11 18:22:36,947 INFO success: terraform-state-parser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2024-01-11 18:22:36,947 INFO success: tfe-health-check entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2024-01-11 18:22:36,947 INFO success: vault entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2024-01-11 20:50:20,331 INFO exited: terraform-state-parser (terminated by SIGKILL; not expected) 2024-01-11 20:50:20,823 INFO spawned: 'terraform-state-parser' with pid 7253 2024-01-11 20:50:21,686 INFO success: terraform-state-parser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
-
High Memory Consumption by Puma Workers
Executing
psinside the container reveals high memory usage by puma processes.$ sudo docker exec -it terraform-enterprise-tfe-1 ps auxww | awk '{mem[$11]+=int($6/1024)}; {cpuper[$11]+=$3};{memper[$11]+=$4}; END {for (i in mem) {print cpuper[i]"% ",memper[i]"% ",mem[i]"MB ",i}}' | sort -k3nr | head -n 10Example output showing high memory usage.
187.3% 71.6% 22720 MB puma:
Solutions
Select the solution that corresponds to your version of Terraform Enterprise.
Solution 1: Restart Puma Workers via Script (v202308-2 and newer)
This solution uses a script to perform a rolling restart of the puma worker processes, which allows active connections to close gracefully.
-
Create a file named
puma_restart.shwith the following content.#!/bin/bash ## This script issues a rolling restart on puma worker processes. Worker ## processes wait for active connections to close before proceeding. More ## information on Puma signals can be found here: ## https://github.com/puma/puma/blob/master/docs/signals.md#puma-signals ## Function to find the primary Puma process ID get_puma_pid() { local container_id="$1" docker exec "$container_id" ps -ax | grep "puma 6" | grep -v grep | awk '{print $1}' } ## Prompt for container name or ID read -p "Enter Container Name or ID: " container_id ## Check if container ID is empty if [[ -z "$container_id" ]]; then echo "Error: Please enter a container ID." exit 1 fi ## Find the Puma process ID puma_pid=$(get_puma_pid "$container_id") ## Check if puma_pid is empty if [[ -z "$puma_pid" ]]; then echo "Error: Couldn't find Puma process in container $container_id." exit 1 fi ## Signal Puma for restart with informative message if docker exec -u 0 "$container_id" kill -USR1 "$puma_pid"; then echo "Sent USR1 signal to Puma process $puma_pid in container $container_id. Initiating rolling restart." fi -
Make the script executable.
$ chmod +x /path/to/puma_restart.sh
-
Execute the script and provide the Terraform Enterprise container name or ID when prompted.
$ ./puma_restart.sh Enter Container Name or ID: terraform-enterprise
Solution 2: Restart the Container (v202307-1 to v202308-1)
For older versions of Terraform Enterprise, restart the tfe-atlas container directly to resolve the issue.
$ docker restart tfe-atlas
Outcome Validation
After applying Solution 1, you can confirm that the puma processes have restarted by monitoring the Terraform Enterprise container's logs.
$ docker logs --timestamps --details terraform-enterprise
Look for log entries indicating a phased worker restart.
...
2024-03-27T17:06:36.967157153Z {"log":"[129] - Starting phased worker restart, phase: 3","component":"atlas"}
2024-03-27T17:06:36.967210934Z {"log":"[129] + Changing to /app","component":"atlas"}
2024-03-27T17:06:36.967220474Z {"log":"[129] - Stopping 1966 for phased upgrade...","component":"atlas"}
2024-03-27T17:06:36.967227114Z {"log":"[129] - TERM sent to 1966...","component":"atlas"}
2024-03-27T17:06:36.967233055Z {"log":"2024-03-27 17:06:36 [DEBUG] Shutting down background worker","component":"atlas"}
2024-03-27T17:06:36.967726387Z {"log":"2024-03-27 17:06:36 [DEBUG] Killing session flusher","component":"atlas"}