Problem
Terraform Enterprise application crashes due to OOM Killer
Prerequisites
- Terraform Enterprise v202307-1 - present (v202402-1)
Cause
The Terraform Enterprise application may become unavailable due to unbounded memory growth by the puma processes inside the container. When monitoring the container resource utilization, memory consumption will grow, plateau, fall and repeat as the application is used, but sporadically, memory growth continues to the point where the Linux Kernel triggers the Out of Memory Killer (OOM Killer) to prevent the system from crashing when memory runs low.
- Example of OoM Event
sudo journalctl -k -p err -b
-- Logs begin at Thu 2024-01-11 18:21:01 UTC, end at Sat 2024-01-13 00:08:00 UTC. --
Jan 11 20:50:21 ip-10-118-128-187.us-east-2.compute.internal kernel: Out of memory: Kill process 6269 (terraform-state) score 224 or sacrifice child
Jan 11 20:50:21 ip-10-118-128-187.us-east-2.compute.internal kernel: Killed process 6269 (terraform-state) total-vm:14121672kB, anon-rss:7247892kB, file-rss:0kB, shmem-rss:0kB - Terraform Enterprise log
2024-01-11 18:22:36,947 INFO success: terraform-state-parser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 18:22:36,947 INFO success: tfe-health-check entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 18:22:36,947 INFO success: vault entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 20:50:20,331 INFO exited: terraform-state-parser (terminated by SIGKILL; not expected) <-- HERE
2024-01-11 20:50:20,823 INFO spawned: 'terraform-state-parser' with pid 7253
2024-01-11 20:50:21,686 INFO success: terraform-state-parser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) - Evidence of the Puma worker high resource consumption
sh-4.2$ sudo docker exec -it terraform-enterprise-tfe-1 ps aux | awk '{mem[$11]+=int($6/1024)}; {cpuper[$11]+=$3};{memper[$11]+=$4}; END {for (i in mem) {print cpuper[i]"% ",memper[i]"% ",mem[i]"MB ",i}}' | sort -k3nr | head -n 10
187.3% 71.6% 22720 MB puma:
Solution
- Create a
puma_restart.sh
file with the following contents and save it.
#!/bin/bash
# This script issues a rolling restart on puma worker processes. Worker
# processes wait for active connections to close before proceeding. More
# information on Puma signals can be found here:
# https://github.com/puma/puma/blob/master/docs/signals.md#puma-signals
# Prompt for container ID
read -p "Enter Container Name or ID: " CONTAINER_ID
# Check if CONTAINER_ID is empty
if [[ -z "$CONTAINER_ID" ]]; then
echo "Error: Please enter a container ID."
exit 1
fi
# Isolate the primary Puma process id
puma_pid=$(docker exec "$CONTAINER_ID" ps -ax | grep "puma 6" | sed -r 's/\s*([0-9]*).*/\1/')
# Check if puma_pid is empty
if [[ -z "$puma_pid" ]]; then
echo "Error: Couldn't find Puma process in container $CONTAINER_ID."
exit 1
fi
# Issue the USR1 signal
docker exec "$CONTAINER_ID" kill -USR1 $puma_pid
echo "Sent USR1 signal to Puma process $puma_pid in container $CONTAINER_ID. Rolling restart initiated." - Make the file executable:
chmod +x /path/to/puma_restart.sh
- Execute the script and enter the container name or ID, example:
$ ./puma_restart.sh
Enter Container Name or ID: terraform-enterprise - For Terraform Enterprise v202307-1 and v202308-1 restart the
tfe-atlas
container instead:$ docker restart tfe-atlas
Outcome
Confirm the script has restarted the process by tailing the Terraform Enterprise container's logs:
$ docker logs --timestamps --details terraform-enterprise
...
2024-03-27T17:06:36.967157153Z {"log":"[129] - Starting phased worker restart, phase: 3","component":"atlas"}
2024-03-27T17:06:36.967210934Z {"log":"[129] + Changing to /app","component":"atlas"}
2024-03-27T17:06:36.967220474Z {"log":"[129] - Stopping 1966 for phased upgrade...","component":"atlas"}
2024-03-27T17:06:36.967227114Z {"log":"[129] - TERM sent to 1966...","component":"atlas"}
2024-03-27T17:06:36.967233055Z {"log":"2024-03-27 17:06:36 [DEBUG] Shutting down background worker","component":"atlas"}
2024-03-27T17:06:36.967726387Z {"log":"2024-03-27 17:06:36 [DEBUG] Killing session flusher","component":"atlas"}
Additional Information
-
Please submit a support request for continued issues.