Problem
Terraform Enterprise application crashes due to OOM Killer
Prerequisites
- Terraform Enterprise v202307-1 - present (v202406-1)
Cause
The Terraform Enterprise application may become unavailable due to unbounded memory growth by the puma processes inside the container. When monitoring the container resource utilization, memory consumption will grow, plateau, fall and repeat as the application is used, but sporadically, memory growth continues to the point where the Linux Kernel triggers the Out of Memory Killer (OOM Killer) to prevent the system from crashing when memory runs low.
- Example of OoM Event
sudo journalctl -k -p err -b
-- Logs begin at Thu 2024-01-11 18:21:01 UTC, end at Sat 2024-01-13 00:08:00 UTC. --
Jan 11 20:50:21 ip-10-118-128-187.us-east-2.compute.internal kernel: Out of memory: Kill process 6269 (terraform-state) score 224 or sacrifice child
Jan 11 20:50:21 ip-10-118-128-187.us-east-2.compute.internal kernel: Killed process 6269 (terraform-state) total-vm:14121672kB, anon-rss:7247892kB, file-rss:0kB, shmem-rss:0kB - Terraform Enterprise log
2024-01-11 18:22:36,947 INFO success: terraform-state-parser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 18:22:36,947 INFO success: tfe-health-check entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 18:22:36,947 INFO success: vault entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 20:50:20,331 INFO exited: terraform-state-parser (terminated by SIGKILL; not expected) <-- HERE
2024-01-11 20:50:20,823 INFO spawned: 'terraform-state-parser' with pid 7253
2024-01-11 20:50:21,686 INFO success: terraform-state-parser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) - Evidence of the Puma worker high resource consumption
sh-4.2$ sudo docker exec -it terraform-enterprise-tfe-1 ps aux | awk '{mem[$11]+=int($6/1024)}; {cpuper[$11]+=$3};{memper[$11]+=$4}; END {for (i in mem) {print cpuper[i]"% ",memper[i]"% ",mem[i]"MB ",i}}' | sort -k3nr | head -n 10
187.3% 71.6% 22720 MB puma:
Solution
- Create a
puma_restart.sh
file with the following contents and save it.
#!/bin/bash
# This script issues a rolling restart on puma worker processes. Worker
# processes wait for active connections to close before proceeding. More
# information on Puma signals can be found here:
# https://github.com/puma/puma/blob/master/docs/signals.md#puma-signals
# Function to find the primary Puma process ID
get_puma_pid() {
local container_id="$1"
docker exec "$container_id" ps -ax | grep "puma 6" | grep -v grep | awk '{print $1}'
}
# Prompt for container name or ID
read -p "Enter Container Name or ID: " container_id
# Check if container ID is empty
if [[ -z "$container_id" ]]; then
echo "Error: Please enter a container ID."
exit 1
fi
# Find the Puma process ID
puma_pid=$(get_puma_pid "$container_id")
# Check if puma_pid is empty
if [[ -z "$puma_pid" ]]; then
echo "Error: Couldn't find Puma process in container $container_id."
exit 1
fi
# Signal Puma for restart with informative message
if docker exec -u 0 "$container_id" kill -USR1 "$puma_pid"; then
echo "Sent USR1 signal to Puma process $puma_pid in container $container_id. Initiating rolling restart."
fi - Make the file executable:
chmod +x /path/to/puma_restart.sh
- Execute the script and enter the container name or ID, example:
$ ./puma_restart.sh
Enter Container Name or ID: terraform-enterprise - For Terraform Enterprise v202307-1 and v202308-1 restart the
tfe-atlas
container instead:$ docker restart tfe-atlas
Outcome
Confirm the script has restarted the process by tailing the Terraform Enterprise container's logs:
$ docker logs --timestamps --details terraform-enterprise
...
2024-03-27T17:06:36.967157153Z {"log":"[129] - Starting phased worker restart, phase: 3","component":"atlas"}
2024-03-27T17:06:36.967210934Z {"log":"[129] + Changing to /app","component":"atlas"}
2024-03-27T17:06:36.967220474Z {"log":"[129] - Stopping 1966 for phased upgrade...","component":"atlas"}
2024-03-27T17:06:36.967227114Z {"log":"[129] - TERM sent to 1966...","component":"atlas"}
2024-03-27T17:06:36.967233055Z {"log":"2024-03-27 17:06:36 [DEBUG] Shutting down background worker","component":"atlas"}
2024-03-27T17:06:36.967726387Z {"log":"2024-03-27 17:06:36 [DEBUG] Killing session flusher","component":"atlas"}