Terraform Enterprise crashes due to OOM Killer event – HashiCorp Help Center

Problem

Terraform Enterprise application crashes due to OOM Killer

Prerequisites

Terraform Enterprise v202307-1 - present (v202503-1)
Replicated or FDO Docker

Cause

The Terraform Enterprise application may become unavailable due to unbounded memory growth by the puma processes inside the container. When monitoring the container resource utilization, memory consumption will grow, plateau, fall and repeat as the application is used, but sporadically, memory growth continues to the point where the Linux Kernel triggers the Out of Memory Killer (OOM Killer) to prevent the system from crashing when memory runs low.

Example of OoM Event

sudo journalctl -k -p err -b
-- Logs begin at Thu 2024-01-11 18:21:01 UTC, end at Sat 2024-01-13 00:08:00 UTC. --
Jan 11 20:50:21 ip-10-118-128-187.us-east-2.compute.internal kernel: Out of memory: Kill process 6269 (terraform-state) score 224 or sacrifice child
Jan 11 20:50:21 ip-10-118-128-187.us-east-2.compute.internal kernel: Killed process 6269 (terraform-state) total-vm:14121672kB, anon-rss:7247892kB, file-rss:0kB, shmem-rss:0kB

Terraform Enterprise log

2024-01-11 18:22:36,947 INFO success: terraform-state-parser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 18:22:36,947 INFO success: tfe-health-check entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 18:22:36,947 INFO success: vault entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-11 20:50:20,331 INFO exited: terraform-state-parser (terminated by SIGKILL; not expected) <-- HERE
2024-01-11 20:50:20,823 INFO spawned: 'terraform-state-parser' with pid 7253
2024-01-11 20:50:21,686 INFO success: terraform-state-parser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

Evidence of the Puma worker high resource consumption

sh-4.2$ sudo docker exec -it terraform-enterprise-tfe-1 ps auxww | awk '{mem[$11]+=int($6/1024)}; {cpuper[$11]+=$3};{memper[$11]+=$4}; END {for (i in mem) {print cpuper[i]"% ",memper[i]"% ",mem[i]"MB ",i}}' | sort -k3nr | head -n 10
187.3%  71.6%  22720 MB  puma:

Solution

Create a puma_restart.sh file with the following contents and save it.

#!/bin/bash

# This script issues a rolling restart on puma worker processes. Worker
# processes wait for active connections to close before proceeding. More
# information on Puma signals can be found here:
# https://github.com/puma/puma/blob/master/docs/signals.md#puma-signals

# Function to find the primary Puma process ID
get_puma_pid() {
  local container_id="$1"
  docker exec "$container_id" ps -ax | grep "puma 6" | grep -v grep | awk '{print $1}'
}

# Prompt for container name or ID
read -p "Enter Container Name or ID: " container_id

# Check if container ID is empty
if [[ -z "$container_id" ]]; then
  echo "Error: Please enter a container ID."
  exit 1
fi

# Find the Puma process ID
puma_pid=$(get_puma_pid "$container_id")

# Check if puma_pid is empty
if [[ -z "$puma_pid" ]]; then
  echo "Error: Couldn't find Puma process in container $container_id."
  exit 1
fi

# Signal Puma for restart with informative message
if docker exec -u 0 "$container_id" kill -USR1 "$puma_pid"; then
  echo "Sent USR1 signal to Puma process $puma_pid in container $container_id. Initiating rolling restart."
fi

Make the file executable:
```
chmod +x /path/to/puma_restart.sh
```

Execute the script and enter the container name or ID, example:

$ ./puma_restart.sh
Enter Container Name or ID: terraform-enterprise

For Terraform Enterprise v202307-1 and v202308-1 restart the tfe-atlas container instead:
```
$ docker restart tfe-atlas
```

Outcome

Confirm the script has restarted the process by tailing the Terraform Enterprise container's logs:

$ docker logs --timestamps --details terraform-enterprise
...
2024-03-27T17:06:36.967157153Z {"log":"[129] - Starting phased worker restart, phase: 3","component":"atlas"}
2024-03-27T17:06:36.967210934Z {"log":"[129] + Changing to /app","component":"atlas"}
2024-03-27T17:06:36.967220474Z {"log":"[129] - Stopping 1966 for phased upgrade...","component":"atlas"}
2024-03-27T17:06:36.967227114Z {"log":"[129] - TERM sent to 1966...","component":"atlas"}
2024-03-27T17:06:36.967233055Z {"log":"2024-03-27 17:06:36 [DEBUG] Shutting down background worker","component":"atlas"}
2024-03-27T17:06:36.967726387Z {"log":"2024-03-27 17:06:36 [DEBUG] Killing session flusher","component":"atlas"}

Problem

Prerequisites

Cause

Solution

Outcome

Articles in this section

Related articles