Problem
The tfectl node drain
command hangs indefinitely and queued runs continue to be executed.
root@terraform-enterprise:~# docker compose exec terraform-enterprise tfectl node drain
Starting node drain activity. This process runs in the background. Please monitor its progress before proceeding with a complete application shutdown.
stopping service: service=sidekiq
waiting for command to finish execution on node 650f8cfa0dac
successfully stopped service: service=sidekiq
stopping service: service=task-worker
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
waiting for command to finish execution on node 650f8cfa0dac
^Croot@terraform-enterprise:~#
The output of supervisorctl status
in the Terraform Enterprise container shows the tfe:task-worker
service is stuck in a STOPPING
state.
terraform-enterprise@75b9f4daceb8:/# supervisorctl status
fluent-bit RUNNING pid 45, uptime 0:02:21
postgres STOPPED Not started
redis RUNNING pid 49, uptime 0:02:20
terraform-enterprise RUNNING pid 25, uptime 0:02:24
tfe:archivist RUNNING pid 75, uptime 0:02:19
tfe:atlas RUNNING pid 76, uptime 0:02:19
tfe:backup-restore RUNNING pid 77, uptime 0:02:19
tfe:licensing RUNNING pid 82, uptime 0:02:19
tfe:metrics RUNNING pid 87, uptime 0:02:19
tfe:nginx RUNNING pid 91, uptime 0:02:19
tfe:outbound-http-proxy RUNNING pid 98, uptime 0:02:19
tfe:sidekiq STOPPED Sep 17 04:08 PM
tfe:slug-ingress RUNNING pid 103, uptime 0:02:19
tfe:task-worker STOPPING Sep 17 04:08 PM
tfe:terraform-registry-api RUNNING pid 123, uptime 0:02:19
tfe:terraform-registry-worker RUNNING pid 124, uptime 0:02:19
tfe:terraform-state-parser RUNNING pid 130, uptime 0:02:19
tfe:tfe-health-check RUNNING pid 141, uptime 0:02:19
tfe:vault RUNNING pid 144, uptime 0:02:18
Prerequisites
- Terraform Enterprise v202404-2 to v202409-1
- Docker and Podman deployments
Cause
When a node drain command is executed, two services making up the run pipeline, sidekiq and the task-worker, are gracefully stopped to ensure in-flight jobs are completed and no new jobs are enqueued. In Terraform Enterprise release v202404-2 to v202409-1, there is a bug which prevents the task-worker process from being shutdown during the node drain.
Solution
Upgrade to v202409-2 for a permanent solution. As a temporary workaround, utilize the following command in place of the tfectl node drain
command.
- Docker
docker exec -u 0 <TFE_CONTAINER> bash -c 'supervisorctl stop tfe:sidekiq && TTW_PIDS=$(pgrep -f /usr/local/bin/task-worker); for pid in $TTW_PIDS; do kill -s TERM $pid && echo "tfe:task-worker: stopped ($pid)"; done'
- Podman
podman exec -u 0 <TFE_CONTAINER> bash -c 'supervisorctl stop tfe:sidekiq && TTW_PIDS=$(pgrep -f /usr/local/bin/task-worker); for pid in $TTW_PIDS; do kill -s TERM $pid && echo "tfe:task-worker: stopped ($pid)"; done'