Problem
When running the tfe-admin node-drain command on certain versions of Terraform Enterprise, the command does not correctly stop queued runs from being processed. This issue affects environments where consolidated services are disabled.
Depending on the Terraform Enterprise release, the command output will vary.
For versions v202306-1 to v202309-1, the command appears to succeed but does not stop the correct containers.
# tfe-admin node-drain Running node-drain (localhost) [INFO] draining node: node=localhost [INFO] stopping sidekiq [INFO] successfully stopped sidekiq: output="tfe-sidekiq" [INFO] stopping build_manager and build_worker [INFO] successfully stopped build_manager and build_worker: output="tfe-build-managertfe-build-worker"
For versions v202310-1 to v202312-1, the command fails with an error because the legacy containers it tries to stop no longer exist.
# tfe-admin node-drain Running node-drain (localhost) [INFO] draining node: node=localhost [INFO] stopping sidekiq [INFO] successfully stopped sidekiq: output="tfe-sidekiq" [INFO] stopping build_manager and build_worker [ERROR] error stopping build_manager and build_worker: error="exit status 1" [ERROR] Error response from daemon: No such container: tfe-build-managerError response from daemon: No such container: tfe-build-worker: error="exit status 1"error draining node: error stopping build_manager and build_worker: exit status 1
Prerequisites
This issue affects Replicated deployments of Terraform Enterprise that meet the following criteria:
- Version is between
v202306-1andv202312-1. - Consolidated services mode is disabled:
-
consolidated_services = 0(for versionsv202306-1tov202308-1) -
consolidated_services_enabled = 0(for versionsv202309-1tov202312-1)
-
Cause
In version v202306-1, support for the legacy pipeline was removed. This change unintentionally removed a configuration option that directed the tfe-admin node-drain command to stop the tfe-task-worker container. As a result, the command attempts to stop containers from the legacy pipeline instead of the correct container for the agent pipeline (tfe-task-worker).
In version v202310-1, the legacy tfe-build-worker and tfe-build-manager containers were removed entirely, causing the command to fail with an error.
Solutions
There are two approaches to address this issue. The recommended solution is to enable consolidated services. If that is not immediately possible, you can use the manual workaround.
Solution 1: Enable Consolidated Services (Recommended)
Enabling consolidated services mode resolves the issue by aligning the instance with the current architecture, which corrects the behavior of the node-drain command.
-
Enable consolidated services. The command varies by version.
For versions
v202306-1tov202308-1, run the following command.# replicatedctl app-config set consolidated_services --value 1
For versions
v202309-1tov202312-1, run the following command.# replicatedctl app-config set consolidated_services_enabled --value 1
-
Apply the configuration change to restart the application.
# replicatedctl app apply-config
Solution 2: Manually Stop the Task Worker Container
If you cannot enable consolidated services immediately, you can manually stop the tfe-task-worker container as a workaround.
Run the following command immediately after running tfe-admin node-drain.
# docker stop -t 86400 tfe-task-worker