Introduction
This article addresses an issue observed when running Nomad jobs with the nomad-podman
driver version 0.6.2, where application containers are not properly cleaned up and become orphaned after a Nomad agent restart on worker nodes. This leads to unexpected duplicate container instances.
Problem
When a Nomad agent on a worker node is restarted, Podman-based application containers that were previously running on that node and subsequently re-allocated to another healthy worker node are unexpectedly left running on the original node. This results in the same application container running simultaneously on two different nodes, creating orphaned and unmanaged resources.
Prerequisites
To reproduce and understand this issue, you will need:
- A running Nomad cluster with at least two worker nodes.
-
nomad-podman-driver
version 0.6.2 installed (this is the affected version). - Podman installed on the worker nodes.
Verification Steps:
-
Verify Nomad Node Health:
$ nomad node status ID Node Pool DC Name Class Drain Eligibility Status 15448551 default dc1 dc1-cli-02 <none> false eligible ready 99c7d516 default dc1 dc1-cli-01 <none> false eligible ready
-
Confirm
nomad-podman-driver
Version: Check the Nomad client's operational logs during service start-up for the detected plugin version:[INFO] agent: detected plugin: name=podman type=driver plugin_version=0.6.2
Cause
This behavior is caused by a bug specifically present in nomad-podman-driver
version 0.6.2, which prevents proper container cleanup upon Nomad agent restarts and subsequent re-scheduling of allocations.
Solution
The issue has been identified and fixed in a pre-release version of nomad-podman-driver
and will be officially resolved in version 0.6.3.
Recommended Action:
Upgrade your nomad-podman-driver
to version 0.6.3 (or a later stable release) as soon as it becomes available.
Outcome
By upgrading to nomad-podman-driver
version 0.6.3 or later, the bug causing orphaned Podman containers after a Nomad agent restart will be resolved. Nomad will correctly clean up containers on the original node when an allocation is moved to another healthy worker node, ensuring that application containers run only on their intended and managed locations.
Additional Information
Steps to Reproduce (for verification or further debugging):
-
Run a sample Podman-based Nomad job: Save the following HCL to a file (e.g.,
nginx-job.nomad
):job "nginx-podman-job" { datacenters = ["dc1"] type = "service" group "nginx-group" { count = 1 task "nginx-task" { driver = "podman" config { image = "docker.io/library/nginx:latest" } resources { cpu = 500 memory = 256 } } } }
Submit the job:
nomad job run nginx-job.nomad
-
Identify the Worker Node: Locate the worker node where the
nginx-podman-job
allocation is placed (e.g.,dc1-cli-01
). -
Check Container Status and Stop Nomad Agent: SSH into the identified worker node (
dc1-cli-01
) and verify the running container:[dc1-cli-01 ~]$ sudo podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 47abe75da52a docker.io/library/nginx:latest nginx -g daemon o... 2 hours ago Up 2 hours 80/tcp nginx-task-9c86bdab-472b-2544-3c57-0b78ee8d0d28
Then, stop the Nomad agent service on this node:
[dc1-cli-01 ~]$ sudo systemctl stop nomad
-
Observe Allocation Re-scheduling: Nomad will re-schedule the workload to another healthy worker node (e.g.,
dc1-cli-02
). Verify it's running on the new node:[dc1-cli-02 ~]$ sudo podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 193d1fecf152 docker.io/library/nginx:latest nginx -g daemon o... 29 seconds ago Up 29 seconds 80/tcp nginx-task-636dc91e-e44e-f447-5971-f98da7639cb6
-
Restart Nomad Agent and Observe Issue: Start the Nomad agent service on the original node (
dc1-cli-01
) again:[dc1-cli-01 ~]$ sudo systemctl start nomad
Now, check the Podman containers on both nodes. You will observe the orphaned container on
dc1-cli-01
:[dc1-cli-01 ~]$ sudo podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 47abe75da52a docker.io/library/nginx:latest nginx -g daemon o... 3 hours ago Up 3 hours 80/tcp nginx-task-9c86bdab-472b-2544-3c57-0b78ee8d0d28 [dc1-cli-02 ~]$ sudo podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 193d1fecf152 docker.io/library/nginx:latest nginx -g daemon o... 14 minutes ago Up 14 minutes 80/tcp nginx-task-636dc91e-e44e-f447-5971-f98da7639cb6
This demonstrates the unexpected running of the container on both nodes.