Troubleshooting: Orphaned Podman Containers After Nomad Agent Restart – HashiCorp Help Center

Introduction

This article addresses an issue observed when running Nomad jobs with the nomad-podman driver version 0.6.2, where application containers are not properly cleaned up and become orphaned after a Nomad agent restart on worker nodes. This leads to unexpected duplicate container instances.

Problem

When a Nomad agent on a worker node is restarted, Podman-based application containers that were previously running on that node and subsequently re-allocated to another healthy worker node are unexpectedly left running on the original node. This results in the same application container running simultaneously on two different nodes, creating orphaned and unmanaged resources.

Prerequisites

To reproduce and understand this issue, you will need:

A running Nomad cluster with at least two worker nodes.
nomad-podman-driver version 0.6.2 installed (this is the affected version).
Podman installed on the worker nodes.

Verification Steps:

Verify Nomad Node Health:

$ nomad node status
ID        Node Pool  DC   Name        Class   Drain  Eligibility  Status
15448551  default    dc1  dc1-cli-02  <none>  false  eligible     ready
99c7d516  default    dc1  dc1-cli-01  <none>  false  eligible     ready

Confirm nomad-podman-driver Version: Check the Nomad client's operational logs during service start-up for the detected plugin version:
```
[INFO]  agent: detected plugin: name=podman type=driver plugin_version=0.6.2
```

Cause

This behavior is caused by a bug specifically present in nomad-podman-driver version 0.6.2, which prevents proper container cleanup upon Nomad agent restarts and subsequent re-scheduling of allocations.

Solution

The issue has been identified and fixed in a pre-release version of nomad-podman-driver and will be officially resolved in version 0.6.3.

Recommended Action:

Upgrade your nomad-podman-driver to version 0.6.3 (or a later stable release) as soon as it becomes available.

Outcome

By upgrading to nomad-podman-driver version 0.6.3 or later, the bug causing orphaned Podman containers after a Nomad agent restart will be resolved. Nomad will correctly clean up containers on the original node when an allocation is moved to another healthy worker node, ensuring that application containers run only on their intended and managed locations.

Additional Information

Steps to Reproduce (for verification or further debugging):

Run a sample Podman-based Nomad job: Save the following HCL to a file (e.g., nginx-job.nomad):

job "nginx-podman-job" {
  datacenters = ["dc1"]
  type = "service"

  group "nginx-group" {
    count = 1

    task "nginx-task" {
      driver = "podman"

      config {
        image = "docker.io/library/nginx:latest"
      }

      resources {
        cpu = 500
        memory = 256
      }
    }
  }
}

Submit the job:

nomad job run nginx-job.nomad

Identify the Worker Node: Locate the worker node where the nginx-podman-job allocation is placed (e.g., dc1-cli-01).

Check Container Status and Stop Nomad Agent: SSH into the identified worker node (dc1-cli-01) and verify the running container:

[dc1-cli-01 ~]$ sudo podman ps
CONTAINER ID  IMAGE                           COMMAND               CREATED      STATUS      PORTS       NAMES
47abe75da52a  docker.io/library/nginx:latest  nginx -g daemon o...  2 hours ago  Up 2 hours  80/tcp      nginx-task-9c86bdab-472b-2544-3c57-0b78ee8d0d28

Then, stop the Nomad agent service on this node:

[dc1-cli-01 ~]$ sudo systemctl stop nomad

Observe Allocation Re-scheduling: Nomad will re-schedule the workload to another healthy worker node (e.g., dc1-cli-02). Verify it's running on the new node:

[dc1-cli-02 ~]$ sudo podman ps
CONTAINER ID  IMAGE                           COMMAND               CREATED         STATUS         PORTS       NAMES
193d1fecf152  docker.io/library/nginx:latest  nginx -g daemon o...  29 seconds ago  Up 29 seconds  80/tcp      nginx-task-636dc91e-e44e-f447-5971-f98da7639cb6

Restart Nomad Agent and Observe Issue: Start the Nomad agent service on the original node (dc1-cli-01) again:

[dc1-cli-01 ~]$ sudo systemctl start nomad

Now, check the Podman containers on both nodes. You will observe the orphaned container on dc1-cli-01:

[dc1-cli-01 ~]$ sudo podman ps
CONTAINER ID  IMAGE                           COMMAND               CREATED      STATUS      PORTS       NAMES
47abe75da52a  docker.io/library/nginx:latest  nginx -g daemon o...  3 hours ago  Up 3 hours  80/tcp      nginx-task-9c86bdab-472b-2544-3c57-0b78ee8d0d28

[dc1-cli-02 ~]$ sudo podman ps
CONTAINER ID  IMAGE                           COMMAND               CREATED         STATUS         PORTS       NAMES
193d1fecf152  docker.io/library/nginx:latest  nginx -g daemon o...  14 minutes ago  Up 14 minutes  80/tcp      nginx-task-636dc91e-e44e-f447-5971-f98da7639cb6

This demonstrates the unexpected running of the container on both nodes.