Introduction
Using Nomad UI and CLI, one could review and monitor any job's current status. However, there could be some scenarios where these status views might not be consistent like in case of some terminal/dead jobs.
In this guide, we will see how we can make this status view more consistent using several native commands, as it would help in fine-tuning monitoring/observation of the jobs.
Lab Setup (to reproduce the issue)
Note: This KB is using below Nomad and Consul versions, however feel free to use any compatible version which supports periodic job, garbage collection and system reconcile summary operations.
Nomad Version: 1.7.2+ent
Consul Version: 1.17.3+ent
-
Run both consul & nomad in dev mode with connect enabled.
consul_config.hclports { grpc = 8502 } connect { enabled = true }
$ consul agent -dev -config-file=./consul_config.hcl
nomad_config.hcl
server {
# license_path is required for Nomad Enterprise as of Nomad v1.1.1+
license_path = "/etc/nomad.d/license.hclic"
enabled = true
}
acl {
enabled = true
}
$ nomad agent -dev-connect -dev-consul -config=./nomad_config.hcl
-
Run the following periodic job in a default namespace.
job "docs-1" { datacenters = ["ns1"] namespace = "default" type = "batch" periodic { cron = "*/15 * * * * * *" prohibit_overlap = true } group "example" { network { port "http" { static = "5678" } } task "server" { driver = "docker" config { image = "hashicorp/http-echodgdgdg" # Intentionally passed wrong image for the job to fail ports = ["http"] args = [ "-listen", ":5678", "-text", "hello world", ] } } } }
[ec2-user@ip-172-31-44-235 ~]$ nomad job status docs-1
ID = docs-1
Name = docs-1
Submit Date = 2024-05-24T06:35:26Z
Type = batch
Priority = 50
Datacenters = dc1
Namespace = default
Node Pool = default
Status = running
Periodic = true
Parameterized = false
Next Periodic Launch = 2024-05-24T07:11:15Z (14s from now)
Children Job Summary
Pending Running Dead
0 1 9
Previously Launched Jobs
ID Status
docs-1/periodic-1716533340 dead
docs-1/periodic-1716533475 dead
docs-1/periodic-1716533610 dead
docs-1/periodic-1716533745 dead
docs-1/periodic-1716533880 dead
docs-1/periodic-1716534015 dead
docs-1/periodic-1716534150 dead
docs-1/periodic-1716534285 dead
docs-1/periodic-1716534420 dead
docs-1/periodic-1716534555 running
-
When allocation gets into dead state, then run the command nomad system gc to clean terminal jobs/allocs, however in the job status under Children Job Summary there are 9 dead allocations present but in the Previously Launched Jobs section, it is not present. Same is the case in Nomad UI as well.
[ec2-user@ip-172-31-44-235 ~]$ nomad job status docs-1 ID = docs-1 Name = docs-1 Submit Date = 2024-05-24T06:35:26Z Type = batch Priority = 50 Datacenters = dc1 Namespace = default Node Pool = default Status = running Periodic = true Parameterized = false Next Periodic Launch = 2024-05-24T07:12:00Z (14s from now) Children Job Summary Pending Running Dead 0 1 9 Previously Launched Jobs ID Status docs-1/periodic-1716534690 running
Solution to the issue
Expectation would be that once nomad system gc runs on terminal allocs for periodic jobs, then there should be a consistency of job status and history on both UI and CLI.
In order to fix the same, user has to run nomad system reconcile summaries command after performing nomad system gc as it would reconcile the summaries of all registered jobs.
https://developer.hashicorp.com/nomad/docs/commands/system/reconcile-summaries
https://developer.hashicorp.com/nomad/docs/commands/system
[ec2-user@ip-172-31-44-235 ~]$ nomad job status docs-1
ID = docs-1
Name = docs-1
Submit Date = 2024-05-24T06:35:26Z
Type = batch
Priority = 50
Datacenters = dc1
Namespace = default
Node Pool = default
Status = running
Periodic = true
Parameterized = false
Next Periodic Launch = 2024-05-24T07:13:45Z (8s from now)
Children Job Summary
Pending Running Dead
0 1 0
Previously Launched Jobs
ID Status
docs-1/periodic-1716534690 running
Conclusion
Nomad perform garbage collection of jobs, evaluations, allocations, and nodes using nomad system gc. This is an asynchronous operation.
Nomad periodically garbage collects jobs, evaluations, allocations, and nodes. The exact garbage collection logic varies by object, but in general Nomad only permanently deletes objects once they are terminal and no longer needed for future scheduling decisions.
Though gc would clean up objects, but for the consistent view of the same on UI and CLI, we need to perform system reconcile summaries operation so that Nomad will reconcile the summary of all registered jobs.