Introduction
The Nomad UI and CLI offer powerful tools for monitoring job status. However, discrepancies can arise, particularly for terminal or dead jobs, leading to confusion and potentially inaccurate assessments. This guide explores native Nomad commands to enforce consistency between the UI and CLI job status views, ultimately improving monitoring accuracy and streamlining job observation.
Expected Outcome
By following the steps outlined in this guide, you will be able to:
- Achieve consistent job status reporting: Eliminate discrepancies between the Nomad UI and CLI, ensuring both interfaces reflect the accurate state of your jobs, even for terminal or dead jobs.
- Improve monitoring accuracy: Gain a reliable and unified view of your job statuses, facilitating accurate assessments and informed decision-making.
- Enhance troubleshooting efficiency: Quickly identify and address job-related issues by relying on consistent status information across both the UI and CLI.
- Streamline job observation: Simplify the monitoring process with a clear and consistent understanding of job states, regardless of the interface used.
Ultimately, this guide empowers you to gain greater control and confidence in managing your Nomad jobs by ensuring consistency and accuracy in status reporting.
Prerequisites
Nomad and Consul Compatibility:
This knowledge base (KB) utilizes specific versions of Nomad and Consul. However, you can freely employ compatible versions that support periodic jobs, garbage collection, and system reconcile summary operations.
Consul Mesh/Connect Mode:
This guide demonstrates the use of Consul with mesh/connect mode, which is integrated with Nomad. However, the use of the Consul agent is only used for reproduction purposes in this KB.
Nomad Version: 1.7.2+ent
Consul Version: 1.17.3+ent
Procedure
Reproduction setup
-
Run both consul & nomad in dev mode with connect enabled.
-
consul_config.hcl
ports { grpc = 8502 } connect { enabled = true }
$ consul agent -dev -config-file=./consul_config.hcl
-
nomad_config.hcl
server { # license_path is required for Nomad Enterprise as of Nomad v1.1.1+ license_path = "/etc/nomad.d/license.hclic" enabled = true } acl { enabled = true }
$ nomad agent -dev-connect -dev-consul -config=./nomad_config.hcl
-
consul_config.hcl
- Run the following periodic job in a default namespace.
job "docs-1" { datacenters = ["ns1"] namespace = "default" type = "batch" periodic { cron = "*/15 * * * * * *" prohibit_overlap = true } group "example" { network { port "http" { static = "5678" } } task "server" { driver = "docker" config { image = "hashicorp/http-echodgdgdg" # Intentionally passed wrong image for the job to fail ports = ["http"] args = [ "-listen", ":5678", "-text", "hello world", ] } } } }
[ec2-user@ip-172-31-44-235 ~]$ nomad job status docs-1 ID = docs-1 Name = docs-1 Submit Date = 2024-05-24T06:35:26Z Type = batch Priority = 50 Datacenters = dc1 Namespace = default Node Pool = default Status = running Periodic = true Parameterized = false Next Periodic Launch = 2024-05-24T07:11:15Z (14s from now) Children Job Summary Pending Running Dead 0 1 9 Previously Launched Jobs ID Status docs-1/periodic-1716533340 dead docs-1/periodic-1716533475 dead docs-1/periodic-1716533610 dead docs-1/periodic-1716533745 dead docs-1/periodic-1716533880 dead docs-1/periodic-1716534015 dead docs-1/periodic-1716534150 dead docs-1/periodic-1716534285 dead docs-1/periodic-1716534420 dead docs-1/periodic-1716534555 running
-
When an allocation enters a "dead" state, the command
nomad system gc
should clean up terminal jobs and allocations. However, an inconsistency exists between the job status summary and the list of previously launched jobs.-
Observed Discrepancy:
- Children Job Summary: Shows 9 dead allocations.
- Previously Launched Jobs: Does not include these dead allocations.
[ec2-user@ip-172-31-44-235 ~]$ nomad job status docs-1 ID = docs-1 Name = docs-1 Submit Date = 2024-05-24T06:35:26Z Type = batch Priority = 50 Datacenters = dc1 Namespace = default Node Pool = default Status = running Periodic = true Parameterized = false Next Periodic Launch = 2024-05-24T07:12:00Z (14s from now) Children Job Summary Pending Running Dead 0 1 9 Previously Launched Jobs ID Status docs-1/periodic-1716534690 running
-
Solution
Ensuring Job Status Consistency After Garbage Collection
Running nomad system gc
to clean up terminal allocations from periodic jobs should ideally result in consistent job status and history across both the UI and CLI. However, to achieve this consistency, an additional step is required:
- After performing
nomad system gc
, execute thenomad system reconcile summaries
command. This command reconciles the summaries of all registered jobs, ensuring that the UI and CLI reflect the accurate and updated state of your Nomad jobs.
[ec2-user@ip-172-31-44-235 ~]$ nomad job status docs-1 ID = docs-1 Name = docs-1 Submit Date = 2024-05-24T06:35:26Z Type = batch Priority = 50 Datacenters = dc1 Namespace = default Node Pool = default Status = running Periodic = true Parameterized = false Next Periodic Launch = 2024-05-24T07:13:45Z (8s from now) Children Job Summary Pending Running Dead 0 1 0 Previously Launched Jobs ID Status docs-1/periodic-1716534690 running
Conclusion
Nomad's garbage collection mechanism (nomad system gc
) plays a vital role in cleaning up terminal jobs, evaluations, allocations, and nodes. While this asynchronous process effectively removes unnecessary objects, it doesn't always guarantee a consistent view of job status across the UI and CLI.
To achieve this consistency, it's crucial to perform a system reconciliation after garbage collection. By executing nomad system reconcile summaries
, you ensure that Nomad reconciles the summaries of all registered jobs, providing an accurate and unified view across both interfaces.
This additional step bridges the gap between garbage collection and consistent reporting, enabling you to confidently monitor and manage your Nomad jobs with reliable and up-to-date information.
Additional Information