Overview
This guide will walk you through creating and executing a job that will demonstrate Nomad’s job anti-affinity rules and, in clusters with memory limited Nomad clients, filtering based resource exhaustion.
Sample Environment
- One Nomad Server Node
- Three Nomad Client Nodes
- 768 MB RAM total (providing 761 MB RAM in
nomad node-status -self
) - Docker installed
- 768 MB RAM total (providing 761 MB RAM in
Process
Create the sample job by running nomad init
.
$ nomad init
Example job file written to example.nomad
Optionally, you can filter out all of the default job file’s commentary with the following command
$ sed -i.bak -e '/\s*#.*$/d' -e '/^\s*$/d' example.nomad
This will create the sample job we use in the Nomad getting started guide which spins up a Docker instance running Redis.
We will want to change the count
to a number higher than your Nomad client count. My sample cluster has three(3) client nodes, so I am going to set the count = 5
. This can be done with a text editor, but I will incluce a sed
one-liner that will make the change.
sed -i.bak 's/count = 1/count = 5/g' example.nomad
Plan the job with nomad plan
.
$ nomad plan example.nomad
+ Job: "example"
+ Task Group: "cache" (5 create)
+ Task: "redis" (forces create)
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 0
To submit the job with version verification run:
nomad run -check-index 0 example.nomad
When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
This plan indicates that we will have 5 allocations scheduled when we run the plan. Let’s run nomad run example.nomad
to execute the plan in our cluster.
$ nomad run example.nomad
==> Monitoring evaluation "0b415d85"
Evaluation triggered by job "example"
Allocation "19640ba9" created: node "05129072", group "cache"
Allocation "2ea73fda" created: node "1dabfc7d", group "cache"
Allocation "3f8ae2ea" created: node "ab58ba15", group "cache"
Allocation "5782372b" created: node "ab58ba15", group "cache"
Allocation "69b40fa5" created: node "1dabfc7d", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "0b415d85" finished with status "complete"
As the output indicates we have 5 new allocations and that two nodes, “ab58ba15” and “05129072”, have two allocations of the job while “1dabfc7d” has one.
We can observe the anti-affinity rule’s impact on scoring when considering the allocations that are colocated on a single node. For the following commands we will discuss allocation “3f8ae2ea”.
Running the nomad alloc-status
command with the -verbose
flag will provide the scoring information in the Placement Metrics section. For example:
$ nomad alloc-status -verbose 3f8ae2ea
ID = 3f8ae2ea-8d62-f31d-1c91-b64e8ee92595
Eval ID = 0b415d85-741f-ea5b-e2d5-f44632b334a4
Name = example.cache[2]
Node ID = ab58ba15-6591-2b37-f9e8-4720bd07189a
Job ID = example
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 08/01/17 10:48:16 EDT
Evaluated Nodes = 2
Filtered Nodes = 0
Exhausted Nodes = 0
Allocation Time = 21.649µs
Failures = 0
Task "redis" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
3/500 MHz 988 KiB/256 MiB 300 MiB 0 db: 10.0.0.23:30265
Recent Events:
Time Type Description
08/01/17 10:48:17 EDT Started Task started by client
08/01/17 10:48:16 EDT Task Setup Building Task Directory
08/01/17 10:48:16 EDT Received Task received by client
Placement Metrics
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 8.269191
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000
While scheduling this allocation, Nomad considered two client nodes: ab58ba15
and 1dabfc7d
. The node’s full UUID name is the first element of the dotted attribute name. Node 1dabfc7d
’s score was reduced because it was already running a copy of the job. The anti-affinity modifier is -20, because the binpack scoring algorithm has a 0-18 scoring range. By subtracting 20, we can guarantee that any node already running a job will have a lower score than one that is not.
Comparing the Placement Metrics of all of the running allocations:
Allocation "19640ba9"
Placement Metrics
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 12.803662
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.job-anti-affinity" = -20.000000
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 8.269191
Allocation "2ea73fda"
Placement Metrics
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 8.269191
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 8.269191
Allocation "3f8ae2ea"
Placement Metrics
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 8.269191
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000
Allocation "5782372b"
Placement Metrics
* Resources exhausted on 1 nodes
* Dimension "memory exhausted" exhausted on 1 nodes
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 12.803662
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.job-anti-affinity" = -20.000000
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 12.803662
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.job-anti-affinity" = -20.000000
Allocation "69b40fa5"
Placement Metrics
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 12.803662
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.job-anti-affinity" = -20.000000
In the case of allocation “5782372b”, Nomad also determined that one of the nodes no longer had sufficient free memory to be a viable target for this allocation and filtered the node out before scoring.
Resource Exhaustion and the Scheduler
If resource exhaustion is causing an allocation of the job to fail to be scheduled, it will be noted in the output of nomad run
and nomad status
for your job.
For example, If we attempt to run 7 copies of the example job in my sample cluster, there will not be enough uncommitted RAM to place a third copy of the job on any cluster node. The scheduler knows this and will wait to place that allocation until sufficient resources are available.
Evaluation "a23ca152" finished with status "complete" but failed to place all allocations:
Task Group "cache" (failed to place 1 allocation):
* Resources exhausted on 3 nodes
* Dimension "memory exhausted" exhausted on 3 nodes
Evaluation "f3ee745a" waiting for additional capacity to place remainder
or nomad status «job-id»
$ nomad status example`
...
Placement Failure
Task Group "cache":
* Resources exhausted on 3 nodes
* Dimension "memory exhausted" exhausted on 3 nodes
...
Summary
During this guide, we:
- created a sample job using
nomad init
, - discussed binpacking and job anti-affinity,
- examined how resource exhaustion is handled by Nomad’s scheduler, and
- ran commands to inspect the job state and scheduler scoring information.
Resources
- Nomad Scheduling
- BestFit v3 - Found in Scheduling a Large DataCenter. Slide 13.