Problem
You notice that your Nomad jobs are no longer being deployed after a restart of a few nodes in your cluster.
This points to an outward symptom of a few underlying issues.
The concerning error message as seen in the logs is as follows
[ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters:
Cause
This error message is usually caused by one of these reasons:
-
Consul and Nomad are running on the same node and the Consul process is not running.
- The port specified is not open or listening.
- If Consul is on another node, then connectivity to that node is compromised.
-
The TLS certificate on Consul/Nomad is expired.
-
Nomad node is not able to join the servers in the cluster.
- ACL settings in the nodes are mismatched.
Solutions:
While the causes for this error can be more than a handful, the following can be done
- Verify all IP addresses and Ports are open and listening.
- Verify all TLS certificates on the nodes are valid
- Verify the Nomad and Consul clusters are healthy by checking the cluster health.
- Verify the ACL settings are the same for both products.
- Restart the Nomad service.
Outcome
Nomad jobs being deployed correctly will be the first step to validate the fix.
Once Nomad is restarted then you can check the nomad monitor logs to see if the errors persist