Resolving Health Check Failures in Nomad Server Nodes After a Consul Upgrade from OSS to ENT Version

Summary:

This article discusses a common issue that arises during an upgrade from Consul OSS to Consul Enterprise, where Nomad server nodes experience intermittent health check failures. The problem was traced to missing namespaces in Consul following the upgrade, and a temporary resolution was implemented by manually creating the default namespace and admin partition.

Overview

Upgrading Consul from OSS to Enterprise should, by default, create a default namespace and attach all objects to the default admin partition. However, in some cases, this process may not be completed correctly, leading to issues such as intermittent health check failures for Nomad server nodes. This article outlines a situation where such a failure occurs and provides a step-by-step explanation of how it was temporarily resolved, as well as a recommended permanent solution to prevent future issues.

Key Points to Note About Consul OSS to Enterprise Upgrade:

Automatic Namespace and Partition Creation:
After upgrading to Consul Enterprise, a default namespace and default admin partition are automatically created. All existing objects should be mapped to these defaults.
Incomplete Upgrade Risks:
If the upgrade process is incomplete or fails at any stage, Consul may not correctly register these defaults. This can lead to missing namespaces and partitions, causing issues with the health checks of integrated services, such as Nomad. In this case, Nomad servers may intermittently fail to register their health checks properly, as they attempt to connect to missing namespaces or partitions.

Error

Log Examination:
The logs should be scrutinized for errors related to specific RPC methods tied to configuration entries in Consul. Key log messages to look for include:
- [TRACE] agent.server: rpc_server_call: method=ConfigEntry.Get errored=true request_type=read pc_type-net/rpc leader=false allow_stale=true blocking=false target_datacenter=tbs locality=local
- [TRACE] agent.server: rpc_server_call: method=ConfigEntry.Get errored=true request_Type=read pc_type=net/pc leader=false allow_stale=true blocking=false target_datacenter=tbs locality=local
Health Check Re-registration:
Observe whether the Nomad server health checks are frequently re-registering themselves in the Consul catalog, which may indicate a deeper issue.

Root Cause

In some cases, this root cause can be an incomplete upgrade from Consul OSS to Enterprise. Specifically, the default namespace and the default admin partition, which should have been automatically created during the upgrade, were missing. This resulted in Nomad failing to register its health checks correctly, as it could not map them to a valid namespace and partition in Consul.

Temporary Workaround

To immediately resolve the issue, manually create the following:

A default namespace.
An Admin Partition.

After these manual creations:

Errors related to configuration entries in the logs should cease.
Nomad health checks should stabilize.
The newly created Namespace and Partition should be correctly reflected in Consul snapshots and the UI.

Conclusion

By following the above steps, users can resolve health check failures in Nomad server nodes caused by an incomplete Consul upgrade. Ensuring a clean and complete upgrade process is crucial to maintaining cluster stability and preventing future issues.

Reference

Nomad Upgrade

Consul Upgrade

Consul OSS to ENT Process

Summary:

Overview

Key Points to Note About Consul OSS to Enterprise Upgrade:

Error

Root Cause

Temporary Workaround

Recommended Permanent Solution

Conclusion

Reference

Articles in this section

Summary:

Overview

Key Points to Note About Consul OSS to Enterprise Upgrade:

Error

Root Cause

Temporary Workaround

Recommended Permanent Solution

Conclusion

Reference

Articles in this section

Related articles