Summary:
This article discusses a common issue that arises during an upgrade from Consul OSS to Consul Enterprise, where Nomad server nodes experience intermittent health check failures. The problem was traced to missing namespaces in Consul following the upgrade, and a temporary resolution was implemented by manually creating the default namespace and admin partition.
Overview
Upgrading Consul from OSS to Enterprise should, by default, create a default
namespace and attach all objects to the default
admin partition. However, in some cases, this process may not be completed correctly, leading to issues such as intermittent health check failures for Nomad server nodes. This article outlines a situation where such a failure occurs and provides a step-by-step explanation of how it was temporarily resolved, as well as a recommended permanent solution to prevent future issues.
Key Points to Note About Consul OSS to Enterprise Upgrade:
-
Automatic Namespace and Partition Creation:
After upgrading to Consul Enterprise, adefault
namespace anddefault
admin partition are automatically created. All existing objects should be mapped to these defaults. -
Incomplete Upgrade Risks:
If the upgrade process is incomplete or fails at any stage, Consul may not correctly register these defaults. This can lead to missing namespaces and partitions, causing issues with the health checks of integrated services, such as Nomad. In this case, Nomad servers may intermittently fail to register their health checks properly, as they attempt to connect to missing namespaces or partitions.
Error
-
Log Examination:
The logs should be scrutinized for errors related to specific RPC methods tied to configuration entries in Consul. Key log messages to look for include:[TRACE] agent.server: rpc_server_call: method=ConfigEntry.Get errored=true request_type=read pc_type-net/rpc leader=false allow_stale=true blocking=false target_datacenter=tbs locality=local
[TRACE] agent.server: rpc_server_call: method=ConfigEntry.Get errored=true request_Type=read pc_type=net/pc leader=false allow_stale=true blocking=false target_datacenter=tbs locality=local
-
Health Check Re-registration:
Observe whether the Nomad server health checks are frequently re-registering themselves in the Consul catalog, which may indicate a deeper issue.
Root Cause
In some cases, this root cause can be an incomplete upgrade from Consul OSS to Enterprise. Specifically, the default
namespace and the default
admin partition, which should have been automatically created during the upgrade, were missing. This resulted in Nomad failing to register its health checks correctly, as it could not map them to a valid namespace and partition in Consul.
Temporary Workaround
To immediately resolve the issue, manually create the following:
- A default namespace.
- An Admin Partition.
After these manual creations:
- Errors related to configuration entries in the logs should cease.
- Nomad health checks should stabilize.
- The newly created Namespace and Partition should be correctly reflected in Consul snapshots and the UI.
Recommended Permanent Solution
While manually creating namespaces and partitions is an effective short-term fix, it is not the ideal long-term solution. To ensure the integrity of the Consul cluster, follow these steps:
- Stop the Consul cluster.
- Clear the data directory.
- Restart the Consul cluster.
- Restore the saved snapshot.
This process ensures a clean upgrade and helps prevent issues that may arise from incomplete upgrades.
Conclusion
By following the above steps, users can resolve health check failures in Nomad server nodes caused by an incomplete Consul upgrade. Ensuring a clean and complete upgrade process is crucial to maintaining cluster stability and preventing future issues.