Introduction
This article addresses a federation issue in Nomad Enterprise that occurs after performing a staggered upgrade of a multi-region cluster. Specifically, it details symptoms and resolutions for failures in ACL replication and UI errors that appear after upgrading an authoritative region (e.g., from v1.7.x to v1.8.x) while other federated regions remain on an older version.
Problem
After upgrading the Nomad servers in the authoritative region, the other non-upgraded federated regions becomes impaired. The primary symptoms include:
- Federated clusters are unable to retrieve expected ACL objects (policies, tokens) from the authoritative region.
- When a user attempts to switch between regions in the Nomad UI, the interface returns a
500
error with the message:The backend responded with an error. rpc error: rpc: can't find method Job.Statuses
. - Nomad server logs may contain the following error:
[ERROR] nomad.fsm: DeregisterVaultAccessor failed: error="accessor delete failed: not found"
.
Prerequisites
This issue occurs under the following special conditions:
- A multi-region federated Nomad Enterprise environment is in use.
- A partial upgrade has been performed, resulting in a version mismatch between Nomad servers across different regions. For example, the authoritative region is running Nomad v1.8.15+ent while other federated regions are still on v1.7.10+ent.
Cause
The root cause is a version incompatibility related to a new API endpoint introduced in Nomad v1.8.0.
- The
/v1/jobs/statuses
API and its corresponding internal RPC (Job.Statuses
) were added in Nomad v1.8.0 to enhance the UI's job index page. - When a user accesses the UI served by a v1.8.x server, it attempts to call this
Job.Statuses
RPC on all federated peers to gather information. - Servers running an older version (v1.7.x) do not have this RPC method. When they receive the request, they correctly report that the method cannot be found.
- This failure disrupts the UI and can interfere with other federated operations, such as ACL replication, that rely on stable cross-region communication.
Solution
To resolve the issue and restore stable federation, all Nomad servers across all federated regions must be upgraded to the same version.
- Verify the Nomad version on all server nodes in every federated region to confirm the mismatch.
- Proceed with the upgrade plan for any regions still running the older version.
- Ensure all server nodes within the federation are running the identical version of Nomad Enterprise (e.g., v1.8.15+ent).
Outcome
Once all Nomad servers in the federation are running the same version, the Job.Statuses
RPC calls will succeed between all peers. This will resolve the 500
error in the UI, and critical federation functions, including ACL replication and cross-region visibility, will be restored.
Additional Information
- Trigger: This behavior is most consistently triggered by user activity in the Nomad UI, specifically when viewing the Jobs page and switching between regions.
- Upgrade Process Consideration: While Nomad supports rolling upgrades, this scenario highlights that new RPCs required for UI functionality can create temporary incompatibilities during a staggered multi-region upgrade. This can challenge upgrade strategies where different production clusters are upgraded at different times.
- Temporary Workaround: If completing the upgrade across all regions is not immediately possible, users can be instructed to bypass the federated UI. They can manage each cluster by logging in directly to its region-specific URL with an appropriate ACL token retrieved from the corresponding Vault namespace. In this mode, federation will remain broken, but direct cluster management is possible.