Overview
When using HashiCorp Nomad in a multi-region cluster, you may encounter a scenario where deployments become stuck in a blocked state and never complete. This issue is most commonly observed when using the canary parameter in the job’s update stanza. This article explains the conditions under which this occurs, how to identify the problem, recommended workarounds, and the current status of a fix.
Affected Versions
Nomad 1.8.14+ent (and possibly other <1.8.x or >1.8.x versions)
Multi-region clusters with jobs using the canary parameter
Symptoms
Deployments to multiple regions (e.g.,
indiaandus) become stuck.Both (or all) regions show a
blockeddeployment status.No region acts as the "unblocker" to complete the deployment.
The issue is intermittent and most likely to occur when a canary is used.
Ref. multiregion block in the job specification | Nomad | HashiCorp Developer
Multi-region federation operational considerations | Nomad | HashiCorp Developer
Federate multi-region clusters | Nomad | HashiCorp Developer
Example:
Latest Deployment
ID = 916be283
Status = blocked
Description = Deployment is complete but waiting for peer region
Multiregion Deployment
Region ID Status
india 916be283 blocked
us 20e5d31e blocked
Background & Workflow
How Multi-Region Deployments Work
When a multi-region job is submitted, each region begins in a
pendingstate.Up to max_parallel regions (default: all) deploy concurrently.
When a region completes its deployment, it enters a
blockedstate, waiting for the last region (as defined in the job spec) to finish.The last region in the
multiregionblock acts as the "unblocker" and should trigger completion for all regions.
Note:
max_parallelsetting used differently instrategyto define how many regions will start deployments concurrently, while inupdateblock it defines number of allocation will updated at a time.
The Issue with Canaries
When a
canaryis specified in theupdatestanza, a race condition or logic bug can cause all regions to become blocked, with no region acting as the unblocker.This does not occur when the canary parameter is omitted; deployments complete as expected.
Example Job Spec
job "example" {
update {
max_parallel = 1
min_healthy_time = "3s"
healthy_deadline = "2m"
canary = 1
auto_promote = true
}
multiregion {
region "india" {
count = 3
}
region "us" {
count = 1
}
}
group "cache" {
# ... task definitions ...
}
}Ref. multiregion block in the job specification | Nomad | HashiCorp Developer
Configure multi-region deployments | Nomad | HashiCorp Developer
Workarounds
Avoid Using Canaries in Multi-Region Jobs
If possible, do not use thecanaryparameter in multi-region deployments until this issue is resolved.Use a Strategy Block to Serialize Deployments
Adding astrategyblock withmax_parallel = 1in themultiregionstanza can reduce the likelihood of hitting this bug, but does not guarantee avoidance if canaries are used.
multiregion {
strategy {
max_parallel = 1
on_failure = "fail_all"
}
region "india" { count = 3 }
region "us" { count = 1 }
}Manual Unblock (Not Recommended for Automation)
If a deployment is stuck, you can manually promote or unblock the deployment using the Nomad CLI:
nomad deployment promote <deployment_id> -region=<region>
nomad deployment unblock <deployment_id> -region=<region>https://developer.hashicorp.com/nomad/commands/deployment/promote
https://developer.hashicorp.com/nomad/commands/deployment/unblock
Note: This is a manual workaround and not suitable for automated workflows.
Recommendations
Design jobspecs carefully:
Be aware that the order of regions in themultiregionblock is operationally significant. The last region acts as the unblocker.Monitor deployments:
Watch for deployments stuck in ablockedstate, especially after updates with canaries.Plan for manual intervention:
If you must use canaries, be prepared to manually unblock deployments if necessary.
Status
This is a known bug in Nomad’s multi-region deployment logic when using canaries. HashiCorp is aware of the issue, and a fix is planned for a future release. Please monitor the Nomad Changelog and GitHub Issues for updates.
Summary
If you are using Nomad multi-region deployments with canaries, you may encounter deployments stuck in a blocked state due to a known bug. Avoid using canaries in multi-region jobs or use serial deployment strategies as a temporary workaround. A fix is planned for a future Nomad release.
References
multiregion block in the job specification | Nomad | HashiCorp Developer
multiregion block in the job specification | Nomad | HashiCorp Developer
Create and run multi-region deployments | Nomad | HashiCorp Developer