Overview
When using HashiCorp Nomad in a multi-region cluster, you may encounter a scenario where deployments become stuck in a blocked
state and never complete. This issue is most commonly observed when using the canary
parameter in the job’s update
stanza. This article explains the conditions under which this occurs, how to identify the problem, recommended workarounds, and the current status of a fix.
Affected Versions
Nomad 1.8.14+ent (and possibly other <1.8.x or >1.8.x versions)
Multi-region clusters with jobs using the canary parameter
Symptoms
Deployments to multiple regions (e.g.,
india
andus
) become stuck.Both (or all) regions show a
blocked
deployment status.No region acts as the "unblocker" to complete the deployment.
The issue is intermittent and most likely to occur when a canary is used.
Ref. multiregion block in the job specification | Nomad | HashiCorp Developer
Multi-region federation operational considerations | Nomad | HashiCorp Developer
Federate multi-region clusters | Nomad | HashiCorp Developer
Example:
Latest Deployment
ID = 916be283
Status = blocked
Description = Deployment is complete but waiting for peer region
Multiregion Deployment
Region ID Status
india 916be283 blocked
us 20e5d31e blocked
Background & Workflow
How Multi-Region Deployments Work
When a multi-region job is submitted, each region begins in a
pending
state.Up to max_parallel regions (default: all) deploy concurrently.
When a region completes its deployment, it enters a
blocked
state, waiting for the last region (as defined in the job spec) to finish.The last region in the
multiregion
block acts as the "unblocker" and should trigger completion for all regions.
Note:
max_parallel
setting used differently instrategy
to define how many regions will start deployments concurrently, while inupdate
block it defines number of allocation will updated at a time.
The Issue with Canaries
When a
canary
is specified in theupdate
stanza, a race condition or logic bug can cause all regions to become blocked, with no region acting as the unblocker.This does not occur when the canary parameter is omitted; deployments complete as expected.
Example Job Spec
job "example" {
update {
max_parallel = 1
min_healthy_time = "3s"
healthy_deadline = "2m"
canary = 1
auto_promote = true
}
multiregion {
region "india" {
count = 3
}
region "us" {
count = 1
}
}
group "cache" {
# ... task definitions ...
}
}
Ref. multiregion block in the job specification | Nomad | HashiCorp Developer
Configure multi-region deployments | Nomad | HashiCorp Developer
Workarounds
Avoid Using Canaries in Multi-Region Jobs
If possible, do not use thecanary
parameter in multi-region deployments until this issue is resolved.Use a Strategy Block to Serialize Deployments
Adding astrategy
block withmax_parallel = 1
in themultiregion
stanza can reduce the likelihood of hitting this bug, but does not guarantee avoidance if canaries are used.
multiregion {
strategy {
max_parallel = 1
on_failure = "fail_all"
}
region "india" { count = 3 }
region "us" { count = 1 }
}
Manual Unblock (Not Recommended for Automation)
If a deployment is stuck, you can manually promote or unblock the deployment using the Nomad CLI:
nomad deployment promote <deployment_id> -region=<region>
nomad deployment unblock <deployment_id> -region=<region>
https://developer.hashicorp.com/nomad/commands/deployment/promote
https://developer.hashicorp.com/nomad/commands/deployment/unblock
Note: This is a manual workaround and not suitable for automated workflows.
Recommendations
Design jobspecs carefully:
Be aware that the order of regions in themultiregion
block is operationally significant. The last region acts as the unblocker.Monitor deployments:
Watch for deployments stuck in ablocked
state, especially after updates with canaries.Plan for manual intervention:
If you must use canaries, be prepared to manually unblock deployments if necessary.
Status
This is a known bug in Nomad’s multi-region deployment logic when using canaries. HashiCorp is aware of the issue, and a fix is planned for a future release. Please monitor the Nomad Changelog and GitHub Issues for updates.
Summary
If you are using Nomad multi-region deployments with canaries, you may encounter deployments stuck in a blocked state due to a known bug. Avoid using canaries in multi-region jobs or use serial deployment strategies as a temporary workaround. A fix is planned for a future Nomad release.
References
multiregion block in the job specification | Nomad | HashiCorp Developer
multiregion block in the job specification | Nomad | HashiCorp Developer
Create and run multi-region deployments | Nomad | HashiCorp Developer