Best Practices for Upgrading HashiCorp Nomad: Process, Compatibility, and Troubleshooting – HashiCorp Help Center

Upgrading HashiCorp Nomad is critical for security, compatibility, and functionality. This article outlines the recommended upgrade process, version compatibility, and common troubleshooting tips. By following these best practices, you’ll ensure a seamless upgrade experience, reduce downtime, and avoid disruptions to your services.

Q: What is the recommended sequence for upgrading Nomad nodes?

To perform an upgrade in Nomad, follow these steps:

Replace Binary Files: Update the Nomad binary files on each node.
Perform a Rolling Restart: Restart the nodes in the following order:
- Follower Nodes
- Leader Node
- Client Nodes

This order minimizes disruptions and ensures continuity during the upgrade.

For detailed guidance on the upgrade process, refer to the official documentation: Nomad Upgrade Process.

Q: What is the recommended upgrade path for Nomad?

Nomad provides a support matrix to ensure compatibility between different versions. It’s recommended to follow a step-by-step upgrade path to avoid compatibility issues.

Check for Breaking Changes: Review version-specific updates and breaking changes here: Nomad Releases.
Upgrade Instructions: Follow these detailed steps for each version: Upgrade Instructions | Nomad.
Avoid Large Version Jumps: Use the recommended upgrade path to prevent issues associated with skipping versions. Guidance on upgrading specific versions is available here: Upgrading Specific Versions | Nomad.

Q: Is it possible for different versions of Nomad to coexist during an upgrade? For instance, if my Nomad cluster is running v1.5.x, can I upgrade some nodes to v1.7.x without encountering issues?

Nomad is designed to maintain backward compatibility across at least two-point releases. For example, Nomad v1.7.x is compatible with v1.5.x, allowing for an incremental upgrade of your cluster without major issues. During the upgrade, nodes running v1.5.x can continue to operate alongside nodes running v1.7.x.

Upgrading in stages—starting with servers, then clients—reduces the risk of downtime and minimizes disruptions. Be sure to follow the recommended upgrade path and validate the cluster's functionality after each upgrade phase.

Q: Is it possible to upgrade directly from any version of Nomad to the latest version?

While Nomad aims to maintain backward compatibility for at least two-point releases, following the recommended step-by-step upgrade path is highly advised. Here’s why:

Compatibility Risks from Version Jumps: Skipping multiple versions can lead to incompatibility with internal states, APIs, or features that may have changed. Since Nomad’s backward compatibility is limited to recent releases, large version jumps increase the risk of operational issues.
Managing Breaking Changes: New releases may introduce breaking changes, deprecations, or necessary updates that require incremental handling. Following the upgrade path ensures these changes are managed properly, reducing the likelihood of unexpected disruptions.

For version-specific updates and breaking changes, refer to the release notes: Nomad Releases.

Q: What steps should I take if I need to downgrade my Nomad cluster? Is it safe to downgrade both clients and servers?

Nomad downgrading limitations

Currently, Nomad does not support downgrades. If a downgrade is necessary:

Clients: You must drain allocations and remove the data directory.
Servers: A safe downgrade requires re-provisioning the entire cluster.

For more information, refer to the official documentation:

Q: What known issues or limitations exist with the latest version of Nomad?

Pre-Upgrade Best Practices

Review Release Notes: Always check the release notes for the new Nomad version to identify any known issues, limitations, or breaking changes. This helps you anticipate potential problems and plan for mitigation.
Test in a Staging Environment: Before upgrading in production, test the new version in a non-production or staging environment. This ensures compatibility with your current configurations and workloads, reducing the risk of unexpected issues in production.

For details on breaking changes and version-specific updates, refer to: Nomad Releases.

Q: What should I do if I am worried about potential instability during an upgrade?

A: Minimizing Upgrade Risks

To reduce instability risks, HashiCorp recommends upgrading Nomad in a staging or non-production environment before deploying to production. This allows you to validate all workloads for proper functionality. Patch releases, which mainly address bug fixes and security updates, typically carry lower risk than major or minor version upgrades.

Q: How does Nomad integrate with Consul, and what is the version compatibility between various Nomad and Consul releases during an upgrade?

A: Nomad and Consul Integration

Nomad integrates with Consul for service discovery, network middleware, and service mesh features via Consul Connect. To ensure a seamless integration, it’s crucial to consider the compatibility between the versions of Nomad and Consul you are using.

Nomad versions have specific compatibility with Consul versions. Generally, newer Nomad versions support newer Consul versions, but backward compatibility is limited. It is essential to consult HashiCorp's version compatibility matrix before performing any upgrades or integrations to ensure smooth operations.

Examples of Compatibility:

Nomad 0.12.0+ is compatible with Consul 1.6.0+
Nomad 0.10.x is compatible with Consul 1.2.0+

If you are utilizing features such as Consul Connect for service mesh, ensure that both Nomad and Consul are running compatible versions to prevent issues with connectivity, service registration, and discovery.

Troubleshooting the nomad upgrade issue:

Upgrading Nomad can sometimes lead to unexpected issues. Below are common problems encountered during the upgrade process, along with their respective error messages and suggested workarounds.

Issue	Error Summary
1. Nil Pointer Error During Upgrade	Panic: invalid memory address or nil pointer dereference
2. Reserved Port Collision Error	Task group failed to place due to network port collision (http8=80 exhausted)
3. CPU and Memory Allocations Show as Zero	CPU and memory allocations reported as zero after upgrading from Nomad 1.0.4 to 1.7.3
4. Task Group Placement Failures	Task group failed to place allocations due to topology requirements not met (3 nodes excluded by filter)

1. Nil Pointer Error During Upgrade

Error:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1c8c1b7]

goroutine 9215 [running]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).Exec(0x0, 0xdf8475800, {0xc001545240, 0x1d}, {0xc001553f50, 0x1, 0xc0048a8480?})
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:70 +0xf7
github.com/hashicorp/nomad/client/allocrunner/taskrunner/template.(*TaskTemplateManager).processScript(0xc002a65950, 0xc0013eb300, 0xc0026dca20?)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/template/template.go:597 +0x104
created by github.com/hashicorp/nomad/client/allocrunner/taskrunner/template.(*TaskTemplateManager).handleChangeModeScript in goroutine 3975
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/template/template.go:563 +0x47

Workaround: Refer to the GitHub issue #24051 for details. This feature will be included in the upcoming Nomad 1.9.0 release, with back-porting to Nomad Enterprise following shortly after.

2. Reserved Port Collision Error

After upgrading one of the cluster nodes to agent version 1.3.1, the Nomad job is now failing with the following error.

Error:


nomad job run -verbose -check-index 2318283 job.nomad ==> 
2022-07-04T16:28:06+05:30: Monitoring evaluation "59c7211d-6159-a4e9-1e5a-7b2cb16dbc73" 
2022-07-04T16:28:06+05:30: Evaluation triggered by job "iptables-test" 
2022-07-04T16:28:06+05:30: Evaluation within deployment: "268048f3-40f4-2698-e76e-b5d363a412b3" 
2022-07-04T16:28:06+05:30: Evaluation status changed: "pending" -> "complete" ==> 
2022-07-04T16:28:06+05:30: Evaluation "59c7211d-6159-a4e9-1e5a-7b2cb16dbc73" finished with status "complete" but failed to place all allocations: 
2022-07-04T16:28:06+05:30: Task Group "iptables-test" (failed to place 1 allocation): * Constraint "${attr.unique.hostname} == eu-pl-02": 1 nodes excluded by filter * Resources exhausted on 1 nodes * Dimension "network: reserved port collision http8=80" exhausted on 1 nodes 
2022-07-04T16:28:06+05:30: Evaluation "34fb2838-ec48-d4ed-8b50-1449f46b776f" waiting for additional capacity to place remainder

Workaround: The job is attempting to reserve port 80, which may already be in use. Consider using dynamic ports instead. For more information, refer to the Dynamic Ports Documentation.

3. CPU and Memory Allocations Show as Zero After Upgrade

cgroups v1 logs:

tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) 
cgroup on /sys/fs/cgroup/systemd type 
cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) 
cgroup on /sys/fs/cgroup/blkio type 
cgroup (rw,nosuid,nodev,noexec,relatime,blkio) 
cgroup on /sys/fs/cgroup/perf_event type 
cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) 
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) 
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) 
cgroup on /sys/fs/cgroup/net_cls,net_prio type 
cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) 
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) 
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) 
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) 
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) 
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)

Workaround: This issue may arise from a mismatch in the required cgroup version. Nomad 1.7.3 may require cgroups v2, which is not supported by CentOS 7 (which uses cgroups v1). To resolve this, check the cgroups mounted on your system and consider upgrading to a newer OS version that supports cgroups v2, such as CentOS Stream or a RHEL 8-based distribution. For additional context, refer to the Nomad Changelog.

4. Task Group Placement Failures After Upgrade

Error:

2022-11-23T15:58:04+01:00: Task Group "mariadb" (failed to place 1 allocation): 
* Class "public": 3 nodes excluded byfilter
*Constraint "did not meet topology requirement": 3 nodes excluded byfilter
2022-11-23T15:58:04+01:00: Evaluation "fac96ff2" waiting for additional capacity to place remainder 
==>2022-11-23T15:58:04+01:00: Monitoring deployment "a1048165"

Workaround: This issue was caused by the CSI plugin update. Downgrading the CSI plugin resolved the scheduling issue for the job.

Additional information:

1. Nil Pointer Error During Upgrade

2. Reserved Port Collision Error

3. CPU and Memory Allocations Show as Zero After Upgrade

4. Task Group Placement Failures After Upgrade

Articles in this section

Related articles