To perform an upgrade in Nomad, follow these steps:
- Replace Binary Files: Update the Nomad binary files on each node.
-
Perform a Rolling Restart: Restart the nodes in the following order:
- Follower Nodes
- Leader Node
- Client Nodes
This order minimizes disruptions and ensures continuity during the upgrade.
For detailed guidance on the upgrade process, refer to the official documentation: Nomad Upgrade Process.
- Check for Breaking Changes: Review version-specific updates and breaking changes here: Nomad Releases.
- Upgrade Instructions: Follow these detailed steps for each version: Upgrade Instructions | Nomad.
- Avoid Large Version Jumps: Use the recommended upgrade path to prevent issues associated with skipping versions. Guidance on upgrading specific versions is available here: Upgrading Specific Versions | Nomad.
Upgrading in stages—starting with servers, then clients—reduces the risk of downtime and minimizes disruptions. Be sure to follow the recommended upgrade path and validate the cluster's functionality after each upgrade phase.
While Nomad aims to maintain backward compatibility for at least two-point releases, following the recommended step-by-step upgrade path is highly advised. Here’s why:
-
Compatibility Risks from Version Jumps: Skipping multiple versions can lead to incompatibility with internal states, APIs, or features that may have changed. Since Nomad’s backward compatibility is limited to recent releases, large version jumps increase the risk of operational issues.
-
Managing Breaking Changes: New releases may introduce breaking changes, deprecations, or necessary updates that require incremental handling. Following the upgrade path ensures these changes are managed properly, reducing the likelihood of unexpected disruptions.
For version-specific updates and breaking changes, refer to the release notes: Nomad Releases.
Q: What steps should I take if I need to downgrade my Nomad cluster? Is it safe to downgrade both clients and servers?
Nomad downgrading limitations
Currently, Nomad does not support downgrades. If a downgrade is necessary:
- Clients: You must drain allocations and remove the data directory.
- Servers: A safe downgrade requires re-provisioning the entire cluster.
For more information, refer to the official documentation:
Pre-Upgrade Best Practices
-
Review Release Notes: Always check the release notes for the new Nomad version to identify any known issues, limitations, or breaking changes. This helps you anticipate potential problems and plan for mitigation.
-
Test in a Staging Environment: Before upgrading in production, test the new version in a non-production or staging environment. This ensures compatibility with your current configurations and workloads, reducing the risk of unexpected issues in production.
For details on breaking changes and version-specific updates, refer to: Nomad Releases.
A: Minimizing Upgrade Risks
To reduce instability risks, HashiCorp recommends upgrading Nomad in a staging or non-production environment before deploying to production. This allows you to validate all workloads for proper functionality. Patch releases, which mainly address bug fixes and security updates, typically carry lower risk than major or minor version upgrades.
Q: How does Nomad integrate with Consul, and what is the version compatibility between various Nomad and Consul releases during an upgrade?
Troubleshooting the nomad upgrade issue:
Upgrading Nomad can sometimes lead to unexpected issues. Below are common problems encountered during the upgrade process, along with their respective error messages and suggested workarounds.
Issue | Error Summary |
1. Nil Pointer Error During Upgrade | Panic: invalid memory address or nil pointer dereference |
2. Reserved Port Collision Error | Task group failed to place due to network port collision (http8=80 exhausted) |
3. CPU and Memory Allocations Show as Zero | CPU and memory allocations reported as zero after upgrading from Nomad 1.0.4 to 1.7.3 |
4. Task Group Placement Failures | Task group failed to place allocations due to topology requirements not met (3 nodes excluded by filter) |
1. Nil Pointer Error During Upgrade
Error:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1c8c1b7]
goroutine 9215 [running]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).Exec(0x0, 0xdf8475800, {0xc001545240, 0x1d}, {0xc001553f50, 0x1, 0xc0048a8480?})
github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:70 +0xf7
github.com/hashicorp/nomad/client/allocrunner/taskrunner/template.(*TaskTemplateManager).processScript(0xc002a65950, 0xc0013eb300, 0xc0026dca20?)
github.com/hashicorp/nomad/client/allocrunner/taskrunner/template/template.go:597 +0x104
created by github.com/hashicorp/nomad/client/allocrunner/taskrunner/template.(*TaskTemplateManager).handleChangeModeScript in goroutine 3975
github.com/hashicorp/nomad/client/allocrunner/taskrunner/template/template.go:563 +0x47
Workaround: Refer to the GitHub issue #24051 for details. This feature will be included in the upcoming Nomad 1.9.0 release, with back-porting to Nomad Enterprise following shortly after.
2. Reserved Port Collision Error
After upgrading one of the cluster nodes to agent version 1.3.1, the Nomad job is now failing with the following error.
nomad job run -verbose -check-index 2318283 job.nomad ==>
2022-07-04T16:28:06+05:30: Monitoring evaluation "59c7211d-6159-a4e9-1e5a-7b2cb16dbc73"
2022-07-04T16:28:06+05:30: Evaluation triggered by job "iptables-test"
2022-07-04T16:28:06+05:30: Evaluation within deployment: "268048f3-40f4-2698-e76e-b5d363a412b3"
2022-07-04T16:28:06+05:30: Evaluation status changed: "pending" -> "complete" ==>
2022-07-04T16:28:06+05:30: Evaluation "59c7211d-6159-a4e9-1e5a-7b2cb16dbc73" finished with status "complete" but failed to place all allocations:
2022-07-04T16:28:06+05:30: Task Group "iptables-test" (failed to place 1 allocation): * Constraint "${attr.unique.hostname} == eu-pl-02": 1 nodes excluded by filter * Resources exhausted on 1 nodes * Dimension "network: reserved port collision http8=80" exhausted on 1 nodes
2022-07-04T16:28:06+05:30: Evaluation "34fb2838-ec48-d4ed-8b50-1449f46b776f" waiting for additional capacity to place remainder
Workaround: The job is attempting to reserve port 80, which may already be in use. Consider using dynamic ports instead. For more information, refer to the Dynamic Ports Documentation.
3. CPU and Memory Allocations Show as Zero After Upgrade
cgroups v1 logs:
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type
cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/blkio type
cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type
cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/net_cls,net_prio type
cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Workaround: This issue may arise from a mismatch in the required cgroup version. Nomad 1.7.3 may require cgroups v2, which is not supported by CentOS 7 (which uses cgroups v1). To resolve this, check the cgroups mounted on your system and consider upgrading to a newer OS version that supports cgroups v2, such as CentOS Stream or a RHEL 8-based distribution. For additional context, refer to the Nomad Changelog.
4. Task Group Placement Failures After Upgrade
Error:
2022-11-23T15:58:04+01:00: Task Group "mariadb" (failed to place 1 allocation):
* Class "public": 3 nodes excluded byfilter
*Constraint "did not meet topology requirement": 3 nodes excluded byfilter
2022-11-23T15:58:04+01:00: Evaluation "fac96ff2" waiting for additional capacity to place remainder
==>2022-11-23T15:58:04+01:00: Monitoring deployment "a1048165"
Workaround: This issue was caused by the CSI plugin update. Downgrading the CSI plugin resolved the scheduling issue for the job.
Additional information:
- Advanced Node Draining in HashiCorp Nomad
- Nomad Command: node drain
- Nomad Command: node eligibility
- Nomad Command: operator snapshot restore
- Nomad Command: operator snapshot save
- https://developer.hashicorp.com/nomad/docs/upgrade#upgrade-process
- Nomad Workload Migration
- Best-Practices-Nomad-Server-Client-Host-Reboot