Introduction
This article describes how to troubleshoot and resolve startup failures of the HashiCorp Nomad client agent when a fatal panic occurs due to corruption of the local BoltDB state store. This condition prevents the Nomad client service from starting, typically after an ungraceful shutdown or disk event.
Problem
The Nomad agent (client) repeatedly fails to start, exiting with a fatal panic indicating corruption in the BoltDB-backed client state store. The logs show a message similar to:
Jul 21 10:27:27 nomad-client nomad[281786]: panic: invalid freelist page: 2957762210125020946, page type is branch
Jul 21 10:27:27 nomad-client nomad[281786]: goroutine 1 [running]:
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt.(*freelist).read(0x0?, 0x7feb5072e000)
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt@v1.3.9/freelist.go:267 +0x20e
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt.(*DB).loadFreelist.func1()
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt@v1.3.9/db.go:420 +0xb7
Jul 21 10:27:27 nomad-client nomad[281786]: sync.(*Once).doSlow(0x19e6ae0?, 0xc0006fe650?)
Jul 21 10:27:27 nomad-client nomad[281786]: sync/once.go:76 +0xb4
Jul 21 10:27:27 nomad-client nomad[281786]: sync.(*Once).Do(...)
Jul 21 10:27:27 nomad-client nomad[281786]: sync/once.go:67
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt.(*DB).loadFreelist(0xc0006fe488?)
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt@v1.3.9/db.go:413 +0x3b
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt.Open({0xc00016ec60, 0x20}, 0x180, 0xc000953098)
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt@v1.3.9/db.go:295 +0x430
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/helper/boltdd.Open({0xc00016ec60?, 0x20?}, 0x0?, 0x0?)
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/helper/boltdd/boltdd.go:55 +0x18
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/client/state.NewBoltStateDB({0x3cafdd0, 0xc000a652f0}, {0xc000862150, 0x17})
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/client/state/db_bolt.go:187 +0x125
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/client.(*Client).init(0xc000a0f508)
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/client/client.go:670 +0x275
Jul 21 10:27:27 nomad-client systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Nomad exits with a status code of 2/INVALIDARGUMENT
, causing systemd or other process managers to continually attempt (and fail) to restart the service.
Prerequisites
- This article applies to HashiCorp Nomad Client agents (not servers).
- Nomad versions using BoltDB (go.etcd.io/bbolt) for local client state.
- Environments where Nomad client nodes may experience ungraceful shutdowns, power loss, or filesystem issues.
- Access to the Nomad client node’s filesystem and logs.
Cause
This error is typically encountered after an unexpected shutdown, power outage, hardware fault, or underlying disk issue that corrupts the BoltDB database (state.db
) Nomad uses to maintain local client state.
Symptoms:
- Nomad client fails to start.
- Logs contain a Go panic trace including text:
-
panic: invalid freelist page: <number>, page type is branch
. - Stack trace referencing
go.etcd.io/bbolt.(*freelist).read
and/orgo.etcd.io/bbolt.Open
-
- The service manager reports exit code
2/INVALIDARGUMENT
.
Overview of Possible Solutions
- Non-destructive recovery is possible for Nomad client nodes because their state can be rebuilt from the cluster.
- The corrupted BoltDB file must be removed or replaced for Nomad to start.
Solutions
Solution: Remove the Corrupted State Directory
- Stop the Nomad client service:
sudo systemctl stop nomad
. - Identify the Nomad client data directory [default path - (
/opt/nomad/data/client
)] or as specified in your client config. - Remove the
state.db
file or move the corrupted client folder:sudo rm -rf /path/to/nomad/client/data/client
. - Restart the Nomad client service:
sudo systemctl start nomad
. - The client will rejoin the cluster and retrieve workload state from the Nomad servers.
Outcome
If the above steps are successful, Nomad will start without panics and the node will successfully rejoin the cluster and resume normal operations. If the problem persists:
- Double-check you have removed the correct directory/file.
- Review additional Nomad agent logs for other errors.
- Consider running filesystem health checks.
If issue still persists, please contact HashiCorp support at support@hashicorp.com.