Nomad Client Startup Failure Due to BoltDB Corruption – Troubleshooting and Recovery – HashiCorp Help Center

Introduction

This article describes how to troubleshoot and resolve startup failures of the HashiCorp Nomad client agent when a fatal panic occurs due to corruption of the local BoltDB state store. This condition prevents the Nomad client service from starting, typically after an ungraceful shutdown or disk event.

Problem

The Nomad agent (client) repeatedly fails to start, exiting with a fatal panic indicating corruption in the BoltDB-backed client state store. The logs show a message similar to:

Jul 21 10:27:27 nomad-client nomad[281786]: panic: invalid freelist page: 2957762210125020946, page type is branch
Jul 21 10:27:27 nomad-client nomad[281786]: goroutine 1 [running]:
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt.(*freelist).read(0x0?, 0x7feb5072e000)
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt@v1.3.9/freelist.go:267 +0x20e
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt.(*DB).loadFreelist.func1()
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt@v1.3.9/db.go:420 +0xb7
Jul 21 10:27:27 nomad-client nomad[281786]: sync.(*Once).doSlow(0x19e6ae0?, 0xc0006fe650?)
Jul 21 10:27:27 nomad-client nomad[281786]: sync/once.go:76 +0xb4
Jul 21 10:27:27 nomad-client nomad[281786]: sync.(*Once).Do(...)
Jul 21 10:27:27 nomad-client nomad[281786]: sync/once.go:67
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt.(*DB).loadFreelist(0xc0006fe488?)
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt@v1.3.9/db.go:413 +0x3b
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt.Open({0xc00016ec60, 0x20}, 0x180, 0xc000953098)
Jul 21 10:27:27 nomad-client nomad[281786]: go.etcd.io/bbolt@v1.3.9/db.go:295 +0x430
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/helper/boltdd.Open({0xc00016ec60?, 0x20?}, 0x0?, 0x0?)
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/helper/boltdd/boltdd.go:55 +0x18
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/client/state.NewBoltStateDB({0x3cafdd0, 0xc000a652f0}, {0xc000862150, 0x17})
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/client/state/db_bolt.go:187 +0x125
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/client.(*Client).init(0xc000a0f508)
Jul 21 10:27:27 nomad-client nomad[281786]: github.com/hashicorp/nomad/client/client.go:670 +0x275
Jul 21 10:27:27 nomad-client systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT

Nomad exits with a status code of 2/INVALIDARGUMENT, causing systemd or other process managers to continually attempt (and fail) to restart the service.

Prerequisites

This article applies to HashiCorp Nomad Client agents (not servers).
Nomad versions using BoltDB (go.etcd.io/bbolt) for local client state.
Environments where Nomad client nodes may experience ungraceful shutdowns, power loss, or filesystem issues.
Access to the Nomad client node’s filesystem and logs.

Cause

This error is typically encountered after an unexpected shutdown, power outage, hardware fault, or underlying disk issue that corrupts the BoltDB database (state.db) Nomad uses to maintain local client state.

Symptoms:

Nomad client fails to start.
Logs contain a Go panic trace including text:
- panic: invalid freelist page: <number>, page type is branch.
- Stack trace referencing go.etcd.io/bbolt.(*freelist).read and/or go.etcd.io/bbolt.Open
The service manager reports exit code 2/INVALIDARGUMENT.

Overview of Possible Solutions

Non-destructive recovery is possible for Nomad client nodes because their state can be rebuilt from the cluster.
The corrupted BoltDB file must be removed or replaced for Nomad to start.

Solutions

Solution: Remove the Corrupted State Directory

Stop the Nomad client service: sudo systemctl stop nomad.
Identify the Nomad client data directory [default path - (/opt/nomad/data/client)] or as specified in your client config.
Remove the state.db file or move the corrupted client folder: sudo rm -rf /path/to/nomad/client/data/client.
Restart the Nomad client service: sudo systemctl start nomad.
The client will rejoin the cluster and retrieve workload state from the Nomad servers.

Outcome

If the above steps are successful, Nomad will start without panics and the node will successfully rejoin the cluster and resume normal operations. If the problem persists:

Double-check you have removed the correct directory/file.
Review additional Nomad agent logs for other errors.
Consider running filesystem health checks.

If issue still persists, please contact HashiCorp support at support@hashicorp.com.

Additional Information