The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.
All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
Introduction
This article provides a step-by-step guide to troubleshooting and resolving an issue where a Nomad server fails to start due to a corrupted or empty keystore file. We will identify the symptoms, understand the cause, and safely remove the corrupted file, allowing Nomad to automatically restore it from other healthy servers in the cluster.
Problem
A Nomad server agent fails to start, showing an "unexpected end of JSON input" error in the logs, and the Nomad service repeatedly restarts without successfully joining the cluster. This is often observed when a specific .aead.nks.json
file in the data/server/keystore
directory is identified as being 0 bytes in size.
==> Error starting agent: server setup failed: could not load key file /hull/nomad/data/server/keystore/<key-id>.aead.nks.json from keystore: unexpected end of JSON input
Prerequisites
Before proceeding, you must confirm the following:
- This applies to Nomad v1.8.x and older. Beginning in v1.9.x, the key is stored in Raft.
- All other Nomad servers in the cluster are healthy and have correct keyring entries.
- The affected server is not the only server in the cluster.
Cause
Nomad uses JSON files to store encryption keys for its keyring, located in the server keystore directory. If one of these files becomes corrupted or is truncated to 0 bytes, the Nomad server is unable to parse the file, causing the startup process to fail.
Overview of possible solutions
The primary solution involves removing the corrupted file. When the Nomad server restarts, it will automatically replicate the missing key from other healthy servers in the cluster. An alternative, though unnecessary, solution is to manually copy a healthy keystore file from another server to the affected server.
Solutions:
-
Verify the corrupted file: Use the
ls -l
command to list the files in the keystore directory and identify the file with a size of 0 bytes.ls -l /hull/nomad/data/server/keystore/*.aead.nks.json
-
Stop the Nomad service: Halt the Nomad service on the affected server.
sudo systemctl stop nomad
-
Remove the corrupted file: Delete the identified 0-byte file.
sudo rm /hull/nomad/data/server/keystore/<key-id>.aead.nks.json
-
Restart the Nomad server: Start the Nomad service to allow it to rejoin the cluster and replicate the missing key.
sudo systemctl start nomad
Outcome
After following the steps, the Nomad server should successfully start, rejoin the cluster, and automatically restore the missing keystore file.
Additional Information
-
Post-Resolution Verification:
- Confirm the node has rejoined the cluster by running:
nomad server members
- Check logs for any remaining keyring or keystore errors.
- Confirm the node has rejoined the cluster by running:
-
Prevention:
- Maintain a healthy cluster with multiple Nomad servers to ensure key replication.
- Regularly back up your Nomad data directory.
- Investigate recurring keystore corruption, as it may indicate underlying disk or file system issues.
- Follow official Nomad key management procedures for key rotation.