Introduction

The consul snapshot agent subcommand initiates a dedicated process responsible for capturing snapshots of your Consul server state and storing them either locally or to a configured remote storage service. This functionality is available in Consul Enterprise versions 0.7.1 and later.

You can operate the snapshot agent in two ways:

Long-running Daemon: Run the agent as a persistent background process that automatically takes snapshots at regular intervals. This mode includes a leader election mechanism for high availability, ensuring seamless failover in case of agent failures.
One-shot Mode: Execute the agent for a single snapshot operation, ideal for integration with batch jobs or scripts. This mode is enabled using the -interval argument.

NOTE: The Consul Snapshot agent is a different process than the Consul service. Therefore the Consul Snapshot leader is not the same as the Consul Cluster leader.

Expected Outcome

By following this guide, users will be able to:

Deploy and configure the Consul snapshot agent.
Understand how the snapshot agent works in a highly available manner.
Troubleshoot common issues related to snapshot failures and leadership transitions.

Prerequisites

Consul Enterprise 0.7.1 or later
Running Consul cluster with multiple servers
Sufficient disk space for snapshot storage
Access to the Consul API for health checks and debugging

Use Case

The Consul snapshot agent provides crucial functionality for maintaining the health and recoverability of your Consul cluster. Its primary roles include:

Regular Backups: It automatically creates backups of the Consul cluster's state, enabling efficient disaster recovery in case of unforeseen events.
Seamless Failover: In highly available Consul deployments, the agent facilitates smooth leadership transitions if the current leader node experiences issues.

Monitoring Agent Health and Status:

The agent registers itself as a Consul service with two key health checks:

"Consul Snapshot Agent Alive": This check confirms that the agent is running and responsive.
"Consul Snapshot Agent Saving Snapshots": This check, present only on the leader node, verifies that the agent is actively creating snapshots.

These health checks allow operators to easily monitor the status of snapshot agents across the cluster. By checking these health checks, you can quickly identify:

Live and Standby Agents: Agents with only the "Alive" check are running but on standby, ready to take over if needed.
Leader Agent: The agent with both health checks is the current leader and actively saving snapshots.

This clear visibility into agent status simplifies monitoring and management of your Consul snapshot infrastructure.

Procedure

Step 1: Start the Snapshot Agent

Run the following command to start the agent:

consul snapshot agent -config-file=<path-to-config>

Step 2: Verify Leadership Election

As snapshots are saved, they will be reported in the log that is produced by the agent. Check the logs to confirm that a leader has been elected:

2020/04/09 21:21:13 [INFO] Snapshot agent running
2020/04/09 21:21:13 [INFO] Waiting to obtain leadership...
2020/04/09 21:21:13 [INFO] Obtained leadership
2020/04/09 21:21:13 [INFO] Saved snapshot 1479360073448728784

Step 3: Handle Snapshot Failures

When an agent fails to save a snapshot for one reason or another, it will retry until the amount specified in -max-failures (default: 3). After that, the snapshot agent will give up leadership. In a highly available operation with multiple snapshot agents available, this gives another agent a chance to take over if an agent is experiencing issues, such as running out of disk space for snapshots. Here is an example of a snapshot agent failing to write to disk and losing the leadership:

2020/04/09 15:14:56 [DEBUG] Taking a snapshot...
2020/04/09 15:14:56 [ERR] Snapshot failed (will retry at next interval): failed to create snapshot: open /tmp/snaps/consul-1586445306987972055.tmp: no such file or directory
2020/04/09 15:15:01 [DEBUG] Taking a snapshot...
2020/04/09 15:15:01 [ERR] Snapshot failed (will retry at next interval): failed to create snapshot: open /tmp/snaps/consul-1586445306987972055.tmp: no such file or directory
2020/04/09 15:15:06 [DEBUG] Taking a snapshot...
2020/04/09 15:15:06 [ERR] Snapshot failed (will retry at next interval): failed to create snapshot: open /tmp/snaps/consul-1586445306987972055.tmp: no such file or directory
2020/04/09 15:15:06 [WARN] Too many snapshot failures (will give up leadership)
2020/04/09 15:15:16 [INFO] Waiting to obtain leadership...

Step 4: Monitor Health Checks

Ensure the snapshot agent registers itself with Consul and is marked as healthy. Here is an example state of the failed health check on the node which just lost the leadership:

user@consul03 ~ $ curl http://127.0.0.1:8500/v1/agent/checks | jq
[output trimmed for brevity]
  "consul-snapshot:c3645410-2aad-eabd-418c-81dc01b609b6:snapshot-ttl": {
    "Node": "consul03",
    "CheckID": "consul-snapshot:c3645410-2aad-eabd-418c-81dc01b609b6:snapshot-ttl",
    "Name": "Consul Snapshot Agent Saving Snapshots",
    "Status": "critical",
    "Notes": "This check is periodically updated as long as the leader is successfully taking snapshots.",
    "Output": "TTL expired",
    "ServiceID": "consul-snapshot:c3645410-2aad-eabd-418c-81dc01b609b6",
    "ServiceName": "consul-snapshot",
    "ServiceTags": [],
    "Definition": {},
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
[output trimmed for brevity]

After that, another snapshot leader is elected on another node.

Snapshot Failure Behavior

When a new snapshot leader is elected, the previously leading agent will automatically deregister theConsul Snapshot Agent Saving Snapshots TTL health check, provided it can still reach the Consul server (on port 8500 or 8501).

If the snapshot agent cannot connect to the Consul server, it will be unable to deregister the TTL check, and the health check will continue to report a failing state.

Note: The Consul Snapshot Agent Alive TTL health check is not deregistered in this scenario, as it is intended to track the status of the agent itself—not its leadership or snapshot-saving responsibilities. If this check is also critical, it may indicate a deeper network or resource-level issue affecting both the agent and possibly the Consul client/server.

Behavior by Consul Version

Consul 1.9.0+ent and later:
No manual intervention is required to clear the failing state of the snapshot TTL health check. The system handles de-registration automatically upon leadership change, as noted in the 1.9.0 changelog:

Snapshot agent: Deregister critical snapshotting TTL check if leadership is transferred.
Consul versions below 1.9.0+ent:
The failing health check will persist unless addressed manually. You can resolve it using one of the following options:
- Restart the snapshot agent on the affected node to reset its health check state.
- Allow the agent to regain leadership and successfully write a snapshot, which will refresh the TTL and clear the critical status.

Additional Information

Consul Snapshot Agent

Consul Snapshot agent: Saving Snapshots status critical with TTL Expired

Introduction

Expected Outcome

Prerequisites

Use Case

Procedure

Step 1: Start the Snapshot Agent

Step 2: Verify Leadership Election

Step 3: Handle Snapshot Failures

Step 4: Monitor Health Checks

Snapshot Failure Behavior

Behavior by Consul Version

Additional Information

Articles in this section

Introduction

Expected Outcome

Prerequisites

Use Case

Procedure

Step 1: Start the Snapshot Agent

Step 2: Verify Leadership Election

Step 3: Handle Snapshot Failures

Step 4: Monitor Health Checks

Snapshot Failure Behavior

Behavior by Consul Version

Additional Information

Articles in this section

Related articles