Introduction
The consul snapshot agent
subcommand initiates a dedicated process responsible for capturing snapshots of your Consul server state and storing them either locally or to a configured remote storage service. This functionality is available in Consul Enterprise versions 0.7.1 and later.
You can operate the snapshot agent in two ways:
- Long-running Daemon: Run the agent as a persistent background process that automatically takes snapshots at regular intervals. This mode includes a leader election mechanism for high availability, ensuring seamless failover in case of agent failures.
-
One-shot Mode: Execute the agent for a single snapshot operation, ideal for integration with batch jobs or scripts. This mode is enabled using the
-interval
argument.
NOTE: The Consul Snapshot agent is a different process than the Consul service. Therefore the Consul Snapshot leader is not the same as the Consul Cluster leader.
Expected Outcome
By following this guide, users will be able to:
-
Deploy and configure the Consul snapshot agent.
-
Understand how the snapshot agent works in a highly available manner.
-
Troubleshoot common issues related to snapshot failures and leadership transitions.
Prerequisites
-
Consul Enterprise 0.7.1 or later
-
Running Consul cluster with multiple servers
-
Sufficient disk space for snapshot storage
-
Access to the Consul API for health checks and debugging
Use Case
The Consul snapshot agent provides crucial functionality for maintaining the health and recoverability of your Consul cluster. Its primary roles include:
- Regular Backups: It automatically creates backups of the Consul cluster's state, enabling efficient disaster recovery in case of unforeseen events.
- Seamless Failover: In highly available Consul deployments, the agent facilitates smooth leadership transitions if the current leader node experiences issues.
Monitoring Agent Health and Status:
The agent registers itself as a Consul service with two key health checks:
- "Consul Snapshot Agent Alive": This check confirms that the agent is running and responsive.
- "Consul Snapshot Agent Saving Snapshots": This check, present only on the leader node, verifies that the agent is actively creating snapshots.
These health checks allow operators to easily monitor the status of snapshot agents across the cluster. By checking these health checks, you can quickly identify:
- Live and Standby Agents: Agents with only the "Alive" check are running but on standby, ready to take over if needed.
- Leader Agent: The agent with both health checks is the current leader and actively saving snapshots.
This clear visibility into agent status simplifies monitoring and management of your Consul snapshot infrastructure.
Procedure
Step 1: Start the Snapshot Agent
Run the following command to start the agent:
consul snapshot agent -config-file=<path-to-config>
Step 2: Verify Leadership Election
As snapshots are saved, they will be reported in the log that is produced by the agent. Check the logs to confirm that a leader has been elected:
2020/04/09 21:21:13 [INFO] Snapshot agent running 2020/04/09 21:21:13 [INFO] Waiting to obtain leadership... 2020/04/09 21:21:13 [INFO] Obtained leadership 2020/04/09 21:21:13 [INFO] Saved snapshot 1479360073448728784
Step 3: Handle Snapshot Failures
When an agent fails to save a snapshot for one reason or another, it will retry until the amount specified in -max-failures
(default: 3). After that, the snapshot agent will give up leadership. In a highly available operation with multiple snapshot agents available, this gives another agent a chance to take over if an agent is experiencing issues, such as running out of disk space for snapshots. Here is an example of a snapshot agent failing to write to disk and losing the leadership:
2020/04/09 15:14:56 [DEBUG] Taking a snapshot... 2020/04/09 15:14:56 [ERR] Snapshot failed (will retry at next interval): failed to create snapshot: open /tmp/snaps/consul-1586445306987972055.tmp: no such file or directory 2020/04/09 15:15:01 [DEBUG] Taking a snapshot... 2020/04/09 15:15:01 [ERR] Snapshot failed (will retry at next interval): failed to create snapshot: open /tmp/snaps/consul-1586445306987972055.tmp: no such file or directory 2020/04/09 15:15:06 [DEBUG] Taking a snapshot... 2020/04/09 15:15:06 [ERR] Snapshot failed (will retry at next interval): failed to create snapshot: open /tmp/snaps/consul-1586445306987972055.tmp: no such file or directory 2020/04/09 15:15:06 [WARN] Too many snapshot failures (will give up leadership) 2020/04/09 15:15:16 [INFO] Waiting to obtain leadership...
Step 4: Monitor Health Checks
Ensure the snapshot agent registers itself with Consul and is marked as healthy. Here is an example state of the failed health check on the node which just lost the leadership:
user@consul03 ~ $ curl http://127.0.0.1:8500/v1/agent/checks | jq [output trimmed for brevity] "consul-snapshot:c3645410-2aad-eabd-418c-81dc01b609b6:snapshot-ttl": { "Node": "consul03", "CheckID": "consul-snapshot:c3645410-2aad-eabd-418c-81dc01b609b6:snapshot-ttl", "Name": "Consul Snapshot Agent Saving Snapshots", "Status": "critical", "Notes": "This check is periodically updated as long as the leader is successfully taking snapshots.", "Output": "TTL expired", "ServiceID": "consul-snapshot:c3645410-2aad-eabd-418c-81dc01b609b6", "ServiceName": "consul-snapshot", "ServiceTags": [], "Definition": {}, "CreateIndex": 0, "ModifyIndex": 0 }, [output trimmed for brevity]
After that, another snapshot leader is elected on another node.
Problem & Cause
Even if the new snapshot leader succeeds in creating a snapshot, the node that used to be the leader will still be marked as failed.
That means that you can have the following situation:
- an operational snapshot cluster that is working correctly
- your new snapshot leader is saving the snapshots successfully
- your old leader is marked as Failed
This is intentional, and it serves to draw attention to why the snapshotting failed in the first place. It might be a symptom of a bigger issue that may not be just an isolated case for an agent failing. It is strongly advised to investigate on why the snapshotting had failed on the old leader node.
Solution
There are two approaches to clear the failed state of the health check.
- Restart the Consul Snapshot Service on the node on which it had failed
- Wait until the node becomes a leader again and successfully saves a snapshot
Additional Information