Introduction
Problem
Consul snapshots taken by the agent fails intermittently when the snapshot agents are running on all consul server nodes - both leader and non-leader nodes.
Cause
- Snapshot agents were configured to run on both leader and non-leader nodes. The snapshot agent fails intermittently with the error message - [ERROR] snapshot: Snapshot failed (will retry at next interval): error="snapshot verification failed: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF"
This error message is usually caused when the snapshot agent is run in a non-leader server. From our documentation - "By default, all snapshots are taken using consistent mode where requests are forwarded to the leader which verifies that it is still in power before taking the snapshot. Snapshots will not be saved if the datacenter is degraded or if no leader is available. To reduce the burden on the leader, it is possible to run the snapshot on any non-leader server using stale consistency mode.
This spreads the load across nodes at the possible expense of losing full consistency guarantees. Typically this means that a very small number of recent writes may not be included. The omitted writes are typically limited to data written in the last 100ms or less from the recovery point. This is usually suitable for disaster recovery. However, the system can't guarantee how stale this may be if executed against a partitioned server."
Solutions:
- use
stale:true
in the snapshot agent configuration. Example scenario - let's say I have 3 node consul cluster and I want to run snapshot agent , I can choose any of the nodes of the 3 node the consul server. Because, the snapshot agent can be run from any host in the datacenter. We can deploy it on a machine hosting the Consul server agent, or on a machine that has a client agent.
Since the snapshot agent uses the consistent mode by default, the leader will ultimately be responsible for generating the snapshot and sending it back to the agent at which the client is connected. `stale=true` will allow a non-leader to generate and return the snapshot.
Other:
Is there a way to have the consul snapshot agent distributed across consul server nodes and not require `stale:true`?
It is possible to run multiple instances of the snapshot agent. The agents will use a session and lock (https://www.consul.io/commands/snapshot/agent#lock-key) to coordinate electing a leader which will take the snapshot from the configured Consul server address. This can point to localhost so that the agent communicates with the local server.
The snapshot itself will be generated by the leader, which may or may not be the server the snapshot agent is communicating with, unless stale is equal to true.