Introduction
The metrics provided in the article offer added visibility into the snapshot process and raft replication.
The snapshot process is used for backing up and restoring a Consul cluster and having visibility into this process significantly enhances its utility and ensures smoother operational management.
Problem
Various scenarios may prompt you to review the metrics for your snapshots. Below, we've outlined several questions you may find yourself asking, along with the need for additional data:
- Is the snapshot process running as frequently as it was configured?
- How long does each snapshot take to complete?
- Have any snapshots failed in the recent past?
- Has an externally provided snapshot been restored successfully?
Solutions
The majority of snapshot-related metrics are presented in time-based units, serving to validate the completion of an action while also providing insight into the duration it took to execute (measured in milliseconds).
A prolonged duration for any specific event could signify potential network or storage-related issues, warranting further investigation and analysis.
Key Metrics
Metric | Usage |
---|---|
consul.raft.fsm.lastRestoreDuration |
Shows the time it took to restore from either source the last time it happened. Most of the time this is when the server was last started. |
consul.raft.snapshot.create |
Measures the time taken to initialize the snapshot process. |
consul.raft.snapshot.persist |
Measures the time taken to dump the current snapshot taken by the Consul agent to the disk. |
consul.raft.snapshot.takeSnapshot |
Measures the total time involved in taking the current snapshot (creating one and persisting it) by the Consul agent. This is usually the sum of the create and persist metrics. |
consul.raft.fsm.snapshot |
Measures the time taken by the FSM to record the current state for the snapshot. |
consul.raft.fsm.restore |
Measures the time taken by the FSM to restore its state from a snapshot. |
The metrics outlined above, if available in the response of /v1/agent/metrics, furnish us with precise time measurements for each step of the snapshot process. These timings serve as valuable indicators during the investigation of potential network or storage issues.
Additional Information
- The Snapshot Inspect command is used to inspect an atomic, point-in-time snapshot of the state of the Consul servers which includes key/value entries, service catalog, prepared queries, sessions, and ACLs. The snapshot is read from the given file.
- All of Consul's metrics, along with their explanations, can be found on the Agent Telemetry documentation page.