Introduction
Consul uses a gossip protocol to manage membership and broadcast messages to the cluster. Specifically, Consul uses a LAN gossip pool and a WAN gossip pool to perform different functions, which is possible by leveraging an embedded Serf library.
Gossip is done over UDP with a configurable but fixed fanout and interval. This ensures that network usage is constant with regards to number of nodes. Complete state exchanges with a random node are done periodically over TCP, but much less often than gossip messages. This increases the likelihood that the membership list converges properly since the full state is exchanged and merged. The interval between full state exchanges is configurable or can be disabled entirely.
Problem
The gossip protocol can sometimes fail, leading to nodes being incorrectly marked as dead. This can cause problems with service discovery and configuration.
Prerequisite
Before you can troubleshoot the gossip protocol, you need to have a basic understanding of how it works. The gossip protocol has three stages: alive, suspected of failure, and dead.
Cause
Failure detection relies on periodic, randomized probes. If a node doesn't acknowledge a probe within a configurable timeframe (based on round-trip time), indirect probing is initiated. This involves a configurable number of randomly selected nodes also probing the target node. Only if both the initial probe and all indirect probes fail within a set time is the node marked "suspicious" and this status gossiped to the cluster.
There are a number of possible causes for gossip protocol failures. These include:
- Network connectivity issues
- Firewall rules that are blocking traffic between Consul nodes
- Misconfigured gossip parameters
Log Example
- Log from consul1:
[INFO] memberlist: Suspect consul2 has failed, no acks received
-
This means that consul2 did not respond to any direct or indirect ping messages, and is beingsuspected of failure.
-
- Log of node consul2 :
[WARN] memberlist: Refuting a suspect message (from: consul1)
- This means consul2 is refuting a suspect message originated by consul1.
Solution
A node designated as "suspicious" is not immediately removed from the cluster. It's given a configurable grace period to challenge this status. If it fails to do so, the node is then considered "dead," and this change is communicated to the cluster via gossip. In many cases, this situation arises from UDP routing issues, preventing nodes from communicating with consul2
(or consul2
from responding) from the example above.
To troubleshoot gossip protocol failures, you can:
- Check the logs of the Consul nodes for error messages.
- Verify that the required ports are open
- Check for firewall interference
- Use the Consul CLI to check the status of the gossip protocol.
- Tune the LAN and WAN gossip parameters.
- Exercise caution as improper tuning can negatively impact Consul's stability.
Additional Information