Problem
Operating a large Consul datacenter with a high churn rate (e.g., frequently adding and removing thousands of client nodes) can lead to a scenario where new clients are unable to join the cluster.
Prerequisites
- Consul v1.7.0 - Consul v1.10.7
- A large datacenter that consists of 1000s of client nodes
Cause
-
The issue stems from a change in Consul's underlying
Memberlist
library (version 1.7.0), which manages cluster membership and failure detection. Previously, Consul only trackedstateDead
nodes, which were automatically removed after 72 hours.This update introduced a new
stateLeft
for nodes gracefully leaving the cluster. However, these nodes are never removed, leading to a continuous growth in the member list. This becomes problematic in dynamic environments with high churn, as the list eventually exceedsMemberlist
'smaxPushStateBytes
limit. This, in turn, breaks thepush/pull
state synchronization mechanism, causing cluster instability. -
If you encounter this bug, you'll likely notice the following symptoms:
- Your Consul servers will experience a significant increase in CPU usage, accompanied by a barrage of the below error messages in the logs. This signals an inability to manage the burgeoning member list and synchronize the state effectively.
agent.server.memberlist.lan: memberlist: Too many pending push/pull requests
-
New client nodes will be blocked from joining the cluster. They'll encounter errors like the two below during the join process, indicating that the member list has grown beyond its capacity.
[ERR] memberlist: failed to receive: Remote node state is larger than limit ({LARGENUMBER})
memberlist: Push/Pull with {HOSTNAME} failed: Remote node state is larger than limit ({LARGENUMBER})
- Your Consul servers will experience a significant increase in CPU usage, accompanied by a barrage of the below error messages in the logs. This signals an inability to manage the burgeoning member list and synchronize the state effectively.
Solution
Temporary Workaround
To temporarily alleviate the issue while you plan for a full upgrade, a rolling restart of your Consul servers can help. This reduces the size of the push/pull
member list, providing some breathing room for the cluster. However, this is a short-term solution, and the issue will eventually resurface.
Permanent Solution
We've addressed this bug in Consul v1.10.8. To fully resolve it, we strongly recommend upgrading all your server nodes to this version first, followed by the client nodes. While upgrading a single server may offer temporary relief, upgrading all servers ensures complete resolution and long-term stability.