Memberlist Push/Pull Bug Prevents New Nodes from Joining a Large Consul Datacenter – HashiCorp Help Center

Problem

Operating a large Consul datacenter with a high churn rate (e.g., frequently adding and removing thousands of client nodes) can lead to a scenario where new clients are unable to join the cluster.

Prerequisites

Consul v1.7.0 - Consul v1.10.7
A large datacenter that consists of 1000s of client nodes

Cause

The issue stems from a change in Consul's underlying Memberlist library (version 1.7.0), which manages cluster membership and failure detection. Previously, Consul only tracked stateDead nodes, which were automatically removed after 72 hours.

This update introduced a new stateLeft for nodes gracefully leaving the cluster. However, these nodes are never removed, leading to a continuous growth in the member list. This becomes problematic in dynamic environments with high churn, as the list eventually exceeds Memberlist's maxPushStateBytes limit. This, in turn, breaks the push/pull state synchronization mechanism, causing cluster instability.
If you encounter this bug, you'll likely notice the following symptoms:
- Your Consul servers will experience a significant increase in CPU usage, accompanied by a barrage of the below error messages in the logs. This signals an inability to manage the burgeoning member list and synchronize the state effectively.
```
agent.server.memberlist.lan: memberlist: Too many pending push/pull requests
```
- New client nodes will be blocked from joining the cluster. They'll encounter errors like the two below during the join process, indicating that the member list has grown beyond its capacity.
```
[ERR] memberlist: failed to receive: Remote node state is larger than limit ({LARGENUMBER}) 
```
```
memberlist: Push/Pull with {HOSTNAME} failed: Remote node state is larger than limit ({LARGENUMBER})
```

Solution

Temporary Workaround

To temporarily alleviate the issue while you plan for a full upgrade, a rolling restart of your Consul servers can help. This reduces the size of the push/pull member list, providing some breathing room for the cluster. However, this is a short-term solution, and the issue will eventually resurface.

Permanent Solution

We've addressed this bug in Consul v1.10.8. To fully resolve it, we strongly recommend upgrading all your server nodes to this version first, followed by the client nodes. While upgrading a single server may offer temporary relief, upgrading all servers ensures complete resolution and long-term stability.

Additional Information

GH-253
- Issue item from the Hashicorp Memberlist GH repo that talks about the bug
GH-11970
- Pull Request from the Hashicorp Consul GH repo that fixes the bug