Problem
If you are running a large Consul datacenter and regularly churn 1000s of client nodes at once, you may run into an issue where new client nodes are not able to join the datacenter
Prerequisites
- Consul v1.7.0+
- A large datacenters that consists of 1000s of client nodes
Cause
- The bug is a side effect from a change in the Memberlist library(which Consul uses to manage cluster membership and member failure detection using a gossip based protocol) with the release of Consul v1.7.0. We introduce a new memberlist state called stateLeft so when the memberlist processes a dead message, it would check where it is coming from and if the reporting node is equal to the reported node, Consul will know that this node wants to leave. Unlike the nodes in the stateDead state which were eventually reaped in 72 hours, nodes in the stateLeft state that were gracefully left were never reaped which result in the permanent increase in the node list and sit inside the push/pull state(i.e. how the memberlist state syncs to ensure that all nodes are in accordance of one another). For large datacenters with lots of churn, you'll hit the memberlist maxPushStateBytes limit and the push/pull mechanism stops working completely
- Those hitting this bug will see the following symptoms:
- A high jump in CPU usage in the server nodes and the following error message in the server logs:
agent.server.memberlist.lan: memberlist: Too many pending push/pull requests
- The following error messages in the new client nodes attempting to join the datacenter:
-
[ERR] memberlist: failed to receive: Remote node state is larger than limit ({LARGENUMBER})
-
memberlist: Push/Pull with {HOSTNAME} failed: Remote node state is larger than limit ({LARGENUMBER})
-
- A high jump in CPU usage in the server nodes and the following error message in the server logs:
Solution:
- To mitigate the impact a rolling restart of the servers would reduce the push/pull memberlist size. When a server is restarted it will use a snapshot of the pushpull/ memberlist to do a puspull with other nodes which should be smaller and would restore the cluster health for sometimes. This won't solve the issue but give time to implement the fix.
- We have fixed this behavior with the release of Consul v1.10.8. While we have seen that upgrading one server node to this helps alleviate some of the symptoms, we recommend upgrading all servers nodes to run v1.10.8 followed by the client nodes.