Introduction
Problem
Non-voter is not being able to become a voter due to the large raft snapshot size and not able to catch up to the leader.
Prerequisites
- This article applies to version <1.10.0
Cause
If the issue is happening on all the new follower agents that are coming up, it might have to do with a situation where enough changes are happening to cause a new snapshot to get generated before new client agents can come up and load the previous snapshot. That scenario can lead to the new agent requiring raft logs that have been truncated already due to the new snapshot being created. In this case, it can be helped by tweaking the raft_trailing_logs config setting:
This controls how many log entries are left in the log store on disk after a snapshot is made. This should only be adjusted when followers cannot catch up to the leader due to a very large snapshot size and high write throughput causing log truncation before a snapshot can be fully installed on a follower.
Overview of possible solutions (if applicable)
Solutions:
Possible next steps,
a) Wait for the process out on the node and not send more raft writes through it
b) Add the raft_trailing_log field to their Consul config with a proper value(~100k) and do a rolling restart on all servers(without erasing the disks in the process)
Basically, if it takes X seconds to send the snapshot to another server and get it applied, and are processing writes at a rate of Y per second then the raft_trailing_logs should be at least X * Y plus some buffer in the case of spikes. The default of 10,000 is fine in all but the most extreme cases. (
https://www.consul.io/docs/agent/options#raft_trailing_logs.
The config option would need to be set in the server configuration file and a reboot of the agent would be required in order for the value to take effect.
The solution proposed below is intended to get the Consul cluster is in a healthy state without any downtime.
Please note that Consul data dir must be residing on a Linux file system that supports extended attributes. Additionally, the steps outline below are only performed on the leader. All the other nodes would need to continue running as before.
Step 1
- Run `chattr -R +i <data dir>/raft/snapshots`.
This will prevent raft from being able to reap the snapshot and in turn prevent compacting the logs (as that only happens when snapshot creation including reaping old ones is successful)
Step 2
- Allow all the servers to become healthy. This will result in getting a snapshot from the leader as before. However, since log compaction is halted, all the remaining logs will be available for normal log replication.
Step 3
- Once the nodes are caught up, the first change in Step 1 must be reversed using `chattr -R -i <data dir>/raft/snapshots`
In addition to the requirement of having extended attributes mentioned above, there should be sufficient disk space to store two extra raft snapshots plus the increased size of the raft db. Given that the ~800MB raft.db seems to be roughly equivalent to about 1.5 minutes worth of data, it is expected for the recovery process to take at most 10 minutes which would result in the raft.db end up being a little under 5GB in size. Therefore, a liberal estimate as to the free disk space required would be about 41GB (18GB per extra snapshot + 5 GB).
The raft.db will never decrease in size after this and will remain at 5GB. Once the `chattr` reversal in Step 3 is complete, Consul will go back to storing one raft snapshot and will require space for one more temporarily.