Introduction
The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided. All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
This is a guide to demonstrate how to resolve
serfLAN
merge issue of multiple clusters (let's say dc1
and dc2
), which is generally caused by cloud auto-join
in retry-join
. Instances merged with the same tag_key
and tag_values
resulted in a quorum and leader election issue in the cluster.
Expected Outcome
Instances upon which different consul clusters are running, should not merge even if these instances have the same
tag_key
and tag_value
to avoid serfLAN
merge and quorum issue. We can achieve this outcome by setting different gossip key
,tls
or acl
settings across these clusters, so they should not get merged.
Prerequisites
To reproduce the merge issue of
serfLAN
we would simply create the following setup:-- Create two different clusters (
dc1
anddc2
) with 3 AWS EC2 instances (runconsul server
on them) in each. - Set configuration files in these instances clusters by keeping
gossip_key
ie. (encrypt
) attribute values same. - Also, ensure that
tag_key
andtag_values
is the same across all these server EC2 instances across clusters.
Procedure/Setup
- For
dc1
&dc2
cluster instances use the following configuration file by making the required changes in the file, and start the consul agent. Ensure to keepencrypt
key same across clusters.
datacenter = "<datacenter_name>" data_dir = "/opt/consul" server = true bootstrap_expect = 3 client_addr = "<IP_ADDR>" retry_join = ["provider=aws tag_key=Server tag_value=true"] ui_config{ enabled = true } bind_addr = "<IP_ADDR>" license_path = "/etc/consul.d/license.hclic" encrypt = "EkXpqLKTKhxEIO2yyNddDl8VYMxQM7EMu4fS3EzP6cA=" node_name = "<node_name>" ports = { http = 8500 } log_level = "TRACE"
- We can see that these two clusters joined due to the same tag key and value pairs in these instances.
[root@ip-192-168-33-29 ec2-user]# consul members Node Address Status Type Build Protocol DC Partition Segment consul-server1 192.168.33.29:8301 alive server 1.14.9+ent 2 dc1 default <all> consul-server2 192.168.35.49:8301 alive server 1.14.9+ent 2 dc1 default <all> consul-server3 192.168.55.98:8301 alive server 1.14.9+ent 2 dc1 default <all> consul-server4 192.168.40.116:8301 alive server 1.14.9+ent 2 dc2 default <all> consul-server5 192.168.52.187:8301 alive server 1.14.9+ent 2 dc2 default <all> consul-server6 192.168.52.188:8301 alive server 1.14.9+ent 2 dc2 default <all>
- Also,
serfLAN
merge would lead to frequent node leadership flips and would impact quorum size.
[root@ip-192-168-33-29 ec2-user]# consul operator raft list-peers Node ID Address State Voter RaftProtocol Commit Index Trails Leader By consul-server2 e011ae50-1877-ed40-f080-792fecec38b5 192.168.35.49:8300 follower true 3 39544 0 commits consul-server3 d448e69d-f24a-a185-1aab-164bb851eb7a 192.168.55.98:8300 leader true 3 39544 - consul-server5 b883d60d-1ecd-1f8f-83d8-819d55657762 192.168.52.187:8300 follower true 3 39544 0 commits consul-server4 045f51fc-173d-8fd2-5bdd-06ae93a66356 192.168.40.116:8300 follower true 3 39544 0 commits consul-server6 a1f543fa-7e2b-54bb-686f-6d101c1d1db3 192.168.52.188:8300 follower true 3 39544 0 commits consul-server1 b2f27b7e-c041-4407-a028-25b335608185 192.168.33.29:8300 follower true 3 39544 0 commits [root@ip-192-168-33-29 ec2-user]#
- To segregate these two clusters back, follow the below steps on the
dc2
cluster first followed bydc1
- First, ensure to change AWS
- For all Consul servers, stop the Consul agent process and clean
- Restart the servers and let the leader election occur for the respective cluster.
- Restore the latest working snapshot (using command
- In case you get an ACL error and your initial master token is no longer valid, you might need to do an ACL bootstrap and use the new master
- Monitor the state of consul servers.
tag_values
on the instance respective to the cluster.- For all Consul servers, stop the Consul agent process and clean
data_dir
content.- Restart the servers and let the leader election occur for the respective cluster.
- Restore the latest working snapshot (using command
consul snapshot restore <snapshot_name>
) and let servers sync.- In case you get an ACL error and your initial master token is no longer valid, you might need to do an ACL bootstrap and use the new master
SecretID
to operate Consul until you restore a snapshot that holds your initially used master token.- Monitor the state of consul servers.
- Follow the above approach and restore the cluster with the respective snapshots for each cluster. In the end, it should segregate two clusters with a leader in each.
[root@ip-192-168-33-29 ec2-user]# consul members Node Address Status Type Build Protocol DC Partition Segment consul-server1 192.168.33.29:8301 alive server 1.14.9+ent 2 dc1 default <all> consul-server2 192.168.35.49:8301 alive server 1.14.9+ent 2 dc1 default <all> consul-server3 192.168.55.98:8301 alive server 1.14.9+ent 2 dc1 default <all> [root@ip-192-168-33-29 ec2-user]# consul operator raft list-peers Node ID Address State Voter RaftProtocol Commit Index Trails Leader By consul-server3 d448e69d-f24a-a185-1aab-164bb851eb7a 192.168.55.98:8300 follower true 3 40997 0 commits consul-server1 b2f27b7e-c041-4407-a028-25b335608185 192.168.33.29:8300 follower true 3 40997 0 commits consul-server2 e011ae50-1877-ed40-f080-792fecec38b5 192.168.35.49:8300 leader true 3 40997 -
Conclusion
To make sure that both clusters (
https://developer.hashicorp.com/consul/tutorials/security/gossip-encryption-secure
https://developer.hashicorp.com/consul/docs/security/encryption
https://www.youtube.com/watch?v=lAL7ocZQprE&t=1223s
dc1
and dc2
) do not merge again in any circumstances, you can enable different gossip encryption keys on both clusters or enable gossip encryption on one of the clusters. This would prevent clusters from merging in the future. Please find additional details in regards to gossip encryption below:https://developer.hashicorp.com/consul/tutorials/security/gossip-encryption-secure
https://developer.hashicorp.com/consul/docs/security/encryption
https://www.youtube.com/watch?v=lAL7ocZQprE&t=1223s