The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.
All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
Background
When attempting to remove a Datacenter (DC) from the federation, issues may arise with Consul contacting the removed DC, especially in scenarios where two or more federated DCs exist.
Consul employs a caching mechanism to retain LAN and WAN membership information for agents within each DC connected to the federation.
Typically, addressing this issue involves adjusting the reconnect_timeout_wan parameter. This parameter serves as the WAN equivalent of the reconnect_timeout parameter, which determines the duration it takes for a failed server to be entirely removed from the WAN pool.
Solution
Please note that the warnings/errors mentioned herein are not anticipated to disrupt the functionality of Consul, and no further action is required. By default, the caching mechanism currently implemented automatically refreshes every 72 hours, ensuring that these specific logs will cease to appear.
Nonetheless, if the presence of these messages in the logs triggers unnecessary alerts, such as those related to Consul Health Checks or Monitoring, or if they excessively occupy log space, there are practical measures available to address and mitigate these concerns.
The number of secondary DCs in the environment will determine which options are available.
-
Within a federated environment encompassing two or more secondary clusters, it's important to note that an immediate cessation of messages is not currently available. The singular recourse at your disposal involves adjusting the cache time. This can be achieved by introducing the
reconnect_timeout_wan
parameter and configuring it to a minimum of 8 hours. This strategic adjustment offers the most viable option to address and potentially expedite the handling of these messages within the federated setup. - In a scenario featuring just two federated data centers (one primary and one secondary), and the above option #1 is not adequate, you can intentionally interrupt the connection between these datacenters on port 8302. Subsequently, you will utilize the Consul force-leave command to effectively remove the specified member.
Examples of the Log Messages
2022-11-09T15:15:23.222Z [WARN] agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=prod-floky-v1-1 method=Internal.ServiceDump
RPC failed to server: method=Internal.ServiceDump server=100.x.x.x:8300 error="rpc error making call: No path to datacenter"
Recommendations
Before implementing this in a production environment, it is highly recommended to conduct testing in a controlled environment such as a sandbox or staging setup. This practice serves to validate seamless functionality and mitigate any potential issues that may be specific to your environment, ensuring a smoother transition to production.
Add the reconnect_timeout_wan Parameter
- Add the reconnect_timeout_wan parameter to the configuration file for all the server nodes in each datacenter (for example, setting it to 8 hours)
reconnect_timeout_wan = "8h"
- Initiate a phased restart of the primary servers, starting with the followers and ending with the leader
-
Note: Starting with Consul v1.15.x there is an option to transfer the leadership to one of the followers using the transfer-leader command
- Upon the successful reintegration of one of the followers into the cluster as a voter, you may transfer the leadership to one of your choosing
-
Note: Starting with Consul v1.15.x there is an option to transfer the leadership to one of the followers using the transfer-leader command
- Follow your process for removing the secondary DC
- Confirm the status of the specified secondary data center by executing the following command to ensure it is indicated as "left" or "failed"
consul members -wan
- Examine the logs after the designated period to ensure that error/warning messages no longer appear
Removing WAN Federation between TWO Consul Clusters
NOTE: WAN traffic will experience disruption during this procedure, but all internal cluster traffic should remain unaffected.
- Use
iptables
rules to drop all traffic between the two WAN federated clusters. This will cause both clusters to think nodes in the other cluster have failed.- Attempt this with root access:
sudo iptables -A OUTPUT -p tcp --dport 8302 -j REJECT
sudo iptables -A INPUT -p tcp --dport 8302 -j REJECT
sudo iptables -A OUTPUT -p udp --dport 8302 -j REJECT
sudo iptables -A INPUT -p udp --dport 8302 -j REJECT - If you’re working remotely via SSH, you might need to open port 22
-
-I
inserts it before all other rules inINPUT
iptables -I INPUT -p tcp --dport 22 -j ACCEPT
- If your SSH service is listening on another port, you’ll have to use that port instead of 22
-
- Attempt this with root access:
- After the clusters have been cleanly separated, you'll need to remove the
retry_join_wan
parameter in the configuration file on each consul node accordingly- Parameter example
retry_join_wan = ["dc2-server-1", "dc2-server-2", "dc2-server-3"]
- Note: The value can contain IPv4, IPv6, or DNS addresses.
- Parameter example
- Reboot each node to update these values
-
Run the force-leave CLI command to separate the two WAN federated clusters cleanly
consul force-leave [options] node
- If you have ACLs enabled and need to pass a token, use the -token=<value> in the options before specifying the node name
-
To re-open port 8302 using
iptables
, use the same command but instead of usingREJECT
, addACCEPT
sudo iptables -A INPUT -p tcp --dports 8302 -j ACCEPT
sudo iptables -A INPUT -p udp --dports 8302 -j ACCEPT
sudo iptables -A OUTPUT -p tcp --dports 8302 -j ACCEPT
sudo iptables -A OUTPUT -p udp --dports 8302 -j ACCEPT-
You can also simply remove the rules by using the
-D
flag rather than the-A
flag
sudo iptables -D INPUT -p tcp --dports 8302 -j ACCEPT
sudo iptables -D INPUT -p udp --dports 8302 -j ACCEPT
sudo iptables -D OUTPUT -p tcp --dports 8302 -j ACCEPT
sudo iptables -D OUTPUT -p udp --dports 8302 -j ACCEPT
-
You can also simply remove the rules by using the