Consul deployments are very sensitive when it comes to network reachability. The Serf gossip protocol is used to detect the health of Consul agents in the cluster and closed ports or network connectivity issues can result in server-side errors and even full cluster failure in some situations. As such, it's important that all Consul client agents and all Consul server agents can reach each other over the required ports. This document will explain some ways to check this as well as discuss some of the failure modes when ports aren't reachable.
Table of Contents
Potential Failure Modes
There's a variety of ways network communication between two or more agents can be disrupted. Some of the most common include:
- Hardware firewall rules blocking required ports. This can be overlooked in some scenarios due to mistakenly believing network traffic isn't traversing the firewall when it actually is (so no network rules are configured to allow the traffic through).
- Software firewall rules blocking required ports. Typically, this would be done by
iptables
(Debian-based systems) orfirewalld
(RHEL systems) rules. Sometimes base images that have been hardened by a security team might have restrictive firewall rules in place that the end user isn't aware of and as such rules were not put in place to allow traffic. It's also possible that some required ports were simply missed when creating the rules. - Cloud network security group rules blocking required ports. If you're deploying to a cloud environment like AWS, Azure, or GCP, then this could impact you. In many such deployments, all inbound ports are closed and you have to add explicit allow rules to let traffic in. With Consul, most traffic happens on the local network, so these rules wouldn't come into play; however, if you're using WAN Federation to connect multiple datacenters and the networks those datacenters are in are not peered, then you could run into issues here.
- Packet loss when traversing the network. The gossip protocol primarily happens over UDP and as such scenarios where there's packet loss can result in odd behavior and errors in Consul server. Ensuring there isn't any packet loss between agents is important – particularly so when you have Consul datacenters located in separate physical datacenters.
These failure modes will be discussed later, but before that it's important to determine if there is actually a connectivity issue.
Testing Network Connectivity Between Agents
There are a few ways you can check network connectivity. Depending upon the exact kind of issue, you might be able to discover problems with very basic checks. In other scenarios, more extensive testing may be required.
Checking Port Connectivity with Netcat
The first and most straightforward way to test connectivity is using command line tools like netcat that allows you to open a network connection with a remote host. This can pretty clearly indicate whether one host can reach another over a given TCP port (see the Caveats section for why this isn't true for UDP).
Installing Netcat
To begin, you'll want to make sure the package containing Netcat is installed:
RHEL / CentOS systems
sudo yum install nmap
Debian-based systems
sudo apt install netcat
Basic Port Testing with Netcat
After installing the appropriate package, you can use it to test connectivity over a given TCP port as follows:
RHEL / CentOS
A successful test:
$ nc -z -v 10.0.0.12 8300
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.0.12:8300.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
An unsuccessful test indicating the port is closed or there's a network connectivity issue between hosts:
$ nc -z -v 10.0.0.12 8300
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection refused.
Debian-based systems
A successful test:
$ nc -z -v 10.0.0.12 8300
Connection to 10.0.0.12 8300 port [tcp/*] succeeded!
An unsuccessful test indicating the port is closed or there's a network connectivity issue between hosts:
$ nc -z -v 10.0.0.12 8300
nc: connect to 10.0.0.12 port 8300 (tcp) failed: Connection refused
Caveats
Although netcat has a UDP mode that can be activated with the -u
parameter, this is not reliable for testing connectivity because UDP messages aren't acknowledge in the same fashion TCP messages are. For this reason, it will either yield no response when used or may provide false positives saying the connection was successful when it really wasn't. As such, basic port testing like this shouldn't be used for testing UDP connections.
Testing UDP Ports with Netcat
This method can be used to definitively see if a given UDP port is open. However, it does require stopping Consul on the receiving side so netcat can bind to the ports we want to test.
To get started, we stop Consul on the receiving side of the connection test (be careful that sufficient Consul servers are up and running that stopping one won't result in you losing quorum). After stopping Consul, run the following command:
nc -v -u -l 8301
You won't see any output at first, but this will cause netcat to bind on 8301/udp and listen for requests. After doing this, on the machine you want to test connectivity with, run the following command:
nc -v -u $IP 8301
where $IP
is the IP address of the machine you have netcat listening on. Wait for a second after running the command and you should see output similar to this:
Connection to $IP 8301 port [udp/*] succeeded!
After which you'll have a prompt. Type "hello" and press enter. After doing that, switch back to the machine you have netcat listening on. You should see the string "hello" printed. That will indicate that it received the message over UDP. If you don't see that message printed, then there is likely a connectivity issue over that UDP port.
You can proceed to test all the required UDP ports using this method.
Resolving the Underlying Cause
Once you've identified that traffic is being blocked, the next step is to determine what's blocking the traffic. As mentioned earlier, there are a variety of potential causes. We'll address them one-by-one here to help you identify the source of the issue.
Hardware firewall rules blocking required ports
Rules configured in hardware firewalls can potentially block required ports. In many situations, Consul clusters are deployed in a single network zone where traffic between agents won't be traversing the firewall. In scenarios like that, sometimes rules aren't configured in the hardware firewall. It can be easy to miss this as a potential problem if additional agents are added down the road where the traffic does actually traverse the firewall.
Unfortunately, checking for these rules typically will require someone from a company's networking team. The easiest way to determine if this is the cause is to rule out the other potential causes first and then demonstrate the lack of connectivity using the network connectivity testing methods described above. At that point, you can engage your networking team and show them that traffic is not reaching the expected destination.
Software firewall rules blocking required ports
There are a variety of software firewalls out there, but the most common on Linux server deployments are iptables
for Debian-based systems and firewalld
for RHEL-based systems. This will discuss how to see if any rules have been set on a server and show how to add new rules to allow traffic.
iptables
You can check to see if any iptables
rules have been applied on your server by running this command:
sudo iptables -nL
If your server has no rules applied, then the output would look something like this:
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Notice how the policy for each chain type is ACCEPT
. On a more locked down system, that might look something like this:
Chain INPUT (policy DROP)
target prot opt source destination
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:22
Chain FORWARD (policy DROP)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Notice how the policy for INPUT
and FORWARD
is now set to DROP
. There is a single rule setup to ACCEPT
tcp traffic over the SSH port. If you see DROP
policies like that without ACCEPT
rules for Consul ports, then iptables
is likely the culprit.
You can add rules to allow Consul traffic by running these commands:
# Consul Server Ports
sudo iptables -A INPUT -p tcp -m multiport --dports 8300,8301,8302,8500,8600 -j ACCEPT
sudo iptables -A INPUT -p udp -m multiport --dports 8301,8302,8600 -j ACCEPT
On Consul client machines, you may also want to open port 8502/tcp
(used for the gRPC API). It's also worth noting that Consul Sidecar Proxies use ports in the range of 21000-21255/tcp
, so you might also want to include a rule like this:
# Consul Client Ports
sudo iptables -A INPUT -p tcp -m multiport --dports 8502,21000:21255 -j ACCEPT
After you've added the rules, you'll need to save them or the next time the server is restarted (or the iptables
service is restarted) all the rules will be gone. How you persist these rules will vary based on implementation. Some environments might have custom Chef, Ansible, or Puppet configurations that set iptables
rules. If this is the case in your environment, make sure you check to see if those systems are already setup to manage firewall rules like this.
On a generic Debian installation, you would need to have the iptables-persistent
package installed:
sudo apt install iptables-persistent
You would then save your current iptables
rules like this:
sudo sh -c 'iptables-save > /etc/iptables/rules.v4'
The iptables-persistent
package will ensure the rules there are loaded on boot.
firewalld
You can check to see if any firewalld
rules have been applied on your server by running this command:
sudo firewall-cmd --list-all
If your server has no rules applied, then the output would look something like this:
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0 eth1
sources:
services: dhcpv6-client ssh
ports:
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
Notice how the public
zone is active and target is set to default
. The ssh service is also allowed. On a more locked down system, that might look something like this:
drop (active)
target: DROP
icmp-block-inversion: no
interfaces: eth0 eth1
sources:
services: ssh
ports:
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
Notice how the drop
zone is active and target is set to DROP
now. This means traffic is being blocked by default.
You can add rules to allow Consul traffic by running these commands:
# Consul Server Ports
sudo firewall-cmd --add-port={8300,8301,8302,8500,8600}/tcp --permanent
sudo firewall-cmd --add-port={8301,8302,8600}/udp --permanent
On Consul client machines, you may also want to open port 8502/tcp
(used for the gRPC API). It's also worth noting that Consul Sidecar Proxies use ports in the range of 21000-21255/tcp
, so you might also want to include a rule like this:
# Consul Client Ports
sudo firewall-cmd --add-port={8502,21000-21255}/tcp --permanent
Using the --permanent
flag will ensure the rules are saved and loaded again the next time the server is restarted. However, the rules themselves won't go into effect until you perform a reload:
sudo firewall-cmd --reload
After doing that, you can check to make sure the rules are showing up by listing all rules again:
# sudo firewall-cmd --list-all
drop (active)
target: DROP
icmp-block-inversion: no
interfaces: eth0 eth1
sources:
services: ssh
ports: 8300/tcp 8301/tcp 8302/tcp 8500/tcp 8600/tcp 8301/udp 8302/udp 8600/udp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
You should see all of the ports you've added on the ports
line.
Cloud network security group rules blocking required ports
Cloud service providers like AWS, Azure, and GCP apply a network security group policy to environments that can block or allow network traffic. Depending upon the environment and how your company performs cloud deployments, the default applied policy may block most inbound traffic. This is fine if your cluster is self-contained in a single VPC or private network, but if it spans networks or communicates over the public internet then you likely need to make rule adjustments.
Every Cloud provider has their own implementation of network security group rules, but here are links to some information on configuring them that may help you with creating the necessary rules for three of the most widely used providers:
- AWS: Security Groups for your VPC
- Azure: Diagnose a virtual machine network traffic filter problem
- GCP: VPC firewall rules overview
In each case, you'll want to make sure appropriate rules exist to allow communication over Consul's required ports.
Packet loss when traversing the network
The gossip protocol that Consul uses to communicate cluster membership and availability occurs almost entirely over UDP. There are periodic syncs that happen over TCP, but the vast majority of the communication occurs over UDP. Given that UDP is a lossy protocol, if your network is suffering from packet loss, this can contribute to errors and other unexpected issues in a Consul deployment. Monitoring network latency and packet loss in your network – particularly between datacenters if you're running in different regions or availability zones – can be very helpful to identify potential issues here.
ping
One of the most basic ways to test for packet loss is to simply ping another server. If packet loss is detected, then it will be shown in the report displayed after ending the ping. To perform a quick test, you can run this command:
ping -c 20 $IP
where $IP
is the IP address of the machine you're sending pings to. You would typically run this test either from a Consul server to another Consul server, from a Consul server to a Consul client, or from a Consul client to Consul server. That command will perform 20 pings and then print a report. If all is well, you would see something like this:
--- 8.8.8.8 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19635ms
rtt min/avg/max/mdev = 21.886/30.795/104.567/17.504 ms
If there were problems, the summary line might look more like this:
20 packets transmitted, 15 received, 5 lost, 25% packet loss
Pretty much any amount of packet loss can be cause for concern and should be investigated – particularly if it's happening on a local network.
mtr
The mtr
tool is a bit more sophisticated than ping
and essentially combines traceroute
with ping
. It will trace the network route traffic is taking and will measure latency and packet loss at each network hop. This can be very helpful when trying to diagnose packet loss between datacenters where traffic is traversing the public internet – or in very large private networks.
Unlike ping, the mtr tool doesn't come pre-installed on most Linux systems, so you'll need to install it.
RHEL / CentOS
sudo yum install mtr
Debian-based
sudo apt install mtr
Once it's installed, you can perform a trace by running this command:
mtr -r -w -c 20 $IP
The -r
flag will cause it to run in report mode, the -w
flag will ensure it doesn't truncate hostname and the -c
parameter tells it to send a specific number of pings to each hop for the test. Output might look something like this:
Start: 2021-06-03T20:04:33+0000
HOST: ubuntu Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.0.2.2 0.0% 20 0.3 0.3 0.2 0.6 0.1
2.|-- 192.168.86.1 0.0% 20 6.2 6.7 5.2 11.4 1.5
3.|-- 82.59.3.1 0.0% 20 13.3 24.6 13.3 76.7 13.2
4.|-- 56.69.3.21 0.0% 20 38.3 56.0 28.8 362.7 72.7
5.|-- 29.175.41.164 0.0% 20 24.2 27.6 15.7 87.9 15.0
6.|-- 29.175.41.46 0.0% 20 26.4 31.3 19.6 75.4 12.9
7.|-- 66.109.1.216 0.0% 20 35.7 34.8 25.1 121.4 21.2
8.|-- 72.14.214.198 0.0% 20 24.6 30.0 21.3 60.1 9.1
9.|-- 108.170.225.174 0.0% 20 30.5 33.8 23.2 126.7 22.1
10.|-- 172.253.78.225 0.0% 20 26.2 30.1 22.2 64.5 9.0
11.|-- 142.250.68.174 0.0% 20 30.7 27.0 19.3 34.7 3.9
Typically, in a packet loss scenario you would start seeing packet loss at one hop and then many or all subsequent hops would show packet loss too. That's generally because the packet loss issue is at the first hop it occurs and then the effects cascade to subsequent hops (because the one with issues is losing packets being sent downstream). As such, it's usually best to focus your efforts on the first hop exhibiting packet loss.
Caveats
These methods of packet loss detection will use ICMP pings out of the box. Many firewalls and other network devices disable ICMP, so you might have issues getting that to work. If you do, you might try the --tcp
flag with mtr or look for other TCP-based ping tools. It may also be possible to add network security rules in some Cloud environments to allow ICMP traffic between the VMs on your network. Doing so would allow this to work internally to your network at the very least.