Setting up mesh gateway via wan federation on virtual machines has multiple steps. There are a few errors that operators can possibly run into that have been documented below.
1. Invalid certificate
Problem
By default, the command consul tls cert create -server will add the SANs "server.<dc_name>.consul", "localhost", and "127.0.0.1". So, a server that has a DNS name like server-123-45-67-891.server.dc1.consul while not be able to make a successful handshake.
Error Message
ip-144-231-425-590.us-west-2.compute.internal consul[8738]: [ERROR] agent.server.rpc:
RPC failed to server in DC: server=172.31.46.103:8300 datacenter=dc1 method=FederationState.
Apply error="rpc error getting client: failed to get conn: x509: certificate is valid for
*.dc2.consul, server.dc1.consul, localhost, not server-123-45-67-891.server.dc1.consul"
Resolution
New certs will need to be distributed to the datacenters with the following format:
$ consul tls cert create -server -domain consul -additional-dnsname=*.server.<dc_name>.consul
-additional-dnsname=*.server.<secondary_dc_name>.consul
-additional-dnsname=*.<dc_name>.consul
-additional-dnsname=*.<secondary_dc_name>.consul
-additional-dnsname will add SANs to the server certificate to allow a successful handshake. The dns name "*.server.<dc_name>.consul" and "*.server.<secondary_dc_name>.consul" will allow servers that have dns names like server-123-45-67-891.server.dc1.consul to successful handshake. It is ok to use wildcards, but if there is a need for stricter security, the dns name for each server can be explicitly added. For example:
$ consul tls cert create -server -domain consul -additional-dnsname=server-123-45-67-891.server.dc1.consul
-additional-dnsname=server-123-45-67-891.server.dc2.consul
-additional-dnsname=*.dc1.consul
-additional-dnsname=*.dc2.consul
Once the new certificates have been generated, distribute them to each server, and restart each consul server(i.e sudo systemctl restart consul ) one at a time.
2. Permission denied: token with AccessorID 'primary-dc-down' lacks permission
Problem
'primary-dc-down' is the default name for a non-existent token. In this scenario, the mesh gateway service in the secondary datacenter cannot be registered because ACL tokens haven't replicated from primary to secondary.
Error message
ip-144-231-425-590.us-west-2.compute.internal consul[3484]: Error registering service
"meshgateway": Unexpected response code: 403 (Permission denied: token with AccessorID
'primary-dc-down' lacks permission 'service:write' on "meshgateway")
Resolution
Before the second mesh gateway service can be registered, ACL replication must be initiated. Make sure that primary_gateways is set in the secondary datacenter. This is what will be used to join the primary and secondary datacenters.
3. Missing role from instance for tag based retry-join (Cloud Auto Join).
Problem
In EC2 instances, the following entry is added in consul agent configuration for tag based retry_join.
retry_join = ["provider=aws tag_key=\"Project\" tag_value=\"consul\""]
However, instances are unable to join consul cluster and we see the following error in consul agent logs.
Error Message:
ip-144-231-425-590.us-west-2.compute.internal consul[32363]: | discover-aws:
DescribeInstancesInput failed: NoCredentialProviders: no valid providers in chain.
Deprecated.
ip-192-168-151-143.us-west-2.compute.internal consul[32363]: | For verbose messaging
see aws.Config.CredentialsChainVerboseErrors
Resolution:
After confirming the right tags on server instances, we need to make sure the required IAM permission for discovering EC2 consul-servers ec2:DescribeInstances
is passed to instance in the following ways.
4. Primary gateway address already in use.
Problem
Envoy(mesh gateway) cannot be started due to the address, or more specifically the port being used. This error will be exhibited on the primary datacenter gateway client. It can be found using one of the two commands(depending on how envoy is started):
$ journalctl -u consul-envoy.service
$ usr/bin/consul connect envoy -gateway=mesh -register -service "meshgateway" -address "172.31.47.49:8443" -wan-address "172.31.47.49:8443" -expose-servers -token-file /etc/consul.d/tokens/mesh-gateway-dc1.txt -- -l debug
Error Message in envoy logs
$ journalclt -u consul.envoy.service -f
ip-144-231-425-590.ec2.internal consul[3402]: [warning][config] [source/common/config/new_delta_subscription_state.cc:261]
delta config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s)
lan:190.54.25.02:8443: cannot bind '190.54.25.02:8443': Address already in use
Resolution
Check what port is conflicting with Envoy, in this scenario port 8443.
$ netstat -tnlp
(No info could be read for "-p": geteuid()=1000 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 172.31.41.94:8500 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8600 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN -
tcp 0 0 172.31.41.94:8443 0.0.0.0:* LISTEN -
tcp 0 0 172.31.41.94:8300 0.0.0.0:* LISTEN -
tcp 0 0 172.31.41.94:8301 0.0.0.0:* LISTEN -
tcp 0 0 172.31.41.94:8302 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::111 :::* LISTEN -
We see that 8443 is open and private IP is listening on it. This could happen if the consul config file has 8443 being used for https protocol. For example:
ports {
http = 8500
https = 8443
server = 8300
serf_lan = 8301
serf_wan = 8302
grpc_tls = 8502
}
And can be confirmed by the following command:
$ sudo lsof -i :8443
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
consul 2356 consul 19u IPv6 20027 0t0 TCP *:pcsync-https (LISTEN)
Since envoy uses 8443 for communication, we can change consul port for https to 8501.And that shoud allow envoy to start.
sudo lsof -i :8443
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
envoy 3685 root 28u IPv4 44644 0t0 TCP ip-192-168-139-63.ec2.internal:pcsync-https (LISTEN)
$ netstat -tnlp
(No info could be read for "-p": geteuid()=1000 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 190.54.25.02:8301 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:19000 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8600 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN -
tcp 0 0 190.54.25.02:8443 0.0.0.0:* LISTEN -
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::8500 :::* LISTEN -
tcp6 0 0 :::8502 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::8503 :::* LISTEN -
Now that the config is updated, the mesh gateway(envoy) is able to listen on 8443.