Introduction
Configuring a mesh gateway with WAN federation involves several steps, and operators may encounter various challenges along the way. This guide provides troubleshooting information for common errors encountered during the configuration process, helping you resolve issues efficiently.
Problems, Causes, and Solutions
Table of Contents
- Invalid certificate
- Permission denied: token with AccessorID 'primary-dc-down' lacks permission
- A missing role from instance for tag-based retry-join (Cloud Auto Join).
- The primary gateway address is already in use.
Invalid certificate
Error Message
ip-144-231-425-590.us-west-2.compute.internal consul[8738]: [ERROR] agent.server.rpc:
RPC failed to server in DC: server=172.31.46.103:8300 datacenter=dc1 method=FederationState.
Apply error="rpc error getting client: failed to get conn: x509: certificate is valid for
*.dc2.consul, server.dc1.consul, localhost, not server-123-45-67-891.server.dc1.consul"
Problem
The default consul tls cert create -server
command adds SANs like "server.<dc_name>.consul"
, "localhost"
, and "127.0.0.1"
.
Cause
A server with a DNS name like server-123-45-67-891.server.dc1.consul
will fail to establish a successful handshake.
Solution
Generate and distribute new certificates with appropriate SANs using the following command:
$ consul tls cert create -server -domain consul -additional-dnsname=*.server.<dc_name>.consul
-additional-dnsname=*.server.<secondary_dc_name>.consul
-additional-dnsname=*.<dc_name>.consul
-additional-dnsname=*.<secondary_dc_name>.consul
-additional-dnsname will add SANs to the server certificate to allow a successful handshake. The dns name "*.server.<dc_name>.consul" and "*.server.<secondary_dc_name>.consul" will allow servers that have dns names like server-123-45-67-891.server.dc1.consul to successful handshake. It is ok to use wildcards, but if there is a need for stricter security, the dns name for each server can be explicitly added. For example:
$ consul tls cert create -server -domain consul -additional-dnsname=server-123-45-67-891.server.dc1.consul
-additional-dnsname=server-123-45-67-891.server.dc2.consul
-additional-dnsname=*.dc1.consul
-additional-dnsname=*.dc2.consul
Distribute the new certificates to servers and restart the Consul server on each machine:
Permission denied: token with AccessorID 'primary-dc-down' lacks permission
Error message
ip-144-231-425-590.us-west-2.compute.internal consul[3484]: Error registering service
"meshgateway": Unexpected response code: 403 (Permission denied: token with AccessorID
'primary-dc-down' lacks permission 'service:write' on "meshgateway")
Problem
The token 'primary-dc-down'
is a placeholder for a missing token.
Cause
This error occurs when the mesh gateway service in the secondary datacenter cannot register because ACL tokens haven't been replicated from the primary datacenter.
Solution
Ensure ACL replication is initiated between the primary and secondary datacenters. Verify that primary_gateways
is configured in the secondary datacenter configuration to enable communication.
A missing role from instance for tag-based retry-join (Cloud Auto Join).
Error Message:
ip-144-231-425-590.us-west-2.compute.internal consul[32363]: | discover-aws:
DescribeInstancesInput failed: NoCredentialProviders: no valid providers in chain.
Deprecated.
ip-192-168-151-143.us-west-2.compute.internal consul[32363]: | For verbose messaging
see aws.Config.CredentialsChainVerboseErrors
Problem
On EC2 instances, the following entry is added in the Consul agent configuration:
retry_join = ["provider=aws tag_key=\"Project\" tag_value=\"consul\""]
Cause
Instances cannot join the consul cluster and we see the following error in the consul agent logs.
Solution
- Confirm that the server instances have the correct tags.
- Ensure the required IAM permission (
ec2:DescribeInstances
) is granted to the instance role. This enables Consul to discover EC2 servers in the following ways.
The primary gateway address is already in use.
Error Message (in Envoy logs)
$ journalclt -u consul.envoy.service -f
ip-144-231-425-590.ec2.internal consul[3402]: [warning][config] [source/common/config/new_delta_subscription_state.cc:261]
delta config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s)
lan:190.54.25.02:8443: cannot bind '190.54.25.02:8443': Address already in use
Problem
Envoy (mesh gateway) fails to start because the specified address or port (e.g., 8443
) is already in use. The error appears on the primary datacenter gateway client.
$ journalctl -u consul-envoy.service
$ usr/bin/consul connect envoy -gateway=mesh -register -service "meshgateway" -address "172.31.47.49:8443" -wan-address "172.31.47.49:8443" -expose-servers -token-file /etc/consul.d/tokens/mesh-gateway-dc1.txt -- -l debug
Solution
Identify the conflicting port using:
$ netstat -tnlp
(No info could be read for "-p": geteuid()=1000 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 172.31.41.94:8500 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8600 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN -
tcp 0 0 172.31.41.94:8443 0.0.0.0:* LISTEN -
tcp 0 0 172.31.41.94:8300 0.0.0.0:* LISTEN -
tcp 0 0 172.31.41.94:8301 0.0.0.0:* LISTEN -
tcp 0 0 172.31.41.94:8302 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::111 :::* LISTEN -
If 8443
is used by Consul for HTTPS, update the configuration to use a different port (e.g., 8501
):
ports {
http = 8500
https = 8443
server = 8300
serf_lan = 8301
serf_wan = 8302
grpc_tls = 8502
}
Confirm using the following command:
$ sudo lsof -i :8443
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
consul 2356 consul 19u IPv6 20027 0t0 TCP *:pcsync-https (LISTEN)
sudo lsof -i :8443
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
envoy 3685 root 28u IPv4 44644 0t0 TCP ip-192-168-139-63.ec2.internal:pcsync-https (LISTEN)
$ netstat -tnlp
(No info could be read for "-p": geteuid()=1000 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 190.54.25.02:8301 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:19000 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8600 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN -
tcp 0 0 190.54.25.02:8443 0.0.0.0:* LISTEN -
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::8500 :::* LISTEN -
tcp6 0 0 :::8502 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::8503 :::* LISTEN -
With the updated configuration, the mesh gateway (Envoy) can now successfully listen on port 8443.
Additional Information
- Cloud Auto-join documents
- Cloud Auto-join Authentication & Precedence