Overview
Consul’s Service Mesh (aka Connect) has Envoy built-in and it is used as the default proxy to provide communication between services. There are times when environmental or configuration issues can cause the Envoy proxy to not work properly. This article will show common envoy errors and ways to find the root causes.
Errors
SSLV3_ALERT_CERTIFICATE_EXPIRED
upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED
Problem:
This error can occur if the leaf certificates used for services expire. The leaf certificates have an expiration of 2-3 days after it's automatically rotated by Consul Connect. If the certificate is expiring, it's possible the Consul clients can't update the certificates in Envoy for the services.
How to troubleshoot:
Check http://localhost:19000/certs in one of the services to see if the certificates in Envoy are expired. Under cert_chain, If days_until_expiration shows 0, try restarting the service sidecar and checking the certificate page again.
If this doesn't make a difference, there may be a disconnect between the Consul clients and the sidecar. You can test this by restarting a Consul client, that is aware of the service, and checking the certificate page again.
gRPC config stream closed: 14
gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
Problem:
This error can cover a few things happening in Consul, but overall this means that the gRPC config stream was closed because metadata needed from Consul, such as an ACL token, isn't provided or there is a protocol/network issue between Consul and Envoy.
How to troubleshoot:
Check if there is a Consul token in the service's config dump (http://localhost:19000/config_dump). It would be under "key": "x-consul-token".
The value being empty, if ACL is enabled, would indicate that Consul or Nomad is unable to provide the token to Envoy.
- If Nomad is not present, set the Consul log-level to trace, restart the sidecar service, and review the logs.
- If Nomad is present, ensure that the Envoy image used by Nomad and Consul isn't being overridden by a custom one. This would cause Nomad's prestart hooks to not operate properly.
gRPC config for ClusterLoadAssignment rejected
gRPC config for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment rejected: malformed IP address: 00000000000000000000.us-west-2.elb.amazonaws.com. Consider setting resolver_name or setting cluster type to 'STRICT_DNS' or 'LOGICAL_DNS'
Problem:
The IP being used for the cluster endpoint cannot be used in Envoy. Consul pushes cluster and endpoint data to Envoy for communication through the mesh.
How to troubleshoot:
This is likely due to a stale load balancer being used in Envoy from a mesh gateway. Restarting the mesh gateway sidecar (if using VMs) or the mesh gateway pods (for Kubernetes) should help if the stale load balancers appear in the Envoy errors.
If this error is shown when tearing down a secondary datacenter, a Consul cluster outage may follow this behavior. In order to properly tear down a secondary datacenter, that are WAN federated, please use refer to the following articles:
- Remove WAN Federation between Consul Clusters - Virtual Machines (VMs)
- Remove WAN Federation between Consul Clusters - Kubernetes (K8s)
gRPC config stream closed: 7, permission denied
StreamAggregatedResources gRPC config stream closed: 7, permission denied
Problem:
In this instance, a token is provided to Envoy in order to authenticate against Consul, but it does not have the correct privileges to receive updated resources from the Consul clients.
How to troubleshoot:
If only using Consul, check the policy on the token to see if it has service write permissions for the sidecar to be registered. If Nomad is being used for orchestration, check that there is a policy in Consul with the following privileges:
agent_prefix "" { policy = "read" } node_prefix "" { policy = "read" } service_prefix "" { policy = "write" } acl = "write"
- With a token assigned to this policy, add it to Nomad's config file under the consul token block.
cannot bind '127.0.0.1:1020': Permission denied
Error adding/updating listener(s) <listener name>
cannot bind '127.0.0.1:1020': Permission denied
Problem:
An Envoy proxy configuration(managed in Consul annotations/hcl file) requiring it to listen on port 1020 fails to bind to this port, resulting in a 'Permission denied' error. This issue can occur in environments where Envoy is used alongside Consul for service mesh capabilities, regardless of whether it's deployed within a Kubernetes cluster or in a standalone setup.
How to troubleshoot:
Ports under 1024 have root privileges, so port 1020 won't be allowed to bind to a listener in Envoy without more intervention. The following can be a solution for this:
- Use Ports Above 1024: By configuring Envoy to listen on ports higher than 1024, you can avoid the need for root privileges. This approach is safer from a security perspective.
delayed connect error: 111
upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
Problem:
The error code 111 typically corresponds to the ECONNREFUSED
error in Linux, which means that the connection was refused by the server(sidecar). This generally indicates that no process is listening on the IP:Port combination that the client is trying to connect to.
How to troubleshoot:
In the context of Consul + Envoy, there are two reasons this error could occur:
-
This issue is caused by connect sidecars being restarted (e.g. during a consul version upgrade) and continuing to receive requests even after the sidecar stops. Tuning the interval and success_before_passing fields from the check block could help with stopping traffic to the sidecar in this scenario.
-
Example
"connect": { "sidecar_service": { "check": {
"name": "Sidecar Health Check"
"tcp": "localhost:1235"
# This will require the check to succeed 3 times before
# the sidecar is marked as healthy (passing).
# defaults to 0
"success_before_passing": 3
"interval": "5s" #defaults to 10s } } }
-
Example
-
This can also occur if an orphan pod without an envoy container is spun up. You can confirm this by checking the pods to see if both the application and Envoy containers are running:
kubectl get pods -n <namespace> -o=jsonpath='{.items[*].spec.containers[*].name}'
- If the Envoy container isn't there, make sure that connectInject is enabled in the values.yaml file. Once this is checked, restarting the pod should allow the envoy container to be added.
delayed connect error: 113
upstream reset: reset reason: connection failure, transport failure reason: delayed connect error: 113
Problem:
The error code 113 typically corresponds to the EHOSTUNREACH
error in Linux, which means that the host you are trying to reach is unreachable. This error occurs at the network level and indicates that a route to the specified host cannot be found, preventing any kind of network connection or communication.
How to troubleshoot:
Checking Envoy's stats and cluster pages will be helpful here. In the stat's you'll want to see how many members of the upstream are healthy compared to the total members:
healthy members:
curl http://localhost:19000/stats | grep 'consul.membership_healthy'
total members
curl http://localhost:19000/stats | grep 'consul.membership_total'
If they don't match, then you'll want to check the clusters page to get the IPs of the unhealthy members:
curl http://localhost:19000/clusters | grep '::health_flags::/failed_eds_health'
Seeing the failed_eds health flags for IPs can indicate that there are stale IPs trying to be connected to. Envoy sidecars shouldn't have stale IPs in a normal scenario. The next step would be to check Consul or Consul-k8s changelog for the version you're on. For example, you may see the 113 error in the mesh on one of the consul-k8s 1.1.x versions. Upgrading to a consul-k8s version where the fix was backported to may resolve the issue.