Introduction
To effectively troubleshoot issues with your Envoy dataplane within the Consul service mesh, gathering specific information from the affected sidecar proxies and any related downstream proxies is crucial. This document outlines the necessary prerequisites and the expected outcome of collecting this data.
Skip to the Procedure section
Expected Outcome
By following the instructions to gather information from the specified Envoy Admin API endpoints and logs, you will provide the necessary data for effective troubleshooting. The expected outcome is a collection of text-based outputs for each relevant sidecar, including:
-
Config Dump: The raw JSON output from the
/config_dump
endpoint. This will allow inspection of the Envoy configurations applied by Consul, such as service defaults and resolvers. -
Stats: The text output from the
/stats
endpoint. This will provide insights into resource utilization (e.g., memory, connections) and upstream connection counts, potentially revealing proxy-level bottlenecks or errors. -
Clusters: The text output from the
/clusters
endpoint. This will confirm whether the Envoy proxy has discovered and correctly configured the necessary upstream services managed by Consul. -
Listeners: The text output from the
/listeners
endpoint. This will verify that the Consul XDS server has successfully configured the required listeners, including those on port20000
for service traffic. -
Certificates: The text output from the
/certs
endpoint. This will allow examination of the TLS certificates being used by the proxy, helping to identify issues like expired or incorrect certificates that could cause connectivity problems. -
Envoy Logs: A segment of the Envoy container logs from the relevant timeframe, potentially showing error messages, configuration issues, or connectivity failures. If the log level was temporarily increased, these logs will contain more detailed information.
This comprehensive set of data will provide a detailed snapshot of the Envoy proxy's configuration, runtime state, and recent activity, enabling a more accurate diagnosis of the underlying issue within your Consul service mesh.
Prerequisites
To gather the required Envoy dataplane information, ensure you have the following:
-
Access to the Kubernetes/OpenShift Cluster: You need command-line access to your Kubernetes or OpenShift cluster where the Consul service mesh is running. This typically involves having
kubectl
oroc
configured and authenticated to interact with your cluster. -
Identification of Affected Sidecars: Clearly identify the specific Envoy sidecar proxy experiencing the issue. Additionally, identify any related downstream sidecar proxies, such as ingress gateways or API gateways, that might be involved in the problem. Knowing the pod names and namespaces of these sidecars is essential.
-
Network Access to Envoy Admin API: You need a way to access the Envoy Admin API endpoints (on port
19000
) of the identified sidecar containers. This can be achieved through one of the following methods:-
Executing commands inside the container: Using tools like
kubectl exec -it <pod-name> -n <namespace> -- sh
to gain a shell inside the container. -
Port Forwarding: Using
kubectl port-forward <pod-name> <local-port>:19000 -n <namespace>
(oroc port-forward
) to expose the Admin API on your local machine.
-
Executing commands inside the container: Using tools like
-
Tools for Making HTTP Requests: You will need a tool capable of making HTTP GET and POST requests to the Envoy Admin API endpoints. Common tools include
curl
,wget
, or even a web browser (for GET requests). -
(Optional but Recommended) Familiarity with
kubectl logs
: To collect Envoy logs, familiarity with thekubectl logs
command (oroc logs
on OpenShift) for retrieving container logs is beneficial.
Use Case
The process of gathering Envoy dataplane information as described serves several critical use cases in operating and troubleshooting a Consul service mesh:
-
Diagnosing Connectivity Issues: When services within the mesh fail to communicate, collecting the
config_dump
,clusters
,listeners
, andcerts
output from the involved Envoy sidecars can reveal misconfigurations, missing upstream clusters, incorrect listener setup, or TLS certificate problems that are preventing connections. For example, a missing cluster for a downstream service in theclusters
output would immediately point to a service discovery issue. -
Identifying Performance Bottlenecks: High resource usage or excessive connection counts in the
stats
output can indicate performance bottlenecks within a specific Envoy proxy. Examining the stats for upstream connections can also reveal if a particular service is experiencing a high volume of requests or connection errors. -
Troubleshooting Routing Problems: If traffic is not being routed as expected within the mesh, the
config_dump
andlisteners
outputs can be analyzed to verify the virtual service and route configurations applied to the Envoy proxies. Incorrectly configured listeners or routes will be evident in these outputs. -
Debugging 503 Errors: When encountering service unavailable (503) errors, the
stats
,clusters
, andcerts
outputs can provide valuable clues. High numbers of upstream connection failures in thestats
, a missing or unhealthy cluster in theclusters
output, or certificate validation errors in thecerts
output can all lead to 503 errors. -
Investigating Configuration Propagation Delays: By comparing the
config_dump
output across multiple Envoy proxies, operators can identify if there are delays or inconsistencies in the propagation of Consul service mesh configurations. This can help pinpoint issues with the Consul control plane or the XDS communication. -
Analyzing Runtime Errors: Envoy logs often contain detailed information about errors encountered during runtime, such as issues with configuration parsing, connection establishment failures, or policy enforcement. Collecting logs around the time of an incident can provide crucial context for understanding the sequence of events leading to the problem.
-
Verifying Policy Enforcement: When troubleshooting issues related to traffic management policies (e.g., timeouts, retries, circuit breaking) or security policies (e.g., TLS, authorization), the
config_dump
output will show how these policies have been translated into Envoy configurations. Examining the relevant sections of the config dump can confirm if the policies are being applied correctly. -
Providing Data for Support: When seeking assistance from HashiCorp Support, providing the requested Envoy Admin API outputs and logs ensures that support engineers have the necessary information to efficiently diagnose the problem without requiring extensive back-and-forth communication to gather basic data.
In essence, collecting this Envoy dataplane information provides a detailed diagnostic snapshot of the service mesh at the proxy level, empowering operators and support teams to effectively identify, understand, and resolve a wide range of issues.
Procedure
To help us troubleshoot your Envoy dataplane issue in the Consul service mesh, provide the following information to the Support team from:
- The sidecar that is experiencing the issue
- Any related downstream sidecars (e.g., ingress or API gateway)
Envoy Admin API Endpoints
You can gather these using the Envoy Admin API on port 19000 inside the container or via a port-forward. Please provide the output for each relevant sidecar:
- Config Dump
- Endpoint:
GET http://localhost:19000/config_dump
- Purpose: Verify applied configurations from Consul (e.g., service defaults, service resolvers).
- Endpoint:
- Stats
- Endpoint:
GET http://localhost:19000/stats
- Purpose: Inspect resource usage and upstream connection counts to identify potential proxy-level issues.
- Endpoint:
- Clusters
- Endpoint:
GET http://localhost:19000/clusters
- Purpose: Confirm that Consul-discovered upstream services are present and configured correctly.
- Endpoint:
-
Listeners
- Endpoint:
GET http://localhost:19000/listeners
- Purpose: Verify that the Consul XDS server has set up all needed listeners (e.g., port 20000).
- Endpoint:
-
Certificates
- Endpoint:
GET http://localhost:19000/certs
- Purpose: Check for any TLS certificate issues (e.g., expired leaf certs) causing connectivity failures or 503 errors.
- Endpoint:
Envoy Logs
Envoy’s logs can give additional insight, especially if a configuration or connectivity problem occurs during runtime. By default, these logs are typically available in your platform’s standard container logs (e.g., via `kubectl logs` on Kubernetes).
- Capture logs from the relevant Envoy sidecar containers around the timeframe of the issue.
- If necessary, you can adjust Envoy’s log level to capture more detailed logs.
Adjust Envoy Log Level via Admin API
You can dynamically change Envoy’s log level to gather more verbose logs. For example, to set the log level to debug on the fly:
POST http://localhost:19000/logging?level=debug
- Supported log levels include trace, debug, info, warning, error, critical, and off. You can revert to a lower log level when finished collecting the logs (e.g.,
POST http://localhost:19000/logging?level=info
).
Note: Setting a very verbose log level (like trace or debug) can result in high log volume, so it’s best done briefly while reproducing the issue.