Understanding Envoy's Consul-Related Metrics – HashiCorp Help Center

Introduction

This article is intended to be used as a guide for monitoring Envoy metrics related to Consul services.

Primer

Envoy: A service proxy for communication between applications within a service mesh.
upstream: the upstream cluster in Envoy that will be receiving requests.
downstream: the downstream cluster in Envoy that will be connecting to the upstream service
xDS: This stands for "Discovery Services", which is related to Envoy's xDS APIs:
- LDS (Listener Discovery Service)
- RDS (Route Discovery Service)
- CDS (Cluster Discovery Service)
- EDS (Endpoint Discovery Service)

Prerequisite

Access Envoy stats from the admin page(by default on localhost:19000/stats)
- If using a VM, go to the registered service that has the envoy proxy, and curl localhost:19000/stats
- If using Consul + Kubernetes, refer here.
- If using Datadog, refer here.
- If using Prometheus/Grafana, refer here.

Envoy Metrics

Name	Type	Description/Example	Why is this helpful to monitor?
upstream_cx_total	Counter	Total upstream connections	To compare the total connections with how many were successful(2xx), or not(5xx).
upstream_cx_active	Gauge	Total active connections with the upstream	Beneficial to monitor upstream connections in real time using a tool like Grafana.
upstream_cx_http1_total	Counter	Total HTTP/1.1 connections(Most downstreams will use HTTP/1.1)	Helpful to see if the downstream is making too many HTTP requests.
upstream_cx_connect_fail	Counter	Total number of failed connections between services	Helpful to see if services are working properly in the service mesh. If there are 503 responses and this metric is increasing, checking envoy logs as well as a tcpdump should show more insight.
upstream_rq_total	Counter	Total number of requests made between services	Good to check the overall requests the downstream is making to the upstream.
default.total.match_count	Counter	Total matches to the upstream(e.g making a request to "/public_api" will increment this counter)	Helpful to check this to see if service-router is working as intended.
upstream_rq_timeout	Counter	Total requests that timed out waiting for a response	In case the downstream service is getting 503 responses, checking this stat will shed light on if it's hitting an Envoy timeout.
connect_authzrbac.allowed	Counter	Total requests that were allowed based on the RBAC (Role-Based Access Control) policy applied to the connections.	Helpful to check this metric in the upstream sidecar to see if the service intention(allow) is working as intended. Currently the downstream sidecar doesn't increment this or connect_authzrbac.denied.

xDS server metrics

Consul's xDS management server is what watches service proxy changes in consul client's and Consul servers(via the Catalog) so it can be pushed to Envoy. The xDS server exposes a few metrics in Consul servers(not Envoy) that are helpful to monitor.

Name	Unit	Type	Description/Example	Why is this helpful to monitor?
consul.xds.server.streams	streams	gauge	Measures the number of active xDS streams handled by the server split by protocol version.	Good to check which requests to Envoy are using a certain protocol(envoy_discovery_v2 or envoy_discovery_v3).
consul.xds.server.streamsUnauthenticated(Added in Consul 1.15)	streams	gauge	Measures the number of active xDS streams handled by the server that are unauthenticated because ACLs are not enabled or ACL tokens were missing.	If Consul or Nomad aren't registering a service properly, it may be due to a missing token. Monitoring this metric can confirm why a service isn't registering.
consul.xds.server.idealStreamsMax(Added in Consul 1.14)	streams	gauge	The maximum number of xDS streams per server, chosen to achieve a roughly even spread of load across servers.	This metric is helpful to check across servers to understand if resources are being exhausted on servers(via xDS load balancing).
consul.xds.server.streamDrained(Added in Consul 1.14)	streams	counter	Counts the number of xDS streams that are drained when rebalancing the load between servers.	A helpful metric to monitor to check that streams are properly being drained for xDS load balancing on Consul servers.

Example:

Let's use the Counting Dashboard services to monitor stats.

Sequence of Events

Client(far left) accesses the dashboard service via browser or curl request
The dashboard service reaches the counting service through Envoy
Dashboard's outbound listener 127.0.0.1_5000 reaches the public listener in the counting service(on port 20000)
The public listener in the counting service reaches local app, which connects to counting locally(on port 9003)
Local app in the counting sidecar gets a response from the counting service, which is sent to the counting public listener
Counting's public listener sends the response to the dashboard service through it's sidecar.

Dashboard stats to monitor

Listeners in stats:

listener.127.0.0.1_5000 # dashboard's outbound listener used to reach the original destination(counting) specified by the client(dashboard)
listener.127.0.0.1_20000 # dashboard's public listener used for mTLS and to connect local app(dashboard)
listener.127.0.0.1_9002 # dashboard's public listener that accepts inbound https traffic, which then is directed to the upstream(counting)

Take note that listener.127.0.0.1_5000 and listener.127.0.0.1_9002 are meant to have traffic sent to the same destinations, but in different ways. listener.127.0.0.1_5000 leverages the original_dst filter to act as a transparent proxy, forwarding traffic to its original intended destination without any further processing. On the other hand, listener.127.0.0.1_9002uses the http_connection_manager filter to manage HTTP connections and provides more sophisticated routing rules. This means it not only routes the incoming HTTP traffic to the appropriate destination based on the request domain and path, but it also provides features like HTTP/2 and WebSocket support, HTTP L7 routing, retries, and more, offering a more feature-rich approach for handling traffic.

Clusters in stats:

cluster.consul-dataplane # only found in consul 1.14+. This cluster handles updates from Consul to Envoy
cluster.counting.default.dc1.internal # the upstream cluster
cluster.local_app # local_app in this case is dashboard, as that is what envoy locally connects to.

Example listener metrics:

listener.127.0.0.1_9002.downstream_cx_destroy

This metric(connections destroyed between upstream and downstream) correlates to how many connections you expect your services to have. For example, if the dashboard service increments to 400, that means 400 connections have been made, so around 400 connections will be destroyed.

listener.127.0.0.1_20000.downstream_cx_total

The dashboard service reaches Envoy via the public listener(20000), so the amount of connections between dashboard(local_app) and Envoy(public listener) can be monitored by this metric.

Example cluster metrics:

cluster.counting.default.dc1.internal.b07f5740-6c72-36f9-b101-c90a9e878ac3.consul.membership_healthy

This is a type gauge metric, which will increase based healthy upstreams(counting). For example, say that Consul UI is showing two instances of counting, but Envoy shows one cluster as healthy, that could shed light on an inconsistent state in Consul.

cluster.counting.default.dc1.internal.b07f5740-6c72-36f9-b101-c90a9e878ac3.consul.ssl.handshake

This is a type counter metric, which will increase based on ssl handshakes initiated by the downstream(dashboard). This metric can confirm that services are reaching each other through mesh via mTLS.

Counting (upstream) stats to monitor

The stats used above can also be used for upstreams like the counting service. One stat that can be used to check that the upstream is reachable is the following:

local_app::127.0.0.1:9003::health_flags::healthy

This stat can be found at the url http://localhost:19000/clusters. The local_app in this metric is referring to the counting service.