Introduction
This article is intended to be used as a guide for monitoring Envoy metrics related to Consul services.
Primer
- Envoy: A service proxy for communication between applications within a service mesh.
- upstream: the upstream cluster in Envoy that will be receiving requests.
- downstream: the downstream cluster in Envoy that will be connecting to the upstream service
-xDS: This stands for "Discovery Services", which is related to Envoy's xDS APIs:
- LDS (Listener Discovery Service)
- RDS (Route Discovery Service)
- CDS (Cluster Discovery Service)
- EDS (Endpoint Discovery Service)
Prerequisite
- Access Envoy stats from the admin page(by default on localhost:19000/stats)
Envoy Metrics
Name | Type | Description/Example | Why is this helpful to monitor? |
upstream_cx_total | Counter | Total upstream connections |
To compare the total connections with how many were successful(2xx), or not(5xx). |
upstream_cx_active | Gauge | Total active connections with the upstream | Beneficial to monitor upstream connections in real time using a tool like Grafana. |
upstream_cx_http1_total | Counter | Total HTTP/1.1 connections(Most downstreams will use HTTP/1.1) | Helpful to see if the downstream is making too many HTTP requests. |
upstream_cx_connect_fail | Counter | Total number of failed connections between services | Helpful to see if services are working properly in the service mesh. If there are 503 responses and this metric is increasing, checking envoy logs as well as a tcpdump should show more insight. |
upstream_rq_total | Counter | Total number of requests made between services | Good to check the overall requests the downstream is making to the upstream. |
default.total.match_count | Counter | Total matches to the upstream(e.g making a request to "/public_api" will increment this counter) | Helpful to check this to see if service-router is working as intended. |
upstream_rq_timeout | Counter | Total requests that timed out waiting for a response | In case the downstream service is getting 503 responses, checking this stat will shed light on if it's hitting an Envoy timeout. |
connect_authzrbac.allowed | Counter | Total requests that were allowed based on the RBAC (Role-Based Access Control) policy applied to the connections. | Helpful to check this metric in the upstream sidecar to see if the service intention(allow) is working as intended. Currently the downstream sidecar doesn't increment this or connect_authzrbac.denied. |
xDS server metrics
Consul's xDS management server is what watches service proxy changes in consul client's and Consul servers(via the Catalog) so it can be pushed to Envoy. The xDS server exposes a few metrics in Consul servers(not Envoy) that are helpful to monitor.
Name | Unit | Type | Description/Example | Why is this helpful to monitor? |
consul.xds.server.streams | streams | gauge | Measures the number of active xDS streams handled by the server split by protocol version. | Good to check which requests to Envoy are using a certain protocol(envoy_discovery_v2 or envoy_discovery_v3). |
consul.xds.server.streamsUnauthenticated(Added in Consul 1.15) | streams | gauge | Measures the number of active xDS streams handled by the server that are unauthenticated because ACLs are not enabled or ACL tokens were missing. | If Consul or Nomad aren't registering a service properly, it may be due to a missing token. Monitoring this metric can confirm why a service isn't registering. |
consul.xds.server.idealStreamsMax(Added in Consul 1.14) | streams | gauge | The maximum number of xDS streams per server, chosen to achieve a roughly even spread of load across servers. | This metric is helpful to check across servers to understand if resources are being exhausted on servers(via xDS load balancing). |
consul.xds.server.streamDrained(Added in Consul 1.14) | streams | counter | Counts the number of xDS streams that are drained when rebalancing the load between servers. | A helpful metric to monitor to check that streams are properly being drained for xDS load balancing on Consul servers. |
Example:
Let's use the Counting Dashboard services to monitor stats.
Sequence of Events
- Client(far left) accesses the dashboard service via browser or curl request
- The dashboard service reaches the counting service through Envoy
- Dashboard's outbound listener 127.0.0.1_5000 reaches the public listener in the counting service(on port 20000)
- The public listener in the counting service reaches local app, which connects to counting locally(on port 9003)
- Local app in the counting sidecar gets a response from the counting service, which is sent to the counting public listener
- Counting's public listener sends the response to the dashboard service through it's sidecar.
Dashboard stats to monitor
Listeners in stats:
listener.127.0.0.1_5000 # dashboard's outbound listener used to reach the original destination(counting) specified by the client(dashboard)
listener.127.0.0.1_20000 # dashboard's public listener used for mTLS and to connect local app(dashboard)
listener.127.0.0.1_9002 # dashboard's public listener that accepts inbound https traffic, which then is directed to the upstream(counting)
Take note that listener.127.0.0.1_5000
and listener.127.0.0.1_9002
are meant to have traffic sent to the same destinations, but in different ways. listener.127.0.0.1_5000
leverages the original_dst
filter to act as a transparent proxy, forwarding traffic to its original intended destination without any further processing. On the other hand, listener.127.0.0.1_9002
uses the http_connection_manager
filter to manage HTTP connections and provides more sophisticated routing rules. This means it not only routes the incoming HTTP traffic to the appropriate destination based on the request domain and path, but it also provides features like HTTP/2 and WebSocket support, HTTP L7 routing, retries, and more, offering a more feature-rich approach for handling traffic.
Clusters in stats:
cluster.consul-dataplane # only found in consul 1.14+. This cluster handles updates from Consul to Envoy
cluster.counting.default.dc1.internal # the upstream cluster
cluster.local_app # local_app in this case is dashboard, as that is what envoy locally connects to.
Example listener metrics:
listener.127.0.0.1_9002.downstream_cx_destroy
This metric(connections destroyed between upstream and downstream) correlates to how many connections you expect your services to have. For example, if the dashboard service increments to 400, that means 400 connections have been made, so around 400 connections will be destroyed.
listener.127.0.0.1_20000.downstream_cx_total
The dashboard service reaches Envoy via the public listener(20000), so the amount of connections between dashboard(local_app) and Envoy(public listener) can be monitored by this metric.
Example cluster metrics:
cluster.counting.default.dc1.internal.b07f5740-6c72-36f9-b101-c90a9e878ac3.consul.membership_healthy
This is a type gauge metric, which will increase based healthy upstreams(counting). For example, say that Consul UI is showing two instances of counting, but Envoy shows one cluster as healthy, that could shed light on an inconsistent state in Consul.
cluster.counting.default.dc1.internal.b07f5740-6c72-36f9-b101-c90a9e878ac3.consul.ssl.handshake
This is a type counter metric, which will increase based on ssl handshakes initiated by the downstream(dashboard). This metric can confirm that services are reaching each other through mesh via mTLS.
Counting(upstream) stats to monitor
The stats used above can also be used for upstreams like the counting service. One stat that can be used to check that the upstream is reachable is the following:
local_app::127.0.0.1:9003::health_flags::healthy
This stat can be found at the url http://localhost:19000/clusters. The local_app in this metric is referring to the counting service.