Introduction:
This article addresses an issue observed in environments where Consul dataplanes run on a Kubernetes cluster, while Consul servers are hosted externally on virtual machines. In this setup, when the external Consul servers are configured as a list of IP addresses, and the pods are scaled to run multiple replicas, the dataplanes often encounter problems establishing and load-balancing gRPC connections to the servers. This results in errors in servers and dataplanes logs such as: envoy.config(16) DeltaAggregatedResources gRPC config stream to consul-dataplane closed since 1503s ago: 8, this server has too many xDS streams open, please try another
Pre-requisites:
- 3 node consul cluster on VMs (TLS enabled)
- Consul Version: 1.21.x
- Helm chart version: <=1.7.x
- Consul-k8s running on kubernetes with
externalServers
values like below sample:
global: logLevel: debug datacenter: dc1 enabled: false enterpriseLicense: secretKey: <secret_key> secretName: <secret_name> tls: enabled: true enableAutoEncrypt: true verify: false httpsOnly: false caCert: secretKey: <secret_key> secretName: <secret_name> caKey: secretKey: <secret_key> secretName: <secret_name> name: consul connectInject: enabled: true envoyExtraArgs: "--component-log-level upstream:debug,http:debug,router:debug,config:debug" dns: enableRedirection: false enabled: false externalServers: enabled: true hosts: #IPs of external Consul servers VMs as a list - 192.168.105.xx - 192.168.105.yy - 192.168.105.zz httpsPort: 8501 grpcPort: 8503 k8sAuthMethodHost: 192.168.105.xy:6443
- Data-plane injected Application, with single replica.
- No Layer4/Layer7 load balancer between consul servers and dataplane.
Note:
The shown behaviour isnt specific to helm chart version 1.7.x or consul version 1.21.x, but exists for lower versions as well.
Environment Setup for reproduction:
- Validate the servers are in quorum, and no pre-existing errors in consul server logs:
$ consul members
- Install Consul on Kubernetes using helm:
helm install consul hashicorp/consul -n consul --values values.yaml --version 1.7.1 --wait --debug
- Deploy a Single replica application, say a fake-service, backend application, deployment manifest below:
# Service to expose backend apiVersion: v1 kind: Service metadata: name: backend spec: selector: app: backend ports: - name: http protocol: TCP port: 80 targetPort: 9090 --- apiVersion: v1 kind: ServiceAccount metadata: name: backend --- # Deployment for backend apiVersion: apps/v1 kind: Deployment metadata: name: backend labels: app: backend spec: replicas: 1 selector: matchLabels: app: backend template: metadata: labels: app: backend annotations: "consul.hashicorp.com/connect-inject": "true" spec: serviceAccountName: backend containers: - name: backend image: nicholasjackson/fake-service:v0.26.2 ports: - containerPort: 9090 env: - name: "LISTEN_ADDR" value: "0.0.0.0:9090" - name: "NAME" value: "backend" - name: "MESSAGE" value: "Response from backend"
kubectl apply -f backend.yaml -n consul
- You have a healthy application running. Now when scale the application to 4, you'll notice the xDS
grpc
load balancing messages in server node and consul-dataplane container. kubectl scale deployment/backend --replicas=4 -n consul
Error on server:
[ERROR] agent.envoy: Error handling ADS delta stream: xdsVersion=v3 error="rpc error: code = ResourceExhausted desc = this server has too many xDS streams open, please try another"
Error on Application Dataplane container:
[warning] envoy.config(16) DeltaAggregatedResources gRPC config stream to consul-dataplane closed since 1503s ago: 8, this server has too many xDS streams open, please try another [warning] envoy.config(16) DeltaAggregatedResources gRPC config stream to consul-dataplane closed since 1559s ago: 14, name resolver error: produced zero addresses
Cause:
In the current setup, where there are 3 or more nodes external consul cluster on VMs, the consul-dataplane has to load balance grpc
connections among consul servers to received xDS updates from servers. Given the existing design on consul-dataplane, by default it picks the IP based SAN of first server in the externalServers
list, and uses the SNI to talk to other server members. gRPC connection from other server members get discarded by the consul-dataplane, as it has IP SAN information for a different server, but the certificate presented by other servers have their own IP in their SAN, which leads to mistrust between consul dataplane and other servers, and hence leading to grpc
load balancing issue, and the errors we see as above.
Solution:
By adding a configuration tlsServerName
in externalServers
stanza in values file can help resolve this issue. The value of tlsServerName
has to be a DNS SAN name of server, which is common for all servers in the consul cluster. This information can be extracted by inspecting server certificates, and inspecting DNS SAN information in the certificate. When this is used, the SNI has server information in it, which is common for all servers, and hence all servers build trust with consul dataplanes and allow communication with them.