Troubleshooting error "this server has too many xDS streams open, please try another" in consul dataplane and servers. – HashiCorp Help Center

Introduction:

This article addresses an issue observed in environments where Consul dataplanes run on a Kubernetes cluster, while Consul servers are hosted externally on virtual machines. In this setup, when the external Consul servers are configured as a list of IP addresses, and the pods are scaled to run multiple replicas, the dataplanes often encounter problems establishing and load-balancing gRPC connections to the servers. This results in errors in servers and dataplanes logs such as: envoy.config(16) DeltaAggregatedResources gRPC config stream to consul-dataplane closed since 1503s ago: 8, this server has too many xDS streams open, please try another

Pre-requisites:

3 node consul cluster on VMs (TLS enabled)
Consul Version: 1.21.x
Helm chart version: <=1.7.x
Consul-k8s running on kubernetes with externalServers values like below sample:

global:
  logLevel: debug
  datacenter: dc1
  enabled: false
  enterpriseLicense:
    secretKey: <secret_key>
    secretName: <secret_name>
  tls:
    enabled: true
    enableAutoEncrypt: true
    verify: false
    httpsOnly: false
    caCert:
      secretKey: <secret_key>
      secretName: <secret_name>
    caKey:
      secretKey: <secret_key>
      secretName: <secret_name>
  name: consul
connectInject:
  enabled: true
  envoyExtraArgs: "--component-log-level upstream:debug,http:debug,router:debug,config:debug"
dns:
  enableRedirection: false
  enabled: false
externalServers:
 enabled: true
 hosts:             #IPs of external Consul servers VMs as a list
 - 192.168.105.xx
 - 192.168.105.yy
 - 192.168.105.zz
 httpsPort: 8501
 grpcPort: 8503
 k8sAuthMethodHost: 192.168.105.xy:6443

Data-plane injected Application, with single replica.
No Layer4/Layer7 load balancer between consul servers and dataplane.

Note:

The shown behaviour isnt specific to helm chart version 1.7.x or consul version 1.21.x, but exists for lower versions as well.

Environment Setup for reproduction:

Validate the servers are in quorum, and no pre-existing errors in consul server logs:

$ consul members
Install Consul on Kubernetes using helm:

helm install consul hashicorp/consul -n consul --values values.yaml --version 1.7.1 --wait --debug

Deploy a Single replica application, say a fake-service, backend application, deployment manifest below:

# Service to expose backend
apiVersion: v1
kind: Service
metadata:
 name: backend
spec:
 selector:
   app: backend
 ports:
 - name: http
   protocol: TCP
   port: 80
   targetPort: 9090
---
apiVersion: v1
kind: ServiceAccount
metadata:
 name: backend
---
# Deployment for backend
apiVersion: apps/v1
kind: Deployment
metadata:
 name: backend
 labels:
   app: backend
spec:
 replicas: 1
 selector:
   matchLabels:
     app: backend
 template:
   metadata:
     labels:
       app: backend
     annotations:
       "consul.hashicorp.com/connect-inject": "true"
   spec:
     serviceAccountName: backend
     containers:
     - name: backend
       image: nicholasjackson/fake-service:v0.26.2
       ports:
       - containerPort: 9090
       env:
       - name: "LISTEN_ADDR"
         value: "0.0.0.0:9090"
       - name: "NAME"
         value: "backend"
       - name: "MESSAGE"
         value: "Response from backend"

kubectl apply -f backend.yaml -n consul
You have a healthy application running. Now when scale the application to 4, you'll notice the xDS grpc load balancing messages in server node and consul-dataplane container.
kubectl scale deployment/backend --replicas=4 -n consul

Error on server:

[ERROR] agent.envoy: Error handling ADS delta stream: xdsVersion=v3 error="rpc error: code = ResourceExhausted desc = this server has too many xDS streams open, please try another"

Error on Application Dataplane container:

[warning] envoy.config(16) DeltaAggregatedResources gRPC config stream to consul-dataplane closed since 1503s ago: 8, this server has too many xDS streams open, please try another
[warning] envoy.config(16) DeltaAggregatedResources gRPC config stream to consul-dataplane closed since 1559s ago: 14, name resolver error: produced zero addresses

Cause:

In the current setup, where there are 3 or more nodes external consul cluster on VMs, the consul-dataplane has to load balance grpc connections among consul servers to received xDS updates from servers. Given the existing design on consul-dataplane, by default it picks the IP based SAN of first server in the externalServers list, and uses the SNI to talk to other server members. gRPC connection from other server members get discarded by the consul-dataplane, as it has IP SAN information for a different server, but the certificate presented by other servers have their own IP in their SAN, which leads to mistrust between consul dataplane and other servers, and hence leading to grpc load balancing issue, and the errors we see as above.

Solution:

By adding a configuration tlsServerName in externalServers stanza in values file can help resolve this issue. The value of tlsServerName has to be a DNS SAN name of server, which is common for all servers in the consul cluster. This information can be extracted by inspecting server certificates, and inspecting DNS SAN information in the certificate. When this is used, the SNI has server information in it, which is common for all servers, and hence all servers build trust with consul dataplanes and allow communication with them.