Introduction
The following issue was first observed with WAN federated Consul environments where the Primary Datacenter is running on Kubernetes with Consul version 1.15.7 and the Secondary Datacenters running on VMs with Consul version 1.14.9
Problem
Consul servers upgrade from 1.14.9 to 1.15.7 on a secondary datacenter fails with the connect-injector
and mesh-gateway
pods stuck in a "CrashLoopBackOff" state.
$ kubectl get pods -n consul
NAME READY STATUS RESTARTS AGE IP NODE
consul-consul-connect-injector-XXXXXXX1 0/1 Running 95 (51s ago) 5h23m 100.127.240.10 aks-agentpool-33018690-vmss000001
consul-consul-connect-injector-XXXXXXX2 0/1 CrashLoopBackOff 95 (21s ago) 5h23m 100.127.240.65 aks-agentpool-33018690-vmss00000w
consul-consul-mesh-gateway-XXXXXXXXXX 0/1 Init:CrashLoopBackOff 42 (111s ago) 5h23m 100.127.240.12 aks-agentpool-33018690-vmss000001
consul-consul-mesh-gateway-XXXXXXXXXX 0/1 Init:0/1 41 (6m24s ago) 5h23m 100.127.240.132 aks-agentpool-33018690-vmss00000w
consul-consul-server-acl-init-jwqs2 0/1 Completed 0 5h23m 100.127.240.134 aks-agentpool-33018690-vmss00000w
- In the
connect-injector
pod logs, the following errors are observed
[DEBUG] consul-server-connection-manager: gRPC resolver failed to update connection address: error="bad resolver state"
[ERROR] consul-server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unauthenticated desc = ACL system must be bootstrapped before making any requests that require authorization: ACL not found"
Cause
- When the
connect-injector
pods start, they communicate with the Consul Servers over gRPC to authenticate and get an ACL token - This token is created in the primary datacenter and gets replicated in the secondary datacenters, and is used to fetch the dataplane features that the server supports among other details
- The
ACL not found
error indicates that the token used by the init container ofconnect-injector
pod is still unknown to the servers in the secondary DC as it hasn't been replicated yet from the primary datacenter - Based on the "startupProbe" parameter, the
connect-injector
pod will auto restart after 60 seconds, which restarts the whole process again - The replication of tokens is typically a quick task, but network conditions due to the number of secondary datacenters or due to a misconfiguration could delay it
Solution
Tune the RPC connections timeout and max connections per client along with the connect-injector
"startupProbe" to allow for the token replication to complete.
-
Increase the
rpc_client_timeout
from the default 60s to 180s in the primary datacenter and then the secondary datacenters -
Increase
rpc_max_conns_per_client
from 100 to 200 in the primary datacenter and then in the secondary datacenters -
Increase the connect-injector pods
startupProbe ['periodSeconds']
to 5 andstartupProbe['failureThreshold']
to 300.
Outcome
If the connect-injector pod starts without errors or restarts, then that is confirmation that the ACL token is successfully replicated. Once the connect-injector pod comes up, the mesh gateway pod should follow.
References
- See Kubernetes Protect slow starting containers with startup probes documents for more information