Introduction
Problem
While upgrading Consul servers from 1.14.9 to 1.15.7, the consul-connect-injector
and mesh gateway pods are stuck in a CrashLoopBackOff state. The environment is a WAN federated setup with a large amount of secondary DCs. The primary DC is already upgraded on 1.15.7+ent.
- Primary DC consul servers runs on K8S cluster, while all other secondary DC consul servers run on VMs
PS C:\Users\vt255011> kubectl get pods -n consul
NAME READY STATUS RESTARTS AGE IP NODE
NOMINATED NODE READINESS GATES
consul-consul-connect-injector-XXXXXXX1 0/1 Running 95 (51s ago) 5h23m 100.127.240.10 aks-agentpool-33018690-vms
s000001 <none> <none>
consul-consul-connect-injector-XXXXXXXX2 0/1 CrashLoopBackOff 95 (21s ago) 5h23m 100.127.240.65 aks-agentpool-33018690-vms
s00000w <none> <none>
consul-consul-mesh-gateway-XXXXXXXXXX 0/1 Init:CrashLoopBackOff 42 (111s ago) 5h23m 100.127.240.12 aks-agentpool-33018690-vms
s000001 <none> <none>
consul-consul-mesh-gateway-XXXXXXXXXX 0/1 Init:0/1 41 (6m24s ago) 5h23m 100.127.240.132 aks-agentpool-33018690-vms
s00000w <none> <none>
consul-consul-server-acl-init-jwqs2 0/1 Completed 0 5h23m 100.127.240.134 aks-agentpool-33018690-vms
s00000w <none> <none>
- In the
connect-injector
logs, these errors are observed
2024-01-05T07:47:17.472Z [DEBUG] consul-server-connection-manager: gRPC resolver failed to update connection address: error="bad resolver state"
2024-01-05T07:47:17.472Z [ERROR] consul-server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unauthenticated desc = ACL system must be bootstrapped before making any requests that require authorization: ACL not found"
Prerequisites
- Consul version v1.14.9+ent --> 1.15.7+ent
- Consul-k8s v1.0.9
Cause
-
When the sidecar-injectors containers start, they (using the consul-server-connection-manager library) talk to the Consul Servers over GRPC to:
-
First, it is authenticated to get an ACL token.
- This token is created in the primary DC and gets replicated in the secondary.
- Then, it uses the token to fetch the data plane features that the server supports among other details.
-
First, it is authenticated to get an ACL token.
-
It seems that the Consul Server is returning an
ACL not found
error.- This means that the token used by the init container is still unknown to the servers in the secondary DC.
- Based on the startupProbe parameters, the connect-injector pod could restart after ~60 seconds on such occasions.
-
In such cases, a restart of the
connect-injector
pod provisions a new ACL token.- It resets the fail scenario making the secondary's connect-injector wait until the token replication is successful.
Overview of possible solutions
Solution
Because there are many secondary datacenters, the pod needs to be alive longer than the default for the replication to complete. This is typically a quick task, but network conditions (usually due to the setup or an incident) could delay the replication step. The suggestion is to increase the below 2 RPC parameters.
-
Increase the
rpc_client_timeout
from the default 60s to 180s in the primary datacenter and then the secondary datacenter -
Increase
rpc_max_conns_per_client
from 100 to 200 in the primary datacenter and then in the secondary datacenter
-
Increase the connect-injector startupProbe ['periodSeconds'] to 5 and startupProbe['failureThreshold'] to 300.
- This is to keep the connect-injector pod alive a little longer than the default 60s so that the token from a successful ACL login can get replicated
- See Kubernetes Protect slow starting containers with startup probes documents for more information
Outcome
If the connect-injector pod starts without errors or restarts, then that is confirmation that the ACL token is successfully replicated. Once the connect-injector pod comes up, the mesh gateway pod should follow.