WAN Federated Consul Upgrade Failed: CrashLoopBackOff with ACL not found Error – HashiCorp Help Center

Introduction

The following issue was first observed with WAN federated Consul environments where the Primary Datacenter is running on Kubernetes with Consul version 1.15.7 and the Secondary Datacenters running on VMs with Consul version 1.14.9

Problem

Consul servers upgrade from 1.14.9 to 1.15.7 on a secondary datacenter fails with the connect-injector and mesh-gateway pods stuck in a "CrashLoopBackOff" state.

$ kubectl get pods -n consul
NAME                                    READY   STATUS                  RESTARTS        AGE     IP              NODE
consul-consul-connect-injector-XXXXXXX1 0/1     Running                 95 (51s ago)    5h23m   100.127.240.10  aks-agentpool-33018690-vmss000001             
consul-consul-connect-injector-XXXXXXX2 0/1     CrashLoopBackOff        95 (21s ago)    5h23m   100.127.240.65  aks-agentpool-33018690-vmss00000w             
consul-consul-mesh-gateway-XXXXXXXXXX   0/1     Init:CrashLoopBackOff   42 (111s ago)   5h23m   100.127.240.12  aks-agentpool-33018690-vmss000001             
consul-consul-mesh-gateway-XXXXXXXXXX   0/1     Init:0/1                41 (6m24s ago)  5h23m   100.127.240.132 aks-agentpool-33018690-vmss00000w             
consul-consul-server-acl-init-jwqs2     0/1     Completed               0               5h23m   100.127.240.134 aks-agentpool-33018690-vmss00000w

In the connect-injector pod logs, the following errors are observed

[DEBUG] consul-server-connection-manager: gRPC resolver failed to update connection address: error="bad resolver state"
[ERROR] consul-server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unauthenticated desc = ACL system must be bootstrapped before making any requests that require authorization: ACL not found"

Cause

When the connect-injector pods start, they communicate with the Consul Servers over gRPC to authenticate and get an ACL token
This token is created in the primary datacenter and gets replicated in the secondary datacenters, and is used to fetch the dataplane features that the server supports among other details
The ACL not found error indicates that the token used by the init container of connect-injector pod is still unknown to the servers in the secondary DC as it hasn't been replicated yet from the primary datacenter
Based on the "startupProbe" parameter, the connect-injector pod will auto restart after 60 seconds, which restarts the whole process again
The replication of tokens is typically a quick task, but network conditions due to the number of secondary datacenters or due to a misconfiguration could delay it

Solution

Tune the RPC connections timeout and max connections per client along with the connect-injector "startupProbe" to allow for the token replication to complete.

Increase the rpc_client_timeout from the default 60s to 180s in the primary datacenter and then the secondary datacenters
Increase rpc_max_conns_per_client from 100 to 200 in the primary datacenter and then in the secondary datacenters
Increase the connect-injector pods startupProbe ['periodSeconds'] to 5 and startupProbe['failureThreshold'] to 300.

Outcome

If the connect-injector pod starts without errors or restarts, then that is confirmation that the ACL token is successfully replicated. Once the connect-injector pod comes up, the mesh gateway pod should follow.

References

See Kubernetes Protect slow starting containers with startup probes documents for more information