Introduction
The Consul-server pods frequently display errors related to Envoy RPC communication failures and ACL subscription closures. These errors indicate potential disruptions in proxy configuration updates, xDS communication, and ACL token management, which could impact service mesh stability and connectivity.
Problem
The logs show repeated instances of the following errors:
-
RPC Error:
Error receiving new DeltaDiscoveryRequest; closing request channel
-
ACL Subscription Closure:
subscription closed by server, ACL change occurred
These issues suggest that the proxy configuration process is being interrupted, leading to potential failures in service mesh operations.
Error logs
2024-12-11T05:34:39.484Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, ACL change occurred" failure_count=1 key=mesh topic=MeshConfig
2024-12-11T05:34:39.484Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, ACL change occurred" failure_count=1 topic=JWTProvider wildcard_subject=true
Reproduction of the issue
Deployed the below values.yaml
file with the required configuration.
global:
enabled: true
name: consul
dataenter: dc1
logLevel: debug
image: "hashicorp/consul-enterprise:1.20.1-ent"
datacenter: dc1
acls:
manageSystemACLs: true
createReplicationToken: true
tls:
enabled: true
enableAutoEncrypt: true
verify: true
enableConsulNamespaces: true
enterpriseLicense:
secretName: consul-ent-license
secretKey: key
gossipEncryption:
autoGenerate: true
server:
replicas: 1
bootstrapExpect: 1
exposeService:
enabled: true
type: LoadBalancer
ui:
enabled : true
service:
type: LoadBalancer
connectInject:
enabled: true
default: true
replicas: 1
apiGateway:
manageExternalCRDs: true
managedGatewayClass:
enabled: true
serviceType: LoadBalancer
- Deploy multiple applications, ensuring each includes the annotation
consul.hashicorp.com/connect-inject: 'true'
. -
-
apiVersion: v1
kind: Service
metadata:
# This name will be the service name in Consul.
name: static-server
spec:
selector:
app: static-server
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: static-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: static-server
spec:
replicas: 1
selector:
matchLabels:
app: static-server
template:
metadata:
name: static-server
labels:
app: static-server
annotations:
'consul.hashicorp.com/connect-inject': 'true'
spec:
containers:
- name: static-server
image: hashicorp/http-echo:latest
args:
- -text="hello world"
- -listen=:8080
ports:
- containerPort: 8080
name: http
# If ACLs are enabled, the serviceAccountName must match the Consul service name.
serviceAccountName: static-server - Wait for the application pods to come up and run successfully.
- Delete the application pods that had the
connect-inject
annotation set to true. - Check the Consul server logs and observe the same error message reported earlier.
This sequence of actions led to the occurrence of the error messages in the logs, confirming the issue.
Reproduction of Consul Server Error: DeltaDiscoveryRequest and ACL Change Failures
2024-12-12T08:52:30.752Z [WARN] agent: error getting server health from consul-server-0 error="context deadline exceeded"
2024-12-12T08:52:52.728Z [ERROR] agent.envoy: DeltaDiscoveryRequest error="rpc error: code = Canceled desc = context canceled"
2024-12-12T08:52:53.042Z [ERROR] agent.proxycfg: subscribe call failed, ACL change occurred, failure_count=1 key=mesh topic=MeshConfig
2024-12-12T08:55:49.026Z [ERROR] agent.envoy: DeltaDiscoveryRequest error="rpc error: code = Canceled desc = context canceled"
2024-12-12T08:55:55.151Z [ERROR] agent.proxycfg: subscribe call failed, ACL change occurred, failure_count=1 topic=JWTProvider
Cause
The observed errors occurred due to normal Consul behavior in response to the following conditions:
-
Proxy Configuration Lifecycle:
- When a new service comes online, Consul servers establish watches on required resources to build the proxy config snapshot.
- If no downstream consumers are actively using the proxy config, the server terminates these watches.
- If the downstream consumer disconnects from the xDS server, the proxy configuration watches are also removed.
-
ACL Token Changes:
- If the ACL token used by a downstream consumer is updated or deleted, the subscription to configuration updates is closed by the server.
- This results in log messages indicating that the subscription has been terminated.
Outcome
1. This behavior is normal and does not indicate a critical failure. However, regular monitoring and validation of ACL tokens and proxy consumers will help prevent unexpected disruptions and maintain service mesh stability.
Possible Solution
3. To minimize manual intervention in pod management, Kubernetes offers automated mechanisms through Deployments and Jobs.
Deployments are ideal for managing stateless applications that require consistent availability. They ensure that a specified number of pod replicas are running at all times and facilitate updates without downtime. For more details, refer to the Kubernetes Deployment documentation.
Link: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
Jobs are suited for finite tasks that need to run to completion, such as batch processing or scheduled operations. A Job creates one or more pods to execute the task and ensures successful completion. For comprehensive information, see the Kubernetes Jobs documentation.
By leveraging Deployments and Jobs, you can automate pod management effectively, reducing the need for manual interventions.
Link: https://kubernetes.io/docs/concepts/workloads/controllers/job/
Additional Information
https://github.com/hashicorp/consul/blob/main/docs/service-mesh/proxycfg.md
https://support.hashicorp.com/hc/en-us/articles/17843632998163-Troubleshooting-Consul-ACL-Issues