Problem
Customer reported that they were observing intermittent 503 (Service unavailable) response from the services registered on the mesh. Multiple pod restarts for the consul-controller
pod and the consul-connect-injector-webhook-deployment
pods were observed.
Versions in use
-
Ambassador Version: datawire/aes:2.1.2
Consul version: consul-enterprise:1.10.6-ent
Controller image: consul-k8s-control-plane:0.38.0
Helm chart: v0.38.0 - Could happen with other versions as well.
Cause
- Check the controller log for the following -
2022-02-19T16:47:05.397Z INFO Waited for 1.047657907s due to client-side throttling, not priority and fairness, request: GET:https://172.xx.x.x:443/apis/storage.k8s.io/v1?timeout=32s
This indicates a resource/timeout issue trying to reach the kube api server.
-
The
kubectl logs consul-controller-xxxxxx --previous
command shows the error message:problem running manager {"error": "leader election lost"}
. From the below stack trace,context deadline exceeded
error was observed.
2022-02-20T22:48:21.168Z ERROR error retrieving resource lock service-mesh/consul.hashicorp.com: Get "https://172.xx.x.x:443/api/v1/namespaces/service-mesh/configmaps/consul.hashicorp.com": context deadline exceeded
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:272
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:217
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:230
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:577
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:542
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:533
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:271
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:268
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:212
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/manager/internal.go:681
2022-02-20T22:48:21.168Z INFO failed to renew lease service-mesh/consul.hashicorp.com: timed out waiting for the condition
2022-02-20T22:48:21.168Z ERROR setup problem running manager {"error": "leader election lost"}
This context deadline exceeded
message means that the controller was unable to reach the kube api server and get a response in time.
Possible Causes of this ERROR could be any of the following:
- Resource Contention
- Slow I/O
- Network Latency
- Firewall Rules / Cloud Security Rules
The reason behind controller pod restarting is that - effectively we are retrying by restarting the controller pod, when the underlying kubernetes infrastructure is unable to run the pod.
Solutions:
-
Check the system monitoring for any resource contention on CPU/RAM/Disk activities. If there is any resource contention, then increasing that should help.
-
If resources are good, then network latency should be checked.