Intermittent 503 error and pod restart issue – HashiCorp Help Center

Problem

Customer reported that they were observing intermittent 503 (Service unavailable) response from the services registered on the mesh. Multiple pod restarts for the consul-controller pod and the consul-connect-injector-webhook-deployment pods were observed.

Versions in use

Ambassador Version: datawire/aes:2.1.2
Consul version: consul-enterprise:1.10.6-ent
Controller image: consul-k8s-control-plane:0.38.0
Helm chart: v0.38.0
Could happen with other versions as well.

Cause

Check the controller log for the following -

2022-02-19T16:47:05.397Z INFO Waited for 1.047657907s due to client-side throttling, not priority and fairness, request: GET:https://172.xx.x.x:443/apis/storage.k8s.io/v1?timeout=32s

This indicates a resource/timeout issue trying to reach the kube api server.

The kubectl logs consul-controller-xxxxxx --previous command shows the error message: problem running manager {"error": "leader election lost"}. From the below stack trace, context deadline exceeded error was observed.

2022-02-20T22:48:21.168Z ERROR error retrieving resource lock service-mesh/consul.hashicorp.com: Get "https://172.xx.x.x:443/api/v1/namespaces/service-mesh/configmaps/consul.hashicorp.com": context deadline exceeded

k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:272
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:217
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:230
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:577
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:542
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:533
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:271
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:268
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:212
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/manager/internal.go:681
2022-02-20T22:48:21.168Z INFO failed to renew lease service-mesh/consul.hashicorp.com: timed out waiting for the condition

2022-02-20T22:48:21.168Z ERROR setup problem running manager {"error": "leader election lost"}

This context deadline exceeded message means that the controller was unable to reach the kube api server and get a response in time.

Possible Causes of this ERROR could be any of the following:

Resource Contention
Slow I/O
Network Latency
Firewall Rules / Cloud Security Rules

The reason behind controller pod restarting is that - effectively we are retrying by restarting the controller pod, when the underlying kubernetes infrastructure is unable to run the pod.

Solutions:

Check the system monitoring for any resource contention on CPU/RAM/Disk activities. If there is any resource contention, then increasing that should help.
If resources are good, then network latency should be checked.

Problem

Versions in use

Cause

Solutions:

Articles in this section

Related articles