The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.
All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
Introduction
This article offers guidance for end users and practitioners looking to overcome downtime related to Consul Serf/Raft issues during the upgrade process from older Consul versions.
Applicability
The steps outlined in this article apply only to the versions listed in the table below. This also includes OpenShift UBI and FIPS images, which are covered within the same version scope.
Consul Image | Version |
consul-k8s-control-plane |
< 1.4.x |
consul-dataplane |
< 1.4.x |
consul | consul-enterprise
|
< 1.18.x |
Background
- At the time of writing this Knowledge Base (KB) article, the latest supported version of Consul Enterprise was Consul v1.17.x. Therefore, the discussions and examples provided will assume users are starting with this version or later for the upgrade process..
- The steps outlined are motivated from several user reported experiences while performing upgrades with Consul on Kubernetes.
- The issues posted in this HashiCorp Discuss thread and this related GitHub issue have been resolved and are no longer present in later versions of consul-k8s.
- Details of these improvements are outlined in PR #3000 for consul-k8s, which was officially merged on January 15th, 2024
- In summary, the PR updates the Consul agent configuration by setting leave_on_terminate to
true
by default, aligning with the typical Kubernetes pod rescheduling behavior during upgrades. This fix was backported to the following versions of consul-k8s-control-plane: v1.4.0, v1.4.1, v1.4.2, v1.4.3, v1.4.4, v1.5.0, and v1.5.1.
- In summary, the PR updates the Consul agent configuration by setting leave_on_terminate to
Consul K8s Upgrade Guidance
Prerequisites:
- Kubernetes Command Line Tools (kubectl)
- Familiarity with concepts surrounding the use of Kubernetes Rolling Update Strategies.
- Familiarity with Helm installations and upgrades.
- Environment: Consul on Kubernetes deployed in adherence with the supported Kubernetes or OpenShift versions.
- Users should also familiarize themselves with the official public documentation for Upgrading Consul on Kubernetes components for their particular version of Consul and consul-k8s.
Procedural Steps
These steps should be completed just before performing the Upgrade Consul servers section, that is outlined in the Consul K8s Upgrade documentation. These steps help configure a more Kubernetes-friendly agent leave behavior for the Consul servers and mitigate potential issues when using a Consul Enterprise deployment with the default autopilot features enabled.
WARNING:The initial Helm upgrade in this procedure may result in an extended Raft leadership election period as the changes to correct the It is recommended to anticipate a 3-10 minute period of unavailability during this first configuration change. However, any subsequent Helm upgrades should not cause this issue once these changes have been implemented. |
- Establish Consul server agent leave behavior and disabling autopilot's upgrade migration (Consul Enterprise only) to your currently applied Helm values file:
## File: values.yaml - Consul Helm Chart overrides
server:
enabled: true
extraConfig: |
{
"leave_on_terminate": true,
"autopilot": {
"disable_upgrade_migration": true
}
} - Ensure the changes are updated to the cluster by performing helm upgrade with updated values file:
$ helm upgrade consul hashicorp/consul --namespace consul --values values.yaml
- Establish exec shell access to any Consul Server pod using kubectl or oc (OpenShift CLI):
$ kubectl --namespace consul exec -it statefulset/consul-server -c consul -- /bin/sh
- Verify Consul operator autopilot configuration for upgrade migration disablement is set to true by running
consul operator autopilot get-config
:Operator Note: If Consul ACLs are enabled, you'll need to pass in a token secret ID that has operator:read
permissions when reading the autopilot configuration. The permissions to update, if the configuration setting failed to change, areoperator:write
.$ consul operator autopilot get-config CleanupDeadServers = true LastContactThreshold = 200ms MaxTrailingLogs = 250 MinQuorum = 0 ServerStabilizationTime = 10s RedundancyZoneTag = "" DisableUpgradeMigration = true UpgradeVersionTag = ""
- Verify the
leave_on_terminate
configuration setting is in the Consul Server agent's extra config by describing theconsul-server-tmp-extra-config
Kubernetes configmap:$ kubectl --namespace consul describe configmap consul-server-tmp-extra-config
$ kubectl --namespace consul describe configmap consul-server-tmp-extra-config Name: consul-server-tmp-extra-config Namespace: consul Labels: app=consul app.kubernetes.io/managed-by=Helm chart=consul-helm component=server heritage=Helm release=cluster-01 Annotations: meta.helm.sh/release-name: cluster-01 meta.helm.sh/release-namespace: consul Data ==== extra-from-values.json: ---- { "leave_on_terminate": true, "autopilot": { "disable_upgrade_migration": true } } BinaryData ==== Events: <none>
- Moving forward, you can resume performing the upgrade process by following the Upgrade Consul servers section of the upgrade documentation, where you'll update the Consul server updatePartition and walk through a stepped upgrade of the Consul servers with the updated image versioning scheme.
- Once completed with the final Helm upgrade from the updatePartition upgrade method, the Helm Chart value overrides added by this procedure can be safely removed as they're configured by default in the previously mentioned versions of consul-k8s-control plane.
References:
- Consul K8s Github Issue #1612 (Oct 12, 2022): Consul Server Rollout Restart Causes Downtime
- HashiCorp Discuss Forum: Unstable Deployment on K8s with Helm Chart