Introduction
The Upgrade Runbook should detail the order of operations for executing an upgrade to the Consul installation within your production environment.
This guide details some of the stages that your organization’s Upgrade Runbook should cover.
This is not an exhaustive list, rather this guide should serve as a starting point for building your own Upgrade Runbook
Runbook Sections
Information Gathering
At this stage you collect information regarding the target version and any compatibility requirements:
- Review the release notes and changelog of the target version.
- Check the specific version upgrade details for the target version, if any.
- Review the Consul and Envoy compatibility matrices depending on your setup
- i.e. If running Consul on Kubernetes, check the Consul-k8s and Kubernetes versions but you can safely disregard the Envoy version, as starting with Consul-k8s 1.0 we bundle a compatible version of Envoy.
- If you are integrating Consul with Vault, Nomad, and/or Terraform, crosscheck the compatibility matrix of the integrating product(s) if available.
- If a compatibility matrix doesn't exist, plan to do extensive testing in a non-production environment using production-like workloads.
- Consul Engineering typically tests with the latest 2 versions of the other products during the time the latest version of Consul is being released, but not all use cases can be tested.
- If you are using a private image repository such as AWS ECR, upload the required images and binaries to your repository with checksum validation.
- Prepare the upgrade plan according to your setup.
There are a few resources we recommend that you reference while building or updating the Upgrade Runbook.
Resource | Notes |
Consul General Upgrade Process | Some best practices that you should follow when upgrading Consul. |
Consul Upgrade Instructions | How to upgrade Consul when a new version is released. |
Upgrade multiple federated Consul datacenters | Instructions related to upgrading Consul servers that are part of a federated environment. |
Troubleshooting a Consul Upgrade | Some troubleshooting information. |
Consul Upgrade Checklist | A knowledge base article that gives additional considerations and best practices when planning an upgrade. |
Upgrading Consul on Kubernetes Components | Information related to upgrading deployments using Kubernetes. |
Backup Consul Data and State tutorial | A tutorial with information on backing up and restoring Consul. |
Note: When visiting the document pages on our website, it is advised that you update the version in the upper-right corner of the page and review the content for your current version, as well as the version you are upgrading to.
Testing the upgrade
We highly recommend testing all upgrades in a Dev or Test environment before upgrading.
The more similar the Dev environment to Production, and the more testing we perform in Dev, the higher probability of an incident-free upgrade in Production.
- Include information for your operator on how to request or access the Dev environment in your organization.
- Confirm the existing Consul environment is healthy:
- Consul services running (systemctl/kubectl)
- Consul servers have a stable leader
- Consul members are stable
- Check KV operations
- No recurring WARN or ERROR messages in Consul components logs
- Servers
- Clients
- Dataplane
- Mesh Gateways
- Terminating Gateways
- API Gateways
Upgrading in Dev/Test
At this stage you detail the steps to perform on the day or few hours before the upgrade, for example:
- Increase log verbosity to Debug or Trace.
- Take a Consul snapshot and make sure the snapshots are valid, not corrupted, and readily available and accessible.
- Start the upgrade in the order based on your use case.
Post-Upgrade checks
- Confirm the upgraded Consul environment is healthy:
- Consul services running (systemctl/kubectl)
- Consul servers have a stable leader
- Consul members are stable
- Check KV operations
- No recurring WARN or ERROR messages in Consul components logs
- Servers
- Clients
- Dataplane
- Mesh Gateways
- Terminating Gateways
- API Gateways
- Detail how to test actual workload scenarios, for example:
- Service Discovery
- Service-Service comm. inside the Mesh
- External service to Mesh service comm. through API-GW
- Mesh service to External service comm. through Terminating-GW
- Service comm. through Mesh-GWs
- Service failover across federated, peered, or partitioned clusters
- KV operations
- If you run into an issue during the test upgrade or observe any issues in your testing, Raise a ticket with Consul Support.
Preparing for Production upgrade
- Communicate the scheduled upgrade plan and get approvals according to your organization's change management policies.
- Confirm the Consul environment is healthy:
- Consul services running (systemctl/kubectl)
- Consul servers have a stable leader
- Consul members are stable
- Check KV operations
- No recurring WARN or ERROR messages in Consul components logs
- Servers
- Clients
- Dataplane
- Mesh Gateways
- Terminating Gateways
- API Gateways
- Raise a low severity ticket notifying Consul Support of the planned upgrade with the date, times, and time zone.
- Provide support with configuration files and details on your consul environment, architecture, use case, and the current and target versions.
- Share the upgrade procedure you plan to follow.
- The Consul Support team will share their feedback and concerns if any, then close the ticket.
Upgrading in Production
At this stage you detail the steps to perform on the day or few hours before the upgrade, for example:
- Increase log verbosity to Debug or Trace.
- Take a Consul snapshot and make sure the snapshots are valid, not corrupted, and readily available and accessible.
- Start the upgrade in the order based on your use case.
- If you run into an issue during the upgrade, Raise a ticket with Consul Support.
Rollback
Document the rollback plan should the upgrade be unsuccessful.
- Is there a designated snapshot in case of failure?
- Do you know how the snapshot restoration process works?
- The method and frequency of testing the rollback plan should be covered by the Disaster Recovery Runbook.