Overview:
This Knowledge Base article addresses frequently asked questions (FAQs) regarding Consul Autopilot Redundancy Zones, a powerful feature available in Consul Enterprise. Redundancy zones enhance the fault tolerance and scalability of your Consul cluster by distributing Consul servers across availability zones.
Q1: Why should we use the"Consul Autopilot Redundancy zones" feature?
- Consul’s redundancy zones provide high availability in the case of server failure through the Enterprise feature of autopilot.
- Autopilot allows you to add read replicas to your datacenter that will be promoted to the "voting" status in case of voting server failure.
- It makes it possible to run one voter and any number of non-voters in each defined zone.
- Please note that there can only be one voter per zone.
- For more details, please refer to this link.
Q2: Do we require the enterprise binary of consul to use the "Consul Autopilot Redundancy zones" feature?
Consul Redundancy Zones is a powerful feature that enhances the fault tolerance and scalability of your Consul deployment. However, it's important to note that this functionality is exclusive to Consul Enterprise. To leverage Redundancy Zones, you'll need a valid Consul Enterprise license and the corresponding licensed Consul binary.
Q3: How to set up a quick test infra on AWS manually to test all given scenarios of Consul Autopilot Redundancy zones?
If you would like to use Terraform for the same then this reference link can be used for an automated setup. Otherwise you can refer below steps for quick manual guide to setup the required test infra.
Consul Enterprise binary with 6 nodes EC2 consul Cluster-
2 in us-east-1 defined as "zone-a"
2 in us-east-2 defined as "zone-b"
2 in us-east-3 defined as "zone-c"
Create custom Launch template:
You can also incorporate a bootstrap script when creating a launch template to automate the installation of necessary Consul packages.
Create ASG (Auto Scaling Group):
We must utilize the Launch template that has been previously established when setting up the Auto Scaling Group (ASG).
Based on the desired capacity set for the AWS Auto Scaling Group (ASG), corresponding EC2 instances will be generated within the specified Availability Zones (AZs). In below the setup, 6 nodes have been deployed across the designated AZs.
By default, the Autopilot "RedundancyZoneTag" feature is deactivated. To verify this setting, you can execute the following command:
$consul operator autopilot get-config
RedundancyZoneTag = ""
Q4: What are the prerequisites for using Redundancy Zones?
- Consul Enterprise License: Redundancy zones is an exclusive feature of Consul Enterprise. You'll need a valid license to utilize it.
- Consul Enterprise Binary: The licensed Consul Enterprise binary is required to leverage redundancy zone functionality.
Q5: How to Enable "Redundancy Zone" in Consul Autopilot?
* Add below in agent configuration file
"autopilot": [
{
"redundancy_zone_tag": "zone"
}
],
"node_meta": [
{
"zone": "zone-a/b/c"
}
]
* Restarted Consul services
$ systemctl stop consul
$ systemctl start consul
$ systemctl status consul
* Enable "RedundancyZoneTag" in autopilot
$ consul operator autopilot set-config -redundancy-zone-tag=zone
Configuration updated!
For more details, please refer to this official document link.
Q6: How can one verify the status of the "RedundancyZoneTag" in Autopilot?
$ consul operator autopilot get-config
RedundancyZoneTag = "zone"
Q7: How does the behavior of Consul voters differ before and after activating the "RedundancyZoneTag" in Autopilot?
Before enabling the flag "Redundancy Zone" in Consul Autopilot, all nodes are shown as "voters".
$ consul operator raft list-peers
After enabling the flag, was discovered that there were only 3 Voters present. Enabling the redundancy zone flag resulted in one voter from each zone being present. Consequently, the cluster transitioned from having 6 voters previously to just 3 voters, along with 3 non-voters.
Q8: What will happen if one AZ's voter node goes down?
Autopilot allows you to add read replicas to your datacenter that will be promoted to the "voting" status in case of voting server failure.
To verify configuration, stop one of the voters, in this case server-2, and verify that the correspondent non-voter in its redundancy zone gets promoted as a voter as soon as the server gets declared unhealthy.
server-2 feffe44d-b0c6-809f-97b6-cb7143b5cb9d X.X.X.X:8300 follower true 3
server-2a a5842855-58a6-197d-694d-b56eab9acc5c X.X.X.X:8300 follower true 3
$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
server-1 738f19c4-8543-eef2-6f83-e20544b863dd X.X.X.X:8300 leader true 3
server-1a 43afac13-f5af-b06f-8a9b-0092244790df X.X.X.X:8300 follower false 3
server-2a a5842855-58a6-197d-694d-b56eab9acc5c X.X.X.X:8300 follower true 3
server-3 6154e025-55ad-89a5-298d-6b7ae6cfb0f8 X.X.X.X:8300 follower true 3
server-3a 96f1d875-d57e-558e-2707-a32a63666bfb X.X.X.X:8300 follower false 3
Once “server-2a” gets promoted as a voter you can start Consul on “server-2” again and verify the one voter per redundancy zone rule is still respected.
$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
server-1 738f19c4-8543-eef2-6f83-e20544b863dd 10.20.10.11:8300 leader true 3
server-2 feffe44d-b0c6-809f-97b6-cb7143b5cb9d 10.20.10.12:8300 follower false 3
server-3 6154e025-55ad-89a5-298d-6b7ae6cfb0f8 10.20.10.13:8300 follower true 3
server-1a 43afac13-f5af-b06f-8a9b-0092244790df 10.20.10.21:8300 follower false 3
server-2a a5842855-58a6-197d-694d-b56eab9acc5c 10.20.10.22:8300 follower true 3
server-3a 96f1d875-d57e-558e-2707-a32a63666bfb 10.20.10.23:8300 follower false 3
Q9: How can we disable "Redundancy Zone" in Consul Autopilot?
$ consul operator autopilot set-config -redundancy-zone-tag=””
Configuration updated!
Q10: What alterations can be expected in Consul voter nodes upon re-disabling the "Redundancy Zone" flag?
Once we disabled the "Redundancy Zone" flag in Consul Autopilot we found 6 voters again.
Q11: What is the advised sequence for upgrades when "RedundancyZoneTag" is enabled?
When upgrading Consul with redundancy zones enabled, ensuring that the cluster quorum size meets the desired conditions is essential. Additionally, allowing sufficient time for server stabilization is crucial. While upgrading non-voter nodes first may seem like a prudent approach, the overall upgrade sequence (Follower nodes -> Leader node -> Client nodes) remains unchanged. Ultimately, the upgrade sequence should prioritize ensuring stability and failover readiness.
Q12: How many voters should we expect there to be when we lose an entire Consul redundancy zone?
If one AZ is lost, one of the non-voters in one of the other AZs will be promoted to voter to bring the cluster back to 3 servers.
Step1:
AZ1- voter1, non-voter1
AZ2- voter2, non-voter2
AZ3- voter3, non-voter3
Step2:
AZ1- Lost - now, one of the existing AZ promotes non-voter to voter.
AZ2- voter2, non-voter2-> promoted to voter =>> now this AZ has two voters and no non-voters
AZ3- voter3, non-voter3
Please note, AZ (Availability Zone) = RZ (Redundancy Zone).
Promoting a non-voter just means we are back to a full 3 voters which allows for a further single node failure in either remaining AZ before there is an availability incident. In this scenario above, one of the non-voters in another AZ is promoted so now both nodes in that AZ are voters. As soon as AZ1 comes back though the non-voter in AZ2 will step down and the voters will be spread across availability/redundancy zones again.
Q13: What are the key commands for managing redundancy zones?
Important Commands-
$ consul operator autopilot state
$ consul operator autopilot --help
$ consul operator autopilot get-config
$ consul operator autopilot set-config < >
$ curl http://127.0.0.1:8500/v1/operator/autopilot/configuration
$ curl curl http://127.0.0.1:8500/v1/operator/autopilot/state
$ curl localhost:8500/v1/operator/autopilot/health | jq .
Other Official links:
https://developer.hashicorp.com/consul/tutorials/datacenter-operations/autopilot-datacenter-operations
https://developer.hashicorp.com/consul/docs/agent/config/config-files#autopilot
https://developer.hashicorp.com/consul/commands/operator/autopilot
https://developer.hashicorp.com/consul/api-docs/operator/autopilot
https://developer.hashicorp.com/consul/docs/enterprise/redundancy
https://developer.hashicorp.com/consul/tutorials/operate-consul/redundancy-zones
Conclusion:
This FAQ addresses a collection of questions recently raised by Consul users through support tickets. It provides a comprehensive overview of Consul Autopilot Redundancy Zones, empowering you to build a resilient and scalable service discovery solution for your infrastructure.
For further details and advanced configurations, refer to the comprehensive Consul documentation. If you have additional questions, feel free to reach out to HashiCorp support.