The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.
All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
Introduction
auto-join
(Consul version 1.14.x and above) instead of a LoadBalancer for a Consul cluster on Kubernetes to connect with external servers hosted on VMs. Cloud auto-join
provides fault tolerance similar to a LoadBalancer by selecting nodes with specific tags in a random sequence, utilizing the go-netaddr
library.auto-join
used the go-addr
library, which selected nodes sequentially. If the first server node was down, this would cause the entire Kubernetes cluster components installation to fail.
Use-Case
While this guide focuses on Consul 1.14, the concepts and techniques apply to the latest versions (1.19.x and 1.20.x) as well.
Starting with Consul 1.14, the architecture of Consul on Kubernetes changed, removing the dedicated client agent. This means that when Consul servers are hosted externally, the consul-dataplane
sidecar container, responsible for service mesh communication, is assigned only one IP address.
... - args: - -addresses - 10.162.34.52 - -grpc-port=8503 - -proxy-service-id-path=/consul/connect-inject/proxyid ...
consul-dataplane
container selects the first IP address specified in the values.yaml
file.... externalServers: enabled: true hosts: ["10.162.34.52","10.162.34.53","10.162.34.54"] ...
Problem
Relying on a single Consul server address in the consul-dataplane
argument introduces a single point of failure. If that server becomes unavailable, your service mesh could experience connectivity issues and application disruptions.
Problem example versions using 1.13.x or older
This example used AWS
- Consul Version
1.13.9
and helm chart0.49.8
- 3 Consul Servers on EC2 VMs
- Clients & other components (like controller, mesh-gateway etc.) on EKS cluster.
Instead of creating LB, we used cloud auto-join
like below in values.yaml
to let servers hosted on EC2 discover each other with same tag_key
and tag_values
.
-
- Please ensure proper networking and routing is intact between EKS cluster and EC2 instances.
... externalServers: enabled: true hosts: - 'provider=aws tag_key=Server tag_value=true' k8sAuthMethodHost: 'https://16E7A4DAC528AE39C477031B4732DF12.gr7.ap-south-1.eks.amazonaws.com' #Address of the Kubernetes API server client: enabled: true join: - 'provider=aws tag_key=Server tag_value=true' exposeGossipPorts: true ...
consul-clients
relied on cloud auto-join with the go-addr
library, which selected servers sequentially from externalServers.hosts
. If the first server was unavailable, the entire Kubernetes cluster installation would fail.Solutions
-
Load Balancer (Recommended): Place your Consul servers behind a load balancer and use its IP address or DNS name in the
externalServers.hosts
Helm chart value. This provides health checks to ensure traffic is always directed to a healthy Consul server. However, it requires configuring CoreDNS to forward Consul domain queries to the load balancer. -
Cloud Auto-Join (Consul 1.14+): For a simpler approach, leverage cloud auto-join. This feature allows the
consul-dataplane
to randomly select a healthy Consul server from a provided list. If the initial connection fails, it automatically tries other servers, providing built-in fault tolerance similar to a load balancer.
By implementing either of these solutions, you can enhance the resilience of your Consul service mesh and minimize the impact of server failures on your applications.
Procedure
Consul 1.14.x introduced a significant change to cloud auto-join, eliminating the reliance on Consul clients and improving fault tolerance. Here's how:
-
No More Client Dependency: With the removal of Consul clients, cloud auto-join now operates directly within the
consul-dataplane
sidecar. -
Randomized Server Selection: By leveraging the
go-netaddr
library, cloud auto-join randomizes the order in which it attempts to connect to the Consul servers specified inexternalServers.hosts
. This ensures that if one server is unavailable, theconsul-dataplane
can quickly connect to another healthy server.
To experience this enhanced functionality, upgrade your Consul deployment to the latest 1.14.x version. Follow the official guides for server upgrades on VMs and K8s Cluster upgrades to Dataplane. This upgrade effectively replaces the older go-addr
library with the more robust go-netaddr
for cloud auto-join.
-
Modify the cloud
auto-join
library fromgo-addr
togo-netaddr
... externalServers: enabled: true hosts: - 'exec=discover -q addrs provider=aws tag_key=Server tag_value=true' k8sAuthMethodHost: 'https://16E7A4DAC528AE39C477031B4732DF12.gr7.ap-south-1.eks.amazonaws.com' #Address of the Kubernetes API server client: enabled: true join: - 'exec=discover -q addrs provider=aws tag_key=Server tag_value=true' exposeGossipPorts: true ...
- To validate the improved fault tolerance, stop the first of the three Consul servers and perform a fresh installation of the
consul-k8s
cluster.- The
consul-dataplane
successfully discovered and connected to the remaining servers, ensuring a successful cluster deployment.
... consul-dataplane: Container ID: containerd://080705a71b7e9784ccbde18cd6762c2fe9e6c5c757ab02b65c8bad2035f16b82 Image: hashicorp/consul-dataplane:1.0.5 Image ID: docker.io/hashicorp/consul-dataplane@sha256:b5a7e0f22dec65a90d2c3aff338c661e04d423c0060e77887ad5759f7e2b7b6b Port: <none> Host Port: <none> Args: -addresses exec=discover -q addrs provider=aws tag_key=Server tag_value=true -grpc-port=8502 -proxy-service-id-path=/consul/connect-inject/proxyid -log-level=info ...
- The
Additional Information
- Consul Servers Outside of Kubernetes
- Cloud Auto-join
- Helm Chart Reference: externalServers
- GitHub go-netaddrs