Disclaimer
This article references a third-party vendor library and is not owned or maintained by HashiCorp. While we discuss how this library interacts with Consul, HashiCorp does not provide direct support for the Spring Boot library itself. Due to the library's inefficient use of blocking queries, we do not recommend using this integration in production environments.
Introduction
This article addresses a common issue encountered when using the Consul Spring Boot library for Java-based microservices, particularly in service discovery and Key-Value (KV) store use cases within on-premise and Kubernetes environments. It explains a specific problem related to inefficient blocking queries that can lead to "i/o deadline reached" errors.
Primer:
- SpringBoot: A Java-based framework used to build standalone, production-grade microservices.
- Blocking queries: A feature of the Consul HTTP API that allows a client to wait for a change instead of polling repeatedly.
Problem
While the Spring Boot library effectively leverages Consul's HTTP API for blocking queries and watching for changes, its current implementation can introduce significant inefficiencies. A frequent error encountered by users is "i/o deadline reached." This occurs when the Spring Boot client excessively watches the Consul catalog, leading to a timeout as Consul reaches its internal I/O deadline.
Common Error Example:
"Error watching Consul CatalogServices", "thread_name":"catalogWatchTaskScheduler-1", "level":"ERROR", "stack_trace":"com.ecwid.consul.v1.OperationException: OperationException(statusCode=500, statusMessage='Internal Server Error', statusContent='rpc error making call: i/o deadline reached')"
This behavior stems from the library’s use of blocking queries to the /v1/catalog/services
endpoint. Unfortunately, Spring Boot hardcodes logic that forces these queries to always target the Consul leader—even when spring.cloud.consul.discovery.consistency-mode
is configured. As a result, the consistent mode setting is overridden, and the query load cannot be distributed to followers, making it impossible to tune or mitigate this behavior effectively.
Prerequisites
To understand and address this issue, it's beneficial to have:
- A Spring Boot microservice integrated with the Consul Spring Boot library.
- A Consul cluster (on-prem or Kubernetes).
- Access to Consul client configurations.
- Ability to configure Consul agent telemetry.
Cause
The root cause of the "i/o deadline reached" error is the Spring Boot library's hardcoded logic for blocking queries to the /v1/catalog/services
endpoint. This logic forces all such queries to be directed to the Consul leader node, regardless of the spring.cloud.consul.discovery.consistency-mode
setting. This prevents the distribution of query load to follower nodes, leading to an excessive burden on the leader and eventual timeouts when Consul's internal I/O deadline is reached. The leader: "true"
and allow_stale: "false"
labels in Consul RPC metrics confirm this behavior.
Solution
As a temporary mitigation to reduce load on the Consul cluster and avoid error spam, the discovery_max_stale
configuration should be enabled on all Consul clients that the Spring Boot services interact with. This setting allows clients to accept stale data from follower nodes, thereby reducing pressure on Consul servers and minimizing the likelihood of timeout errors when retrieving catalog data.
How to implement discovery_max_stale
:
This configuration needs to be applied to the Consul client agents that are consuming the Spring Boot services. The exact method will depend on how your Consul clients are deployed.
Example for Consul Agent Configuration:
# Example for a Consul agent configuration file agent { # ... other agent configurations discovery_max_stale = "10s" # Example: allows stale data up to 10 seconds old }
Example for Kubernetes Helm Override (if deploying Consul via Helm):
While discovery_max_stale
is typically a client-side setting, ensure your Spring Boot services are configured to leverage this if they are directly connecting to Consul clients/agents that have this enabled.
(Note: The provided text only mentions enabling discovery_max_stale
as a temporary mitigation. A more permanent solution would likely require an update to the Spring Cloud Consul library itself to allow for proper consistency mode configuration and query distribution.)
Outcome
By enabling the discovery_max_stale
configuration, you should observe:
- A reduction in "i/o deadline reached" errors originating from Spring Boot services.
- Decreased load on the Consul leader node, as clients are now permitted to read stale data from follower nodes.
- Improved stability and responsiveness of your Consul cluster, particularly under heavy load from Spring Boot clients.
You can further confirm the effectiveness of this mitigation by monitoring Consul telemetry metrics, specifically observing a change in the allow_stale
label from false
to true
for Catalog.ListServices
calls.
Additional Information
Metrics for Confirmation:
Consul telemetry metrics, particularly consul.rpc.server.call
, can help confirm the high volume of blocking queries originating from Spring Boot and the impact of the discovery_max_stale
setting.
To enable this level of visibility, configure the telemetry block in the Consul agent configuration:
telemetry { prefix_filter = ["+consul.rpc.server.call"] }
Consul K8s Helm Override (for Kubernetes deployments):
global: metrics: enabled: true prefixFilter: allowList: [ "consul.rpc.server.call" ]
Once enabled, the /v1/agent/metrics
endpoint can be queried to inspect RPC method call activity. For example, the following metric entry shows a call to the Catalog.ListServices
method before allow_stale
is enabled:
{ "Name": "consul.rpc.server.call", "Count": 30, "Rate": 0.005000000074505806, "Sum": 0.05000000074505806, "Min": 0.05000000074505806, "Max": 0.05000000074505806, "Mean": 0.05000000074505806, "Stddev": 0, "Labels": { "allow_stale": "false", "blocking": "true", "errored": "false", "leader": "true", "locality": "local", "method": "Catalog.ListServices", "request_type": "read", "rpc_type": "net/rpc", "target_datacenter": "dc1" } }
Note that:
-
"leader": "true"
confirms that the call is hitting the Consul leader. -
"allow_stale": "false"
confirms that stale reads are not enabled, meaning all reads are directed to the leader rather than follower nodes.
Related Resources:
- Spring Boot Consul GitHub Repository: https://github.com/spring-cloud/spring-cloud-consul