Introduction
This article provides detailed steps to resolve an issue where Consul blocking queries return stale data after a disaster recovery (DR) operation or snapshot restore. The issue affects clusters running Consul versions 1.15.x through 1.18.x and is fully resolved in version 1.19.0 and later.
The behaviour is token-scoped: a specific ACL token continues to receive outdated responses while queries using a different token immediately return current data. This article includes diagnostic steps, reproduction guidance, temporary mitigations, and the permanent solution through upgrading.Problem
After a disaster recovery operation or snapshot restore in Consul, blocking queries using the streaming backend may continue to return stale data for specific ACL tokens. The issue persists until the client application restarts or the ACL token is rotated. Switching to a different token immediately returns fresh data, confirming the token-scoped nature of the problem.
This can lead to:
The behaviour is token-scoped: a specific ACL token continues to receive outdated responses while queries using a different token immediately return current data. This article includes diagnostic steps, reproduction guidance, temporary mitigations, and the permanent solution through upgrading.Problem
After a disaster recovery operation or snapshot restore in Consul, blocking queries using the streaming backend may continue to return stale data for specific ACL tokens. The issue persists until the client application restarts or the ACL token is rotated. Switching to a different token immediately returns fresh data, confirming the token-scoped nature of the problem.
This can lead to:
- Applications making decisions based on outdated service health information
- Service discovery returning stale endpoint lists
- Configuration changes not being reflected in real-time
- Inconsistent behaviour across different clients or tokens
Prerequisites
- Consul Enterprise or OSS version 1.15.x through 1.18.x
- Streaming backend enabled (default since Consul 1.10)
- ACL tokens in use for authentication
- Client applications using blocking queries (with
waitandindexparameters) - Access to Consul server logs and API
Cause
This issue occurs because Consul's streaming backend, which powers blocking queries through materialized views, maintains token-scoped subscriptions that are filtered by ACL permissions. During disaster recovery operations—particularly snapshot restores—internal event streams may not be properly terminated and reset, leaving materialized views in a stale state.
1. Stream Termination on Snapshot Restore
Before Consul 1.18.0, internal streams were not properly terminated when a snapshot restore occurred. This left materialized views pointing to pre-restore state, continuing to serve outdated data even after the cluster state had been rolled back.
2. ACL Error Handling at Timeout Boundaries
Before Consul 1.19.0, when blocking queries reached their timeout while encountering ACL errors, the streaming backend did not consistently handle the error condition. This caused subscriptions to become stuck in an inconsistent state, unable to refresh or reconnect properly.
Why Token Rotation or Client Restart Resolves the Issue
- Token rotation: Creates a new materialized view with a fresh subscription to event streams, bypassing the stale view
- Client restart: Drops the existing connection and subscription; the new connection creates a new view from scratch
- Server reload/restart: Clears all in-memory state including materialized views and forces all clients to reconnect
Error Message
You may not see explicit errors in client logs, but server logs might contain:
You may not see explicit errors in client logs, but server logs might contain:
[ERROR] agent.server: stream error: error="snapshot restore interrupted stream" [WARN] agent.server: materialized view not updating: token=<token-id> [ERROR] streaming: subscription error: error="ACL not found"
Overview of Possible Solutions
The issue can be permanently resolved by upgrading to Consul 1.19.0 or later, which includes the necessary fixes. Until you can upgrade, temporary mitigations include:
- Using
consistent=trueparameter for critical reads to bypass the streaming backend - Implementing client-side staleness detection with automatic reconnection
- Performing rolling server reloads after DR operations to clear stale views
- Implementing periodic ACL token rotation for long-running applications
Solutions
Solution 1: Detect the Issue
Follow these diagnostic steps to confirm you're experiencing this issue:
Step 1: Verify Streaming Backend Is Active
Check the
X-Consul-Query-Backend response header to confirm the streaming backend is handling your requests:curl -v http://consul:8500/v1/health/service/myservice?index=123&wait=5m \ -H "X-Consul-Token: your-token" 2>&1 | grep -i query-backend
Expected output:
< X-Consul-Query-Backend: streaming
If the header shows
blocking-query instead of streaming, the streaming backend is not active and this issue does not apply.
Step 2: Monitor for Index Regression
The
X-Consul-Index should monotonically increase or stay the same (during no-change periods). If it decreases, you have stale data:for i in {1..5}; do
echo "Request $i:"
curl -s -D - http://consul:8500/v1/health/service/myservice?index=1&wait=5m \
-H "X-Consul-Token: your-token" -o /dev/null | grep X-Consul-Index
sleep 2
doneExample of stale behavior:
Request 1: X-Consul-Index: 5432 Request 2: X-Consul-Index: 5432 # No updates (normal during wait) Request 3: X-Consul-Index: 5432 # Still waiting Request 4: X-Consul-Index: 5420 # Index went backwards - PROBLEM! Request 5: X-Consul-Index: 5420 # Stuck at old value
Step 3: Verify Token-Scoped Behavior
Create a new test token with identical permissions and compare responses:
Create a new test token with identical permissions and compare responses:
# Create a test token with the same policy consul acl token create -policy-name=service-read -description="Test token for debugging" # Query with the original token curl -s http://consul:8500/v1/health/service/myservice \ -H "X-Consul-Token: original-token" | jq '.[] | {ID: .Service.ID, Checks: [.Checks[].Status]}' # Query with the new token curl -s http://consul:8500/v1/health/service/myservice \ -H "X-Consul-Token: new-test-token" | jq '.[] | {ID: .Service.ID, Checks: [.Checks[].Status]}'
If the new token returns current data while the original token returns stale data, this confirms the token-scoped streaming view issue.
Step 4: Check Server Logs
Look for streaming-related errors around the time of the recovery event:
Look for streaming-related errors around the time of the recovery event:
# On server nodes, search for relevant log entries grep -E "(SubMatView|streaming|snapshot restore|event publisher)" /var/log/consul/consul.log # Look for ACL-related streaming errors grep -i "materialized view" /var/log/consul/consul.log | grep -i error
Warning signs include:
- Errors mentioning "SubMatView" or "materialized view"
- Subscription errors that aren't being retried
- Missing "snapshot restore complete" messages
- ACL errors in streaming handlers
Solution 2: Apply Temporary Mitigations (Pre-Upgrade)
If you cannot immediately upgrade to Consul 1.19.0 or later, implement one or more of these mitigations:
Use Consistency Mode for Critical Reads
Force critical blocking queries to bypass the streaming backend and read directly from the leader:
Force critical blocking queries to bypass the streaming backend and read directly from the leader:
curl "http://consul:8500/v1/health/service/myservice?index=123&wait=5m&consistent=true" \ -H "X-Consul-Token: your-token"
Trade-offs:
- Increased load on the Raft leader
- Reduced scalability benefits of the streaming backend
- Higher latency for queries
When to use:
- Critical decision paths where stale data could cause outages
- Health checks that trigger automated failover
- Service discovery for critical infrastructure components
Solution 3: Upgrade to Consul 1.19.0 or Later (Permanent Fix)
Target Versions:
- Minimum recommended: Consul 1.19.0 (includes both critical fixes)
- Current stable: Consul 1.21.x (includes additional stability and diagnostic improvements)
Key Fixes by Version:
-
Consul 1.18.0 (February 2024):
- Stream termination on snapshot restore (GH-20642)
- Automatic restart of controllers on internal stream errors
- Improved handling of stream lifecycle during cluster events
-
Consul 1.19.0 (June 2024):
- Consistent ACL error handling at blocking query timeout boundaries (GH-20876)
- Fixed race conditions in materialized view updates
- Enhanced token validation in streaming subscriptions
-
Consul 1.21.0 (May 2025):
- Improved logging for subscription retries on ACL changes (GH-22141)
- Better operational visibility into streaming backend health
- Enhanced metrics for materialized view lifecycle
Outcome
After following the above steps, you should observe:
- Blocking queries return current data consistently for all ACL tokens
- The
X-Consul-Indexheader increments monotonically without regressions - No difference in behavior between original and newly created tokens
- Snapshot restore operations do not cause streaming views to become stale
- Client applications receive timely updates without requiring restarts
- Token rotation or client restarts are no longer necessary to resolve staleness
If the issue persists after upgrading to 1.19.0 or later, review the configuration steps carefully or contact HashiCorp Support for further assistance.
Additional Information
Best Practices for Blocking Queries
When building applications that use Consul blocking queries:
-
Always track index values: Monitor
X-Consul-Indexacross requests and detect anomalies - Implement reconnection logic: Handle connection drops and API errors gracefully with exponential backoff
- Use circuit breakers: Fail fast when staleness is detected to prevent cascading failures
-
Log query metadata: Include
X-Consul-Query-BackendandX-Consul-Indexin debug logs -
Set appropriate timeouts: Use reasonable
waittimes (2-5 minutes) to balance freshness and load - Handle index resets: Be prepared for index=0 scenarios after major cluster events
Monitoring Recommendations
Deploy comprehensive monitoring of Consul's streaming backend:
Server-side metrics:
-
consul.streaming.subscriptions- Total active subscriptions -
consul.streaming.resets- Number of stream resets -
consul.raft.apply- Raft log application rate - Custom alerts on snapshot restore completion
- Track subscription lifecycle events in logs
Client-side metrics:
- Time between successful blocking query updates (should be < 5 minutes under normal load)
- Index regression detection counter (should be zero)
- Token rotation events and timing
- Client reconnection frequency
- Blocking query timeout rate
Application-level indicators:
- Service discovery staleness (compare discovered endpoints with actual registrations)
- Configuration propagation delay (time from KV update to application awareness)
- Health check accuracy (compare reported health with actual service status)
Monitoring Recommendations
- Consul Blocking Queries - Official documentation on blocking query behavior
- Consul Streaming Backend - Technical details on the streaming architecture
- Consistency Modes - Understanding stale, consistent, and leader read modes
- ACL Token Management - Token lifecycle and rotation strategies
- Consul 1.19.0 Release Notes - Full changelog for the fix version
- Consul 1.18.0 Release Notes - Initial stream termination fix
- Upgrading Consul - General upgrade guidance
Related GitHub Issues
- GH-20642 - Stream termination on snapshot restore (fixed in 1.18.0)
- GH-20876 - ACL error handling at timeout boundaries (fixed in 1.19.0)
- GH-20868 - WAN federation streaming backend usage (backported to 1.15.11)
- GH-18636 - Envoy endpoint population after restore (fixed in 1.16.2)
- GH-22141 - SubMatView retry logging improvements (1.21.0)