Blocking queries return stale data after DR/snapshot restore – HashiCorp Help Center

Introduction

This article provides detailed steps to resolve an issue where Consul blocking queries return stale data after a disaster recovery (DR) operation or snapshot restore. The issue affects clusters running Consul versions 1.15.x through 1.18.x and is fully resolved in version 1.19.0 and later.
The behaviour is token-scoped: a specific ACL token continues to receive outdated responses while queries using a different token immediately return current data. This article includes diagnostic steps, reproduction guidance, temporary mitigations, and the permanent solution through upgrading.Problem
After a disaster recovery operation or snapshot restore in Consul, blocking queries using the streaming backend may continue to return stale data for specific ACL tokens. The issue persists until the client application restarts or the ACL token is rotated. Switching to a different token immediately returns fresh data, confirming the token-scoped nature of the problem.
This can lead to:

Applications making decisions based on outdated service health information
Service discovery returning stale endpoint lists
Configuration changes not being reflected in real-time
Inconsistent behaviour across different clients or tokens

Prerequisites

Consul Enterprise or OSS version 1.15.x through 1.18.x
Streaming backend enabled (default since Consul 1.10)
ACL tokens in use for authentication
Client applications using blocking queries (with wait and index parameters)
Access to Consul server logs and API

Cause

This issue occurs because Consul's streaming backend, which powers blocking queries through materialized views, maintains token-scoped subscriptions that are filtered by ACL permissions. During disaster recovery operations—particularly snapshot restores—internal event streams may not be properly terminated and reset, leaving materialized views in a stale state.

1. Stream Termination on Snapshot Restore
Before Consul 1.18.0, internal streams were not properly terminated when a snapshot restore occurred. This left materialized views pointing to pre-restore state, continuing to serve outdated data even after the cluster state had been rolled back.

2. ACL Error Handling at Timeout Boundaries
Before Consul 1.19.0, when blocking queries reached their timeout while encountering ACL errors, the streaming backend did not consistently handle the error condition. This caused subscriptions to become stuck in an inconsistent state, unable to refresh or reconnect properly.

Why Token Rotation or Client Restart Resolves the Issue

Token rotation: Creates a new materialized view with a fresh subscription to event streams, bypassing the stale view
Client restart: Drops the existing connection and subscription; the new connection creates a new view from scratch
Server reload/restart: Clears all in-memory state including materialized views and forces all clients to reconnect

Error Message
You may not see explicit errors in client logs, but server logs might contain:

[ERROR] agent.server: stream error: error="snapshot restore interrupted stream"
[WARN] agent.server: materialized view not updating: token=<token-id>
[ERROR] streaming: subscription error: error="ACL not found"

Overview of Possible Solutions

The issue can be permanently resolved by upgrading to Consul 1.19.0 or later, which includes the necessary fixes. Until you can upgrade, temporary mitigations include:

Using consistent=true parameter for critical reads to bypass the streaming backend
Implementing client-side staleness detection with automatic reconnection
Performing rolling server reloads after DR operations to clear stale views
Implementing periodic ACL token rotation for long-running applications

Solutions

Solution 1: Detect the Issue

Follow these diagnostic steps to confirm you're experiencing this issue:

Step 1: Verify Streaming Backend Is Active

Check the X-Consul-Query-Backend response header to confirm the streaming backend is handling your requests:

curl -v http://consul:8500/v1/health/service/myservice?index=123&wait=5m \
  -H "X-Consul-Token: your-token" 2>&1 | grep -i query-backend

Expected output:

< X-Consul-Query-Backend: streaming

If the header shows blocking-query instead of streaming, the streaming backend is not active and this issue does not apply.

Step 2: Monitor for Index Regression

The X-Consul-Index should monotonically increase or stay the same (during no-change periods). If it decreases, you have stale data:

for i in {1..5}; do
  echo "Request $i:"
  curl -s -D - http://consul:8500/v1/health/service/myservice?index=1&wait=5m \
    -H "X-Consul-Token: your-token" -o /dev/null | grep X-Consul-Index
  sleep 2
done

Example of stale behavior:

Request 1: X-Consul-Index: 5432
Request 2: X-Consul-Index: 5432  # No updates (normal during wait)
Request 3: X-Consul-Index: 5432  # Still waiting
Request 4: X-Consul-Index: 5420  # Index went backwards - PROBLEM!
Request 5: X-Consul-Index: 5420  # Stuck at old value

Step 3: Verify Token-Scoped Behavior
Create a new test token with identical permissions and compare responses:

# Create a test token with the same policy
consul acl token create -policy-name=service-read -description="Test token for debugging"

# Query with the original token
curl -s http://consul:8500/v1/health/service/myservice \
  -H "X-Consul-Token: original-token" | jq '.[] | {ID: .Service.ID, Checks: [.Checks[].Status]}'

# Query with the new token
curl -s http://consul:8500/v1/health/service/myservice \
  -H "X-Consul-Token: new-test-token" | jq '.[] | {ID: .Service.ID, Checks: [.Checks[].Status]}'

If the new token returns current data while the original token returns stale data, this confirms the token-scoped streaming view issue.

Step 4: Check Server Logs
Look for streaming-related errors around the time of the recovery event:

# On server nodes, search for relevant log entries
grep -E "(SubMatView|streaming|snapshot restore|event publisher)" /var/log/consul/consul.log

# Look for ACL-related streaming errors
grep -i "materialized view" /var/log/consul/consul.log | grep -i error

Warning signs include:

Errors mentioning "SubMatView" or "materialized view"
Subscription errors that aren't being retried
Missing "snapshot restore complete" messages
ACL errors in streaming handlers

Solution 2: Apply Temporary Mitigations (Pre-Upgrade)

If you cannot immediately upgrade to Consul 1.19.0 or later, implement one or more of these mitigations:

Use Consistency Mode for Critical Reads
Force critical blocking queries to bypass the streaming backend and read directly from the leader:

curl "http://consul:8500/v1/health/service/myservice?index=123&wait=5m&consistent=true" \
  -H "X-Consul-Token: your-token"

Trade-offs:

Increased load on the Raft leader
Reduced scalability benefits of the streaming backend
Higher latency for queries

When to use:

Critical decision paths where stale data could cause outages
Health checks that trigger automated failover
Service discovery for critical infrastructure components

Reference:- https://developer.hashicorp.com/consul/api-docs/features/consistency

Solution 3: Upgrade to Consul 1.19.0 or Later (Permanent Fix)

Target Versions:

Minimum recommended: Consul 1.19.0 (includes both critical fixes)
Current stable: Consul 1.21.x (includes additional stability and diagnostic improvements)

Key Fixes by Version:

Consul 1.18.0 (February 2024):
- Stream termination on snapshot restore (GH-20642)
- Automatic restart of controllers on internal stream errors
- Improved handling of stream lifecycle during cluster events
Consul 1.19.0 (June 2024):
- Consistent ACL error handling at blocking query timeout boundaries (GH-20876)
- Fixed race conditions in materialized view updates
- Enhanced token validation in streaming subscriptions
Consul 1.21.0 (May 2025):
- Improved logging for subscription retries on ACL changes (GH-22141)
- Better operational visibility into streaming backend health
- Enhanced metrics for materialized view lifecycle

Outcome

After following the above steps, you should observe:

Blocking queries return current data consistently for all ACL tokens
The X-Consul-Index header increments monotonically without regressions
No difference in behavior between original and newly created tokens
Snapshot restore operations do not cause streaming views to become stale
Client applications receive timely updates without requiring restarts
Token rotation or client restarts are no longer necessary to resolve staleness

If the issue persists after upgrading to 1.19.0 or later, review the configuration steps carefully or contact HashiCorp Support for further assistance.

Additional Information

Best Practices for Blocking Queries

When building applications that use Consul blocking queries:

Always track index values: Monitor X-Consul-Index across requests and detect anomalies
Implement reconnection logic: Handle connection drops and API errors gracefully with exponential backoff
Use circuit breakers: Fail fast when staleness is detected to prevent cascading failures
Log query metadata: Include X-Consul-Query-Backend and X-Consul-Index in debug logs
Set appropriate timeouts: Use reasonable wait times (2-5 minutes) to balance freshness and load
Handle index resets: Be prepared for index=0 scenarios after major cluster events

Monitoring Recommendations

Deploy comprehensive monitoring of Consul's streaming backend:

Server-side metrics:

consul.streaming.subscriptions - Total active subscriptions
consul.streaming.resets - Number of stream resets
consul.raft.apply - Raft log application rate
Custom alerts on snapshot restore completion
Track subscription lifecycle events in logs

Client-side metrics:

Time between successful blocking query updates (should be < 5 minutes under normal load)
Index regression detection counter (should be zero)
Token rotation events and timing
Client reconnection frequency
Blocking query timeout rate

Application-level indicators:

Service discovery staleness (compare discovered endpoints with actual registrations)
Configuration propagation delay (time from KV update to application awareness)
Health check accuracy (compare reported health with actual service status)

Monitoring Recommendations

Consul Blocking Queries - Official documentation on blocking query behavior
Consul Streaming Backend - Technical details on the streaming architecture
Consistency Modes - Understanding stale, consistent, and leader read modes
ACL Token Management - Token lifecycle and rotation strategies
Consul 1.19.0 Release Notes - Full changelog for the fix version
Consul 1.18.0 Release Notes - Initial stream termination fix
Upgrading Consul - General upgrade guidance

Related GitHub Issues

GH-20642 - Stream termination on snapshot restore (fixed in 1.18.0)
GH-20876 - ACL error handling at timeout boundaries (fixed in 1.19.0)
GH-20868 - WAN federation streaming backend usage (backported to 1.15.11)
GH-18636 - Envoy endpoint population after restore (fixed in 1.16.2)
GH-22141 - SubMatView retry logging improvements (1.21.0)

Articles in this section

Related articles