Introduction
This article provides insights into troubleshooting and addressing two common issues observed in Consul API Gateway deployments. These issues—unexpected request failures with response_code: 0
and high latency for backend services—can significantly impact application functionality. The article discusses potential causes, mitigation steps, and recommendations for improved observability and stability in your Consul architecture.
Scenario
Issue 1: Request Failures with response_code: 0
Summary:
A backend/upstream service exhibited unexpected request failures with response_code: 0
in the Envoy access logs when accessed via Consul API Gateway. This behavior persisted even for requests well within configured timeout thresholds and impacted key application functions.
Key Observations:
- Requests terminated abruptly with
response_code: 0
. - Logs indicated
downstream_remote_disconnect
in the Consul data plane logs for the backend service. - Envoy metrics like
rq_total
included requests not categorized underrq_error
,rq_success
, orrq_timeout
. - Further analysis suggested these failures were due to unexpected client-side connection terminations.
Issue 2: High Latency for Backend Services
Summary:
High latency was observed for a backend/upstream service via the Consul API Gateway, causing performance degradation in critical application functions.
Key Observations:
- Requests exceeded the configured timeout threshold of 60 seconds, resulting in
504
errors. - Some requests within the timeout threshold terminated unexpectedly with
response_code: 0
. - Envoy access log timing parameters (e.g.,
duration
,upstream_service_time
) indicated connection terminations before request completion. - More analysis pointed to the application unexpectedly closing connections.
Recommendation
-
Analyze Application Behavior:
- Investigate the application stack for unexpected connection terminations. Ensure proper handling of client connections to avoid abrupt disconnects.
- Consider implementing retries or failover mechanisms to handle transient connection issues.
-
Enable Distributed Tracing (if feasible):
- Enable application-level distributed tracing for granular insights into request flow and connection behavior.
- Propagate required tracing headers as outlined in the Consul Distributed Tracing Documentation.
-
Enhance Envoy Observability:
- Configure custom Envoy access log formatting to include additional parameters for detailed analysis. Useful Envoy command operators include
response_flags
,downstream_local_address
, andupstream_transport_failure_reason
. - Enable metrics collection for a deeper understanding of
response_code: 0
occurrences.
- Configure custom Envoy access log formatting to include additional parameters for detailed analysis. Useful Envoy command operators include
-
Optimize Timeout Settings:
- Verify timeout settings in
proxy-defaults
and adjust based on application requirements. Consider breaking down large transactions into smaller chunks to minimize long-lived connections.
- Verify timeout settings in
-
Leverage Solutions Architecture and Security Team Support:
- Engage your Solution Architect and your Security Team to design tailored observability and troubleshooting strategies, such as integrating third-party Open Source monitoring tools into your application or customizing Envoy configurations for the custom log formatting.
Additional Information
-
Links to Related Resources:
By following these recommendations and leveraging the additional resources, you can address and prevent similar issues in your Consul deployments, ensuring a stable and performant environment for your critical application functions.