Introduction
This document covers the information that should be collected while opening support tickets for issues relating to Consul and Nomad issues.
Expected Outcome
The instructions are geared towards gathering the information by running various commands. Many of these can be avoided if a centralized logging facility is available from which logs can be extracted and correlated.
NOTE: The outputs of the commands should be written to files with the file names representing the component and the host from which it is collected.
Use Case
Types of Issues
There are primarily two categories of issues that determine the level and type of information to be collected.
- Cluster-Level Issues: These are related to the Core Consul and Nomad Services. These could be due to configuration, performance issues, network issues etc.
- Service-Level Issues: These issues affect the service-to-service communication inside the Service Mesh. These could be due to application performance, network issues, and rarely due to the performance issues of Nomad and Consul.
Procedure - Cluster-Level Issues
The following information should be collected and attached to the ticket for cluster-level issues:
Issue Details
It is important to describe the issue and the following details are crucial to understand the issue better:
- Description of the issue (include the name of the Cluster)
- Timeline of the issue
- Changes that lead to the issue (if known)
Data Collection
NOTE: Additional data will have to be collected, based on whether the issue is happening live, or did happen in the past.
Live Issue
If the issue is happening live, collect the following information:
Nomad:
NOTE: Limit the number of nodes to the affected ones by specifying the server id if you are running a large cluster.
nomad operator debug -duration=2m -log-level=TRACE -server-id=all -node-id=all
Consul:
For consul, run the following commands from the leader node and the other affected nodes:
consul debug
Past Issue
In addition to the above details, collect the past logs from the affected hosts using journalctl
(adjust the duration based on the timeline of the issue)
sudo journalctl --since "3 days ago" --no-pager
Procedure - Service-Level Issues
High-Level Components
The following diagram represents the various components on a high level and the request flow.
- Gateway-to-Service: These are requests from outside the cluster that come to the API Gateway, which then is routed to the applications running inside the mesh.
- Service-to-Service: These requests are the ones between two services inside the mesh (eg: App A talking to App B via the sidecar proxies)
Service-to-Service
Issue Details
It is important to describe the issue and the following details are crucial to understand the issue better:
- Description of the issue
- Timeline of the issue
- Changes that lead to the issue (if known)
Data Collection
The following information should be collected from both the upstream and downstream services:
- Nomad Job and Allocation Status
- Nomad Job Status
nomad status -verbose [-namespace <namespace-name>] <job-name>
- Find the allocation that belongs to the job
nomad job allocs -verbose [-namespace <namespace-name>] <job-name>
- Collect the allocation status
nomad alloc status -verbose [-namespace <namespace-name>] <alloc-id>
- System Logs from the host using
journalctl
SSH onto the host running the allocation and get the
journalctl
logs since the issue started happening.Ref: https://www.freedesktop.org/software/systemd/man/latest/journalctl.html#-S
sudo journalctl --since <duration> --no-pager
- Envoy debug information from the allocation
- Exec into the allocation
nomad exec [-namespace <namespace name>] -task <task-name> -job < job-name> sh
- Run the below commands from the allocation
NOTE: For allocations with multiple tasks with sidecar service, the port 19001 will become 1900x, where x is the index of the task in the job definition.
curl 127.0.0.2:19001/clusters curl 127.0.0.2:19001/listeners curl 127.0.0.2:19001/stats curl 127.0.0.2:19001/config_dump?include_eds=true
- Increase the log level to
DEBUG
curl 127.0.0.2:19000/logging?level=debug -X POST
- Once the log level is bumped up, collect the allocation logs, and re-initiate the service request to capture the interactions.
nomad logs [-namespace <namespace-name>] -task connect-proxy-<task-name> -f <alloc-id>
- If the issue is performance-related to Nomad or Consul, generate a debug bundle from the leader node of both Nomad and Consul. (Refer to cluster-level-issues section)
Gateway-to-Service
For any issues related to API Gateway, collect the following information while opening a ticket:
Issue Details
It is important to describe the issue and the following details are crucial to understand the issue better:
- Description of issue
- Timeline of the issue
- Changes that lead to the issue (if known)
Data Collection
- Nomad Job and Allocation Status
- Nomad Job Status
nomad status -verbose [-namespace <namespace-name>] <job-name>
- Find the allocation that belongs to the job
nomad job allocs -verbose [-namespace <namespace-name>] <job-name>
- Collect the allocation status
nomad alloc status -verbose [-namespace <namespace-name>] <alloc-id>
- Exec into Network Namespace
NOTE:
The steps under this section are only required for allocations to which you can’t get a nomad exec
shell session (eg: Nomad Gateway allocations) .
The Consul API Gateway uses a distroless envoy container and you won’t be able to exec
into the allocation and collect debug data. You have to log into the hosts and enter into the network namespace of the allocation to collect this information.
- SSH into the worker node where the allocation is running (you can find this from step 2 above)
- Find the docker containers running the API Gateway allocs (
pause
andenvoy
container)
sudo docker ps | grep <alloc-id>
- Find the process ID of the process running inside the container
sudo docker top nomad_init_<alloc-id> -o pid
Example:
sudo docker top nomad_init_1786bb5c-9b8e-a9c1-b864-9204486cf65f -o pid
PID
1890
- Use
nsenter
to enter inside the network namespace using the PID from the previous output
sudo nsenter -n -t <PID>
-
Envoy debug information from the network namespace
When the issues are related to service connectivity (eg: timeouts, connection errors, latency etc), the following information should be collected (from the
nsenter
session or byexec
ing into the allocation).curl 127.0.0.2:19000/clusters curl 127.0.0.2:19000/listeners curl 127.0.0.2:19000/stats curl 127.0.0.2:19000/config_dump?include_eds=true
Increase the log level to
DEBUG
curl 127.0.0.2:19000/logging?level=debug -X POST
Once the log level is bumped up, collect the allocation logs, and re-initiate the service request to capture the interactions.
nomad logs [-namespace <namespace-name>] -task api -f <alloc-id>
- Config Entries
Considering Gateways are configured using Consul Config Entries, include the output of the following commands that show the status of Config Entries:
NOTE: Include any additional config entry relevant to the issue.
consul config read -kind api-gateway -name <api-gateway-service-name> consul config read -kind http-route -name <http-route-name> consul config read -kind proxy-defaults -name global
- System Logs
- If the issue is with Jobs/allocations, SSH onto the host running the allocation and get the
journalctl
logs since the issue started happening.Ref: https://www.freedesktop.org/software/systemd/man/latest/journalctl.html#-S
sudo journalctl --since <duration> --no-pager
example:
sudo journalctl --since "3 days ago" --no-pager
- If the issue is with Jobs/allocations, SSH onto the host running the allocation and get the