How-to - Information Gathering Checklist for Consul Issues on Nomad – HashiCorp Help Center

Introduction

This document covers the information that should be collected while opening support tickets for issues relating to Consul and Nomad issues.

Expected Outcome

The instructions are geared towards gathering the information by running various commands. Many of these can be avoided if a centralized logging facility is available from which logs can be extracted and correlated.

NOTE: The outputs of the commands should be written to files with the file names representing the component and the host from which it is collected.

Use Case

Types of Issues

There are primarily two categories of issues that determine the level and type of information to be collected.

Cluster-Level Issues: These are related to the Core Consul and Nomad Services. These could be due to configuration, performance issues, network issues etc.
Service-Level Issues: These issues affect the service-to-service communication inside the Service Mesh. These could be due to application performance, network issues, and rarely due to the performance issues of Nomad and Consul.

Procedure - Cluster-Level Issues

The following information should be collected and attached to the ticket for cluster-level issues:

Issue Details

It is important to describe the issue and the following details are crucial to understand the issue better:

Description of the issue (include the name of the Cluster)
Timeline of the issue
Changes that lead to the issue (if known)

Data Collection

NOTE: Additional data will have to be collected, based on whether the issue is happening live, or did happen in the past.

Live Issue

If the issue is happening live, collect the following information:

Nomad:

NOTE: Limit the number of nodes to the affected ones by specifying the server id if you are running a large cluster.

nomad operator debug -duration=2m -log-level=TRACE -server-id=all -node-id=all

Consul:

For consul, run the following commands from the leader node and the other affected nodes:

consul debug

Past Issue

In addition to the above details, collect the past logs from the affected hosts using journalctl (adjust the duration based on the timeline of the issue)

sudo journalctl --since "3 days ago" --no-pager

Procedure - Service-Level Issues

High-Level Components

The following diagram represents the various components on a high level and the request flow.

Gateway-to-Service: These are requests from outside the cluster that come to the API Gateway, which then is routed to the applications running inside the mesh.
Service-to-Service: These requests are the ones between two services inside the mesh (eg: App A talking to App B via the sidecar proxies)

Service-to-Service

Issue Details

It is important to describe the issue and the following details are crucial to understand the issue better:

Description of the issue
Timeline of the issue
Changes that lead to the issue (if known)

Data Collection

The following information should be collected from both the upstream and downstream services:

Nomad Job and Allocation Status

Nomad Job Status

nomad status -verbose [-namespace <namespace-name>] <job-name>

Find the allocation that belongs to the job

nomad job allocs -verbose [-namespace <namespace-name>] <job-name>

Collect the allocation status

nomad alloc status -verbose [-namespace <namespace-name>] <alloc-id>

System Logs from the host using journalctl
SSH onto the host running the allocation and get the journalctl logs since the issue started happening.

Ref: https://www.freedesktop.org/software/systemd/man/latest/journalctl.html#-S
```
sudo journalctl --since <duration> --no-pager
```
Envoy debug information from the allocation
- Exec into the allocation
```
nomad exec [-namespace <namespace name>] -task <task-name> -job <
job-name> sh
```
- Run the below commands from the allocation
NOTE: For allocations with multiple tasks with sidecar service, the port 19001 will become 1900x, where x is the index of the task in the job definition.
```
curl 127.0.0.2:19001/clusters
curl 127.0.0.2:19001/listeners
curl 127.0.0.2:19001/stats
curl 127.0.0.2:19001/config_dump?include_eds=true
```
- Increase the log level to DEBUG
```
curl 127.0.0.2:19000/logging?level=debug -X POST
```
- Once the log level is bumped up, collect the allocation logs, and re-initiate the service request to capture the interactions.
```
nomad logs [-namespace <namespace-name>] -task connect-proxy-<task-name> -f <alloc-id>
```
If the issue is performance-related to Nomad or Consul, generate a debug bundle from the leader node of both Nomad and Consul. (Refer to cluster-level-issues section)

Gateway-to-Service

For any issues related to API Gateway, collect the following information while opening a ticket:

Issue Details

It is important to describe the issue and the following details are crucial to understand the issue better:

Description of issue
Timeline of the issue
Changes that lead to the issue (if known)

Data Collection

Nomad Job and Allocation Status

Nomad Job Status

nomad status -verbose [-namespace <namespace-name>] <job-name>

Find the allocation that belongs to the job

nomad job allocs -verbose [-namespace <namespace-name>] <job-name>

Collect the allocation status

nomad alloc status -verbose [-namespace <namespace-name>] <alloc-id>

Exec into Network Namespace

NOTE:

The steps under this section are only required for allocations to which you can’t get a nomad exec shell session (eg: Nomad Gateway allocations) .

The Consul API Gateway uses a distroless envoy container and you won’t be able to exec into the allocation and collect debug data. You have to log into the hosts and enter into the network namespace of the allocation to collect this information.

SSH into the worker node where the allocation is running (you can find this from step 2 above)
Find the docker containers running the API Gateway allocs (pause and envoy container)

sudo docker ps | grep <alloc-id>

Find the process ID of the process running inside the container

sudo docker top nomad_init_<alloc-id> -o pid

Example:

sudo docker top  nomad_init_1786bb5c-9b8e-a9c1-b864-9204486cf65f -o pid
PID
1890

Use nsenter to enter inside the network namespace using the PID from the previous output

sudo nsenter -n -t <PID>

Envoy debug information from the network namespace
When the issues are related to service connectivity (eg: timeouts, connection errors, latency etc), the following information should be collected (from the nsenter session or by execing into the allocation).
```
curl 127.0.0.2:19000/clusters
curl 127.0.0.2:19000/listeners
curl 127.0.0.2:19000/stats
curl 127.0.0.2:19000/config_dump?include_eds=true
```
Increase the log level to DEBUG
```
curl 127.0.0.2:19000/logging?level=debug -X POST
```
Once the log level is bumped up, collect the allocation logs, and re-initiate the service request to capture the interactions.
```
nomad logs [-namespace <namespace-name>] -task api -f <alloc-id>
```
Config Entries
Considering Gateways are configured using Consul Config Entries, include the output of the following commands that show the status of Config Entries:

NOTE: Include any additional config entry relevant to the issue.
```
consul config read -kind api-gateway -name <api-gateway-service-name>
consul config read -kind http-route -name <http-route-name>
consul config read -kind proxy-defaults -name global 
```
System Logs
- If the issue is with Jobs/allocations, SSH onto the host running the allocation and get the journalctl logs since the issue started happening.
  Ref: https://www.freedesktop.org/software/systemd/man/latest/journalctl.html#-S
```
sudo journalctl --since <duration> --no-pager
```
  example:
```
sudo journalctl --since "3 days ago" --no-pager
```

Introduction

Expected Outcome

Use Case

Types of Issues

Procedure - Cluster-Level Issues

Issue Details

Data Collection

Live Issue

Past Issue

Procedure - Service-Level Issues

High-Level Components

Service-to-Service

Issue Details

Data Collection

Gateway-to-Service

Issue Details

Data Collection

Articles in this section

Related articles