This is a relatively comprehensive collection of errors and warnings which are emitted by the Consul agents or found in Consul server/client log output. A detailed explanation of the error and when possible, typical root causes, workarounds, or solutions are also provided.
[ERR] memberlist: Failed to send gossip to :8301: write udp :8301->:8301: sendto: invalid argument
-> This usually indicate a problem with ARP table overflow. A possible solution would be to increase the ARP cache expiration time:
On Linux:
net.ipv4.neigh.[interface].base_reachable_time_ms
On Windows:
netsh int ipv4 set interface [interface] basereachable
The maximum value (on windows) being 1hour.
[ERR] agent: failed to sync remote state: No known Consul servers
and/or
[ERR] dns: rpc error: No known Consul servers
-> This can happen when an agent is first starting, but shouldn’t happen during normal operation!
[WARN] memberlist: handler queue full, dropping message
-> This is indicative of a pretty high message load. One possible reason is if there is a lot of node flap for one or more nodes.
[WARN] serf: Intent queue depth
-> This issue was mostly seen with older version of Consul (<0.7) where we used a small, fixed- size circular buffer. This was fixed in 0.7
by storing intents per-node instead.
[WARN] raft: not part of stable configuration, aborting election
-> This means you don’t have a complete peers.json
on all the servers (the server is not seeing itself in the peer configuration). You’ll need to stop all the servers and create an identical peers.json
file on each, which includes all the server IP:port pairs. Once they all have the same peers.json
file you can start them again.
Error starting agent: Failed to start Consul server: Failed to start Raft: recovery failed: refused to recover cluster with no initial state, this is probably an operator error
-> This message is generated when you set a peers.json
on a fresh server (nothing in the data-dir’s “raft” directory). Consul’s assuming this is an error because usually you want to recover a server with data; fresh servers can be joined into an existing cluster with a consul join
. It is possible that the data-dir got wiped, or it is a newly-spawned server. The fix here is to only set that file on the servers that were part of the cluster you are trying to recover and that have valid Raft data in their data-dir, and then join the servers to the cluster in the usual way.
Error parsing xxx/yyy.json: 1 error(s) occurred
-> This can be seen if you recently upgraded your consul from 0.9.x
or earlier to 1.x
and if you are using something like "type":"tcp", "tcp":"10.10.40.12:9999",...
. Here, the flag type
was always undocumented from the get-go, since it was an internal flag in that struct that was set based on whether you passed a “tcp”, “http” or whatever other value you used. If you tried to pass it in pre-1.0 versions, it was essentially a no-op. We made some changes in the parser in 1.0
that prevent this internal-only field from being set in config files. If you remove it, your config file will work.
QName Invalid:
-> This DNS error is produced when Consul is given a name that it is supposed to resolve (to something in .consul) but it doesn’t parse correctly.
[ERR] memberlist: Decrypt packet failed: Unsupported encryption version 88 from=10.112.139.134:48050
-> This is due to Unsupported encryption version. Valid encryption version values are 0 or 1, so seeing anything other than 0 or 1 (88 in this case) is extremely unusual and could point to something on the network (VIPs or VLAN tagging for example) altering packets.
140600432383648:error:14094412:SSL routines:SSL3_READ_BYTES:sslv3 alert bad certificate:s3_pkt.c:1262:SSL alert number 42
or
140600432383648:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:s23_lib.c:177:
-> When a user enables all forms of the tls_* configuration flags, including verify_incoming
, errors can be observed when connecting to Consul without presenting a client certificate. Use a command like the following with correct certificate and key pairs to connect:
echo quit | openssl s_client -showcerts -CAfile rootCA.pem -key client.key \
-cert client.crt -connect localhost:8501
or with curl:
curl -k https://192.168.1.2:8501/v1/catalog/services?pretty \
--key client.key --cert client.crt
{
"consul": []
}
These commands demonstrate that the Consul server is indeed correctly serving with TLS, and that a client certificate allows successful handshaking and connection.
Failed to invoke watch handler /tmp/script.sh exit status 126
-> If you see a shell exit status of 126
like this, it’s indicative of a non-executable (or other permissions issue) script file. Here is the list of other relevant shell exit status codes: http://www.tldp.org/LDP/abs/html/exitcodes.html
[WARN] consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "node-id": Node name "node-name" reserved by node "n" with name "node-name"
-> This can be seen when a user is re-installing node. When re-installing a node starting with Consul v0.8.5, the node id is no longer predictable by default. Thus, the node id might be different, which leads to this message. This message can also appear in the logs if you re-run a node with an identical node name but a different node-id. When this occurs, Consul refuses the new node to "steal" the node name as the previous node was seen not that much time ago. One way to fix this is by properly leave the cluster when the "first" node is being decommissioned. An additional method would be to stop the node in question, log into a different node running in the same Consul cluster, run the consul force-leave CLI command on the node based on the node name, and restart the node in question
consul: error getting server health from "SERVER NAME N": context deadline exceeded
consul: error getting server health from "SERVER NAME N": last request still outstanding
These messages generally indicate that the instances running the server nodes may be running low on resouces(i.e. Network I/O, RAM, and Storage). You always want to make that your nodes have enough of these resources so that the Consul cluster can properly and efficiently run your specific Consul use-case.