Introduction
If the Consul storage backend is used for Vault then it's important to consider the default Consul agent parameter:
The default value of 200
connections may be insufficient resulting in issues during high loads and subsequent parallelism to Consul from Vault that's in excess of the set maximum.
Cause
Some examples of related scenarios can include:
- High volume of requests to Vault or a new peak in the number of requests per second. The Vault Telemetry such as
vault.consul.*
, as well as the Vault Audit Logs should help identify volumes and transaction rates. Consul specific matrices such as:consul.client.rpc*
&consul.rpc.request
can also help. - During Vault the startup / boot phase, iterative requests are abruptly stopping mid way resulting in the service restarting and cycling through similar events.
- Vault is restarted after a prolong period of being off-line where a large number of lease revocations need to be performed at a rate that's above the available connections and so resulting in a restart each time. This is particularly common to older versions of Vault prior to 1.7.x that do not have improvements in the revocation manager which help to prevent that from happening. In these cases some differences can be observed (different Lease IDs) in the revoked leases referenced from the offending mount; Vault continues to expunge expired leases with each restart before exceeding available connections limits and repeating the process with another restart again.
- A lot of 429s being logged within the vault operational logs. An example log line is included below:
Oct 24 15:29:17 vault[3317]: {"@level":"error","@message":"failed to revoke lease","@module":"expiration","@timestamp":"2023-10-24T15:29:17.846828Z","error":"failed to revoke token: failed to scan for children: Unexpected response code: 429","lease_id":"auth/aws/login/hc394621b09308176cb45029f7424e7257fba288b6abd014aee1c4bab7449b5d6.kGL2x"}
Solution
Set http_max_conns_per_client in accordance to the measured precedence using the Vault Audit Log or the Vault Operational Log to determine what's needed as well considering the hardware resources available to the Consul servers. For example an increased value of 300
may be sufficient if the mentioned scenarios being observed are intimidatory and only occurring during certain peaks and or other periods when a restart is made but several more restarts are then transpiring before the Vault service becomes stable.
To commence with the required increase begin by stopping the Vault service on the Vault hosts before making any adjustments to the Consul configuration file for the consul agent running in client mode on the same host. It's also good to confirm the initial state of Consul members & peers so that a similar state can also be confirmed at the very end when all changes have been successfully applied.
# // On Vault host:
consul members ;
sudo systemctl stop vault ;
Add the new limits
stanza to the Consul agent HCL file running on the vault host:
# // contents of Consul agent conf '/etc/consul.d/consul.hcl' on Vault
server = false
data_dir = "/opt/consul"
node_name = "PR-US-vault1-agent"
# ... reset of conf ...
# // add:
limits {
http_max_conns_per_client = 300
}
Perform a `consul reload` so that the set parameter can take effect:
consul reload ;
consul members ;
Within the consul agent logs, Search for the keyword "reload" to make sure the reload has actually been performed.
sudo journalctl -u consul | grep -i "reload"
It is important to perform this change on the consul server nodes as well. A hot reload like the above will be sufficient, i.e. a consul service restart isn't required. Perform the following on all consul server nodes (including the leader):
Add the new limits
stanza to the Consul agent HCL file running as a server on the consul host:
# // contents of Consul agent conf '/etc/consul.d/consul.hcl' on Consul hosts
server = true
data_dir = "/opt/consul"
node_name = "PR-US-vault1-agent"
# ... reset of conf ...
# // add:
limits {
http_max_conns_per_client = 300
}
Perform a `consul reload` on the server node so that the set parameter can take effect:
consul reload ;
Within the consul agent logs, Search for the keyword "reload" to make sure the reload has actually been performed.
sudo journalctl -u consul | grep -i "reload"
The above needs to be performed on all consul server nodes.
Proceed to restart Vault:
sudo systemctl start vault && sudo journalctl -u vault -f ;
Continue to monitor all Telemetry and logs to verify that the the newly set increases in limits have sufficed.
Other CLI tools such as lsof
may also be used to get a count of connections open to Consul and in conjunction with the command watch
fluctuations in the number of connection can be monitored - an example of this can may be like:
# // consul connections on IPv4
sudo lsof -i4 | grep consul ;
# consul 15643 consul 8u IPv4 45227 0t0 TCP ...:8301 (LISTEN)
# consul 15643 consul 9u IPv4 45228 0t0 UDP ...:8301
# consul 15643 consul 10u IPv4 45229 0t0 UDP localhost:8600
# consul 15643 consul 11u IPv4 45231 0t0 TCP localhost:8600 (LISTEN)
# consul 15643 consul 12u IPv4 45233 0t0 TCP localhost:8500 (LISTEN)
# consul 15643 consul 13u IPv4 45279 0t0 TCP localhost:8500->localhost:42466 (ESTABLISHED)
# consul 15643 consul 14u IPv4 45349 0t0 TCP ....:45753->....:8300 (ESTABLISHED)
# consul 15643 consul 16u IPv4 45313 0t0 TCP ....:41505->....:8300 (ESTABLISHED)
watch 'sudo lsof -i4 | grep consul | wc -l' ;