An operating Vault consumes file descriptors for both use in accessing files on a filesystem and for representing connections established to other network hosts as sockets. A common problem Vault administrators run into when scaling up the workload for Vault is running out of available file descriptors, which results in Vault becoming unresponsive and throwing error messages like this:
http: Accept error: accept tcp4 0.0.0.0:8200: accept4: too many open files; retrying in 1s
These issues may resolve with decreased utilization, but if the underlying causes are left unaddressed it can result in future transient issues or, depending on load, a more long-lasting service disruption and outage. Additionally, this can cause issues with snapshots if the snapshot files cannot be opened.
Typically, these issues are caused by insufficient file descriptors configured either at the operating system level and/or in the service definition. It is also possible that there is something consuming file descriptors at an excessive rate.
The following troubleshooting steps should help identify the root cause of file descriptor exhaustion:
- Find out how many file handles are in use by the vault process, by taking its pid and running:
sudo lsof -p $(pidof vault) | wc -l
- Check the maximum number of open files allowed for the vault process:
cat /proc/$(pidof vault)/limits | awk 'NR==1; /Max open files/'
Check the system-wide maximum open files value:
If the process-level limits are indeed the bottleneck, increase those values be editing the service file for vault, typically defined in (/etc/systemd/system/vault.service). Adding (or modifying) the LimitNOFILE parameter will bump up the limit (capped to "file-max"), and value like 65536 will increase the process-level limits to a reasonable level. The service will need to be reloaded (sudo systemctl daemon-reload)
- If the same problems are still occurring after increasing the limit, check to see the new number of file handles in use to make sure the changes were applied properly. If so, and you're hitting the new larger cap, there may be an issue with connections from vault to other systems (like external databases configured as secrets engines) that need to be addressed.
For more details, please refer to the following performance-tuning article: https://learn.hashicorp.com/tutorials/vault/performance-tuning#max-open-files