If you are experiencing an issue with your Vault and look in your Operational logs, you may see errors that have `context deadline exceeded` in them. This article aims to explain this error a bit more and what the possible causes are.
What does `context deadline exceeded` mean?
Most HashiCorp software, including Vault, is built using Go. We use contexts for connections to provide a timeout or deadline mechanism. The error 'context deadline exceeded' means that we ran into a situation where a given action was not completed in an expected timeframe. For Vault this is typically going to be related to a network connection made to an external system such as a database or even a storage backend such as Consul.
2020-05-09T02:19:38.136-0500 [ERROR] expiration: failed to revoke lease: lease_id=auth/approle/login/<hashed_value> error="failed to revoke token: failed to revoke entry: failed to delete entry: context deadline exceeded"
Looking at the error above we can see there was a problem revoking a lease. In this particular case Vault was leveraging Consul as its backend, and was having an issue.
- Firewall Rules / Cloud Security Rules
- Resource Contention
There are multiple issues that can lead to this error message. Most often it is a network related connection to an external database connection or Vault backend. This can be increased latency between Vault and the external system, or a misconfiguration of Firewall Rules or Cloud Security Rules/Groups. You can also see this error if there is slow I/O on the external system. During periods of high Vault usage (Token generation, Lease Expiration, etc.) I/O load on the underlying storage system will grow. Depending on the underlying infrastructure, this can be a limiting factor on how long a response can come back. Using Consul as an example, if Consul is experiencing high I/O load, then it may be slower to respond to a given request originating from Vault. This period of time can be longer than what Vault is expecting and you can see context deadline exceeded errors.
Before being able to solve the problem, you need to determine what is actually failing. For example if we look at the above error, we can make a determination that we aren't relying on a network connection to an external database such as MySQL or Oracle as the lease was generated from an Approle Auth method. That means that we are looking at something internal to Vault itself. In this case since we know Vault is using Consul as its storage backend, we can begin investigating any potential communication issues between Vault and Consul.
- Investigate Network connectivity: Can Vault and the required systems communicate in general? Were there any changes made recently that could have impacted this communication? Are there any firewalls in place that can be preventing communication? If in the cloud, are the instances in the same VPC/Network? Do the instances belong to the correct/expected Security Groups?
- Resource Contention: Are we asking too much of the underlying provisioned infrastructure? What is our CPU/Memory usage?
- Slow I/O: Are we using an external storage backend? Are we seeing I/O wait? Are we in the cloud? Do we have enough IOPS provisioned for our storage?