This article describes a performance issue and solution which is caused by DataDog agent on Consul < 1.7.6.
For older versions of Consul < 1.7.6, there was a performance issue that could trigger when using dogstatsd for telemetry. The dogstatsd client was updated to fix the problem and we upgraded the underlying armon/go-metrics library used in all HashiCorp products to use the fixed dogstatsd client in v0.3.0 here.
- Consul: < 1.7.6, 1.8.0, 1.8.1
- Vault: < 1.3.0
Vault servers in one datacenter had very high-average response time (in the order of minutes) for issuing new tokens. As a result, most clients were timing out and the Vault was effectively down - new workloads in the Nomad cluster could not start as they couldn't get the tokens created.
- The highly concurrent Vault writes from the expiry routines attempting to prune the backlog of tokens triggered the concurrency issue in dogstatsd which prevented throughput even with few resources used on Consul servers.
- The switch to statsd vs dogstatsd (which buffers metric writes) seems to have fixed all timings and left both Vault and Consul much more stable than before this incident.
The primary issue identified was very high raft commit times in Consul that were slowing Vault write operations to a crawl.
- CPU usage, memory pressure and Disk IO - Everything was low
- Checked Snapshot
- Checked telemetry metrics - We looked specifically at `consul.raft.leader.dispatchLogs` timing. This is a pretty good proxy for disk IO latency issues since it wraps the append to the leader's logs. This was high at a mean of around 20ms and max of several hundred ms. We expected a max low digit milliseconds or microseconds for a high-performance disk like this not under stress.
- pprof profiling information from Consul
After checking all the above, we switched to statsd and it fixed all the timings and stabilzed Consul and Vault.
Other HashiCorp products used this library too, so even if Consul is on a version that is not impacted, it is still possible for another HashiCorp product in their environment to be on a version that is, so please keep that in mind when troubleshooting.