Problem
Audit log information is missing every once in a while in the logs of Terraform Enterprise and this could relate to any audit log or component. The following example is for a sentinel run.
You would expect the following 3 audit lines in the logs for a sentinel run.
policy_check - created || queued || passed
"log\":\"2024-02-10 00:58:41 [INFO] [ce17c08e-1225-4d75-a4da-428a4d5e78d2] [dd.service=atlas dd.trace_id=3995258940181134386 dd.span_id=0 ddsource=ruby] [Audit Log] {\\\"resource\\\":\\\"policy_check\\\",\\\"action\\\":\\\"created\\\"
"log\":\"2024-02-10 00:58:52 [INFO] [Audit Log] {\\\"resource\\\":\\\"policy_check\\\",\\\"action\\\":\\\"queued\\\"
"log\":\"2024-02-10 00:58:55 [INFO] [3374520e-833d-429b-a777-25bfd06a5c19] [dd.service=atlas dd.trace_id=1910763846437290669 dd.span_id=0 ddsource=ruby] [Audit Log] {\\\"resource\\\":\\\"policy_check\\\",\\\"action\\\":\\\"passed\\\
In your logs you are only seeing a audit log for the policy_check - queued
"log\":\"2024-02-10 16:58:53 [INFO] [Audit Log] {\\\"resource\\\":\\\"policy_check\\\",\\\"action\\\":\\\"queued\\\",\\\"resource_id\\\":\\\"polchk-3ucnF1gM4if7nPfK\\\",
Prerequisites
- Terraform Enterprise version less than (<) 202402-1
Cause
In the Terraform Enterprise container the application logs are picked up by fluent-bit. When there is something in the log that is bigger then 32k it will crash with the following error.
2024-02-15T13:09:17.764748000Z [2024/02/15 13:09:17] [error] [input:tail:tail.0] file=/var/log/terraform-enterprise/atlas.log requires a larger buffer size, lines are too long. Skipping file.
When this error happens fluent-bit crashes and doesn't pick up logs for a while with missing audit logs in the process.
Solution 1:
Upgrade your TFE environment to version 202402-1, The buffer limit has been increased from 32k to 128k.
If the issue persists after this then please check if you for example have a sentinel run output that exceeds the 128k size.
Solution 2:
If you are still missing logs then please verify if you don't have rate limiting on your journald process. See if you have following message in journald logs
journalctl | grep -i suppressed
[ 3.495723] printk: systemd: 19 output lines suppressed due to ratelimiting
Then change the limits by altering the file/etc/systemd/journald.conf
RateLimitInterval=0
RateLimitBurst=0
Restart the journald service
sudo systemctl restart systemd-journald
Additional Information
-
KB article related to solution 2 can be found here