Problem
In some instances of Terraform Enterprise, Sentinel policy or Cost Estimation checks may consistently fail for all runs. This issue can also prevent the generation and download of Sentinel Mocks in workspaces.
Cause
This issue can be caused by a race condition that occurs when the Terraform Enterprise application restarts. The container responsible for scheduling Nomad jobs may start accepting requests before it is fully initialized, causing jobs to fail. This problem primarily affects versions of Terraform Enterprise prior to v202212-1.
You can confirm this issue by observing the following symptoms.
Symptom 1: Error Logs
Review the logs for the relevant container based on your Terraform Enterprise version.
- TFE
v202212-1or later:tfe-task-worker - TFE
v202205-1throughv202211-1:tfe-nomad - TFE prior to
v202205-1:ptfe_nomad
Check the logs for the Nomad container. The command will vary based on your TFE version.
## For TFE v202212-1 or later $ docker logs tfe-task-worker ## For TFE v202205-1 through v202211-1 $ docker logs tfe-nomad ## For TFE prior to v202205-1 $ docker logs ptfe_nomad
If the race condition occurred, you may see an error for Cost Estimation.
2020/08/13 06:45:37.600969 [ERR] http: Request /v1/job/cost-estimation-worker/dispatch, error: parameterized job not found
Alternatively, you may see an error for Sentinel.
2020/06/05 03:10:47.760165 [ERR] http: Request /v1/job/sentinel-worker/dispatch, error: parameterized job not found
Additionally, check the Sidekiq container logs.
## For TFE v202205-1 or later $ docker logs tfe-sidekiq ## For TFE prior to v202205-1 $ docker logs ptfe_sidekiq
You may see the following error.
2020-08-11 13:08:39 [ERROR] {:msg=>"Failed to enqueue cost estimate", :run_id=>622, :cost_estimate_id=>590, :exception=>#<RestClient::InternalServerError: 500 Internal Server Error>}Symptom 2: Missing Nomad Jobs
Check the status of the Nomad jobs. The output should list three scheduled jobs: cost-estimation-worker, plan-export-worker, and sentinel-worker. If they are missing, it confirms the issue.
## For TFE v202205-1 through v202211-1 $ docker exec -it tfe-nomad nomad job status ## For TFE prior to v202205-1 $ docker exec -it ptfe_nomad nomad job status
Solution
This solution involves manually rescheduling the Nomad jobs. This procedure is only applicable to Terraform Enterprise versions v202211-1 and earlier.
Step 1: Manually Reschedule Nomad Jobs
Execute the following command on the Terraform Enterprise instance, using the correct container name for your version.
For Terraform Enterprise v202205-1 through v202211-1, use tfe-nomad.
$ docker exec -it tfe-nomad /bin/bash -c 'for i in ${WORKERDIR}/*.job; do nomad run "${i}"; done'For Terraform Enterprise versions prior to v202205-1, use ptfe_nomad.
$ docker exec -it ptfe_nomad /bin/bash -c 'for i in ${WORKERDIR}/*.job; do nomad run "${i}"; done'This command reschedules the jobs required for Sentinel and Cost Estimation checks.
Step 2: Verify the Fix
To verify that the jobs are now scheduled correctly, run the status command again.
For Terraform Enterprise v202205-1 through v202211-1:
$ docker exec -it tfe-nomad nomad job status
For Terraform Enterprise versions prior to v202205-1:
$ docker exec -it ptfe_nomad nomad job status
The output should now show the cost-estimation-worker, plan-export-worker, and sentinel-worker jobs.
Additional Information
This race condition is a known bug affecting older versions of Terraform Enterprise, such as v202008-1. To permanently prevent this issue from recurring, you should upgrade your instance to a more recent version where the bug has been addressed.
For more details on managing and troubleshooting your instance, please refer to the official Terraform Enterprise documentation.