Problem
Terraform Enterprise will initially connect to a VCS repo to pull in a module and successfully publish it for use. Throughout its lifecycle, changes and updates will be committed to the module. Normally Terraform Enterprise will ingress and publish these changes without issue. Sometimes however, these module updates fail to ingress to Terraform Enterprise to be published occasionally.
Cause
This issue may stem from several causes. It is best to begin by looking at the ptfe_sidekiq
, ptfe_registry_worker
, or ptfe_atlas
logs to identify what is happening in the environment.
To check the logs for any of these containers run the following command with the intended <container_name>
.
NOTE: For Terraform Enterprise v202205-1
or later the container names have changed and would resemble the following tfe_sidekiq
, tfe_registry_worker
, or tfe_atlas
as the "p" has been removed.
$ docker logs <container_name>
Some common issues seen in the logs are:
- Expired VCS/OAuth Token.
This can be found by running the following command to check the logs of the ptfe_sidekiq
or tfe_sidekiq
(For Terraform Enterprise v202205-1
or later) container:
$ docker logs ptfe_sidekiq
$ docker logs tfe_sidekiq (TFE v202205-1 or later)
The logs can be active depending on load so some things to search for are error
or the HTTP response code 400
.
[ERROR] [e9253f0c-8bb4-4dde-b293-6459b0b024b9] {:resource=>"Vcs::AdoServices::OAuthTokenGetter", :method=>"refresh_token", :token_external_id=>"ot-TToDhdks│
ACKN5FG2", :msg=>"Post completed with an error", :exception=>"400 Bad Request", :attempt_number=>1}
- Errors Within The Module Configuration Files.
These errors will show up in the ptfe-registry-worker
or tfe-registry-worker
(For Terraform Enterprise v202205-1
) or later container logs, and can be seen by running the following command:
$ docker logs ptfe_registry_worker
Depending on how active the logs are, it can be useful to search for error
or .tf
.
2020-10-09T20:36:41.760311392Z 2020/10/09 20:36:41 [ERROR] push_ingress_version[76d8279467a67b84508db2c5] error loading the module: Argument or block definition required: An argument or block definition is required here. (in main.tf on line 54)
- Network Errors.
A network related issue will present itself in the ptfe_atlas
or tfe-atlas
(For Terraform Enterprise v202205-1
or later) container logs, which can be viewed by running the following:
$ docker logs ptfe_atlas
It is important to filter this log as it is the main container for the Terraform Enterprise application and can be pretty active. Some things to search the logs for are HTTP error repsonses such as 400
, 404
, or 500
.
2020-09-10T16:57:36.233865500Z 172.17.0.17 - - [10/Sep/2020:16:57:36 +0000] "GET /v1/modules/nmhc?offset=0 HTTP/1.1" 404 49 "" "rest-client/2.0.2 (linux-musl x86_64) ruby/2.4.5p335"
Another specific network connectivity issue that can some times show up is when Terraform Enterprise loses the connection to the PostgreSQL database. Since the ptfe_atlas or tfe_atlas
logs are usually pretty active, the logs can be searched for PG::ConnectionBad
or .rb
to help locate the error messages.
May 27 07:57:07 tfe-production-000 dockerd-current[9181]: 2020-05-27T05:57:07.356Z 1 TID-gt8c5mwc5 WARN: PG::ConnectionBad: could not connect to server: Connection refused
May 27 07:57:07 tfe-production-000 dockerd-current[9181]: Is the server running on host "postgres" (172.17.0.1) and accepting
May 27 07:57:07 tfe-production-000 dockerd-current[9181]: TCP/IP connections on port 5432?
May 27 07:57:07 tfe-production-000 dockerd-current[9181]:
May 27 07:57:07 tfe-production-000 dockerd-current[9181]: 2020-05-27T05:57:07.356Z 1 TID-gt8c5mwc5 WARN: /app/vendor/bundle/ruby/2.4.0/gems/activerecord-4.2.11.1/lib/active_record/connection_adapters/postgresql_a
May 27 07:57:07 tfe-production-000 dockerd-current[9181]: /app/vendor/bundle/ruby/2.4.0/gems/activerecord-4.2.11.1/lib/active_record/connection_adapters/postgresql_adapter.rb:651:in `new'
May 27 07:57:07 tfe-production-000 dockerd-current[9181]: /app/vendor/bundle/ruby/2.4.0/gems/activerecord-4.2.11.1/lib/active_record/connection_adapters/postgresql_adapter.rb:651:in `connect'
- Module In Stuck State
Sometimes a published module will enter a stuck state, preventing any changes from being ingressed. This generally isn’t indicated in any logs in the Terraform Enterpise application and is really only identifiable by taking steps to resolve it.
Solution
- Expired VCS/OAuth Token
If the issue is caused by an expired VCS/OAuth Token that will not refresh, unfortunately this will have to be corrected manually. The token can be refreshed by navigating to the workspace’s VCS connection in question and selecting Revoke Connection
. Once that’s revoked the VCS can be reconnected by selecting Connect Organization <organization name>
.
Once this is reconnected all VCS settings will need to reconfigured on any workspace or module that was previously connected to the VCS.
- Errors Within The Module Configuration Files.
Issues within the module’s terraform configuration files will need to resolved before any new updates can properly ingress. The error message will indicate the file name and line number that contains the error as seen from the example error message, (in main.tf on line 54)
.
Once any errors have been resolved in the terraform configuration files and pushed to the VCS repository the module is connected to, the changes should be successfully ingressed.
- Network Errors.
Most commonly network errors are seen when the network a Terraform Enterprise instance is connected to doesn’t meet the minimum Terraform Enterprise Network Requirements documentation. These minimum network requirements must be met along the entire network path including any firewall, security groups, load balancer, proxy, or any other network device.
The instance in which Terraform Enterprise cannot connect to its PostgreSQL instance is more commonly seen when deployed in External Services mode. It is important to ensure that the underlying network path between Terraform Enterprise meets the minimum Terraform Enterprise Network Requirements documentation as well. Provide the network requirements are met, further troubleshooting will be required to ensure that the external PostgreSQL instance is online and accepting connections.
- Module In Stuck State
When the module is in a stuck state the only real way to correct this is to delete the module via the Terraform Enterprise API, then re-publish the module.
For this a user token with Owners permissions will be needed. This token will be used to hit the Delete Module API endpoint.
A few variables will be used in this example, and will need to be replaced in the command or set as environment variables when making the API call.
-
$TOKEN
is a User token with Owners team permissions -
$TFE_URL
is the URL of the Terraform Enterprise instance -
$ORG_NAME
is the name of the organization where the module is published -
$MOD_NAME
is the name of the module that will be deleted
Any of these variables can be set on a Unix based environment using the export
command. For example:
$ export TFE_URL='app.terraform.io'
Additionally the set
command can be used to set any of these environment variables when using Windows. For example:
$ set TFE_URL=app.terraform.io
The following API call can then be made to the Delete Module API endpoint.
curl -k --header "Authorization: Bearer $TOKEN" --header "Content-Type: application/vnd.api+json" --request POST https://$TFE_URL/api/v2/registry-modules/actions/delete/$ORG_NAME/$MOD_NAME
Once that module is successfully deleted via API the module can then be republished via the Terraform Enterprise per the documentation: https://www.terraform.io/docs/cloud/registry/publish.html#publishing-a-new-module