Terraform Enterprise - Basic Troubleshooting Guide
Overview

Terraform Enterprise - Basic Troubleshooting Guide

Overview

Terraform Enterprise operates in layers. It is important to understand the architecture of the application in order to properly troubleshoot issues.

First, cloud infrastructure from cloud providers such as Azure, GCP or AWS is used to create a virtual network and cloud resources which then hosts a virtual Linux instance within it. VMWare is also commonly used as an on-prem infrastructure solution. The Replicated script is executed on the Linux instance typically via cloud-init which downloads & installs the Docker application if online mode is chosen, along with the Replicated & Terraform Enterprise Docker images. Once all the Replicated containers have started successfully then this will allow the start of the Terraform Enterprise containers and eventually launch the application.

draft1 (2)

Cloud Infrastructure

Screen Shot 2021-12-14 at 8 47 21 AM

At the base of the application stack is cloud infrastructure (layer 1 of the application stack). Cloud infrastructure refers to the hardware and software components, such as servers, storage, networking, virtualization software, services and management tools, that support the computing requirements of a cloud computing model.

Proper cloud networking is required to enable Terraform Enterprise to communicate both internally at the Docker container level and externally due to certain API requests being made from the Docker containers back out to the fully qualified domain name.

Proxies, load balancers, application firewalls and security groups can impact application communication to itself and external services. Ensure that networking requirements have been met.

Also the memory, CPU, and IOPS configured for the Linux system can have an affect on the performance of Terraform Enterprise. Ensure that the cloud infrastructure is created according to Hashicorp’s reference architecture.

Linux Instance

Screen Shot 2021-12-14 at 8 47 04 AM

Terraform Enterprise runs on a Linux based operating systems (layer 2 of the application stack). SELinux and Iptables are a part of the built in Linux security framework. Iptables refers to the firewall restrictions on the Linux operating system and SELinux provides file system security.

Both SELinux and Iptables can affect how Terraform Enterprise communicates to itself and external services.

Ensure that your Linux Instance meets these specific requirements.

· https://www.terraform.io/docs/enterprise/before-installing/index.html#software-requirements-standalone-deployment

· https://www.terraform.io/docs/enterprise/before-installing/index.html#selinux

Docker

Screen Shot 2021-12-14 at 8 46 47 AM

Docker utilizes OS-level virtualization to deliver software in packages called containers (layer 3 of the application stack). Containers are isolated from one another and bundle their own software, libraries and configuration files. They can communicate with each other through internal Docker networks.

HashiCorp utilizes Docker containers to facilitate the runtime of various services used by Terraform Enterprise. These Docker containers are managed by Replicated.

Docker requirements vary based on the Linux operating system chosen. See docker compatibility guide for specific requirements.

https://www.terraform.io/docs/enterprise/before-installing/index.html#docker-engine-requirements

Replicated

Screen Shot 2021-12-14 at 8 46 33 AM

Terraform Enterprise uses a service named Replicated to manage the installation, configuration, and management of the various Docker containers that make up the Terraform Enterprise application (layer 4 of the application stack). Terraform Enterprise uses Replicated’s native scheduler to manage the lifecycle of the Docker containers.

Replicated also provides a CLI tool called replicatedctl that can be used to interact with the Replicated service, and by proxy, the Terraform Enterprise application itself.

Roles of the Replicated containers

replicated - The daemon that runs Replicated services and starts the application. It communicates with the external Replicated API and registry unless running in airgap mode. This is the only component that communicates externally.

replicated-ui - Provides the Replicated console which listens on host port 8800. It communicates internally with the Replicated daemon and with the premkit service.

replicated-operator - A utility image to transfer files between the host and daemon and to run application containers if using the native scheduler. It communicates internally with the Replicated daemon on port 9879.

replicated-premkit - This serves as a reverse proxy to the audit log, metrics, and integration services. It communicates internally with the daemon, audit log, and metrics services.

replicated-statsd - This image is used for a metrics service that runs when the application is running.

support-bundle - This image is run to collect system information when the customer creates a support bundle.

cmd - This image may be used for custom commands if configured in the application yaml. It may communicate internally or externally if configured to do so by the vendor’s application.

retraced - Retraced provides an API and worker for the audit log component and communicates internally with the audit log’s Postgres and NSQ services. The following are the API and worker containers:

retraced-processor
retraced-api
retraced-cron

retraced-postgres - This is the database for the audit log.

retraced-nsq -This is the audit log’s queue.

Terraform Enterprise

Screen Shot 2021-12-14 at 8 46 11 AM

Terraform Enterprise is a fully featured application that allows teams to run Terraform CLI within a data center or cloud provider (layer 5). Terraform Enterprise can be thought of as the top layer of the overall application stack. If the previous layers are not configured according to HashiCorp's recommendations then this can cause an issue with communication between the containers and/or the external services that Terraform Enterprise interacts with.

In addition to the Replicated Docker containers, there are long-running containers that comprise the Terraform Enterprise application. To troubleshoot properly, it is important to understand the roles of each container as this will guide you on how to troubleshoot the application.

Roles of the Terraform Enterprise containers

Note: For installations using Terraform Enterprise v202205-01 or later, all container names now follow the naming convention of "tfe-<service>"
Example:

ptfe_atlas > tfe-atlas
ptfe_archivist > tfe-archivist

This article will be updated to remove references to the "ptfe" prefix at a later date.

More information can be found in the release notes here.

ptfe_nginx - Nginx reverse proxy, facilitates access to the Terraform Enterprise services

ptfe_atlas - The API and Web UI. Terraform Enterprise used to be known as Atlas

ptfe_build_manager - Manages the queue of Terraform runs

ptfe_build_worker - Creates workers on-demand as required by the queue. Injects variables, secrets, and Terraform configuration to a temporary container, ptfe_worker

ptfe_worker - Executes a Terraform plan or apply. This container can be replaced with a custom image. This ephemeral container may be created with a randomly generated name by Docker

ptfe_vault - HashiCorp Vault, utilizes transit encryption for items such as sensitive workspace variables ptfe_registry_api - Terraform Private Module Registry API

ptfe_slug_ingress (or ptfe_ingress in older versions of Terraform Enterprise) - Listens for VCS webhooks. Packages VCS repo data as a slug and sends it to ptfe_archivist

ptfe_registry_worker - Processes VCS slugs, prepares module to be published on the Terraform private Module Registry

ptfe_sidekiq - Background job scheduler system

ptfe_redis - Redis in-memory database, use for caching and ptfe_sidekiq queue. This container will not be active on Active-Active Terraform Enterprise installations.

ptfe_nomad - HashiCorp Nomad, Schedules Sentinel and Cost Estimation runs

ptfe_archivist - Object storage API

ptfe_migrations - Runs on startup only, runs database migrations from ptfe_atlas

ptfe_postgres - PostgreSQL database, holds relational data such as workspace applies and where their state is stored in object storage

ptfe_state_parser - Reads Terraform state files and parses important information out of them

rabbitmq - RabbitMQ message queue

ptfe_backup_restore - The Terraform Enterprise Backup and Restore API

ptfe_outbound_http_proxy - Security control used to filter user-controlled network traffic (e.g., sentinel imports) and prevent them from accessing internal services directly

ptfe_health_check - Runs a periodic health check against Terraform Enterprise

ptfe_base_startup - Runs on install only. Initializes Terraform Enterprise for installation

ptfe_registry_migrations - Runs on startup only, runs database migrations from ptfe_registry_api

telegraf - Data collection agent for collecting and reporting metrics. This container runs when enable_metrics_collection is enabled in the application configuration

influxdb - Time-series database for storing metrics data from telegraf. This container runs when enable_metrics_collection is enabled in the application configuration

Install and Configure Terraform Enterprise

Verify Replicated Config prior to installation

The replicated config MUST be present at /etc/replicated.conf and contains information for replicated to start, but also imports the TFE settings from another file.

View the file:

$ cat /etc/replicated.conf

{
  "DaemonAuthenticationType": "password",
  "DaemonAuthenticationPassword": "xxxxxxxxxxxxx",
  "TlsBootstrapHostname": "tfe.company.com",
  "TlsBootstrapType": "self-signed",
  "TlsBootstrapCert": "",
  "TlsBootstrapKey": "",
  "BypassPreflightChecks": true,
  "ImportSettingsFrom": "/etc/replicated-tfe.json",
  "LicenseBootstrapAirgapPackagePath": "/etc/tfe/latest.airgap",
  "LicenseFileLocation": "/etc/replicated.rli"
}

Verify:

"ImportSettingsFrom" is set to a valid file, more information on the contents in the next section, cat <path to tfe configs>
"DaemonAuthenticationPassword" has a password set.
"LicenseFileLocation" is set to the path of a valid file, cat <path to license file>
- This file is technically a binary, but can be visually inspected.
- Should start with the string "license.json"
- Should also contain the strings ["key.signature", "BEGIN RSA PUBLIC KEY", "END RSA PUBLIC KEY"]
"TlsBootstrapHostname" is accurately set to the FQDN the will be used to access TFE.
If "TlsBootstrapType" == "self-signed", "TlsBootstrapCert" and "TlsBootstrapKey" should be empty.
If "TlsBootstrapType" == "server-path", "TlsBootstrapCert" and "TlsBootstrapKey" should be not be empty.
- Check that the cert and key paths are valid, cat <path to cert or key>
- Verify that the cert is PEM encoded, does not contain embedded "\n" strings and includes the entire CA chain. ```sh -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----
-----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----
```
- Verify that the key is an RSA private key
```sh
-----BEGIN RSA PRIVATE KEY-----
...
-----END RSA PRIVATE KEY-----
```
If "LicenseBootstrapAirgapPackagePath" is set, verify the file is present and is at least 900MB+, ls -lh <path to airgap file>

Installation Instructions

https://www.terraform.io/docs/enterprise/install/index.html

Verification

To verify that the installation was successfull review the cloud-final logs on the Linux instance.

Verify Cloud Init

This assumes cloud-init is being used from the user_data argument.

SSH into the instance running TFE to perform the following checks:

$ journalctl -xu cloud-final -o cat

Verify that ./install.sh ran and exited properly.

Here is an example:

root : TTY=unknown ; PWD=/etc/tfe ; USER=root ; COMMAND=./install.sh ... private-address=10.0.5.59 public-address=10.0.5.59
Determining local address
The installer will use local address '10.0.5.59' (from parameter)
Running preflight checks...
[INFO] / disk usage is at 15%
[INFO] /var/lib/docker disk usage is at 15%
[INFO] Docker http proxy not set
...
To continue the installation, visit the following URL in your browser:
http://10.0.5.59:8800

Verify Docker is Running

If cloud-init shows a successful installation then verify that Docker and the Replicated services have started. By running docker info the output will confirm whether or not this layer of the application has started. If docker info returns information then it is safe to assume that this layer has loaded successfully. Continue to the next layer to verify that the Replicated services are up.

Run docker info command:

$ sudo docker info

Good

Containers: 36
 Running: 28
 Paused: 0
 Stopped: 8
Images: 37
Server Version: 1.13.1
...

Bad

Error: error response from daemon get https://index.docker.io connection refused

Resolution

Verify docker service is running, or if there are any errors:

$ systemctl status docker

Resolve any networking or routing issues.

https://docs.docker.com/config/daemon/

Verify Replicated is Running

Run the system status command:

$ replicatedctl system status

Good

{
    "Replicated": "ready",
    "Retraced": "ready"
}

Bad

{
    "Replicated": "ready",
    "Retraced": "initializing"
}

Resolution

Verify Replicated services are running, and check logs:

# Verify Replicated services are running
$ systemctl status replicated replicated-operator replicated-ui

● replicated.service - Replicated Service
   Loaded: loaded (/etc/systemd/system/replicated.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-10-06 15:39:41 UTC; 4 weeks 1 days ago
 Main PID: 14949 (docker)
    Tasks: 13 (limit: 4915)
   CGroup: /system.slice/replicated.service
           └─14949 /usr/bin/docker run --name=replicated -p 9874-9879:9874-9879/tcp -u 1001:999

● replicated-operator.service - Replicated Operator Service
   Loaded: loaded (/etc/systemd/system/replicated-operator.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-10-06 15:39:51 UTC; 4 weeks 1 days ago
 Main PID: 15570 (docker)
    Tasks: 11 (limit: 4915)
   CGroup: /system.slice/replicated-operator.service
           └─15570 /usr/bin/docker run --name=replicated-operator -u 1001:999 

● replicated-ui.service - Replicated Service
   Loaded: loaded (/etc/systemd/system/replicated-ui.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-10-06 15:39:41 UTC; 4 weeks 1 days ago
 Main PID: 15013 (docker)
    Tasks: 11 (limit: 4915)
   CGroup: /system.slice/replicated-ui.service
           └─15013 /usr/bin/docker run --name=replicated-ui -p 8800:8800/tcp -u 1001:999

View Replicated Logs

$ journalctl -xu replicated -o cat

View Replicated Docker Logs

# View & follow replicated container logs
$ sudo docker logs replicated -f 

# View & follow replicated-operator container logs
$ sudo docker logs replicated-operator

Run Preflight Checks

Run replicatedctl preflight checks to ensure that the minimum requirements have been met:

# Preflight checks
$ replicatedctl preflight run


  ✓ OS linux is supported
  - The operating system must be linux

  ✓ Kernel version requirement met
  - Kernel version must be at least 3.10

  ✓ Total space requirement met for directory /tmp
  - Directory must have at least 1GB total space

  ✓ Total space requirement met for directory /var/lib/replicated
  - Directory must have at least 250MB total space

  ✓ Successful TLS connection
  - Can connect to TLS 10.0.172.53 address

  ✓ Successful HTTP request
  - Can access api.replicated.com

  ✓ Docker server version requirement met
  - Docker server version must be at least 1.7.1

  ✓ Memory requirement met
  - Server must have at least 4GB total memory

  ✓ Total space requirement met for directory /var/lib/docker
  - Directory must have at least 40GB total space

  ✓ CPU cores requirement met
  - Server must have at least 2 CPU cores

  ✓ Total space requirement met for directory /
  - Directory must have at least 10GB total space

  ✓ Successful connection to https://releases.hashicorp.com.
  - Can connect to https://releases.hashicorp.com.

  ✓ Successful Docker registry ping
  - Can access registry index.docker.io

  ✓ Successful Docker registry ping
  - Can access registry registry.replicated.com

NODE: 916916fc055445185808a3101b79d795

  ✓ CPU cores requirement met
  - Server must have at least 2 CPU cores

  ✓ Memory requirement met
  - Server must have at least 4GB total memory

  ✓ OS linux is supported
  - The operating system must be linux

  ✓ Kernel version requirement met
  - Kernel version must be at least 3.10

  ✓ Total space requirement met for directory /
  - Directory must have at least 10GB total space

  ✓ Total space requirement met for directory /var/lib/docker
  - Directory must have at least 40GB total space

  ✓ Docker server version requirement met
  - Docker server version must be at least 1.7.1

  ✓ Successful TLS connection
  - Can connect to TLS 10.0.172.53 address

All preflight checks passed!

Verify Terraform Enterprise Containers

Terraform Enterprise runs as a series of Docker containers which are managed by Replicated. Sometimes, it’s necessary to use the docker command to view the logs of a container or to execute a command within a container. Those actions and more are detailed below.

The docker ps command is used to list all currently running containers. A healthy, idle Terraform Enterprise installation should have around 25-30 containers running at a given time, about 10 of which should be Replicated containers.

Note: that the last container to be started is ptfe_atlas. If this container is started it is indicative of the application being up.

# Containers
$ sudo docker ps

CONTAINER ID        IMAGE                                                                 COMMAND                  CREATED             STATUS              PORTS                                                                                                 NAMES
e9ef9136b222        172.31.83.56:9874/hashicorp-terraform-build-worker:651-723542f        "start-tbw /terrafor…"   20 minutes ago      Up 20 minutes                                                                                                             affectionate_spence
e670cb2451bb        172.31.83.56:9874/hashicorp-archivist:608-11824ea                     "/usr/bin/wait-for-t…"   20 minutes ago      Up 20 minutes       0.0.0.0:7675->7675/tcp                                                                                ptfe_archivist
296029f74e33        172.31.83.56:9874/hashicorp-tf-registry:1209-cef77f6                  "setup-ssl /usr/bin/…"   20 minutes ago      Up 20 minutes                                                                                                             ptfe_registry_worker
34cc9cc8816b        172.31.83.56:9874/hashicorp-tf-registry:1209-cef77f6                  "setup-ssl /usr/bin/…"   20 minutes ago      Up 20 minutes                                                                                                             ptfe_registry_api
fbf0a59f5e0f        172.31.83.56:9874/hashicorp-atlas:CIRC-52813-8885d12                  "/usr/bin/init.sh /a…"   20 minutes ago      Up 20 minutes                                                                                                             ptfe_sidekiq
084de8a5812c        172.31.83.56:9874/hashicorp-atlas:CIRC-52813-8885d12                  "/usr/bin/init.sh bu…"   20 minutes ago      Up 20 minutes       0.0.0.0:9292->9292/tcp                                                                                ptfe_atlas
fe85af2e0ec2        172.31.83.56:9874/hashicorp-terraform-build-manager:556-6f20add       "/usr/bin/tbm-start"     20 minutes ago      Up 20 minutes                                                                                                             ptfe_build_manager
6c3e323b864e        172.31.83.56:9874/hashicorp-ptfe-vault:CIRC-43-c3012a5                "vault-start"            20 minutes ago      Up 20 minutes       0.0.0.0:8200->8200/tcp                                                                                ptfe_vault
5a172f506e51        172.31.83.56:9874/hashicorp-ptfe-postgres:2738c44                     "docker-entrypoint.s…"   20 minutes ago      Up 20 minutes       0.0.0.0:5432->5432/tcp                                                                                ptfe_postgres
7f26d1117c8f        172.31.83.56:9874/hashicorp-ptfe-rabbitmq:3-7a948ea                   "/start.sh rabbitmq-…"   20 minutes ago      Up 20 minutes       0.0.0.0:5672->5672/tcp, 0.0.0.0:32784->4369/tcp, 0.0.0.0:32783->5671/tcp, 0.0.0.0:32782->25672/tcp    rabbitmq
49800f0208c0        172.31.83.56:9874/hashicorp-ptfe-nginx:2-de4e9dc                      "/usr/bin/run-ssl ng…"   20 minutes ago      Up 20 minutes       0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp, 0.0.0.0:23001->8080/tcp                                     ptfe_nginx
a1c6d60aa825        172.31.83.56:9874/hashicorp-slug-ingress:597-f03e9b3                  "/usr/bin/slug-ingre…"   20 minutes ago      Up 20 minutes       0.0.0.0:7586->7586/tcp                                                                                ptfe_ingress
36134510d493        172.31.83.56:9874/hashicorp-tfe-backup-restore:129-3e8e200            "/usr/bin/wait-for-t…"   20 minutes ago      Up 20 minutes       0.0.0.0:23009->23009/tcp                                                                              ptfe_backup_restore
24a466b9a15b        172.31.83.56:9874/hashicorp-ptfe-nomad:6-f61e114                      "nomad-run"              20 minutes ago      Up 20 minutes       0.0.0.0:23020->23020/tcp                                                                              ptfe_nomad
17ef12b51d22        172.31.83.56:9874/influxdb:1.6.4-alpine                               "/entrypoint.sh infl…"   21 minutes ago      Up 20 minutes       0.0.0.0:8086->8086/tcp                                                                                influxdb
7f4d0b37df93        172.31.83.56:9874/telegraf:1.8.1-alpine                               "/entrypoint.sh tele…"   21 minutes ago      Up 21 minutes       0.0.0.0:23010->23010/udp, 0.0.0.0:32774->8092/udp, 0.0.0.0:32781->8094/tcp, 0.0.0.0:32773->8125/udp   telegraf
067596c60f7f        172.31.83.56:9874/hashicorp-ptfe-redis:4-de207d6                      "docker-entrypoint.s…"   21 minutes ago      Up 21 minutes       0.0.0.0:6379->6379/tcp                                                                                ptfe_redis
d40584c68f99        172.31.83.56:9874/hashicorp-terraform-state-parser:537-b3c2c82        "/terraform-state-pa…"   21 minutes ago      Up 21 minutes       0.0.0.0:7588->7588/tcp                                                                                ptfe_state_parser
efd6ab856559        registry.replicated.com/library/statsd-graphite:1.0.6                 "/usr/bin/supervisor…"   21 minutes ago      Up 21 minutes       0.0.0.0:32780->2003/tcp, 0.0.0.0:32779->2004/tcp, 0.0.0.0:32778->2443/tcp, 0.0.0.0:32772->8125/udp    replicated-statsd
7788adf5b847        172.31.83.56:9874/hashicorp-ptfe-health-check:CIRC-194-6bcc7e9        "/root/ptfe-health-c…"   21 minutes ago      Up 21 minutes       0.0.0.0:23005->23005/tcp                                                                              ptfe-health-check
f3aecc245e6a        registry.replicated.com/library/retraced:1.3.28                       "/src/replicated-aud…"   About an hour ago   Up About an hour    0.0.0.0:9873->3000/tcp                                                                                retraced-api
80fd359750bf        registry.replicated.com/library/retraced:1.3.28                       "/src/replicated-aud…"   About an hour ago   Up About an hour    3000/tcp                                                                                              retraced-processor
0fb6beab6188        registry.replicated.com/library/retraced:1.3.28                       "/bin/sh -c '/usr/lo…"   About an hour ago   Up About an hour    3000/tcp                                                                                              retraced-cron
26beab508fb6        registry.replicated.com/library/retraced-postgres:10.10-20200213      "docker-entrypoint.s…"   About an hour ago   Up About an hour    5432/tcp                                                                                              retraced-postgres
6390c05766e8        registry.replicated.com/library/retraced-nsq:v1.0.0-compat-20191118   "/bin/sh -c nsqd"        About an hour ago   Up About an hour    4150-4151/tcp, 4160-4161/tcp, 4170-4171/tcp                                                           retraced-nsqd
151e2830c9b8        registry.replicated.com/library/premkit:1.3.1                         "/usr/bin/premkit da…"   About an hour ago   Up About an hour    80/tcp, 443/tcp, 2080/tcp, 0.0.0.0:9880->2443/tcp                                                     replicated-premkit
820d95c6586d        quay.io/replicated/replicated-operator:current                        "/usr/bin/replicated…"   About an hour ago   Up About an hour                                                                                                          replicated-operator
15ef1cea94ad        quay.io/replicated/replicated-ui:current                              "/usr/bin/replicated…"   About an hour ago   Up About an hour    0.0.0.0:8800->8800/tcp                                                                                replicated-ui
60456468a40b        quay.io/replicated/replicated:current                                 "entrypoint.sh -d"       About an hour ago   Up About an hour    0.0.0.0:9874-9879->9874-9879/tcp                                                                      replicated

Terraform Enteprise Health

A health check command can be used to verify the application is up and healthy.

# Check Terraform Enteprise Health
$ tfe-admin health-check

checking: Archivist Health Check...
|  checks that Archivist is up and healthy
|- ✓ PASS

checking: Terraform Enterprise Health Check...
|  checks that Terraform Enterprise is up and can communicate with Redis and Postgres
|- ✓ PASS

checking: Terraform Enterprise Vault Health Check...
|  checks that Terraform Enterprise can connect to Vault and is able to encrypt and decrypt tokens
|- ✓ PASS

checking: Fluent Bit Health Check...
|  checks that the configure Fluent Bit server is healthy
|- SKIPPED

checking: RabbitMQ Health Check...
|  checks that RabbitMQ can be connected to and that we can send and consume messages
|- ✓ PASS

checking: Vault Server Health Check...
|  checks that the configured Vault Server is healthy
|- ✓ PASS

  All checks passed.

https://support.hashicorp.com/hc/en-us/articles/360058829814-Monitoring-a-Terraform-Enterprise-Instance

Terraform Enterprise Application Troubleshooting

When beginning the troubleshooting process it is important to keep all of the layers of the application stack in mind as issues can occur at any level which ultimately cause failures in Terraform Enterprise. Along with the application stack, review the list of containers and the purpose they serve in the applicaiton.

With all of these in mind ask the following questions should any errors occur in the application.

What is the error?
Is it occurring within the application or outside of the application?
Have there been any changes to the infrastructure, Linux instance, code, or application recently?
What layer is this issue occuring at?
- Once the layer is identified, work your way up the stack, verifying each stack along the way.
Where can I locate the logs for the failure?
Check the HashiCorp Help Center to see if there is an article around the error. For best results, search based on the most unique part of the error.
Is this a known bug that is fixed?
- Check release notes
  - https://www.terraform.io/docs/enterprise/release/index.html
  - https://release-notes.replicated.com/

Terraform Enterprise Application Fails to Start

If all of the containers are not up yet use watch docker ps which will refresh the docker ps command every two seconds. This will allow you to watch the startup process. The last container to be started is ptfe_atlas. The application will still be in a starting status if this container is not up yet.

If the application fails to start, the Replicated service will usually report which container it failed on. Locate the container name by viewing the errors under sudo docker logs replicated. Note the container that Terraform Enterprise failed to start.

To view the logs for a given container, use the docker logs CONTAINER command where CONTAINER is a container ID or name. To view the logs for the ptfe_vault container, use docker logs ptfe_vault. You can also follow the logs as they come in by using the -f option. To follow the logs for the ptfe_vault container, use docker logs -f ptfe_vault.

$ sudo docker container ls

$ sudo docker logs ptfe_vault

Good

2020-05-27T17:44:27.446Z [INFO]  core: vault is unsealed
Key             Value
---             -----
Seal Type       shamir
Initialized     true
Sealed          false
Total Shares    1
Threshold       1
Version         1.2.3
Cluster Name    vault-cluster-bf612c25
Cluster ID      469f4ad7-7329-91e0-d39e-84f820edf4c5
HA Enabled      false
false

Bad

Vault is already initialized
+ killing vault with pid 33
==> Vault shutdown triggered
+ vault has exited
+ exiting vault setup with 0
+ Retrieving Vault unseal key
get unseal: could not decrypt unseal key: crypto: could not decrypt ciphertext: chacha20poly1305: message authentication failed

Resolution

Likely you have the wrong encryption password set for an existing data layer.

$ sudo docker logs ptfe_postgresql_setup

Good

+ Detected postgresql up and active
CREATE SCHEMA
CREATE SCHEMA
CREATE SCHEMA
NOTICE:  schema "rails" already exists, skipping
NOTICE:  schema "vault" already exists, skipping
NOTICE:  schema "registry" already exists, skipping

Bad

+ Detected postgresql up and active
ERROR: permission denied for database tfe

2021-10-19T20:02:24.732742989Z psql: error: could not connect to server: No route to host
2021-10-19T20:02:24.732790701Z Is the server running on host "10.22.0.2" and accepting
2021-10-19T20:02:24.732798293Z TCP/IP connections on port 5432?
2021-10-19T20:02:30.787133573Z psql: error: timeout expired

Resolution

Likely your Postgres connection information is incorrect, or there is a network rule/firewall blocking Terraform Enterprise to connect to Postgres.

Tracking Application Errors

All application requests come through the ptfe_atlas container. If the application fails to save a setting and an error is displayed in the user interface then the error will be logged in this container and most likely with a stack trace as well. If the error is reproducible then run sudo docker logs ptfe_atlas -f --tail 100 which will give a live follow of the data coming into the container and also provide the last 100 loglines. The error will most likely be displayed there.

Stack Trace Example:

2021-09-07T18:18:55.762071351Z 2021-09-07 18:18:55 [DEBUG] OOM command not allowed when used memory > 'maxmemory'. excluded from capture: DSN not set

2021-09-07T18:18:55.762083007Z 2021-09-07 18:18:55 [ERROR] [451fd9ba-d423-4ec2-9371-8cc2a2d8a846] {:error=>"Redis::CommandError", :id=>28844840, :message=>"OOM command not allowed when used memory > 'maxmemory'."}
2021-09-07T18:18:55.762421452Z 2021-09-07 18:18:55 [DEBUG] [451fd9ba-d423-4ec2-9371-8cc2a2d8a846] {:error=>"Redis::CommandError", :id=>28844840, :message=>"OOM command not allowed when used memory > 'maxmemory'.", :backtrace=>["/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis/client.rb:199:in `call_pipelined'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis/client.rb:157:in `block in call_pipeline'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis/client.rb:293:in `with_reconnect'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis/client.rb:155:in `call_pipeline'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis.rb:2304:in `block in multi'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis.rb:58:in `block in synchronize'", "/usr/local/lib/ruby/2.7.0/monitor.rb:202:in `synchronize'", "/usr/local/lib/ruby/2.7.0/monitor.rb:202:in `mon_synchronize'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis.rb:58:in `synchronize'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis.rb:2296:in `multi'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-namespace-1.8.1/lib/redis/namespace.rb:523:in `namespaced_block'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-namespace-1.8.1/lib/redis/namespace.rb:294:in `multi'", "/app/vendor/bundle/ruby/2.7.0/gems/sidekiq-5.2.9/lib/sidekiq/client.rb:184:in `block in raw_push'", "/app/vendor/bundle/ruby/2.7.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:63:in `block (2 levels) in with'", "/app/vendor/bundle/ruby/2.7.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:62:in `handle_interrupt'", "/app/vendor/bundle/ruby/2.7.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:62:in `block in with'", "/app/vendor/bundle/ruby/2.7.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:59:in `handle_interrupt'", "/app/vendor/bundle/ruby/2.7.0/....

Take action based on the error message.

Sentinel, Cost Estimation Failures, & Plan Exports

Sentinel, Cost Estimation, and Plan Export jobs are run by the Nomad container. If there are errors with the jobs themselves then data can be found in the Nomad container.

SSH to the Terraform Enterprise instance and run docker exec -it ptfe_nomad /bin/sh to connect to the Nomad container.

Within the Nomad container execute the following commands:

$ cd /var/lib/nomad/alloc

$ find . -name "worker.stderr.0" | xargs ls -l

-rw-r--r--. 1 root root 1312 Oct 1 14:00 ./2cc6dd57-5be5-f51b-b6a5-bf484360591b/alloc/logs/worker.stderr.0
-rw-r--r--. 1 root root 1081 Oct 1 14:00 ./50987597-2f85-f51b-cb29-8e808aa8d17f/alloc/logs/worker.stderr.0

Locate the latest worker log and cat out the contents.

Example:

cat 2cc6dd57-5be5-f51b-b6a5-bf484360591b/alloc/logs/worker.stderr.0

Sentinel Worker version a9c99fb
Input must be a configuration file or Terraform plan.

Error parsing as configuration file: bad response code: 403

Error parsing as Teraform plan: input must be legacy Terraform plan or directory: https://<TFE-HOSTNAME>/api/internal/v2/policy-check/polchk-YfbvexddxgFW/payload

If there is an error with the job then it will be displayed in this file. Take action based on the error or contact HashiCorp Support.

Audit log:

Cost estimation and Policy checks audits can also be found in the ptfe_atlas container. grep the ptfe_atlas container for polchk, cost-estimates, or plan-exports to get more information on those requests.

Sentinel Audit Policy Example:

Finished policy check polchk-izEdMooz5hDuufYG on run run-X9Nznasbcwf35bKe4. Result: true, Passed: 52, Total failed: 0, Hard failed: 0, Soft failed: 0, Advisory failed: 0, Duration ms: 0

Webhooks Troubleshooting

Webhooks from Version Control Systems (VCS) come into the sidekiq container. If the instance is not recieving webhooks then check VCS repository / settings / webhook deliveries to see what error, if any, it is reporting when delivering the payload to Terraform Enterprise.

There are common reasons for the webhook delivery failures such as network restrictions or DNS issues which are often diagnosable using cURL or nslookup from the ptfe_atlas container. Self-signed certificates on either the VCS or Terraform Enterprise side can cause connectivity issues as well.

HTTPS connection issues can be diagnosed by connecting to the ptfe_atlas container using sudo docker exec -it ptfe_atlas /bin/bash and using curl to test access to the VCS instance curl -v -L https://<VCS-HOSTNAME.COM. A curl test can also be run from the VCS instance to the Terraform Enterpise hostname.

If the curl command completes successfully, the output should indicate that the chain of trust (TLS) for the HTTPS connection was completed successfully. If the curl command throws an error then its likely due to DNS or the chain of trust could not be completed successfully. If there are any errors in this process then ensure that both sides have publicly trusted certificates or make configuration changes to allow them to trust each other.

Modules Failing to Import

Modules are posted to Terraform Enterprise via webhooks. If the VCS repository is showing a successful delivery of the payload into Terraform Enterprise but the module is not showing in the application then there are several potential causes for this. Review the troubleshooting steps in the articles below.

If the error cannot be located then generate and upload a support bundle to HashiCorp Support.

SLUG Errors

The SIC-001 (Source Ingress Controller ) error is a generic failure to process a Terraform slug. A slug refers to a blob of data which contains the current state of the Terraform configuration files. Terraform Enterprise uses slug services to pull VCS information in to extract, merge, and process Terraform configuration files. After a slug is ingressed and processed it is then uploaded to blob storage via archivist.

Common causes of SIC-001 errors:

The Oauth token expires
Permissions on the VCS side are changed or revoked
Network issues keeping TFE from reaching the VCS or its internal store of slugs
TFE is misconfigured
Use of symlinks that link outside of the workspace
Extremely large repository sizes
Incorrect or non-existent Terraform Working Directory

Typically, SIC-001s are identified by reviewing the logs in the ptfe_slug_ingress container sudo docker logs ptfe_slug_ingress and reviewing it for errors. Errors can also be found in the ptfe_archivist container by running sudo docker logs ptfe_archivist and grepping it for errors.

The SIW-001 error occurs when Terraform Enterprise has not been installed correctly, but may manifest itself as a slug ingress error when importing a module or linking a workspace to VCS. The cause of this error can be confirmed by running the command tfe-admin health-check, if you see that the Archivist and Vault containers are not healthy then there has been an issue with the IP address configuration during install. The private-address and public-address flags need to be set when running the install.sh script. More details can be found here: https://www.terraform.io/enterprise/install/automated/automating-the-installer#invoking-the-online-installation-script.

Terraform Cloud Agents on Terraform Enterprise

Terraform Cloud Agents allow Terraform Enterprise to communicate with isolated infrastructure by deploying lightweight agents within a specific network segment.

Output from the Terraform execution will be visible on the run’s page within Terraform Enterprise, however, if there are issues with the agent then debug logging will not be displayed by default. Starting the environment with TFC_AGENT_LOG_LEVEL=DEBUG along with TF_LOG=TRACE will allow the agent to capture debug logs for the agent and the Terraform run to assist with troubleshooting.

Errors within Runs

Runs within Terraform Enterprise are executed within Docker containers, Agents, or local machines using the remote backend. If there are any failures within the runs, first identify the Terraform operation being performed when the error occurs and take action based on the common issues below.

For plans within Terraform Enterprise:

Unpacks the configuration that was provided
If specified in the workspace settings, changes to the given working directory
Generates variables.tfvars from the workspace's Variables page
Exports environment variables from the workspace's Variables page
Generates a backend override file so that the workspace is always used for state storage regardless as to any backend block in the configuration
Runs terraform init and discards the log if it is successful
Runs terraform plan, which loads the workspace state, refreshes (reads) all resources and data sources into the in-memory state, and compares the configuration to the current deployments to determine planned changes
Generates and stores the plan file, JSON plan file, and final configuration filesystem

Common issues which cause errors during plans include:

Syntax errors reported by terraform, which usually include a filename and line number where the error was encountered
Configuration changes that are syntactically valid but lead to errors or unexpected results, such as changing variable values or resource names
Incorrect variable values, especially if provided from multiple sources (Variables page, configuration default, *.auto.tfvars)
Incorrect or insufficient service credentials, which are errors from the cloud provider reported by Terraform when authenticating to, e.g., refresh resources
Incorrect but otherwise valid provider configuration. For example, an incorrect region
Incorrect module sources or versions in the configuration
Incorrect provider sources or versions in the configuration or terraform-bundle, if a bundle of providers and terraform version is in use
Incorrect configuration version, which can be checked by expanding the plan's run details and following the link to the commit in the VCS for verification
Modifications to resources outside of Terraform that cannot be detected or reconciled (e.g. by another automation system or manually by a user at the cloud web console)
Using old versions of Terraform and providers that lack features and bug fixes

For apply within Terraform Enterprise:

Unpacks the plan filesystem and plan file
If specified in the workspace settings, changes to the given working directory
Exports environment variables from the workspace's Variables page
Runs terraform init and discards the log if it is successful
Runs terraform apply with the plan file, which which executes the planned changes
Generates and stores the state file

Common issues which cause errors during applies include:

Incorrect or insufficient service credentials. Only read permissions are required to plan, but write permissions are required to create or modify resources
Issues with values given in the configuration that are rejected by the service provider. E.g, some combinations of otherwise valid values may not be valid and accepted by the service
Service timeouts or excessive rate limiting, usually due to attempting to manage too many resources in one workspace or across multiple workspaces running simultaneously
Modifications to resources outside of Terraform that cannot be detected or reconciled (e.g. by another automation system or manually by a user at the cloud web console)
Using old versions of TFE, Terraform, and/or providers that lack features and bug fixes

Managing Replicated

Replicated provides a CLI tool called replicatedctl that can be used to interact with the Replicated service, and by proxy, the Terraform Enterprise application itself. Some of the common replicatedctl commands are detailed below and more can be found at https://help.replicated.com/api/replicatedctl/

Restarting Terraform Enterprise Start/Stop/Status

$ replicatedctl app status
$ replicatedctl app stop
$ watch replicatedctl app status
$ replicatedctl app start

Restarting Replicated Service

$ systemctl stop replicated replicated-operator replicated-ui
$ systemctl start replicated replicated-operator replicated-ui

Replicated Application Settings

To export the Replicated application settings, use the replicatedctl params export command.

To change a given setting, use the replicatedctl params set NAME --value VALUE command where NAME is the name of the attribute that is to be changed and VALUE is the value to be assigned to that attribute. To change the ReleaseSequence attribute to the value 0, the command replicatedctl params set ReleaseSequence --value 0 would be used.

Application Configuration

To change a given setting, use the replicatedctl app-config set NAME --value VALUE command where NAME is the name of the attribute that is to be changed and VALUE is the value to be assigned to that attribute. The attributes list can be found by running replicatedctl app-config export.

$ replicatedctl app-config set NAME --value VALUE

General Terraform Enterprise (TFE) Information

Main Page (follow left pane navigation for Deployment and Operation and Application Usage and Other Docs sections)

Monitoring/Health Check basics

Reference Architectures (including Active/Active)

Active/Active Terraform Enterprise (TFE) Information

TFE Active/Active Install/Configure

TFE Active/Active Administration

Other Terraform Enterprise (TFE) Monitoring Information

From HashiCorp Blog Posts written by HashiCorp people, but not specifically official documentation:

Monitoring and Logging for Terraform Enterprise

Monitoring and Logging for Terraform Enterprise — Azure Monitor

Monitoring and Logging for Terraform Enterprise — GCP Operations

Active/Active Admin Commands

As active/active modules will disable the replicated UI by default, we have provided admin commands to facilitate configuration changes, safe application stops and support bundles, etc. This work is done in a new container - tfe-admin.

These and other CLI commands are published on TFE Active/Active Administration

Contacting HashiCorp Support

When contacting HashiCorp Support, please include any detailed run logs using TF_LOG=TRACE, redacted Terraform code (if necessary) and a support bundle as this will help ensure a timely response to your support request.

Table of Contents

Terraform Enterprise - Basic Troubleshooting Guide

Overview

Cloud Infrastructure

Linux Instance

Docker

Replicated

Roles of the Replicated containers

Terraform Enterprise

Roles of the Terraform Enterprise containers

Install and Configure Terraform Enterprise

Verify Replicated Config prior to installation

Installation Instructions

Verification

Verify Cloud Init

Verify Docker is Running

Verify Replicated is Running

Run Preflight Checks

Verify Terraform Enterprise Containers

Terraform Enteprise Health

Terraform Enterprise Application Troubleshooting

Terraform Enterprise Application Fails to Start

Tracking Application Errors

Sentinel, Cost Estimation Failures, & Plan Exports

Webhooks Troubleshooting

Modules Failing to Import

SLUG Errors

Terraform Cloud Agents on Terraform Enterprise

Errors within Runs

Managing Replicated

Restarting Terraform Enterprise Start/Stop/Status

Restarting Replicated Service

Replicated Application Settings

Application Configuration

General Terraform Enterprise (TFE) Information

Active/Active Terraform Enterprise (TFE) Information

Other Terraform Enterprise (TFE) Monitoring Information

Active/Active Admin Commands

Contacting HashiCorp Support

Articles in this section

Related articles