Table of Contents
- Terraform Enterprise - Basic Troubleshooting Guide
-
Overview
- Cloud Infrastructure
- Linux Instance
- Docker
- Replicated
- Roles of the Replicated containers
- Terraform Enterprise
- Roles of the Terraform Enterprise containers
- Install and Configure Terraform Enterprise
- Verification
- Terraform Enterprise Application Troubleshooting
- Managing Replicated
- General Terraform Enterprise (TFE) Information
- Active/Active Terraform Enterprise (TFE) Information
- Other Terraform Enterprise (TFE) Monitoring Information
- Active/Active Admin Commands
- Contacting HashiCorp Support
Terraform Enterprise - Basic Troubleshooting Guide
Overview
Terraform Enterprise operates in layers. It is important to understand the architecture of the application in order to properly troubleshoot issues.
First, cloud infrastructure from cloud providers such as Azure, GCP or AWS is used to create a virtual network and cloud resources which then hosts a virtual Linux instance within it. VMWare is also commonly used as an on-prem infrastructure solution. The Replicated script is executed on the Linux instance typically via cloud-init which downloads & installs the Docker application if online mode is chosen, along with the Replicated & Terraform Enterprise Docker images. Once all the Replicated containers have started successfully then this will allow the start of the Terraform Enterprise containers and eventually launch the application.
Cloud Infrastructure
At the base of the application stack is cloud infrastructure (layer 1 of the application stack). Cloud infrastructure refers to the hardware and software components, such as servers, storage, networking, virtualization software, services and management tools, that support the computing requirements of a cloud computing model.
Proper cloud networking is required to enable Terraform Enterprise to communicate both internally at the Docker container level and externally due to certain API requests being made from the Docker containers back out to the fully qualified domain name.
Proxies, load balancers, application firewalls and security groups can impact application communication to itself and external services. Ensure that networking requirements have been met.
Also the memory, CPU, and IOPS configured for the Linux system can have an affect on the performance of Terraform Enterprise. Ensure that the cloud infrastructure is created according to Hashicorp’s reference architecture.
Linux Instance
Terraform Enterprise runs on a Linux based operating systems (layer 2 of the application stack). SELinux and Iptables are a part of the built in Linux security framework. Iptables refers to the firewall restrictions on the Linux operating system and SELinux provides file system security.
Both SELinux and Iptables can affect how Terraform Enterprise communicates to itself and external services.
Ensure that your Linux Instance meets these specific requirements.
· https://www.terraform.io/docs/enterprise/before-installing/index.html#selinux
Docker
Docker utilizes OS-level virtualization to deliver software in packages called containers (layer 3 of the application stack). Containers are isolated from one another and bundle their own software, libraries and configuration files. They can communicate with each other through internal Docker networks.
HashiCorp utilizes Docker containers to facilitate the runtime of various services used by Terraform Enterprise. These Docker containers are managed by Replicated.
Docker requirements vary based on the Linux operating system chosen. See docker compatibility guide for specific requirements.
Replicated
Terraform Enterprise uses a service named Replicated to manage the installation, configuration, and management of the various Docker containers that make up the Terraform Enterprise application (layer 4 of the application stack). Terraform Enterprise uses Replicated’s native scheduler to manage the lifecycle of the Docker containers.
Replicated also provides a CLI tool called replicatedctl that can be used to interact with the Replicated service, and by proxy, the Terraform Enterprise application itself.
Roles of the Replicated containers
replicated
- The daemon that runs Replicated services and starts the application. It communicates with the external Replicated API and registry unless running in airgap mode. This is the only component that communicates externally.
replicated-ui
- Provides the Replicated console which listens on host port 8800. It communicates internally with the Replicated daemon and with the premkit service.
replicated-operator
- A utility image to transfer files between the host and daemon and to run application containers if using the native scheduler. It communicates internally with the Replicated daemon on port 9879.
replicated-premkit
- This serves as a reverse proxy to the audit log, metrics, and integration services. It communicates internally with the daemon, audit log, and metrics services.
replicated-statsd
- This image is used for a metrics service that runs when the application is running.
support-bundle
- This image is run to collect system information when the customer creates a support bundle.
cmd
- This image may be used for custom commands if configured in the application yaml. It may communicate internally or externally if configured to do so by the vendor’s application.
retraced
- Retraced provides an API and worker for the audit log component and communicates internally with the audit log’s Postgres and NSQ services. The following are the API and worker containers:
- retraced-processor
- retraced-api
- retraced-cron
retraced-postgres
- This is the database for the audit log.
retraced-nsq
-This is the audit log’s queue.
Terraform Enterprise
Terraform Enterprise is a fully featured application that allows teams to run Terraform CLI within a data center or cloud provider (layer 5). Terraform Enterprise can be thought of as the top layer of the overall application stack. If the previous layers are not configured according to HashiCorp's recommendations then this can cause an issue with communication between the containers and/or the external services that Terraform Enterprise interacts with.
In addition to the Replicated Docker containers, there are long-running containers that comprise the Terraform Enterprise application. To troubleshoot properly, it is important to understand the roles of each container as this will guide you on how to troubleshoot the application.
Roles of the Terraform Enterprise containers
Note: For installations using Terraform Enterprise v202205-01 or later, all container names now follow the naming convention of "tfe-<service>"
Example:
ptfe_atlas > tfe-atlas
ptfe_archivist > tfe-archivist
This article will be updated to remove references to the "ptfe" prefix at a later date.
More information can be found in the release notes here.
ptfe_nginx
- Nginx reverse proxy, facilitates access to the Terraform Enterprise services
ptfe_atlas
- The API and Web UI. Terraform Enterprise used to be known as Atlas
ptfe_build_manager
- Manages the queue of Terraform runs
ptfe_build_worker
- Creates workers on-demand as required by the queue. Injects variables, secrets, and Terraform configuration to a temporary container, ptfe_worker
ptfe_worker
- Executes a Terraform plan or apply. This container can be replaced with a custom image. This ephemeral container may be created with a randomly generated name by Docker
ptfe_vault
- HashiCorp Vault, utilizes transit encryption for items such as sensitive workspace variables ptfe_registry_api
- Terraform Private Module Registry API
ptfe_slug_ingress
(or ptfe_ingress in older versions of Terraform Enterprise) - Listens for VCS webhooks. Packages VCS repo data as a slug and sends it to ptfe_archivist
ptfe_registry_worker
- Processes VCS slugs, prepares module to be published on the Terraform private Module Registry
ptfe_sidekiq
- Background job scheduler system
ptfe_redis
- Redis in-memory database, use for caching and ptfe_sidekiq
queue. This container will not be active on Active-Active Terraform Enterprise installations.
ptfe_nomad
- HashiCorp Nomad, Schedules Sentinel and Cost Estimation runs
ptfe_archivist
- Object storage API
ptfe_migrations
- Runs on startup only, runs database migrations from ptfe_atlas
ptfe_postgres
- PostgreSQL database, holds relational data such as workspace applies and where their state is stored in object storage
ptfe_state_parser
- Reads Terraform state files and parses important information out of them
rabbitmq
- RabbitMQ message queue
ptfe_backup_restore
- The Terraform Enterprise Backup and Restore API
ptfe_outbound_http_proxy
- Security control used to filter user-controlled network traffic (e.g., sentinel imports) and prevent them from accessing internal services directly
ptfe_health_check
- Runs a periodic health check against Terraform Enterprise
ptfe_base_startup
- Runs on install only. Initializes Terraform Enterprise for installation
ptfe_registry_migrations
- Runs on startup only, runs database migrations from ptfe_registry_api
telegraf
- Data collection agent for collecting and reporting metrics. This container runs when enable_metrics_collection
is enabled in the application configuration
influxdb
- Time-series database for storing metrics data from telegraf. This container runs when enable_metrics_collection
is enabled in the application configuration
Install and Configure Terraform Enterprise
Verify Replicated Config prior to installation
The replicated config MUST be present at /etc/replicated.conf
and contains information for replicated to start, but also imports the TFE settings from another file.
View the file:
$ cat /etc/replicated.conf
{
"DaemonAuthenticationType": "password",
"DaemonAuthenticationPassword": "xxxxxxxxxxxxx",
"TlsBootstrapHostname": "tfe.company.com",
"TlsBootstrapType": "self-signed",
"TlsBootstrapCert": "",
"TlsBootstrapKey": "",
"BypassPreflightChecks": true,
"ImportSettingsFrom": "/etc/replicated-tfe.json",
"LicenseBootstrapAirgapPackagePath": "/etc/tfe/latest.airgap",
"LicenseFileLocation": "/etc/replicated.rli"
}
Verify:
- "ImportSettingsFrom" is set to a valid file, more information on the contents in the next section,
cat <path to tfe configs>
- "DaemonAuthenticationPassword" has a password set.
- "LicenseFileLocation" is set to the path of a valid file,
cat <path to license file>
- This file is technically a binary, but can be visually inspected.
- Should start with the string "license.json"
- Should also contain the strings ["key.signature", "BEGIN RSA PUBLIC KEY", "END RSA PUBLIC KEY"]
- "TlsBootstrapHostname" is accurately set to the FQDN the will be used to access TFE.
- If "TlsBootstrapType" == "self-signed", "TlsBootstrapCert" and "TlsBootstrapKey" should be empty.
-
If "TlsBootstrapType" == "server-path", "TlsBootstrapCert" and "TlsBootstrapKey" should be not be empty.
- Check that the cert and key paths are valid,
cat <path to cert or key>
- Verify that the cert is PEM encoded, does not contain embedded "\n" strings and includes the entire CA chain. ```sh -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----
-----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----
- Verify that the key is an RSA private key ```sh -----BEGIN RSA PRIVATE KEY----- ... -----END RSA PRIVATE KEY-----
- Check that the cert and key paths are valid,
- If "LicenseBootstrapAirgapPackagePath" is set, verify the file is present and is at least 900MB+,
ls -lh <path to airgap file>
Installation Instructions
Verification
To verify that the installation was successfull review the cloud-final logs on the Linux instance.
Verify Cloud Init
This assumes cloud-init is being used from the user_data
argument.
SSH into the instance running TFE to perform the following checks:
$ journalctl -xu cloud-final -o cat
Verify that ./install.sh
ran and exited properly.
Here is an example:
root : TTY=unknown ; PWD=/etc/tfe ; USER=root ; COMMAND=./install.sh ... private-address=10.0.5.59 public-address=10.0.5.59
Determining local address
The installer will use local address '10.0.5.59' (from parameter)
Running preflight checks...
[INFO] / disk usage is at 15%
[INFO] /var/lib/docker disk usage is at 15%
[INFO] Docker http proxy not set
...
To continue the installation, visit the following URL in your browser:
http://10.0.5.59:8800
Verify Docker is Running
If cloud-init
shows a successful installation then verify that Docker and the Replicated services have started. By running docker info
the output will confirm whether or not this layer of the application has started. If docker info
returns information then it is safe to assume that this layer has loaded successfully. Continue to the next layer to verify that the Replicated services are up.
Run docker info command:
$ sudo docker info
Good
Containers: 36
Running: 28
Paused: 0
Stopped: 8
Images: 37
Server Version: 1.13.1
...
Bad
Error: error response from daemon get https://index.docker.io connection refused
Resolution
Verify docker service is running, or if there are any errors:
$ systemctl status docker
Resolve any networking or routing issues.
Verify Replicated is Running
Run the system status command:
$ replicatedctl system status
Good
{
"Replicated": "ready",
"Retraced": "ready"
}
Bad
{
"Replicated": "ready",
"Retraced": "initializing"
}
Resolution
Verify Replicated services are running, and check logs:
# Verify Replicated services are running
$ systemctl status replicated replicated-operator replicated-ui
● replicated.service - Replicated Service
Loaded: loaded (/etc/systemd/system/replicated.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2021-10-06 15:39:41 UTC; 4 weeks 1 days ago
Main PID: 14949 (docker)
Tasks: 13 (limit: 4915)
CGroup: /system.slice/replicated.service
└─14949 /usr/bin/docker run --name=replicated -p 9874-9879:9874-9879/tcp -u 1001:999
● replicated-operator.service - Replicated Operator Service
Loaded: loaded (/etc/systemd/system/replicated-operator.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2021-10-06 15:39:51 UTC; 4 weeks 1 days ago
Main PID: 15570 (docker)
Tasks: 11 (limit: 4915)
CGroup: /system.slice/replicated-operator.service
└─15570 /usr/bin/docker run --name=replicated-operator -u 1001:999
● replicated-ui.service - Replicated Service
Loaded: loaded (/etc/systemd/system/replicated-ui.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2021-10-06 15:39:41 UTC; 4 weeks 1 days ago
Main PID: 15013 (docker)
Tasks: 11 (limit: 4915)
CGroup: /system.slice/replicated-ui.service
└─15013 /usr/bin/docker run --name=replicated-ui -p 8800:8800/tcp -u 1001:999
View Replicated Logs
$ journalctl -xu replicated -o cat
View Replicated Docker Logs
# View & follow replicated container logs
$ sudo docker logs replicated -f
# View & follow replicated-operator container logs
$ sudo docker logs replicated-operator
Run Preflight Checks
Run replicatedctl preflight
checks to ensure that the minimum requirements have been met:
# Preflight checks
$ replicatedctl preflight run
✓ OS linux is supported
- The operating system must be linux
✓ Kernel version requirement met
- Kernel version must be at least 3.10
✓ Total space requirement met for directory /tmp
- Directory must have at least 1GB total space
✓ Total space requirement met for directory /var/lib/replicated
- Directory must have at least 250MB total space
✓ Successful TLS connection
- Can connect to TLS 10.0.172.53 address
✓ Successful HTTP request
- Can access api.replicated.com
✓ Docker server version requirement met
- Docker server version must be at least 1.7.1
✓ Memory requirement met
- Server must have at least 4GB total memory
✓ Total space requirement met for directory /var/lib/docker
- Directory must have at least 40GB total space
✓ CPU cores requirement met
- Server must have at least 2 CPU cores
✓ Total space requirement met for directory /
- Directory must have at least 10GB total space
✓ Successful connection to https://releases.hashicorp.com.
- Can connect to https://releases.hashicorp.com.
✓ Successful Docker registry ping
- Can access registry index.docker.io
✓ Successful Docker registry ping
- Can access registry registry.replicated.com
NODE: 916916fc055445185808a3101b79d795
✓ CPU cores requirement met
- Server must have at least 2 CPU cores
✓ Memory requirement met
- Server must have at least 4GB total memory
✓ OS linux is supported
- The operating system must be linux
✓ Kernel version requirement met
- Kernel version must be at least 3.10
✓ Total space requirement met for directory /
- Directory must have at least 10GB total space
✓ Total space requirement met for directory /var/lib/docker
- Directory must have at least 40GB total space
✓ Docker server version requirement met
- Docker server version must be at least 1.7.1
✓ Successful TLS connection
- Can connect to TLS 10.0.172.53 address
All preflight checks passed!
Verify Terraform Enterprise Containers
Terraform Enterprise runs as a series of Docker containers which are managed by Replicated. Sometimes, it’s necessary to use the docker command to view the logs of a container or to execute a command within a container. Those actions and more are detailed below.
The docker ps
command is used to list all currently running containers. A healthy, idle Terraform Enterprise installation should have around 25-30 containers running at a given time, about 10 of which should be Replicated containers.
Note: that the last container to be started is ptfe_atlas
. If this container is started it is indicative of the application being up.
# Containers
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e9ef9136b222 172.31.83.56:9874/hashicorp-terraform-build-worker:651-723542f "start-tbw /terrafor…" 20 minutes ago Up 20 minutes affectionate_spence
e670cb2451bb 172.31.83.56:9874/hashicorp-archivist:608-11824ea "/usr/bin/wait-for-t…" 20 minutes ago Up 20 minutes 0.0.0.0:7675->7675/tcp ptfe_archivist
296029f74e33 172.31.83.56:9874/hashicorp-tf-registry:1209-cef77f6 "setup-ssl /usr/bin/…" 20 minutes ago Up 20 minutes ptfe_registry_worker
34cc9cc8816b 172.31.83.56:9874/hashicorp-tf-registry:1209-cef77f6 "setup-ssl /usr/bin/…" 20 minutes ago Up 20 minutes ptfe_registry_api
fbf0a59f5e0f 172.31.83.56:9874/hashicorp-atlas:CIRC-52813-8885d12 "/usr/bin/init.sh /a…" 20 minutes ago Up 20 minutes ptfe_sidekiq
084de8a5812c 172.31.83.56:9874/hashicorp-atlas:CIRC-52813-8885d12 "/usr/bin/init.sh bu…" 20 minutes ago Up 20 minutes 0.0.0.0:9292->9292/tcp ptfe_atlas
fe85af2e0ec2 172.31.83.56:9874/hashicorp-terraform-build-manager:556-6f20add "/usr/bin/tbm-start" 20 minutes ago Up 20 minutes ptfe_build_manager
6c3e323b864e 172.31.83.56:9874/hashicorp-ptfe-vault:CIRC-43-c3012a5 "vault-start" 20 minutes ago Up 20 minutes 0.0.0.0:8200->8200/tcp ptfe_vault
5a172f506e51 172.31.83.56:9874/hashicorp-ptfe-postgres:2738c44 "docker-entrypoint.s…" 20 minutes ago Up 20 minutes 0.0.0.0:5432->5432/tcp ptfe_postgres
7f26d1117c8f 172.31.83.56:9874/hashicorp-ptfe-rabbitmq:3-7a948ea "/start.sh rabbitmq-…" 20 minutes ago Up 20 minutes 0.0.0.0:5672->5672/tcp, 0.0.0.0:32784->4369/tcp, 0.0.0.0:32783->5671/tcp, 0.0.0.0:32782->25672/tcp rabbitmq
49800f0208c0 172.31.83.56:9874/hashicorp-ptfe-nginx:2-de4e9dc "/usr/bin/run-ssl ng…" 20 minutes ago Up 20 minutes 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp, 0.0.0.0:23001->8080/tcp ptfe_nginx
a1c6d60aa825 172.31.83.56:9874/hashicorp-slug-ingress:597-f03e9b3 "/usr/bin/slug-ingre…" 20 minutes ago Up 20 minutes 0.0.0.0:7586->7586/tcp ptfe_ingress
36134510d493 172.31.83.56:9874/hashicorp-tfe-backup-restore:129-3e8e200 "/usr/bin/wait-for-t…" 20 minutes ago Up 20 minutes 0.0.0.0:23009->23009/tcp ptfe_backup_restore
24a466b9a15b 172.31.83.56:9874/hashicorp-ptfe-nomad:6-f61e114 "nomad-run" 20 minutes ago Up 20 minutes 0.0.0.0:23020->23020/tcp ptfe_nomad
17ef12b51d22 172.31.83.56:9874/influxdb:1.6.4-alpine "/entrypoint.sh infl…" 21 minutes ago Up 20 minutes 0.0.0.0:8086->8086/tcp influxdb
7f4d0b37df93 172.31.83.56:9874/telegraf:1.8.1-alpine "/entrypoint.sh tele…" 21 minutes ago Up 21 minutes 0.0.0.0:23010->23010/udp, 0.0.0.0:32774->8092/udp, 0.0.0.0:32781->8094/tcp, 0.0.0.0:32773->8125/udp telegraf
067596c60f7f 172.31.83.56:9874/hashicorp-ptfe-redis:4-de207d6 "docker-entrypoint.s…" 21 minutes ago Up 21 minutes 0.0.0.0:6379->6379/tcp ptfe_redis
d40584c68f99 172.31.83.56:9874/hashicorp-terraform-state-parser:537-b3c2c82 "/terraform-state-pa…" 21 minutes ago Up 21 minutes 0.0.0.0:7588->7588/tcp ptfe_state_parser
efd6ab856559 registry.replicated.com/library/statsd-graphite:1.0.6 "/usr/bin/supervisor…" 21 minutes ago Up 21 minutes 0.0.0.0:32780->2003/tcp, 0.0.0.0:32779->2004/tcp, 0.0.0.0:32778->2443/tcp, 0.0.0.0:32772->8125/udp replicated-statsd
7788adf5b847 172.31.83.56:9874/hashicorp-ptfe-health-check:CIRC-194-6bcc7e9 "/root/ptfe-health-c…" 21 minutes ago Up 21 minutes 0.0.0.0:23005->23005/tcp ptfe-health-check
f3aecc245e6a registry.replicated.com/library/retraced:1.3.28 "/src/replicated-aud…" About an hour ago Up About an hour 0.0.0.0:9873->3000/tcp retraced-api
80fd359750bf registry.replicated.com/library/retraced:1.3.28 "/src/replicated-aud…" About an hour ago Up About an hour 3000/tcp retraced-processor
0fb6beab6188 registry.replicated.com/library/retraced:1.3.28 "/bin/sh -c '/usr/lo…" About an hour ago Up About an hour 3000/tcp retraced-cron
26beab508fb6 registry.replicated.com/library/retraced-postgres:10.10-20200213 "docker-entrypoint.s…" About an hour ago Up About an hour 5432/tcp retraced-postgres
6390c05766e8 registry.replicated.com/library/retraced-nsq:v1.0.0-compat-20191118 "/bin/sh -c nsqd" About an hour ago Up About an hour 4150-4151/tcp, 4160-4161/tcp, 4170-4171/tcp retraced-nsqd
151e2830c9b8 registry.replicated.com/library/premkit:1.3.1 "/usr/bin/premkit da…" About an hour ago Up About an hour 80/tcp, 443/tcp, 2080/tcp, 0.0.0.0:9880->2443/tcp replicated-premkit
820d95c6586d quay.io/replicated/replicated-operator:current "/usr/bin/replicated…" About an hour ago Up About an hour replicated-operator
15ef1cea94ad quay.io/replicated/replicated-ui:current "/usr/bin/replicated…" About an hour ago Up About an hour 0.0.0.0:8800->8800/tcp replicated-ui
60456468a40b quay.io/replicated/replicated:current "entrypoint.sh -d" About an hour ago Up About an hour 0.0.0.0:9874-9879->9874-9879/tcp replicated
Terraform Enteprise Health
A health check command can be used to verify the application is up and healthy.
# Check Terraform Enteprise Health
$ tfe-admin health-check
checking: Archivist Health Check...
| checks that Archivist is up and healthy
|- ✓ PASS
checking: Terraform Enterprise Health Check...
| checks that Terraform Enterprise is up and can communicate with Redis and Postgres
|- ✓ PASS
checking: Terraform Enterprise Vault Health Check...
| checks that Terraform Enterprise can connect to Vault and is able to encrypt and decrypt tokens
|- ✓ PASS
checking: Fluent Bit Health Check...
| checks that the configure Fluent Bit server is healthy
|- SKIPPED
checking: RabbitMQ Health Check...
| checks that RabbitMQ can be connected to and that we can send and consume messages
|- ✓ PASS
checking: Vault Server Health Check...
| checks that the configured Vault Server is healthy
|- ✓ PASS
All checks passed.
Terraform Enterprise Application Troubleshooting
When beginning the troubleshooting process it is important to keep all of the layers of the application stack in mind as issues can occur at any level which ultimately cause failures in Terraform Enterprise. Along with the application stack, review the list of containers and the purpose they serve in the applicaiton.
With all of these in mind ask the following questions should any errors occur in the application.
- What is the error?
- Is it occurring within the application or outside of the application?
- Have there been any changes to the infrastructure, Linux instance, code, or application recently?
- What layer is this issue occuring at?
- Once the layer is identified, work your way up the stack, verifying each stack along the way.
- Where can I locate the logs for the failure?
- Check the HashiCorp Help Center to see if there is an article around the error. For best results, search based on the most unique part of the error.
- Is this a known bug that is fixed?
Terraform Enterprise Application Fails to Start
If all of the containers are not up yet use watch docker ps
which will refresh the docker ps
command every two seconds. This will allow you to watch the startup process. The last container to be started is ptfe_atlas
. The application will still be in a starting status if this container is not up yet.
If the application fails to start, the Replicated service will usually report which container it failed on. Locate the container name by viewing the errors under sudo docker logs replicated
. Note the container that Terraform Enterprise failed to start.
To view the logs for a given container, use the docker logs CONTAINER
command where CONTAINER is a container ID or name. To view the logs for the ptfe_vault
container, use docker logs ptfe_vault
. You can also follow the logs as they come in by using the -f
option. To follow the logs for the ptfe_vault
container, use docker logs -f ptfe_vault
.
$ sudo docker container ls
$ sudo docker logs ptfe_vault
Good
2020-05-27T17:44:27.446Z [INFO] core: vault is unsealed
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 1
Threshold 1
Version 1.2.3
Cluster Name vault-cluster-bf612c25
Cluster ID 469f4ad7-7329-91e0-d39e-84f820edf4c5
HA Enabled false
false
Bad
Vault is already initialized
+ killing vault with pid 33
==> Vault shutdown triggered
+ vault has exited
+ exiting vault setup with 0
+ Retrieving Vault unseal key
get unseal: could not decrypt unseal key: crypto: could not decrypt ciphertext: chacha20poly1305: message authentication failed
Resolution
Likely you have the wrong encryption password set for an existing data layer.
$ sudo docker logs ptfe_postgresql_setup
Good
+ Detected postgresql up and active
CREATE SCHEMA
CREATE SCHEMA
CREATE SCHEMA
NOTICE: schema "rails" already exists, skipping
NOTICE: schema "vault" already exists, skipping
NOTICE: schema "registry" already exists, skipping
Bad
+ Detected postgresql up and active
ERROR: permission denied for database tfe
2021-10-19T20:02:24.732742989Z psql: error: could not connect to server: No route to host
2021-10-19T20:02:24.732790701Z Is the server running on host "10.22.0.2" and accepting
2021-10-19T20:02:24.732798293Z TCP/IP connections on port 5432?
2021-10-19T20:02:30.787133573Z psql: error: timeout expired
Resolution
Likely your Postgres connection information is incorrect, or there is a network rule/firewall blocking Terraform Enterprise to connect to Postgres.
Tracking Application Errors
All application requests come through the ptfe_atlas
container. If the application fails to save a setting and an error is displayed in the user interface then the error will be logged in this container and most likely with a stack trace as well. If the error is reproducible then run sudo docker logs ptfe_atlas -f --tail 100
which will give a live follow of the data coming into the container and also provide the last 100 loglines. The error will most likely be displayed there.
Stack Trace Example:
2021-09-07T18:18:55.762071351Z 2021-09-07 18:18:55 [DEBUG] OOM command not allowed when used memory > 'maxmemory'. excluded from capture: DSN not set
2021-09-07T18:18:55.762083007Z 2021-09-07 18:18:55 [ERROR] [451fd9ba-d423-4ec2-9371-8cc2a2d8a846] {:error=>"Redis::CommandError", :id=>28844840, :message=>"OOM command not allowed when used memory > 'maxmemory'."}
2021-09-07T18:18:55.762421452Z 2021-09-07 18:18:55 [DEBUG] [451fd9ba-d423-4ec2-9371-8cc2a2d8a846] {:error=>"Redis::CommandError", :id=>28844840, :message=>"OOM command not allowed when used memory > 'maxmemory'.", :backtrace=>["/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis/client.rb:199:in `call_pipelined'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis/client.rb:157:in `block in call_pipeline'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis/client.rb:293:in `with_reconnect'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis/client.rb:155:in `call_pipeline'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis.rb:2304:in `block in multi'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis.rb:58:in `block in synchronize'", "/usr/local/lib/ruby/2.7.0/monitor.rb:202:in `synchronize'", "/usr/local/lib/ruby/2.7.0/monitor.rb:202:in `mon_synchronize'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis.rb:58:in `synchronize'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-3.3.5/lib/redis.rb:2296:in `multi'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-namespace-1.8.1/lib/redis/namespace.rb:523:in `namespaced_block'", "/app/vendor/bundle/ruby/2.7.0/gems/redis-namespace-1.8.1/lib/redis/namespace.rb:294:in `multi'", "/app/vendor/bundle/ruby/2.7.0/gems/sidekiq-5.2.9/lib/sidekiq/client.rb:184:in `block in raw_push'", "/app/vendor/bundle/ruby/2.7.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:63:in `block (2 levels) in with'", "/app/vendor/bundle/ruby/2.7.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:62:in `handle_interrupt'", "/app/vendor/bundle/ruby/2.7.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:62:in `block in with'", "/app/vendor/bundle/ruby/2.7.0/gems/connection_pool-2.2.3/lib/connection_pool.rb:59:in `handle_interrupt'", "/app/vendor/bundle/ruby/2.7.0/....
Take action based on the error message.
Sentinel, Cost Estimation Failures, & Plan Exports
Sentinel, Cost Estimation, and Plan Export jobs are run by the Nomad container. If there are errors with the jobs themselves then data can be found in the Nomad container.
SSH to the Terraform Enterprise instance and run docker exec -it ptfe_nomad /bin/sh
to connect to the Nomad container.
Within the Nomad container execute the following commands:
$ cd /var/lib/nomad/alloc
$ find . -name "worker.stderr.0" | xargs ls -l
-rw-r--r--. 1 root root 1312 Oct 1 14:00 ./2cc6dd57-5be5-f51b-b6a5-bf484360591b/alloc/logs/worker.stderr.0
-rw-r--r--. 1 root root 1081 Oct 1 14:00 ./50987597-2f85-f51b-cb29-8e808aa8d17f/alloc/logs/worker.stderr.0
Locate the latest worker log and cat
out the contents.
Example:
cat 2cc6dd57-5be5-f51b-b6a5-bf484360591b/alloc/logs/worker.stderr.0
Sentinel Worker version a9c99fb
Input must be a configuration file or Terraform plan.
Error parsing as configuration file: bad response code: 403
Error parsing as Teraform plan: input must be legacy Terraform plan or directory: https://<TFE-HOSTNAME>/api/internal/v2/policy-check/polchk-YfbvexddxgFW/payload
If there is an error with the job then it will be displayed in this file. Take action based on the error or contact HashiCorp Support.
Audit log:
Cost estimation and Policy checks audits can also be found in the ptfe_atlas
container. grep
the ptfe_atlas
container for polchk
, cost-estimates
, or plan-exports
to get more information on those requests.
Sentinel Audit Policy Example:
Finished policy check polchk-izEdMooz5hDuufYG on run run-X9Nznasbcwf35bKe4. Result: true, Passed: 52, Total failed: 0, Hard failed: 0, Soft failed: 0, Advisory failed: 0, Duration ms: 0
Webhooks Troubleshooting
Webhooks from Version Control Systems (VCS) come into the sidekiq
container. If the instance is not recieving webhooks then check VCS repository / settings / webhook deliveries to see what error, if any, it is reporting when delivering the payload to Terraform Enterprise.
There are common reasons for the webhook delivery failures such as network restrictions or DNS issues which are often diagnosable using cURL
or nslookup
from the ptfe_atlas
container. Self-signed certificates on either the VCS or Terraform Enterprise side can cause connectivity issues as well.
HTTPS connection issues can be diagnosed by connecting to the ptfe_atlas
container using sudo docker exec -it ptfe_atlas /bin/bash
and using curl
to test access to the VCS instance curl -v -L https://<VCS-HOSTNAME.COM
. A curl
test can also be run from the VCS instance to the Terraform Enterpise hostname.
If the curl
command completes successfully, the output should indicate that the chain of trust (TLS) for the HTTPS connection was completed successfully. If the curl
command throws an error then its likely due to DNS or the chain of trust could not be completed successfully. If there are any errors in this process then ensure that both sides have publicly trusted certificates or make configuration changes to allow them to trust each other.
- https://www.terraform.io/docs/enterprise/install/installer.html#certificate-authority-ca-bundle
- https://support.hashicorp.com/hc/en-us/articles/360046090994-Terraform-runs-failing-with-x509-certificate-signed-by-unknown-authority-error
- https://www.terraform.io/cloud-docs/vcs/troubleshooting#certificate-errors-on-terraform-enterprise
- https://www.terraform.io/cloud-docs/vcs/troubleshooting#can-t-trigger-workspace-runs-from-vcs-webhook
Modules Failing to Import
Modules are posted to Terraform Enterprise via webhooks. If the VCS repository is showing a successful delivery of the payload into Terraform Enterprise but the module is not showing in the application then there are several potential causes for this. Review the troubleshooting steps in the articles below.
- https://support.hashicorp.com/hc/en-us/articles/1500000278482-Module-Updates-Failing-to-Ingress-in-Terraform-Enterprise
- https://support.hashicorp.com/hc/en-us/articles/4407858770451-Failing-to-Add-a-Private-Registry-Module-in-Terraform-Enterprise
If the error cannot be located then generate and upload a support bundle to HashiCorp Support.
SLUG Errors
The SIC-001 (Source Ingress Controller ) error is a generic failure to process a Terraform slug. A slug refers to a blob of data which contains the current state of the Terraform configuration files. Terraform Enterprise uses slug services to pull VCS information in to extract, merge, and process Terraform configuration files. After a slug is ingressed and processed it is then uploaded to blob storage via archivist
.
Common causes of SIC-001 errors:
- The Oauth token expires
- Permissions on the VCS side are changed or revoked
- Network issues keeping TFE from reaching the VCS or its internal store of slugs
- TFE is misconfigured
- Use of symlinks that link outside of the workspace
- Extremely large repository sizes
- Incorrect or non-existent Terraform Working Directory
Typically, SIC-001s are identified by reviewing the logs in the ptfe_slug_ingress
container sudo docker logs ptfe_slug_ingress
and reviewing it for errors. Errors can also be found in the ptfe_archivist
container by running sudo docker logs ptfe_archivist
and grepping it for errors.
The SIW-001 error occurs when Terraform Enterprise has not been installed correctly, but may manifest itself as a slug ingress error when importing a module or linking a workspace to VCS. The cause of this error can be confirmed by running the command tfe-admin health-check
, if you see that the Archivist and Vault containers are not healthy then there has been an issue with the IP address configuration during install. The private-address
and public-address
flags need to be set when running the install.sh
script. More details can be found here: https://www.terraform.io/enterprise/install/automated/automating-the-installer#invoking-the-online-installation-script.
Terraform Cloud Agents on Terraform Enterprise
Terraform Cloud Agents allow Terraform Enterprise to communicate with isolated infrastructure by deploying lightweight agents within a specific network segment.
Output from the Terraform execution will be visible on the run’s page within Terraform Enterprise, however, if there are issues with the agent then debug logging will not be displayed by default. Starting the environment with TFC_AGENT_LOG_LEVEL=DEBUG
along with TF_LOG=TRACE
will allow the agent to capture debug logs for the agent and the Terraform run to assist with troubleshooting.
Errors within Runs
Runs within Terraform Enterprise are executed within Docker containers, Agents, or local machines using the remote
backend. If there are any failures within the runs, first identify the Terraform operation being performed when the error occurs and take action based on the common issues below.
For plans within Terraform Enterprise:
- Unpacks the configuration that was provided
- If specified in the workspace settings, changes to the given working directory
- Generates
variables.tfvars
from the workspace's Variables page - Exports environment variables from the workspace's Variables page
- Generates a backend override file so that the workspace is always used for state storage regardless as to any
backend
block in the configuration - Runs
terraform init
and discards the log if it is successful - Runs
terraform plan
, which loads the workspace state, refreshes (reads) all resources and data sources into the in-memory state, and compares the configuration to the current deployments to determine planned changes - Generates and stores the plan file, JSON plan file, and final configuration filesystem
Common issues which cause errors during plans include:
- Syntax errors reported by
terraform
, which usually include a filename and line number where the error was encountered - Configuration changes that are syntactically valid but lead to errors or unexpected results, such as changing variable values or resource names
- Incorrect variable values, especially if provided from multiple sources (Variables page, configuration default,
*.auto.tfvars
) - Incorrect or insufficient service credentials, which are errors from the cloud provider reported by Terraform when authenticating to, e.g., refresh resources
- Incorrect but otherwise valid provider configuration. For example, an incorrect region
- Incorrect module sources or versions in the configuration
- Incorrect provider sources or versions in the configuration or terraform-bundle, if a bundle of providers and terraform version is in use
- Incorrect configuration version, which can be checked by expanding the plan's run details and following the link to the commit in the VCS for verification
- Modifications to resources outside of Terraform that cannot be detected or reconciled (e.g. by another automation system or manually by a user at the cloud web console)
- Using old versions of Terraform and providers that lack features and bug fixes
For apply within Terraform Enterprise:
- Unpacks the plan filesystem and plan file
- If specified in the workspace settings, changes to the given working directory
- Exports environment variables from the workspace's Variables page
- Runs
terraform init
and discards the log if it is successful - Runs
terraform apply
with the plan file, which which executes the planned changes - Generates and stores the state file
Common issues which cause errors during applies include:
- Incorrect or insufficient service credentials. Only read permissions are required to plan, but write permissions are required to create or modify resources
- Issues with values given in the configuration that are rejected by the service provider. E.g, some combinations of otherwise valid values may not be valid and accepted by the service
- Service timeouts or excessive rate limiting, usually due to attempting to manage too many resources in one workspace or across multiple workspaces running simultaneously
- Modifications to resources outside of Terraform that cannot be detected or reconciled (e.g. by another automation system or manually by a user at the cloud web console)
- Using old versions of TFE, Terraform, and/or providers that lack features and bug fixes
Managing Replicated
Replicated provides a CLI tool called replicatedctl
that can be used to interact with the Replicated service, and by proxy, the Terraform Enterprise application itself. Some of the common replicatedctl
commands are detailed below and more can be found at https://help.replicated.com/api/replicatedctl/
Restarting Terraform Enterprise Start/Stop/Status
$ replicatedctl app status
$ replicatedctl app stop
$ watch replicatedctl app status
$ replicatedctl app start
Restarting Replicated Service
$ systemctl stop replicated replicated-operator replicated-ui
$ systemctl start replicated replicated-operator replicated-ui
Replicated Application Settings
To export the Replicated application settings, use the replicatedctl params export
command.
To change a given setting, use the replicatedctl params set NAME --value VALUE
command where NAME is the name of the attribute that is to be changed and VALUE is the value to be assigned to that attribute. To change the ReleaseSequence
attribute to the value 0, the command replicatedctl params set ReleaseSequence --value 0
would be used.
Application Configuration
To change a given setting, use the replicatedctl app-config set NAME --value VALUE
command where NAME is the name of the attribute that is to be changed and VALUE is the value to be assigned to that attribute. The attributes list can be found by running replicatedctl app-config export
.
$ replicatedctl app-config set NAME --value VALUE
General Terraform Enterprise (TFE) Information
Main Page (follow left pane navigation for Deployment and Operation and Application Usage and Other Docs sections)
Monitoring/Health Check basics
Reference Architectures (including Active/Active)
Active/Active Terraform Enterprise (TFE) Information
TFE Active/Active Install/Configure
TFE Active/Active Administration
Other Terraform Enterprise (TFE) Monitoring Information
From HashiCorp Blog Posts written by HashiCorp people, but not specifically official documentation:
Monitoring and Logging for Terraform Enterprise
Monitoring and Logging for Terraform Enterprise — Azure Monitor
Monitoring and Logging for Terraform Enterprise — GCP Operations
Active/Active Admin Commands
As active/active modules will disable the replicated UI by default, we have provided admin commands to facilitate configuration changes, safe application stops and support bundles, etc. This work is done in a new container - tfe-admin.
These and other CLI commands are published on TFE Active/Active Administration
Contacting HashiCorp Support
When contacting HashiCorp Support, please include any detailed run logs using TF_LOG=TRACE
, redacted Terraform code (if necessary) and a support bundle as this will help ensure a timely response to your support request.