Introduction
This guide covers common issues encountered during setup and migration from Replicated to FDO (Flexible Deployment Options), organized by deployment option. It focuses on troubleshooting specific problems rather than full installation procedures. When using this guide alongside installation documentation, carefully review your deployment mode requirements - many issues stem from mixing configuration settings between different modes or including unnecessary settings from example templates. Start with the minimal required configuration for your chosen deployment mode and add settings only as needed.
Common Red Herrings
- Multiple database timeouts in logs are often a sign of network/firewall issues rather than actual database problems. Focus troubleshooting on network connectivity before investigating database configuration.
- Fluent-bit errors during startup typically self-resolve and do not indicate actual problems unless they persist after system initialization. Wait a few minutes before investigating further.
- Initial SSL errors may appear in logs during startup before certificates are fully loaded into the system. These should clear once startup is complete.
- Migration lock messages are normal during startup when running multiple pods. The system will handle these automatically unless a lock becomes stuck due to an improper shutdown.
Common Issues Across All Deployments
Immediate Container or Pod Crashes
- Symptom: Container crashes before any useful logs are produced, or before any commands can be run.
- Resolution Steps: Use this article to allow the container to continue running and gather logs.
Database Connectivity
- Symptom: Container stops shortly after startup with database timeout.
-
Common Error:
check failed: name=database duration=1m30.00443164s err="timeout: context deadline exceeded"
-
Resolution Steps:
- Test basic connectivity to the database server.
- Verify port access is available, and that the port number defined in the configuration matches the actual database port.
- Check security group configurations.
- Verify database user has appropriate permissions and correct password.
Vault Encryption Errors
- Symptom: Vault fails to decrypt configuration.
-
Common Error:
Error reading Vault configuration: failed decrypting unseal key: could not decrypt ciphertext: chacha20poly1305: message authentication failed
-
Resolution Steps:
- Check for special characters in encryption password. The $ is a reserved character for Docker.
- Verify encryption password matches previous configuration.
Certificate Errors
- Symptom: SSL/TLS errors or failed health checks, unsuccessful runs.
-
Resolution Steps:
- Verify certificate paths match your compose/configuration file.
- Check certificate chain completeness
- Ensure certificates are in .pem format.
- Verify correct order of certificates
- If you use a certificate issued by a private Certificate Authority, you must provide the certificate for that CA in the Certificate Authority (CA) Bundle section of the installation. This allows services running within Terraform Enterprise to access each other properly.
VCS Integration Errors
- Symptom: Unable to connect to VCS provider.
-
Common Error:
You don't have permission to access that OAuth Client
orSSL_connect returned=1 errno=0 state=error: unexpected eof
-
Resolution Steps:
1. Test basic connectivity:curl -v -L https://<vcs-fqdn>
2. Verify SSL certificates:
curl -v https://<tfe_fqdn>/_health_checkopenssl s_client -showcerts -connect VCS_IP:443 </dev/null
3. Check firewall and WAF configurations.
openssl s_client -showcerts -connect TFE_IP:443 </dev/null
4. May require new Application Link from VCS provider.
Log Forwarding Errors
- Symptom: Logs not being forwarded.
-
Common Error:
[error] [output:syslog:syslog.1] no upstream connections available
-
Resolution Steps:
- Enable syslog UDP on port 514 in /etc/rsyslog.conf
- Restart syslog service.
- Verify port is listening.
- Restart TFE.
- Additional.
Docker-Specific Issues
Network Configuration
- Symptom: Container startup fails with network errors.
-
Common Error:
Error response from daemon: network tfe_terraform_isolation not found
-
Resolution Steps:
- Create required network:
docker network create tfe_terraform_isolation
- Verify Docker driver settings
- Create required network:
Version Compatibility
- Symptom: Container fails to start or exhibits unexpected behavior.
-
Resolution Steps:
- Ensure Docker version is compatible
- Downgrade if using incompatible version.
- Remove all containers and volumes if downgrading.
Storage Configuration
- Symptom: Service fails after modifying Docker root or storage paths.
-
Resolution Steps:
- Stop TFE and Docker services before making storage changes.
- Verify any mounted volumes have correct ownership and permissions.
- Check all storage paths in compose file match actual system paths.
- Ensure sufficient disk space is available in new locations.
SystemD Service Issues
- Symptom: Service fails to start.
-
Resolution Steps:
- Verify Docker binary path in service file (commonly
/usr/bin/docker
vs/usr/local/bin/docker
) - Check for hidden characters in compose file.
- Validate working directory exists and has correct permissions.
- Ensure certificates paths are correct relative to mounted directories.
- Verify Docker binary path in service file (commonly
Kubernetes-Specific Issues
Frequently used Kubernetes Commands
Resource Constraints
- Symptom: Pods fail to start or enter CrashLoopBackOff.
-
Common Error:
error waiting for kubernetes container to start: pod container is not ready: context deadline exceeded
-
Resolution Steps:
- Check resource quotas in namespace.
- Review TFE_CAPACITY_CPU setting (default values may conflict with quotas).
- Verify memory limits (default 2048M).
- Check if node has sufficient resources.
Permission Issues
- Symptom: Task worker failures.
-
Common Error:
error building kubernetes config: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied
-
Resolution Steps:
- Review pod security context
- Verify service account permissions.
- Review RBAC configurations.
- Consider helm uninstall/install if permissions are correct.
Networking Issues