Problem
A crashed or prematurely stopped Terraform Enterprise instance may not have completed the process of running database migrations, causing other Terraform Enterprise instances connected to the database to hang up to one hour before proceeding with migrations.
Cause
TFE uses Redis to establish a lock around running database migrations, in order to prevent data corruption caused by multiple TFE instances (such as in an Active/Active deployment) from modifying the database at the same time. In catastrophic cases, such as a Terraform Enterprise node being forcefully terminated while migrations are running, the lock value in Redis may not have been removed, resulting in other Terraform Enterprise nodes waiting to establish the lock before running migrations.
Solution
The migration lock is created with a one hour TTL; in the case of a crashed migration, the lock will remove itself one hour after the migration process began. To reduce this delay, the lock can be removed manually:
- Connect to Redis
- >= v202404-2
docker exec -ti <TFE_CONTAINER> bash -c '. /var/run/terraform-enterprise/atlas/atlas-env && redli -u $REDIS_URL'
-
- < v202404-2
docker exec -ti <TFE_CONTAINER> bash -c '. atlas-env && redli -u $REDIS_URL'
- Delete the
tfe_migration_lock
Redis value
> del tfe_migration_lock
(integer) 1
Once the key is removed from Redis, any nodes attempting to run migrations will re-attempt to acquire the lock, and proceed with running migrations.