Problem
When troubleshooting Terraform Enterprise (TFE) deployments, particularly around the issues when runs are not progressing due to resource constraints and especially issues related to disk space in the host machine, running docker prune
without consideration could lead to significant delays during troubleshooting while waiting for deleted images to be re-downloaded and the stopped containers to start again. In Airgapped environments, this creates catastrophic operational challenges where deleted images cannot be re-downloaded and if the Airgap bundle is not present on the machine, then that would cause significant delays.
Cause
The root cause lies in the decision making process that Docker adapts when distinguishing between what is operationally required and what could be considered as unused resources. If a resource is not referenced by a running container, then this would be deemed as 'unused' by Docker. The critical piece here are the stopped containers and the images used for those containers, which could still be crucial for operation. It seems like a common practice during troubleshooting to execute commands like docker system prune -a
or docker system prune --volumes
which often have implications as these commands remove all unused images and volumes which prolongs the recovery of production systems.
Recommendation
Online
During production impacts, it is essential to inspect Docker environments first which would provide visibility against the resources and ensure prevention on removing resources that appear as 'unused' but are operationally necessary.
- To see all containers.
$ docker ps - a
- Disk Usage Breakdown
$ docker system df
- Docker Volume Inventory
$ docker volume ls
We could also adapt some strategies to ensure precaution is practised, for example running docker container prune would only remove stopped containers, docker image prune
(without the -a flag) would only prune dangling images. Some other practices include applying time based filters like docker image prune --filter "until=24h"
which would only target older resources. Adapting labelled based protection practices ensure critical images are tagged and excluded from pruning operations.
In Terraform Enterprise environments, special attention must be paid to critical TFE core images, custom worker images and database container images. It is advised to not remove images from the host even when they appear unused, as they may be required for specific operations. A safer approach towards this would be docker container prune -f
and docker image prune -f
(dangling only), which would avoid the docker image prune -a
which removes all the images.
The key difference between safe and dangerous pruning lies in understanding command flags. Please see the below:
$ docker image prune
The above is a safe approach and would only delete dangling images.
$ docker image prune -a
Potentially destructive command as it could delete all unused images from the machine.
Airgapped
For Airgapped environments, pre-pruning backups procedures could be implemented to ensure critical images are exported using docker save -o tfe-images-backup.tar
before any pruning is performed on the machine to free up the space. Maintaining an image inventory documenting all the critical TFE images which should never be pruned.