Due to the inherent limitations of Source NAT (Network Address Translation), running Terraform Cloud agents at scale behind a NAT gateway could result in intermittent network connection issues.
We have seen cases, where customers are running Terraform Cloud agents on compute instances (whether containerised or plain VM) behind NAT, in order to control the egress IP range and ensure no incoming connections are possible.
Google explains the NAT behaviour in detail on this page:
1) Google Cloud NAT reserves a set number of source tuples (src-ip, src-port) on each VM
2) A destination 3-tuple (dst-ip, dst-port, proto [tcp,udp]) can not have more active connections than source tuples
As that limit is approached, it is expected that a number of endpoint independent conflicts will occur.
Furthermore, GCP NAT has a concept of 'Delay for TCP source port reuse':
"After a Cloud NAT gateway closes a TCP connection, Google Cloud enforces a two-minute delay before the gateway can reuse the same NAT source IP address and source port tuple with the same destination (destination IP address, destination port, and protocol)."
You can enable logging on GCP NAT, which will capture NAT connections and errors as per this guide.
As the majority of the requests coming from a Terraform run are to a small set of cloud providers' API addresses, it is almost certain that, with scale, the available address tuples will be exhausted.
We agree with the following Google recommendation:
The problem could manifest with the following error messages:
The "remote" backend encountered an unexpected error while creating the
│ Terraform Enterprise client: Get "https://app.terraform.io/api/v2/ping":
│ dial tcp 18.104.22.168:443: i/o timeout.
Failed to create the Terraform Enterprise client
The default GCP NAT settings seems to set the available ports to 64 which means a VM can open 64 connections to the same destination IP and port (it can open another 64 TCP and 64 UDP connections to a different destination IP address and port)