Best practices around running Terraform Cloud agents behind GCP NAT (SNAT) – HashiCorp Help Center

Introduction

Due to the inherent limitations of Source NAT (Network Address Translation), running Terraform Cloud agents at scale behind a NAT gateway could result in intermittent network connection issues.

Scenario

We have seen cases, where customers are running Terraform Cloud agents on compute instances (whether containerised or plain VM) behind NAT, in order to control the egress IP range and ensure no incoming connections are possible.

Google explains the NAT behaviour in detail on this page:

Google Cloud NAT reserves a set number of source tuples (src-ip, src-port) on each VM
A destination 3-tuple (dst-ip, dst-port, proto [tcp,udp]) can not have more active connections than source tuples

As that limit is approached, it is expected that a number of endpoint independent conflicts will occur.

Furthermore, GCP NAT has a concept of 'Delay for TCP source port reuse':

"After a Cloud NAT gateway closes a TCP connection, Google Cloud enforces a two-minute delay before the gateway can reuse the same NAT source IP address and source port tuple with the same destination (destination IP address, destination port, and protocol)."

Troubleshooting

You can enable logging on GCP NAT, which will capture NAT connections and errors as per this guide.

Recommendation

As the majority of the requests coming from a Terraform run are to a small set of cloud providers' API addresses, it is almost certain that, with scale, the available address tuples will be exhausted.

We agree with the following Google recommendation:

If a VM needs to rapidly open and close TCP connections to the same destination IP address and destination port by using the same protocol, you should assign an external IP address to the VM and use firewall rules to limit unsolicited ingress connections instead of using Cloud NAT.

Additional Information

The problem could manifest with the following error messages:

The "remote" backend encountered an unexpected error while creating the
│ Terraform Enterprise client: Get "https://app.terraform.io/api/v2/ping":
│ dial tcp 99.83.150.238:443: i/o timeout.

Failed to create the Terraform Enterprise client

The default GCP NAT settings seems to set the available ports to 64 which means a VM can open 64 connections to the same destination IP and port (it can open another 64 TCP and 64 UDP connections to a different destination IP address and port)

Workaround

As a temporary workaround, you might want to explore tuning the GCP NAT configuration in order to increase the number of available connections:

"You can increase the number of NAT source IP address and source port tuples that the Cloud NAT gateway allocates to each VM by increasing the minimum ports per VM value. For example, you might need to increase the minimum number of ports if VMs need to make more new connections to the same destination within the two-minute delay before reusing a 5-tuple for a new TCP connection."

Additional resources

https://cloud.google.com/nat/docs/ports-and-addresses