Introduction
Due to the inherent limitations of Source Network Address Translation (SNAT), running HCP Terraform agents at scale behind a Google Cloud Platform (GCP) NAT gateway can result in intermittent network connection issues.
This article outlines the cause of these issues and provides Google's recommended architecture to avoid them.
Problem
When running HCP Terraform agents on compute instances behind a GCP NAT gateway, you may encounter intermittent I/O timeout errors or failures to connect to HCP Terraform. These issues often manifest during Terraform runs with the following error messages:
The "remote" backend encountered an unexpected error while creating theβ Terraform Enterprise client: Get "https://app.terraform.io/api/v2/ping":β dial tcp 99.83.150.238:443: i/o timeout.
Failed to create the Terraform Enterprise client
Cause
These connection issues occur because of how GCP NAT allocates source ports and handles connections. As explained in the Google Cloud NAT behavior documentation, GCP NAT reserves a set number of source tuples (source IP, source port) for each virtual machine (VM). A destination (destination IP, destination port, protocol) cannot have more active connections than the available source tuples.
Since the majority of requests from a Terraform run target a small set of cloud provider API addresses, it is highly likely that the available address tuples will be exhausted at scale.
Furthermore, GCP NAT enforces a 'Delay for TCP source port reuse'. After a TCP connection closes, GCP enforces a two-minute delay before the gateway can reuse the same NAT source IP address and source port tuple for the same destination. This delay increases the likelihood of port exhaustion.
By default, GCP NAT may allocate as few as 64 ports per VM, meaning a VM can only open 64 concurrent connections to the same destination IP and port.
You can enable logging on GCP NAT to capture connection details and errors by following this monitoring guide.
Recommendation
To avoid port exhaustion and connection issues, Google recommends an alternative architecture for VMs that need to rapidly open and close connections to the same destination.
Instead of using Cloud NAT, you should assign an external IP address to the VM and use firewall rules to limit unsolicited ingress connections. This approach bypasses the port allocation limitations of Cloud NAT.
Workaround
As a temporary workaround, you can tune the GCP NAT configuration to increase the number of available connections per VM.
By increasing the 'minimum ports per VM' value, you can allocate more NAT source IP address and source port tuples to each VM. This may be necessary if your VMs need to make many new connections to the same destination within the two-minute port reuse delay. For more details, refer to the documentation on how to increase the number of available connections.