Introduction
On Terraform Enterprise workspaces, when using the AzureRM provider for Terraform, the runs are failing randomly and intermittently.
Problem
A terraform apply is giving the following error HTTP response was nil; connection may have been reset for some of the resources that are being created with Terraform AzureRM provider.
Here is an example of an error observed during a run:
Error: Creating/Updating Load Balancing Rule (Subscription: "<subscription-id>"
Resource Group Name: "<resource-group-name>"
Load Balancer Name: "<load-balancer-name>"
Load Balancing Rule Name: "example"): performing CreateOrUpdate:
Put https://management.azure.com/subscriptions/<subscription-id>/
resourceGroups/<resource-group-name>/providers/Microsoft.Network/
loadBalancers/<load-balancer-name>?api-version=2023-06-01:
HTTP response was nil; connection may have been reset
The resource for which the run fails is getting created in the AzureRM provider, however Terraform did not receive the information and cannot commit the resource creation in state file due to the connection reset error.
Prerequisites
- Terraform Enterprise
- AzureRM provider
Cause
The connection gets reset by peer and this error can be found in the TRACE log file for a terraform apply from the workspace
[ERROR] provider. terraform-provider-azurerm_v3.93.0_x5:
PUT https://management.azure.com/subscriptions/<subscription-id>/resourceGroups/
<resource-group-id>/providers/Microsoft.Network/loadBalancers/
<id-load-balancer>?api-version=2023-06-01requestfailed:
Put https://management.azure.com/subscriptions/<subscription-id>/
resourceGroups/<resource-group-id>/providers/Microsoft.Network/loadBalancers/
1<id-load-balancer>y?api-version=2023-06-01":read tcp <ip>:<port>-><ip>:8080:
read:connection reset by peer:timestamp=2024-10-08T11:05:23.2232
The root cause is a networking connection reset between the customer's environment and the cloud service provider. Terraform has a retry logic already built in, however in this specific case unfortunately there is nothing that Terraform can do to prevent this situation from happening.
Terraform is the client of the customer's network environment and there’s no way to tell if the previous request was processed by the service or not and retrying another request may cause other issues like duplicate resources.
Recommendation
Customers should involve their networking internal teams: proxy, load balancer, firewall and any other internal teams that can investigate the root cause from the infrastructure depending on their internal design of their network.
The way to start the investigation is to review the following error message:
read tcp <ip>:<port>-><ip-of-peer>:8080:read:connection reset by peer:timestamp=2024-10-08T11:05:23.2232
Identify what is the second ip from the error in your own network <ip-of-peer>:8080 and start by searching those logs.
Additionally, get your TFE IP and search all of your networking logs from all the layers you have in your infrastructure by the TFE IP.
Overview of possible solutions
The solution needs to come from your internal Networking teams on how to stop the connection being reset by the peer.