Troubleshooting and Best Practices for HashiCorp Nomad Autoscaler with GCP MIGs – HashiCorp Help Center

Overview

This article provides a detailed explanation of the HashiCorp Nomad Autoscaler’s architecture, workflow, and best practices for running it with Google Cloud Platform Managed Instance Groups (GCP MIGs), Datadog, and Prometheus. It also documents real-world troubleshooting steps for resolving issues such as autoscaler freezing, scaling anomalies, and log visibility, referencing actual configuration and log files.

Autoscaler Workflow Overview

Key Components

Nomad Autoscaler: Evaluates scaling policies and interacts with Nomad and cloud APIs to scale infrastructure.
Nomad Servers & Clients: The control plane (servers) and worker nodes (clients) managed by Nomad.
GCP MIGs: Google’s Managed Instance Groups, used as the scaling target for Nomad clients.
Metrics Providers: Datadog and Prometheus, supplying real-time metrics for autoscaler policy checks.
Drain/Shutdown Scripts: Custom scripts for graceful node draining and shutdown.
Logging & Monitoring: Centralized log collection (e.g., Datadog) and alerting.

Autoscaler Workflow:-

A[Autoscaler Policy Evaluation] --> B[Query Metrics Provider (Datadog/Prometheus)]--> 
C[Evaluate Scaling Policy] --> |Threshold Crossed| D[Trigger Scale Up/Down via GCP MIG Plugin] --> 
E[Nomad Node Drain (if scaling in)] --> F[Node Removed from MIG] --> G[Feedback Loop: Continue Monitoring]

Workflow Steps

Nomad Autoscaler is designed to automatically scale Nomad client nodes or application workloads based on real-time metrics. The typical workflow is:

Metrics Collection:
The autoscaler queries external metrics providers (e.g., Datadog, Prometheus) for resource usage data.
Policy Evaluation:
Scaling policies (defined in HCL files) specify thresholds and strategies (e.g., target CPU utilization). The autoscaler evaluates these policies at a configured interval.
Scaling Decision:
If a policy threshold is crossed, the autoscaler triggers a scale-up or scale-down action via the relevant target plugin (e.g., gce-mig for GCP MIGs).
Node Draining:
Before removing a node, the autoscaler initiates a Nomad node drain to gracefully migrate workloads.
Feedback Loop:
The autoscaler continues to monitor metrics and adjust the cluster size as needed.

Key Configuration Snippets

Example scaling policy:

scaling "nomad-client-prd-asia-southeast1-a-igm-v1cf" {
  enabled = true
  min = 1
  max = 24
  policy {
    cooldown = "5m"
    evaluation_interval = "1m"
    check "allocated-cpu-pct" {
      source = "datadog"
      query = "sum:nomad.client.allocated.cpu{instance-template:*} by {*}.weighted() * 100 / (sum:nomad.client.allocated.cpu{instance-template:*} by {*}.weighted() + sum:nomad.client.unallocated.cpu{instance-template:*} by {*}.weighted())"
      strategy "target-value" { target = 70 }
    }
    # ...other checks...
    target "gce-mig" {
      project = "your-gcp-project"
      zone = "asia-southeast1-a"
      mig_name = "your-mig-name"
      node_drain_deadline = "8m"
      node_purge = true
    }
  }
}

Reference:

Scaling Policies | Nomad | HashiCorp Developer

Autoscaling Plugins: GCE MIG | Nomad | HashiCorp Developer

Common Issues and Troubleshooting

A. Log Quota and Missing Logs

Symptom:
Logs are missing or inconsistent, especially during high activity or testing.

Root Cause:
Log ingestion quota is reached, causing logs to be dropped.

Solution:

Increase log quota or retention in your logging backend.
Monitor log pipeline health.
Always verify log pipeline status before troubleshooting service-level issues.

B. 429 Errors: Too Many Concurrent Connections

Symptom:
Autoscaler or Nomad API returns HTTP 429 errors during scaling events.

Root Cause:
Nomad server’s http_max_conns_per_client limit is exceeded, often during large deployments or bursts of scaling/drain actions.

Solution:

Increase http_max_conns_per_client in your Nomad server config:

server { http_max_conns_per_client = 100 }
Tune autoscaler policy cooldown and evaluation_interval to reduce API pressure.
Stagger or batch scaling actions if possible.
Monitor for 429s after deployments or large scaling events.

Reference:

Nomad Agent Configuration | Nomad | HashiCorp Developer

C. TCP Connection Refused to Prometheus

Symptom:
Autoscaler logs show “connection refused” errors to Prometheus.

Root Cause:
Prometheus is unreachable (service down, network issue, or being restarted).

Solution:

Run Prometheus on a static node pool, not subject to autoscaling.
Ensure network/firewall rules allow autoscaler to reach Prometheus.
Consider running multiple Prometheus replicas for high availability.
Monitor Prometheus uptime and alert on outages.

D. “Target Not Ready” and MIG Scaling Races

Symptom:
Autoscaler reports “target not ready” when a MIG is still reconciling a previous scaling action.

Root Cause:
Autoscaler’s evaluation interval or cooldown is too short for the time it takes GCP MIGs to complete operations.

Solution:

Increase cooldown and/or evaluation_interval in your scaling policy to match real-world MIG operation times (e.g., 5–10 minutes).
Consider contributing to the gce-mig plugin to make retry limits configurable.
Monitor for repeated “target not ready” errors and adjust intervals accordingly.

E. MIG Count Drops to 0 Despite min=1

Symptom:
A MIG scales down to 0 instances even though min = 1 is set.

Root Cause:
Possible race condition, bug, or external automation interfering with the MIG.

Solution:

Double-check all scaling policies to ensure min = 1 is set.
Ensure no other automation or manual process is modifying the MIG.

F. OOM and Resource Sizing

Symptom:
Autoscaler container is killed due to out-of-memory (OOM).

Solution:

Increase memory allocation in your Nomad job spec (e.g., 1024 MiB).
Monitor resource usage and adjust as needed.
Use Nomad’s resource limits to prevent noisy neighbor issues.

G. Node Drain Deadline Alignment

Symptoms:

Allocation migration failures during scale-in.

Root Cause:

Drain deadlines too short for graceful shutdown and migration.

Solution:

Set node_drain_deadline in your scaling policy to match your shutdown/drain scripts (e.g., 8–10 minutes).
Align all drain/cooldown/shutdown deadlines across policies and scripts.

H. Autoscaler Freezing or Not Scaling

Symptoms:

Autoscaler appears to freeze or not take scaling actions.

Root Cause:

Often due to missing logs (log quota), network issues, or policy misconfiguration.

Solution:

Ensure logs are not being dropped due to quota.
Check connectivity to Nomad server, Datadog, and Prometheus.
Validate that scaling policies are enabled and correctly configured.
Use Nomad debug bundles (hcdiag) for deep troubleshooting.

Best Practices & Recommendations

Static Node Pools:
Run autoscaler and Prometheus on static node pools to prevent disruption during scale-in.
Align Deadlines:
Set node_drain_deadline to match your shutdown/drain scripts (e.g., 8–10 minutes).
Monitor and Tune:
Regularly monitor for 429s, OOMs, and “target not ready” errors. Tune policy intervals and server limits as needed.
Use Service Discovery:
Point autoscaler at Nomad server(s) using service discovery for high availability.
Review Logs:
Ensure log pipelines are healthy and not hitting quotas.
Policy Hygiene:
Keep all scaling policies up to date and ensure min/max values are correct.

Deep Troubleshooting with Nomad Debug Bundles

Use hcdiag to collect full debug bundles for Nomad (journald-nomad.log, metrics.json, eventstream.json, etc.).
Analyze logs for scaling actions, errors, and policy evaluations.
Review autoscaler and plugin’s configuration files for policy correctness.

Summary

By understanding the full workflow, component interactions, and following the troubleshooting steps above, you can ensure a robust and reliable autoscaling experience with Nomad and GCP MIGs. Always monitor logs, align policy intervals with real-world timings, and keep critical infrastructure on static node pools.