Overview
This article provides a detailed explanation of the HashiCorp Nomad Autoscaler’s architecture, workflow, and best practices for running it with Google Cloud Platform Managed Instance Groups (GCP MIGs), Datadog, and Prometheus. It also documents real-world troubleshooting steps for resolving issues such as autoscaler freezing, scaling anomalies, and log visibility, referencing actual configuration and log files.
Autoscaler Workflow Overview
Key Components
Nomad Autoscaler: Evaluates scaling policies and interacts with Nomad and cloud APIs to scale infrastructure.
Nomad Servers & Clients: The control plane (servers) and worker nodes (clients) managed by Nomad.
GCP MIGs: Google’s Managed Instance Groups, used as the scaling target for Nomad clients.
Metrics Providers: Datadog and Prometheus, supplying real-time metrics for autoscaler policy checks.
Drain/Shutdown Scripts: Custom scripts for graceful node draining and shutdown.
Logging & Monitoring: Centralized log collection (e.g., Datadog) and alerting.
Autoscaler Workflow:-
A[Autoscaler Policy Evaluation] --> B[Query Metrics Provider (Datadog/Prometheus)]--> C[Evaluate Scaling Policy] --> |Threshold Crossed| D[Trigger Scale Up/Down via GCP MIG Plugin] --> E[Nomad Node Drain (if scaling in)] --> F[Node Removed from MIG] --> G[Feedback Loop: Continue Monitoring]
Workflow Steps
Nomad Autoscaler is designed to automatically scale Nomad client nodes or application workloads based on real-time metrics. The typical workflow is:
Metrics Collection:
The autoscaler queries external metrics providers (e.g., Datadog, Prometheus) for resource usage data.Policy Evaluation:
Scaling policies (defined in HCL files) specify thresholds and strategies (e.g., target CPU utilization). The autoscaler evaluates these policies at a configured interval.Scaling Decision:
If a policy threshold is crossed, the autoscaler triggers a scale-up or scale-down action via the relevant target plugin (e.g.,gce-mig
for GCP MIGs).Node Draining:
Before removing a node, the autoscaler initiates a Nomad node drain to gracefully migrate workloads.Feedback Loop:
The autoscaler continues to monitor metrics and adjust the cluster size as needed.
Key Configuration Snippets
Example scaling policy:
scaling "nomad-client-prd-asia-southeast1-a-igm-v1cf" { enabled = true min = 1 max = 24 policy { cooldown = "5m" evaluation_interval = "1m" check "allocated-cpu-pct" { source = "datadog" query = "sum:nomad.client.allocated.cpu{instance-template:*} by {*}.weighted() * 100 / (sum:nomad.client.allocated.cpu{instance-template:*} by {*}.weighted() + sum:nomad.client.unallocated.cpu{instance-template:*} by {*}.weighted())" strategy "target-value" { target = 70 } } # ...other checks... target "gce-mig" { project = "your-gcp-project" zone = "asia-southeast1-a" mig_name = "your-mig-name" node_drain_deadline = "8m" node_purge = true } } }
Reference:
Scaling Policies | Nomad | HashiCorp Developer
Autoscaling Plugins: GCE MIG | Nomad | HashiCorp Developer
Common Issues and Troubleshooting
A. Log Quota and Missing Logs
Symptom:
Logs are missing or inconsistent, especially during high activity or testing.
Root Cause:
Log ingestion quota is reached, causing logs to be dropped.
Solution:
Increase log quota or retention in your logging backend.
Monitor log pipeline health.
Always verify log pipeline status before troubleshooting service-level issues.
B. 429 Errors: Too Many Concurrent Connections
Symptom:
Autoscaler or Nomad API returns HTTP 429 errors during scaling events.
Root Cause:
Nomad server’s http_max_conns_per_client
limit is exceeded, often during large deployments or bursts of scaling/drain actions.
Solution:
-
Increase
http_max_conns_per_client
in your Nomad server config:server { http_max_conns_per_client = 100 }
Tune autoscaler policy
cooldown
andevaluation_interval
to reduce API pressure.Stagger or batch scaling actions if possible.
Monitor for 429s after deployments or large scaling events.
Reference:
Nomad Agent Configuration | Nomad | HashiCorp Developer
C. TCP Connection Refused to Prometheus
Symptom:
Autoscaler logs show “connection refused” errors to Prometheus.
Root Cause:
Prometheus is unreachable (service down, network issue, or being restarted).
Solution:
Run Prometheus on a static node pool, not subject to autoscaling.
Ensure network/firewall rules allow autoscaler to reach Prometheus.
Consider running multiple Prometheus replicas for high availability.
Monitor Prometheus uptime and alert on outages.
D. “Target Not Ready” and MIG Scaling Races
Symptom:
Autoscaler reports “target not ready” when a MIG is still reconciling a previous scaling action.
Root Cause:
Autoscaler’s evaluation interval or cooldown is too short for the time it takes GCP MIGs to complete operations.
Solution:
Increase
cooldown
and/orevaluation_interval
in your scaling policy to match real-world MIG operation times (e.g., 5–10 minutes).Consider contributing to the gce-mig plugin to make retry limits configurable.
Monitor for repeated “target not ready” errors and adjust intervals accordingly.
E. MIG Count Drops to 0 Despite min=1
Symptom:
A MIG scales down to 0 instances even though min = 1
is set.
Root Cause:
Possible race condition, bug, or external automation interfering with the MIG.
Solution:
Double-check all scaling policies to ensure
min = 1
is set.Ensure no other automation or manual process is modifying the MIG.
F. OOM and Resource Sizing
Symptom:
Autoscaler container is killed due to out-of-memory (OOM).
Solution:
Increase memory allocation in your Nomad job spec (e.g., 1024 MiB).
Monitor resource usage and adjust as needed.
Use Nomad’s resource limits to prevent noisy neighbor issues.
G. Node Drain Deadline Alignment
Symptoms:
Allocation migration failures during scale-in.
Root Cause:
Drain deadlines too short for graceful shutdown and migration.
Solution:
Set
node_drain_deadline
in your scaling policy to match your shutdown/drain scripts (e.g., 8–10 minutes).Align all drain/cooldown/shutdown deadlines across policies and scripts.
H. Autoscaler Freezing or Not Scaling
Symptoms:
Autoscaler appears to freeze or not take scaling actions.
Root Cause:
Often due to missing logs (log quota), network issues, or policy misconfiguration.
Solution:
Ensure logs are not being dropped due to quota.
Check connectivity to Nomad server, Datadog, and Prometheus.
Validate that scaling policies are enabled and correctly configured.
Use Nomad debug bundles (
hcdiag
) for deep troubleshooting.
Best Practices & Recommendations
Static Node Pools:
Run autoscaler and Prometheus on static node pools to prevent disruption during scale-in.Align Deadlines:
Setnode_drain_deadline
to match your shutdown/drain scripts (e.g., 8–10 minutes).Monitor and Tune:
Regularly monitor for 429s, OOMs, and “target not ready” errors. Tune policy intervals and server limits as needed.Use Service Discovery:
Point autoscaler at Nomad server(s) using service discovery for high availability.Review Logs:
Ensure log pipelines are healthy and not hitting quotas.Policy Hygiene:
Keep all scaling policies up to date and ensuremin
/max
values are correct.
Deep Troubleshooting with Nomad Debug Bundles
Use
hcdiag
to collect full debug bundles for Nomad (journald-nomad.log
,metrics.json
,eventstream.json
, etc.).Analyze logs for scaling actions, errors, and policy evaluations.
Review autoscaler and plugin’s configuration files for policy correctness.
Summary
By understanding the full workflow, component interactions, and following the troubleshooting steps above, you can ensure a robust and reliable autoscaling experience with Nomad and GCP MIGs. Always monitor logs, align policy intervals with real-world timings, and keep critical infrastructure on static node pools.
References
Autoscaling Concepts | Nomad | HashiCorp Developer
Autoscaling Policy Evaluation | Nomad | HashiCorp Developer
Checks | Nomad | HashiCorp Developer