Introduction
Boundary workers are critical for system performance, ensuring continuous operation to maintain service quality. This KB article guides you through setting up a Bash script to check their health by querying local endpoints for HTTP status, operational state and upstream connection state, integrating with Amazon CloudWatch for real-time metrics and alerts via SNS.
By following these steps, you can proactively manage worker health. If issues arise, share logs with support for assistance.
Script Setup
The script performs health checks by querying the Boundary health's endpoint and parses the response for HTTP status, operational state and upstream connection state. It runs every 5 minutes via cron, ensuring continuous monitoring, and writes logs to a file when issues like non-200 HTTP statuses or non-"READY" upstream states are detected, helping you track worker health locally.
Steps
- Health checks are available for a controller defined with a purpose = "ops" listener stanza, add the below stanza to your worker configuration file and restart existing boundary worker service.
listener "tcp" {
address = "0.0.0.0:9203"
purpose = "ops"
tls_disable = true
}
- Save script and chmod +x script.sh
- Schedule with Crontab and add these lines to run every 5 minutes and on reboot
- sudo crontab -e
*/5 * * * * /usr/local/bin/boundary_health_check.sh
@reboot /usr/local/bin/boundary_health_check.sh
-
check logs to ensure execution
sudo cat /var/log/syslog | grep CRON
Example script
#!/bin/bash
# Set up basic variables
HEALTH_URL="http://localhost:9203/health?worker_info=1"
LOG_FILE="/var/log/boundary_health_check.log"
# Get the current time for logging
CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
# Check the health URL and grab the response
RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X GET "$HEALTH_URL" 2>> "$LOG_FILE")
# If no response, log it and exit
if [ -z "$RESPONSE" ]; then
echo "$CURRENT_TIME: Curl failed to get response" >> "$LOG_FILE"
exit 1
fi
# Split the response into body and status code
BODY=$(echo "$RESPONSE" | sed -e 's/HTTPSTATUS:[0-9]*//g')
HTTP_STATUS=$(echo "$RESPONSE" | sed -e 's/.*HTTPSTATUS://' | tr -d '[:space:]')
# Check HTTP status and log/process accordingly
if [ "$HTTP_STATUS" -eq 200 ]; then
STATE=$(echo "$BODY" | jq -r '.worker_process_info.state')
UPSTREAM_STATE=$(echo "$BODY" | jq -r '.worker_process_info.upstream_connection_state')
# Log if state is not active
if [ "$STATE" != "active" ]; then
echo "$CURRENT_TIME: Boundary worker state is not active. State: $STATE" >> "$LOG_FILE"
fi
# Log if upstream state is not READY
if [ "$UPSTREAM_STATE" != "READY" ]; then
echo "$CURRENT_TIME: Boundary worker upstream connection is not READY. State: $UPSTREAM_STATE" >> "$LOG_FILE"
fi
else
echo "$CURRENT_TIME: Controller is unavailable. HTTP Status: $HTTP_STATUS" >> "$LOG_FILE"
fi
Optional CloudWatch Integration
Metrics are sent to CloudWatch under the BoundaryWorker namespace, with each EC2 instance's metrics tagged by its INSTANCE_ID for differentiation. This setup enables you to track the health of multiple instances, such as your three EC2 instances, you can then create alarms to receive alerts via email through SNS when any worker fails.
Example script with CloudWatch
Required:
- AWS CLI
- ec2 role attached to instance with permissions to write to cloudwatch
#!/bin/bash
# Set up basic variables
HEALTH_URL="http://localhost:9203/health?worker_info=1"
AWS_CLI="aws cloudwatch put-metric-data --region us-east-1"
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id)
LOG_FILE="/var/log/boundary_health_check.log"
# Get the current time for logging
CURRENT_TIME=$(date "+%Y-%m-%d %H:%M:%S")
# Check the health URL and grab the response
RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X GET "$HEALTH_URL" 2>> "$LOG_FILE")
# If no response, log it and send a bad health status to CloudWatch
if [ -z "$RESPONSE" ]; then
echo "$CURRENT_TIME: Curl failed to get response" >> "$LOG_FILE"
$AWS_CLI --namespace "BoundaryWorker" --metric-name "HealthCheckStatus" --value 1 --unit "None" --dimensions Name=InstanceId,Value="$INSTANCE_ID"
exit 1
fi
# Split the response into body and status code
BODY=$(echo "$RESPONSE" | sed -e 's/HTTPSTATUS:[0-9]*//g')
HTTP_STATUS=$(echo "$RESPONSE" | sed -e 's/.*HTTPSTATUS://' | tr -d '[:space:]')
# Check HTTP status, set metrics, and process state/upstream
if [ "$HTTP_STATUS" -eq 200 ]; then
HTTP_METRIC_VALUE=0
STATE=$(echo "$BODY" | jq -r '.worker_process_info.state')
UPSTREAM_STATE=$(echo "$BODY" | jq -r '.worker_process_info.upstream_connection_state')
# Log if state is not active
if [ "$STATE" != "active" ]; then
echo "$CURRENT_TIME: Boundary worker state is not active. State: $STATE" >> "$LOG_FILE"
fi
# Log if upstream state is not READY and set metric value
if [ "$UPSTREAM_STATE" = "READY" ]; then
UPSTREAM_METRIC_VALUE=0
else
UPSTREAM_METRIC_VALUE=1
echo "$CURRENT_TIME: Boundary worker upstream connection is not READY. State: $UPSTREAM_STATE" >> "$LOG_FILE"
fi
# Send upstream state to CloudWatch
$AWS_CLI --namespace "BoundaryWorker" --metric-name "UpstreamState" --value "$UPSTREAM_METRIC_VALUE" --unit "None" --dimensions Name=InstanceId,Value="$INSTANCE_ID"
else
HTTP_METRIC_VALUE=1
echo "$CURRENT_TIME: Controller is unavailable. HTTP Status: $HTTP_STATUS" >> "$LOG_FILE"
fi
# Send the HTTP status to CloudWatch every time
$AWS_CLI --namespace "BoundaryWorker" --metric-name "HttpStatus" --value "$HTTP_METRIC_VALUE" --unit "None" --dimensions Name=InstanceId,Value="$INSTANCE_ID"
# Send overall health status to CloudWatch: 0 if everything’s good, 1 if not
if [ "$HTTP_STATUS" -eq 200 ] && [ "$STATE" = "active" ] && [ "$UPSTREAM_STATE" = "READY" ]; then
$AWS_CLI --namespace "BoundaryWorker" --metric-name "HealthCheckStatus" --value 0 --unit "None" --dimensions Name=InstanceId,Value="$INSTANCE_ID"
else
$AWS_CLI --namespace "BoundaryWorker" --metric-name "HealthCheckStatus" --value 1 --unit "None" --dimensions Name=InstanceId,Value="$INSTANCE_ID"
fi
Alternative: External Health Monitoring
As an alternative to local scripting, you can use an external service to monitor Boundary worker health by querying <instance-ip>:9203. This approach centralizes monitoring but requires updating the EC2 instance’s security group to allow inbound TCP traffic on port 9203. Ensure proper security measures, like restricting source IPs, to protect the endpoint.
Additional Information
What should I do if I have more questions
- For additional questions or support, please open a Support ticket.