Introduction
Problem
In some cases Vault Integrated storage (Raft) AutoPilot is reporting Vault follower (standby) nodes as unhealthy. This can be observed by running:
vault read -format=json sys/storage/raft/autopilot/state | jq '.data.servers | map_values({
id: .id,
healthy: .healthy,
last_contact: .last_contact,
last_index: .last_index,
last_term: .last_term
})'{
"vault-node1": {
"id": "vault-node1",
"healthy": true,
"last_contact": "0s",
"last_index": 2259757,
"last_term": 3
},
"vault-node2": {
"id": "vault-node2",
"healthy": false,
"last_contact": "4.702715835s",
"last_index": 2259745,
"last_term": 3
},
"vault-node3": {
"id": "vault-node3",
"healthy": false,
"last_contact": "3.723809127s",
"last_index": 2259747,
"last_term": 3
}
}Alternatively the vault.autopilot.node.healthy metric can also be used to retrieve the health status of each node in a Vault Autopilot configuration:
vault read -format=json sys/metrics | jq -r '.data.Gauges[] | select(.Name == "vault.autopilot.node.healthy") | "\(.Labels.node_id): \(.Value)"'
vault-us-east-2-us-east-2a-pure-bunny: 1
vault-us-east-2-us-east-2b-pure-bunny: 0
vault-us-east-2-us-east-2c-pure-bunny: 0A value of
1on the gauge means that Autopilot deems the node indicated bynode_idis healthy.- A value of
0on the gauge means that Autopilot cannot communicate with the node indicated bynode_id, or deems the node unhealthy.
Prerequisites (if applicable)
- Vault Enterprise
- Vault Telemetry
- Vault Integrated storage (Raft) backend (only)
Cause
The fact that a follower node is being reported as unhealthy typically occurs when the the values set for either max_trailing_logs or last_contact_threshold have been exceeded. The last_contact_threshold exceeded issue can occur in case of for example connectivity issues or the node being unavailable.
-
max_trailing_logs(int: 1000)- Amount of entries in the Raft Log that a server can be behind before being considered unhealthy. If this value is too low, it can cause the cluster to lose quorum if a follower falls behind. This value only needs to be increased from the default if you have a very high write load on Vault and you see that it takes a long time to promote new servers to becoming voters. This is an unlikely scenario and most users should not modify this value.last_contact_threshold(string "10s")- Limit on the amount of time a server can go without leader contact before being considered unhealthy.
In most cases this can be confirmed by first retrieving the Vault autopilot configuration:
vault read -format=json sys/storage/raft/autopilot/configuration
{
"request_id": "00d46fe1-1216-f0bf-40a6-bc6b70818b78",
"lease_id": "",
"lease_duration": 0,
"renewable": false,
"data": {
"cleanup_dead_servers": false,
"dead_server_last_contact_threshold": "24h0m0s",
"disable_upgrade_migration": false,
"last_contact_threshold": "10s",
"max_trailing_logs": 1,
"min_quorum": 0,
"server_stabilization_time": "10s"
},
"warnings": null,
"mount_type": "system"
}
For this example, the value specified for max_trailing_logs has been changed from the default value 1000 to 1. Please note that setting the value specified for max_trailing_logs to 1 is not recommend, this was done for reproduction purposes only.
By looking at the last index values for the follower nodes, it can be observed that the values for returned for last_index exceeds 1 which the value specified for max_trailing_logs, and is causing the node being reported as "healthy": false
vault read -format=json sys/storage/raft/autopilot/state | jq '.data.servers | map_values({
id: .id,
healthy: .healthy,
last_contact: .last_contact,
last_index: .last_index,
last_term: .last_term
})'{
"vault-node1": {
"id": "vault-node1,
"healthy": true,
"last_contact": "0s",
"last_index": 2259757,
"last_term": 3
},
"vault-node2": {
"id": "vault-node2",
"healthy": false,
"last_contact": "4.702715835s",
"last_index": 2259745,
"last_term": 3
},
"vault-node3": {
"id": "vault-node3",
"healthy": false,
"last_contact": "3.723809127s",
"last_index": 2259747,
"last_term": 3
}
}Alternatively the metric vault.raft_storage.follower.applied_index_delta can be used to retrieve the difference in values returned for the active and follower nodes:
vault read -format=json sys/metrics | jq '.data.Gauges[] | select(.Name == "vault.raft_storage.follower.applied_index_delta")'vault read -format=json sys/metrics | jq '.data.Gauges[] | select(.Name == "vault.raft_storage.follower.applied_index_delta")'
{
"Labels": {
"peer_id": "vault-node2"
},
"Name": "vault.raft_storage.follower.applied_index_delta",
"Value": 13
}
{
"Labels": {
"peer_id": "vault-node3"
},
"Name": "vault.raft_storage.follower.applied_index_delta",
"Value": 11
}Description for the metric: The difference between the index applied by the leader and the index applied by the follower as reported by echoes.
The values retrieved for the vault.raft_storage.follower.applied_index_delta metric can be used to confirm, if the value specified for max_trailing_logs was exceeded at the time the vault.autopilot.node.healthy metric reported nodes as unhealthy.
The reason for the value specified for max_trailing_logs being exceeded could differ, the nodes in the Vault cluster could be insufficiently resourced, or the number of requests made to the Vault cluster could have been (unexpectedly) increased. Please raise a ticket with HashiCorp Global Support for further analyses if required.
Additional Information
Vault API Documentation: Raft AutoPilot
Vault Documentation: Full Metrics list