The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.
All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
Introduction
To clean up the Consul data-dir in a Kubernetes (k8s) environment for a single replica without impacting other replicas, you can follow the procedure outlined below. This process ensures that the problematic replica's data-dir is cleaned up and allows for automatic raft log syncing to bring the new replica into sync.
Expected Outcome
You should have the capability to operate solely on a single Consul server replica without requiring to restart the Consul server StatefulSets (STS).
Prerequisites
- The Consul server is installed in a k8s environment.
- The Consul server has more than 1 replica.
- The other Consul server replicas have raft log entries that would automatically be synced with this new replica once it boots up.
Use Case
Suppose a situation arises wherein you are either unable to start the Consul server replica to clear its data-dir directly, or you are not authorized to delete the data-dir PersistantVolumeClaim (PVC) for the replica and re-run the installation to recreate it. In those cases, you will need a way to empty the data-dir using a process that should not affect the health of the other replicas.
Example
There is a 3 replica Consul server STS, where 1 of the replicas is unable to start due to an error like:
[ERROR] agent.server.raft: failed to get log: index=24946951 error="log not found"
panic: log not found
- View the Error log example for a more detailed log
That replica gradually goes into the CrashLoopBackOff status and a rolling/restart of the Consul server STS cannot be performed.
Procedure
- Verify that the other Consul server replicas are healthy by checking the leader status.
consul operator raft list-peers;
consul info| egrep 'applied_index|commit_index|last_log_index'
# Make sure that applied_index, commit_index and last_log_index are closer to each other. -
Mount the PVC meant for the failing Consul server replica's data-dir using a POD spec like in the below example.
# Note that you can find the PVC meant for this replica using an example like below.
#
# In the below example, the Consul server is installed within the consul namespace
# and the PVCs contain the name `consul-server` in them.
kubectl -n consul get pvc | grep consul-server - Once the new temporary pod starts up, run the below commands to clear up the data-dir
BKP_DATA_DIR="data_backup_$(date | tr ' ' '_')";
mkdir "/consul/data/${BKP_DATA_DIR}";
cd /consul/data;
mv * "${BKP_DATA_DIR}"/;
# Note, you may ignore a response like below for the `mv` command.
# mv: can't rename 'data_backup_Tue_Feb_28_05:40:10_UTC_2023': Invalid argument
# It appears since we are moving everything under /consul/data using a wildcard.
# This just means that it cannot move data_backup_... into data_backup_... dir. - Finally, delete the temporary POD along with the stuck Consul server replica POD.
- You should see the new Consul server replica in the Running state and the sync of raft logs from the remote kicking off.
Error log example:
2023-02-27T11:32:08.129Z [INFO] agent.server.raft: starting restore from snapshot: id=28805-24946950-1677014351704 last-index=24946950 last-term=28805 size-in-bytes=32778380
2023-02-27T11:32:25.628Z [INFO] agent.server.raft: snapshot restore progress: id=28805-24946950-1677014351704 last-index=24946950 last-term=28805 size-in-bytes=32778380 read-bytes=32778380 percent-complete=100.00%
2023-02-27T11:32:25.648Z [INFO] agent.server.raft: restored from snapshot: id=28805-24946950-1677014351704 last-index=24946950 last-term=28805 size-in-bytes=32778380
2023-02-27T11:32:25.650Z [ERROR] agent.server.raft: failed to get log: index=24946951 error="log not found"
panic: log not found
goroutine 1 [running]:
github.com/hashicorp/raft.NewRaft(0xc000620640, {0x306dec0, 0xc000837b60}, {0x30bc548, 0xc000d4a240}, {0x30a2fc0, 0xc0008909a8}, {0x306e100, 0xc000cdacc0}, {0x30e4910, ...})
/home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.3.6/api.go:568 +0xf56
github.com/hashicorp/consul/agent/consul.(*Server).setupRaft(0xc000daee00)
/home/runner/work/consul/consul/agent/consul/server.go:853 +0x10cc
github.com/hashicorp/consul/agent/consul.NewServer(0xc0007c4900, {{0x3104be0, 0xc000dfc060}, 0xc0009c56b0, 0xc0009c5810, 0xc000b86980, 0xc0006799a0, {0x306c030, 0xc000cda7e0}, {0x30361e0, ...}, ...})
/home/runner/work/consul/consul/agent/consul/server.go:479 +0x1065
github.com/hashicorp/consul/agent.(*Agent).Start(0xc0002afb80, {0x30a29d8, 0xc0006efb80})
/home/runner/work/consul/consul/agent/agent.go:539 +0x6e7
github.com/hashicorp/consul/command/agent.(*cmd).run(0xc0003edc00, {0xc000136060, 0x2, 0x2})
/home/runner/work/consul/consul/command/agent/agent.go:248 +0xc53
github.com/hashicorp/consul/command/agent.(*cmd).Run(0xc0003edc00, {0xc000136060, 0x0, 0x0})
/home/runner/work/consul/consul/command/agent/agent.go:61 +0x27
github.com/mitchellh/cli.(*CLI).Run(0xc0007a0dc0)
/home/runner/go/pkg/mod/github.com/mitchellh/cli@v1.1.0/cli.go:260 +0x5f8
main.realMain()
/home/runner/work/consul/consul/main.go:53 +0x40e
main.main()
/home/runner/work/consul/consul/main.go:23 +0x19
POD spec example:
apiVersion: v1
kind: Pod
metadata:
name: consul-support-temp-pod
spec:
volumes:
- name: consul-pv-storage
persistentVolumeClaim:
claimName: <pvc-claim-name-for-the-original-consul-replica-pod>
containers:
- name: consul-pv-container
image: nginx
command: ["sleep", "3000"]
volumeMounts:
- mountPath: "/consul/data"
name: consul-pv-storage