The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.
All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
Introduction
This article will go over the steps for restoring a running Consul cluster to another. This will be useful for users to run tests in a development or sandbox environment before making any changes in their production environment.
Expected Outcome
Successfully restore a Consul cluster from a different environment.
Prerequisites
-
A snapshot backup from the production/existing Consul cluster
-
A development/sandbox/newly created Consul cluster (e.g. 3 servers, 2 clients, and no services registered)
- Gossip Encryption is enabled
Requirements
-
Ensuring that gossip encryption is enabled in production
- Updating Consul's server and client agents'
retry_join
parameter -
Ensuring that gossip key differs from production
If the same gossip key is used in the new environment, it may be able to join the cluster.
Therefore it is extremely important to make sure the new environment uses a different gossip key.
Procedure
Table of Contents
Setup Example
-
Details in the example Consul snapshot backup file
back.snap
taken from the production cluster-
ACL tokens:
-
1 for servers, 2 for clients, 2 for services, bootstrap, and anonymous
-
-
2 registered services with their sidecars.
- Service intentions:
- 1 for deny-all, 1 for allowing frontend service to backend service
- KVs with the prefix
leaderboard/scores
- Gossip key:
G3wLAZ3lVcB6JzZI16SbaUBzEG9PoEr6iv5PTitTIg0=
-
-
The newly created Consul cluster (e.g. development cluster)
-
consul members
Node Address Status Type Build Protocol DC Partition Segment
server-1 172.31.25.146:8301 alive server 1.13.2 2 dc1 default <all>
server-2 172.31.17.109:8301 alive server 1.13.2 2 dc1 default <all>
server-3 172.31.31.242:8301 alive server 1.13.2 2 dc1 default <all>
backend 172.31.18.195:8301 alive client 1.13.2 2 dc1 default <default>
frontend 172.31.27.252:8301 alive client 1.13.2 2 dc1 default <default> -
consul operator raft list-peers
Node ID Address State Voter RaftProtocol
server-1 17b9a2fa-a5dc-6bdd-61e1-54373dc9144d 172.31.25.146:8300 leader true 3
server-2 33ce1c45-7764-6b13-1c06-9d3b9e4051f4 172.31.17.109:8300 follower true 3
server-3 d9d1f1c2-56cb-6b6d-851e-106c328e7980 172.31.31.242:8300 follower true 3 -
Gossip key in use,
cat /opt/consul/serf/local.keyring
:
["CbvxLi23GyFRtqTCTAweUMDl9QzTcqrxL5/NKdSSI7I="]
- Tokens and policy list (5 tokens):
- No services are registered.
-
Restoration
- Restore the snapshot backup to the new Consul cluster
- run consul snapshot restore
backup.snap
to restore the snapshot$ consul snapshot restore backup.snap
Restored snapshot -
Run
consul members
and failed with the error '403 (ACL not found)'
$ consul members
Error retrieving members: Unexpected response code: 403 (ACL not found)The same error messages appeared in the logs,
[ERROR] agent: Coordinate update error: error="ACL not found"
agent: Coordinate update error: error="ACL not found" - Check the logs for any other issues and we see that Consul has failed to insert nodes. This is because the same node names are reserved by nodes with different node IDs.
[INFO] agent.server: member joined, marking health alive: member=server-3 partition=default
[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "b45d787b-5093-99e7-e46e-c764aa20fc24": Node name server-3 is reserved by node dc2dd31a-c219-e10d-a723-a1ad9dece3d3 with name server-3 (172.31.9.223)"
[ERROR] agent.server: failed to reconcile member: member="{server-3 172.31.31.242 8301 map[acls:1 build:1.13.2:0e046bbb dc:dc1 expect:3 ft_fs:1 ft_si:1 grpc_port:8502 id:b45d787b-5093-99e7-e46e-c764aa20fc24 port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}" partition=default error="failed inserting node: Error while renaming Node ID: "b45d787b-5093-99e7-e46e-c764aa20fc24": Node name server-3 is reserved by node dc2dd31a-c219-e10d-a723-a1ad9dece3d3 with name server-3 (172.31.9.223)"
[INFO] agent.server: member joined, marking health alive: member=server-1 partition=default
[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "cae400d4-83ff-8677-b5e6-46100ad5e88f": Node name server-1 is reserved by node bc53941f-ab68-0a85-6e26-493d3516fcd2 with name server-1 (172.31.14.227)"
[ERROR] agent.server: failed to reconcile member: member="{server-1 172.31.25.146 8301 map[acls:1 build:1.13.2:0e046bbb dc:dc1 expect:3 ft_fs:1 ft_si:1 grpc_port:8502 id:cae400d4-83ff-8677-b5e6-46100ad5e88f port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}" partition=default error="failed inserting node: Error while renaming Node ID: "cae400d4-83ff-8677-b5e6-46100ad5e88f": Node name server-1 is reserved by node bc53941f-ab68-0a85-6e26-493d3516fcd2 with name server-1 (172.31.14.227)"
[INFO] agent.server: member joined, marking health alive: member=server-2 partition=default
[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "c7a5f070-88ec-118d-37ea-06c6a5fd1112": Node name server-2 is reserved by node 0ae9f430-01e3-be2d-deb8-a0b08dc873df with name server-2 (172.31.13.129)"
[ERROR] agent.server: failed to reconcile member: member="{server-2 172.31.17.109 8301 map[acls:1 build:1.13.2:0e046bbb dc:dc1 expect:3 ft_fs:1 ft_si:1 grpc_port:8502 id:c7a5f070-88ec-118d-37ea-06c6a5fd1112 port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}" partition=default error="failed inserting node: Error while renaming Node ID: "c7a5f070-88ec-118d-37ea-06c6a5fd1112": Node name server-2 is reserved by node 0ae9f430-01e3-be2d-deb8-a0b08dc873df with name server-2 (172.31.13.129)"
- run consul snapshot restore
Fix the issues
Now let’s fix the issues.
- Fix the ACL not found issue
- Set
CONSUL_HTTP_TOKEN
environment variable using the master token from the production cluster
$ export CONSUL_HTTP_TOKEN=f064f480-2ee8-5eff-355f-fcdd3227cfc2
- Run both
consul members
andconsul operator raft list-peers
commands to ensure the fix,
$ consul members
Node Address Status Type Build Protocol DC Partition Segment
server-1 172.31.25.146:8301 alive server 1.13.2 2 dc1 default <all>
server-2 172.31.17.109:8301 alive server 1.13.2 2 dc1 default <all>
server-3 172.31.31.242:8301 alive server 1.13.2 2 dc1 default <all>
backend 172.31.18.195:8301 alive client 1.13.2 2 dc1 default <default>
frontend 172.31.27.252:8301 alive client 1.13.2 2 dc1 default <default>
$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
server-1 cae400d4-83ff-8677-b5e6-46100ad5e88f 172.31.25.146:8300 leader true 3
server-3 b45d787b-5093-99e7-e46e-c764aa20fc24 172.31.31.242:8300 follower true 3
server-2 c7a5f070-88ec-118d-37ea-06c6a5fd1112 172.31.17.109:8300 follower true 3 - List all the tokens and verify that they were all replaced by the tokens from the production cluster (i.e.
backup.snap
).
AccessorID: dc6ddd83-0cd3-59f5-ff99-f22329b16f55
SecretID: 2cbb0af4-792e-8423-07ca-1823e495f862
Description: Frontend Service Token
...
Service Identities:
frontend (Datacenters: all)
AccessorID: 4c93eaf7-4994-f2d6-298c-1e4e64dde3d9
SecretID: 77ef273a-fd6c-1b32-be7a-33f91879335c
Description: Backend Service Token
...
Service Identities:
backend (Datacenters: all)
AccessorID: 17373374-83f3-2716-76f2-122c07040025
SecretID: 81ae1a54-04ad-d0d4-a969-7f5d40b09b7b
Description: Client Backend Token
...
Policies:
61682efd-f2f7-94d0-e37d-6780b27330e3 - backend
AccessorID: 4905ffd0-e9ef-e036-b4cd-21d23900a870
SecretID: a7123b7c-f845-f554-cd9a-a5954f83affe
Description: Client Frontend Token
...
Policies:
62465f6e-ba23-684f-47e3-3f8ac509847b - frontend
AccessorID: 00000000-0000-0000-0000-000000000002
SecretID: anonymous
Description: Anonymous Token
...
AccessorID: c49fa656-f2ea-45ab-7da3-9f63200b002e
SecretID: dcff1c86-197b-be45-d76b-81d04abfbc89
Description: Server Agent Token
...
Policies:
77d6ad3d-aa41-8cf5-774b-d69a033fdd9b - server-policy
AccessorID: 1fa43ad1-a3a8-0f8a-7822-6148ff3f0ae6
SecretID: f064f480-2ee8-5eff-355f-fcdd3227cfc2
Description: Bootstrap Token (Global Management)
...
Policies:
00000000-0000-0000-0000-000000000001 - global-management - Fix the issue of node names reserved by other node-ID’s.
- Go to each node and update the stored ACL tokens in the config files with the associated SecretID
{
"acl": {
"enabled": true,
"default_policy": "deny",
"down_policy": "extend-cache",
"enable_token_persistence": true,
"tokens": {
"initial_management": "f064f480-2ee8-5eff-355f-fcdd3227cfc2",
"agent": "dcff1c86-197b-be45-d76b-81d04abfbc89"
}
}
}
- Go to each node and update the stored ACL tokens in the config files with the associated SecretID
- Restart the Consul service and verify that the issue is fixed.
- Set
Other Restored Items
Given that ACL tokens and policies are replicated in the new Consul cluster, what other items are also restored:
-
Services and Service Intentions are restored.
$ consul catalog services
backend
backend-sidecar-proxy
consul
frontend
frontend-sidecar-proxy$ consul intention list
ID Source Action Destination Precedence
4cf00618-c844-9005-0397-a051bd2ea662 frontend allow backend 9
* deny * 5 -
KV store is also restored.
$ curl -s --header "X-Consul-Token: f064f480-2ee8-5eff-355f-fcdd3227cfc2" http://127.0.0.1:8500/v1/kv/?recurse | jq
[
{
"LockIndex": 0,
"Key": "leaderboard/scores",
"Flags": 0,
"Value": "ewogICJ1c2VyLWEiOiAxMDAsCiAgInVzZXItYiI6IDI1MCwKICAidXNlci1jIjogNzUKfQo=",
"CreateIndex": 8669,
"ModifyIndex": 8669
}
] - Given that the new Consul cluster has no services registered and is ACL enabled, the registered services on client nodes will be deregistered after the client agent tokens are updated.
$ consul catalog services
The reason is that after the
consulconsul snapshot restore
is executed, it overrides the contents in the raft.db file with the contents from the snapshot file. If there were services registered in the production cluster, they will appear when runningconsul catalog services
after the restoration. We still need to go to the client nodes in the Test cluster to recreate/update services to re-register them.
- In the scenario that the new Consul cluster does not have any client agents. Client nodes will be de-registered right after the restoration. Any registered services will also be de-registered.
agent.server: deregistering member: member=backend partition=default reason=reaped
agent.server: deregistering member: member=frontend partition=default reason=reaped$ consul catalog services
consul
Conclusion
It is feasible to restore the Consul cluster from one datacenter to another based on what has been done here. Everything that was stored in the snapshot is replicated over to the new cluster when running consul snapshot restore
. If ACL is enabled in the source cluster, all the ACL tokens used in any config files for agents/services are required to be manually replicated as well.
Also, note that the command consul snapshot restore
will restore an atomic, point-in-time snapshot of the state of the Consul servers which includes KV entries, service catalog, prepared queries, sessions, and ACLs. This means that Gossip keyrings and certificates are not restored. If intending to replicate them, they can be added to the new cluster at any moment.
Additional Information
- Consul snapshot commands
- Consul snapshot restore command
- Consul ACL token commands
- Consul Raft protocol overview
- Consul disaster recovery for federated primary datacenter