The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.

All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.

Introduction

This article will go over the steps for restoring a running Consul cluster to another. This will be useful for users to run tests in a development or sandbox environment before making any changes in their production environment.

Expected Outcome

Successfully restore a Consul cluster from a different environment.

Prerequisites

A snapshot backup from the production/existing Consul cluster
A development/sandbox/newly created Consul cluster (e.g. 3 servers, 2 clients, and no services registered)
Gossip Encryption is enabled

Requirements

Ensuring that gossip encryption is enabled in production
Updating Consul's server and client agents' retry_join parameter
Ensuring that gossip key differs from production

⚠️ WARNING ⚠️
If the same gossip key is used in the new environment, it may be able to join the cluster.
Therefore it is extremely important to make sure the new environment uses a different gossip key.

Procedure

Table of Contents

Setup
Restoration
Fix the issues
Other Restored Items
Conclusion
Additional Information

Setup Example

Details in the example Consul snapshot backup file back.snaptaken from the production cluster
- ACL tokens:
  - 1 for servers, 2 for clients, 2 for services, bootstrap, and anonymous
- 2 registered services with their sidecars.
- Service intentions:
  - 1 for deny-all, 1 for allowing frontend service to backend service
- KVs with the prefix leaderboard/scores
- Gossip key: G3wLAZ3lVcB6JzZI16SbaUBzEG9PoEr6iv5PTitTIg0=

The newly created Consul cluster (e.g. development cluster)

consul members

Node      Address             Status  Type    Build   Protocol  DC   Partition  Segment
server-1  172.31.25.146:8301  alive   server  1.13.2  2         dc1  default    <all>
server-2  172.31.17.109:8301  alive   server  1.13.2  2         dc1  default    <all>
server-3  172.31.31.242:8301  alive   server  1.13.2  2         dc1  default    <all>
backend   172.31.18.195:8301  alive   client  1.13.2  2         dc1  default    <default>
frontend  172.31.27.252:8301  alive   client  1.13.2  2         dc1  default    <default>

consul operator raft list-peers

Node      ID                                    Address             State     Voter  RaftProtocol
server-1  17b9a2fa-a5dc-6bdd-61e1-54373dc9144d  172.31.25.146:8300  leader    true   3
server-2  33ce1c45-7764-6b13-1c06-9d3b9e4051f4  172.31.17.109:8300  follower  true   3
server-3  d9d1f1c2-56cb-6b6d-851e-106c328e7980  172.31.31.242:8300  follower  true   3

Gossip key in use, cat /opt/consul/serf/local.keyring:
```
["CbvxLi23GyFRtqTCTAweUMDl9QzTcqrxL5/NKdSSI7I="]
```
Tokens and policy list (5 tokens):
No services are registered.

Restoration

Restore the snapshot backup to the new Consul cluster

run consul snapshot restore backup.snap to restore the snapshot
```
$ consul snapshot restore backup.snap
Restored snapshot
```

Run consul members and failed with the error '403 (ACL not found)'

$ consul members
Error retrieving members: Unexpected response code: 403 (ACL not found)

The same error messages appeared in the logs,

[ERROR] agent: Coordinate update error: error="ACL not found"
agent: Coordinate update error: error="ACL not found"

Check the logs for any other issues and we see that Consul has failed to insert nodes. This is because the same node names are reserved by nodes with different node IDs.

[INFO] agent.server: member joined, marking health alive: member=server-3 partition=default
[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "b45d787b-5093-99e7-e46e-c764aa20fc24": Node name server-3 is reserved by node dc2dd31a-c219-e10d-a723-a1ad9dece3d3 with name server-3 (172.31.9.223)"
[ERROR] agent.server: failed to reconcile member: member="{server-3 172.31.31.242 8301 map[acls:1 build:1.13.2:0e046bbb dc:dc1 expect:3 ft_fs:1 ft_si:1 grpc_port:8502 id:b45d787b-5093-99e7-e46e-c764aa20fc24 port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}" partition=default error="failed inserting node: Error while renaming Node ID: "b45d787b-5093-99e7-e46e-c764aa20fc24": Node name server-3 is reserved by node dc2dd31a-c219-e10d-a723-a1ad9dece3d3 with name server-3 (172.31.9.223)"
[INFO] agent.server: member joined, marking health alive: member=server-1 partition=default
[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "cae400d4-83ff-8677-b5e6-46100ad5e88f": Node name server-1 is reserved by node bc53941f-ab68-0a85-6e26-493d3516fcd2 with name server-1 (172.31.14.227)"
[ERROR] agent.server: failed to reconcile member: member="{server-1 172.31.25.146 8301 map[acls:1 build:1.13.2:0e046bbb dc:dc1 expect:3 ft_fs:1 ft_si:1 grpc_port:8502 id:cae400d4-83ff-8677-b5e6-46100ad5e88f port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}" partition=default error="failed inserting node: Error while renaming Node ID: "cae400d4-83ff-8677-b5e6-46100ad5e88f": Node name server-1 is reserved by node bc53941f-ab68-0a85-6e26-493d3516fcd2 with name server-1 (172.31.14.227)"
[INFO] agent.server: member joined, marking health alive: member=server-2 partition=default
[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "c7a5f070-88ec-118d-37ea-06c6a5fd1112": Node name server-2 is reserved by node 0ae9f430-01e3-be2d-deb8-a0b08dc873df with name server-2 (172.31.13.129)"
[ERROR] agent.server: failed to reconcile member: member="{server-2 172.31.17.109 8301 map[acls:1 build:1.13.2:0e046bbb dc:dc1 expect:3 ft_fs:1 ft_si:1 grpc_port:8502 id:c7a5f070-88ec-118d-37ea-06c6a5fd1112 port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}" partition=default error="failed inserting node: Error while renaming Node ID: "c7a5f070-88ec-118d-37ea-06c6a5fd1112": Node name server-2 is reserved by node 0ae9f430-01e3-be2d-deb8-a0b08dc873df with name server-2 (172.31.13.129)"

Fix the issues

Now let’s fix the issues.

Fix the ACL not found issue

Set CONSUL_HTTP_TOKEN environment variable using the master token from the production cluster
```
$ export CONSUL_HTTP_TOKEN=f064f480-2ee8-5eff-355f-fcdd3227cfc2
```

Run both consul members and consul operator raft list-peers commands to ensure the fix,

$ consul members
Node      Address             Status  Type    Build   Protocol  DC   Partition  Segment
server-1  172.31.25.146:8301  alive   server  1.13.2  2         dc1  default    <all>
server-2  172.31.17.109:8301  alive   server  1.13.2  2         dc1  default    <all>
server-3  172.31.31.242:8301  alive   server  1.13.2  2         dc1  default    <all>
backend   172.31.18.195:8301  alive   client  1.13.2  2         dc1  default    <default>
frontend  172.31.27.252:8301  alive   client  1.13.2  2         dc1  default    <default>

$ consul operator raft list-peers
Node      ID                                    Address             State     Voter  RaftProtocol
server-1  cae400d4-83ff-8677-b5e6-46100ad5e88f  172.31.25.146:8300  leader    true   3
server-3  b45d787b-5093-99e7-e46e-c764aa20fc24  172.31.31.242:8300  follower  true   3
server-2  c7a5f070-88ec-118d-37ea-06c6a5fd1112  172.31.17.109:8300  follower  true   3

List all the tokens and verify that they were all replaced by the tokens from the production cluster (i.e. backup.snap).

AccessorID: dc6ddd83-0cd3-59f5-ff99-f22329b16f55
SecretID: 2cbb0af4-792e-8423-07ca-1823e495f862
Description: Frontend Service Token
...
Service Identities:
   frontend (Datacenters: all)

AccessorID: 4c93eaf7-4994-f2d6-298c-1e4e64dde3d9
SecretID: 77ef273a-fd6c-1b32-be7a-33f91879335c
Description: Backend Service Token
...
Service Identities:
   backend (Datacenters: all)

AccessorID: 17373374-83f3-2716-76f2-122c07040025
SecretID: 81ae1a54-04ad-d0d4-a969-7f5d40b09b7b
Description: Client Backend Token
...
Policies:
   61682efd-f2f7-94d0-e37d-6780b27330e3 - backend

AccessorID: 4905ffd0-e9ef-e036-b4cd-21d23900a870
SecretID: a7123b7c-f845-f554-cd9a-a5954f83affe
Description: Client Frontend Token
...
Policies:
   62465f6e-ba23-684f-47e3-3f8ac509847b - frontend

AccessorID: 00000000-0000-0000-0000-000000000002
SecretID: anonymous
Description: Anonymous Token
...

AccessorID: c49fa656-f2ea-45ab-7da3-9f63200b002e
SecretID: dcff1c86-197b-be45-d76b-81d04abfbc89
Description: Server Agent Token
...
Policies:
   77d6ad3d-aa41-8cf5-774b-d69a033fdd9b - server-policy

AccessorID: 1fa43ad1-a3a8-0f8a-7822-6148ff3f0ae6
SecretID: f064f480-2ee8-5eff-355f-fcdd3227cfc2
Description: Bootstrap Token (Global Management)
...
Policies:
   00000000-0000-0000-0000-000000000001 - global-management

Fix the issue of node names reserved by other node-ID’s.

Go to each node and update the stored ACL tokens in the config files with the associated SecretID

{
  "acl": {
    "enabled": true,
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true,
    "tokens": {
      "initial_management": "f064f480-2ee8-5eff-355f-fcdd3227cfc2",
      "agent": "dcff1c86-197b-be45-d76b-81d04abfbc89"
    }
  }
}

Restart the Consul service and verify that the issue is fixed.

Other Restored Items

Given that ACL tokens and policies are replicated in the new Consul cluster, what other items are also restored:

Services and Service Intentions are restored.

$ consul catalog services
backend
backend-sidecar-proxy
consul
frontend
frontend-sidecar-proxy

$ consul intention list
ID                                    Source    Action  Destination  Precedence
4cf00618-c844-9005-0397-a051bd2ea662  frontend  allow   backend      9
                                      *         deny    *            5

KV store is also restored.

$ curl -s --header "X-Consul-Token: f064f480-2ee8-5eff-355f-fcdd3227cfc2" http://127.0.0.1:8500/v1/kv/?recurse | jq
[
  {
    "LockIndex": 0,
    "Key": "leaderboard/scores",
    "Flags": 0,
    "Value": "ewogICJ1c2VyLWEiOiAxMDAsCiAgInVzZXItYiI6IDI1MCwKICAidXNlci1jIjogNzUKfQo=",
    "CreateIndex": 8669,
    "ModifyIndex": 8669
  }
]

Given that the new Consul cluster has no services registered and is ACL enabled, the registered services on client nodes will be deregistered after the client agent tokens are updated.
```
$ consul catalog services
consul
```
The reason is that after the consul snapshot restore is executed, it overrides the contents in the raft.db file with the contents from the snapshot file. If there were services registered in the production cluster, they will appear when running consul catalog services after the restoration. We still need to go to the client nodes in the Test cluster to recreate/update services to re-register them.

In the scenario that the new Consul cluster does not have any client agents. Client nodes will be de-registered right after the restoration. Any registered services will also be de-registered.

agent.server: deregistering member: member=backend partition=default reason=reaped
agent.server: deregistering member: member=frontend partition=default reason=reaped

$ consul catalog services
consul

Conclusion

It is feasible to restore the Consul cluster from one datacenter to another based on what has been done here. Everything that was stored in the snapshot is replicated over to the new cluster when running consul snapshot restore. If ACL is enabled in the source cluster, all the ACL tokens used in any config files for agents/services are required to be manually replicated as well.

Also, note that the command consul snapshot restore will restore an atomic, point-in-time snapshot of the state of the Consul servers which includes KV entries, service catalog, prepared queries, sessions, and ACLs. This means that Gossip keyrings and certificates are not restored. If intending to replicate them, they can be added to the new cluster at any moment.

Additional Information

Consul snapshot commands
- Consul snapshot restore command
Consul ACL token commands
Consul Raft protocol overview
Consul disaster recovery for federated primary datacenter

How to Restore a Consul Cluster Using a Snapshot From Another Cluster

Introduction

Expected Outcome

Prerequisites

Requirements

Procedure

Setup Example

Restoration

Fix the issues

Other Restored Items

Conclusion

Additional Information

Articles in this section

Introduction

Expected Outcome

Prerequisites

Requirements

Procedure

Setup Example

Restoration

Fix the issues

Other Restored Items

Conclusion

Additional Information

Articles in this section

Related articles