The information contained in this article has been verified as up-to-date on the date of the original publication of the article. HashiCorp endeavors to keep this information up-to-date and correct, but it makes no representations or warranties of any kind, express or implied, about the ongoing completeness, accuracy, reliability, or suitability of the information provided.
All information contained in this article is for general information purposes only. Any reliance you place on such information as it applies to your use of your HashiCorp product is therefore strictly at your own risk.
Overview of the issue
If any version of Consul released before Dec 13, 2022 is using Vault 1.11.0+ as Consul’s Connect CA provider, Consul control plane or service mesh communication will break at some point. The intermediate CA will become unable to issue the leaf certificates needed by:
- Service mesh: Services in the mesh to communicate with mTLS
- All use cases: Consul client agents if using auto-encrypt or auto-config, and using TLS to communicate with Consul server agents
You are using the Vault CA provider if either of the following configurations exists:
- The Consul server agent configuration option connect.ca_provider is set to “vault”, or
- The Consul on Kubernetes Helm Chart global.secretsBackend.vault.connectCA value is configured.
If you meet the conditions above, take the action recommended herein to avoid the onset of communication failure.
Underlying cause & onset of failure
Consul attempts to rotate the intermediate CA certificate when it reaches 50% of the lifetime defined by IntermediateCertTTL, which defaults to 1 year. When using Vault 1.11+ as a Connect CA provider, the intermediate CA rotation will fail in such a way that:
- The size of new leaf certificates will increase as Consul reattempts intermediate CA rotation every hour, eventually causing TLS handshake failures. This failure mode may occur days after 50% of IntermediateCertTTL is reached.
- The old intermediate CA will remain the issuer of leaf certificates. As expiration approaches, it becomes unable to issue new leaf certificates. This failure mode will occur days before IntermediateCertTTL is reached.
You may need to act immediately. Assuming default settings, intermediate CA rotation is attempted every 6 months (50% of 1 year). Without knowing when the last successful rotation happened, the next attempted rotation could occur anytime between now and 6 months from now. If using Vault 1.11+ as Consul's Connect CA provider, the next attempted rotation will fail unless you intervene using the recommended workaround.
Consul is only affected if Vault’s version is 1.11.0 or later. In Vault 1.11.0, the PKI secrets engine’s API was modified in a way that is backward-compatible for single endpoint calls, but a breaking change for a multi-endpoint workflow used by Consul. For more details, refer to the recent Vault PR that allows Vault operators to opt into the previous multi-endpoint behavior of the PKI secrets engine. That Vault PR is not a substitute for upgrading Consul once a Consul fix is available.
Action required: upgrade Consul or apply workaround
You must take one of the following actions to avoid the onset of communication failure:
- Preferred: Upgrade Consul to a fixed version
- Operationalize the provided workaround if you cannot upgrade Consul yet
- a) For Vault Versions 1.13.0, 1.12.2, 1.11.6, follow the steps in use Vault setting `default_follows_latest_issuer` parameter true
- b) For Vault Versions 1.11.0 - 1.11.5 or Vault 1.12.0 - 1.12.1, follow the Consul manual workaround
If an intermediate CA rotation may occur before you can complete the upgrade, you must apply the provided workaround in the interim.
Option 1: Upgrade Consul to a fixed version
Compatibility fixes were made available on the Consul 1.12 - 1.14 release branches on Dec 13, 2022. To avoid the onset of failure, you must upgrade Consul to a version listed below before an intermediate CA rotation is attempted:
Release series |
Versions with the compatibility fix |
1.12.x |
1.12.8+ |
1.13.x |
1.13.5+ |
1.14.x |
1.14.3+ |
1.15.x and beyond |
All versions 1.15.0+ |
If an intermediate CA rotation might occur before the upgrade is completed, you must apply the provided workaround in the meantime.
We generally recommend that an upgrade involves moving both server agents and client agents to the new version, as described in Consul’s upgrade documentation. However, this compatibility fix is only necessary on the server agents. For expeditiousness, it is acceptable to upgrade just the server agents for now, and to temporarily defer upgrading the client agents. If upgrading by more than one release series, consult the large version jump documentation.
Option 2: Workaround if cannot upgrade Consul yet
a) For Vault Versions 1.13.0, 1.12.2, 1.11.6, use Vault setting `default_follows_latest_issuer` parameter true
- Retrieve most recent issuer ID by running the below command:
$ for issuer in $(/opt/vault/bin/vault list <insert-your-intermediate-pki-path>/issuers | grep '^[0-9a-f]'); do echo "$(/opt/vault/bin/vault read -field=certificate connect_inter/issuer/${issuer}/json | openssl x509 -noout -enddate) ${issuer}"; done | sort | tail -n 1
2. Create .json file with the default issuer id and the parameter "default_follows_latest_issuer"
( cat <<-EOF { "default": "${inter_id}", "default_follows_latest_issuer": "true" } EOF ) > intermediate-default-issuer.json
3. Set the default issuer to the most recent one and also set the "default_follows_latest_issuer" parameter to "true"
$ curl \
--header "X-Vault-Token: ${vault_token}" \
--request POST \
--data @intermediate-default-issuer.json \
${VAULT_ADDR}/v1/${intermediate_path}/config/issuers
4. Restart Vault agents .
5. Restart consul servers in rolling fashion.
6. Run /opt/vault/bin/vault read <insert-your-intermediate-pki-path>/config/issuers
to see if the default_follows_latest_issuer: true
and default: <id>
remained consistent with default issuer.
This option can be used before Consul reaches the 50% ICA lifetime and attempts a renewal (that doesn't success if the default isn't updated to the latest issuer... either manually or through this new Vault setting)
b) For Vault Versions 1.11.0 - 1.11.5 or Vault 1.12.0 - 1.12.1, follow the Consul manual workaround
Until affected deployments are able to upgrade to a version of Consul with the compatibility fix, you must operationalize the workaround below to avoid the onset of communication failure.
At a high level, the steps are:
- Continuously monitor for initiation of intermediate CA rotation
- Manually intervene in Vault to finish the intermediate CA rotation
Note: When using Vault as a Connect CA provider, every Consul datacenter has its own intermediate CA. Therefore, the steps must be applied to every Consul datacenter.
Continuously monitor for initiation of intermediate CA rotation
For each datacenter:
- Check the IntermediateCertTTL, such as with the Get CA Configuration HTTP API endpoint or CLI command. Calculate that TTL as a number of seconds.
- Monitor the consul.mesh.active-signing-ca.expiry metric, which tracks the number of seconds until the intermediate CA expires and is updated every hour.
- Set up a mechanism to be alerted when consul.mesh.active-signing-ca.expiry seconds falls below 50% of the intermediate CA TTL in seconds.
-
Once the alert from step 3 is triggered, verify that Consul has attempted to rotate the intermediate CA. The rotation attempt should happen within an hour of the alert from step 3.
-
This can be accomplished within Consul if the log level is set to “INFO” on server agents. Look for a log message on the leader containing the substring “new intermediate certificate”. The full log message depends on whether the intermediate is in a primary or secondary datacenter.
- Primary datacenter:
connect.ca: generated new intermediate certificate for primary datacenter
- Secondary datacenter:
connect.ca: received new intermediate certificate from primary datacenter
- Primary datacenter:
-
This can be accomplished within Consul if the log level is set to “INFO” on server agents. Look for a log message on the leader containing the substring “new intermediate certificate”. The full log message depends on whether the intermediate is in a primary or secondary datacenter.
-
When the time comes to issue a new ICA (50% of IntermediateCertTTL), Consul will successfully generate a new intermediate certificate as we see from the above log message.
-
In the Vault audit log, this will be seen as
pki/.../root/sign-intermediate
andpki/.../intermediate/set-signed
. -
You may also verify the issue by comparing the most recent issuer to the default one
Run:$ for issuer in $(/opt/vault/bin/vault list <insert-your-intermediate-pki-path>/issuers | grep '^[0-9a-f]'); do echo "$(/opt/vault/bin/vault read -field=certificate connect_inter/issuer/${issuer}/json | openssl x509 -noout -enddate) ${issuer}"; done | sort | tail -n 1
Example output:
notAfter=Nov 11 20:20:53 2022 GMT 4f1ea071-5f3f-fcd6-ce16-c28a3cae38c0
The default issuer hasn't changed as observed by the following command output
Run:
$ /opt/vault/bin/vault read <insert-your-intermediate-pki-path>/config/issuers
Example output:
Key Value
--- -----
default f3901ee0-a3ac-0970-4398-3fceec2f30ad
-
-
After verifying that Consul has attempted to rotate the intermediate CA, proceed to manually intervene in Vault to finish the intermediate CA rotation.
-
After performing the manual intervention, the consul.mesh.active-signing-ca.expiry metric should increase to approximately IntermediateCertTTL upon the metric’s next update.
-
Return to step 1 to monitor for the next initiation of intermediate CA rotation.
Manually intervene in Vault to finish the intermediate CA rotation
For the datacenter which attempted intermediate CA rotation, perform the following intervention in Vault on the PKI secrets engine at the path specified in the datacenter’s IntermediatePKIPath:
-
Within the PKI secrets engine, find the ID of the most recently created intermediate CA issuer:
Run:
$ for issuer in $(/opt/vault/bin/vault list <insert-your-intermediate-pki-path>/issuers | grep '^[0-9a-f]'); do echo "$(/opt/vault/bin/vault read -field=certificate <insert-your-intermediate-pki-path>/issuer/${issuer}/json | openssl x509 -noout -enddate) ${issuer}"; done | sort | tail -n 1
Example output:notAfter=Nov 11 20:20:53 2022 GMT 4f1ea071-5f3f-fcd6-ce16-c28a3cae38c0
-
Within the PKI secrets engine, manually make the most recently created intermediate CA issuer the new default issuer:
$ echo '{"default": "4f1ea071-5f3f-fcd6-ce16-c28a3cae38c0"}' | /opt/vault/bin/vault write <insert-your-intermediate-pki-path>/config/issuers -
-
Within the PKI secrets engine, verify that the most recently created intermediate CA issuer is now the default issuer:
Run:
$ /opt/vault/bin/vault read <insert-your-intermediate-pki-path>/config/issuers
Example output:
Key Value
--- -----
default 4f1ea071-5f3f-fcd6-ce16-c28a3cae38c0
-
(Optional) Within the PKI secrets engine, confirm the validity of the certificate for the most recently created intermediate CA issuer:
Run:
$ /opt/vault/bin/vault read -field=certificate <insert-your-intermediate-pki-path>/issuer/4f1ea071-5f3f-fcd6-ce16-c28a3cae38c0/json | openssl x509 -noout -issuer -subject -dates
Example output:
issuer=CN = pri-pznnyctn.vault.ca.e45f4cd9.consul
subject=CN = pri-y3121we.vault.ca.e45f4cd9.consul
notBefore=Nov 11 17:20:23 2022 GMT
notAfter=Nov 11 20:20:53 2022 GMT
-
Within Consul, perform a rolling restart of the Consul server agents:
- For each follower, perform the following steps one follower at a time:
- Restart the follower by
systemctl restart consul
- Wait until the follower is healthy and has rejoined the cluster before moving to the next follower.
- Restart the follower by
- Once all the followers have been completed, then run
systemctl restart consul
on the leader. - For Kuberenets environment, to restart servers, please check How to restart Consul servers in k8 env
- For each follower, perform the following steps one follower at a time:
-
Within Consul, confirm that the certificate from step 4 is present in the list of intermediate certificates from the List CA Root Certificates endpoint:
Run:
$ wget -qO- http://127.0.0.1:8500/v1/connect/ca/roots | jq -r '.Roots[]|.IntermediateCerts[]'
Example output:-----BEGIN CERTIFICATE-----
MIICLDCCAdOgAwIBAgIUQxZayjfgDow3i0dPrPL6Uo0ikqMwCgYIKoZIzj0EAwIw
MDEuMCwGA1UEAxMlcHJpLXB6bm55Y3RuLnZhdWx0LmNhLmU0NWY0Y2Q5LmNvbnN1
bDAeFw0yMjExMTExNzIwMjNaFw0yMjExMTEyMDIwNTNaMC8xLTArBgNVBAMTJHBy
aS15MzEyMXdlLnZhdWx0LmNhLmU0NWY0Y2Q5LmNvbnN1bDBZMBMGByqGSM49AgEG
CCqGSM49AwEHA0IABH2TnEMcnbKTrfA6yxgaAa8sirCizKORF5tttwSMxH+kSFw3
863/2Igq+cZw0lCQydx93tnAIhAOYwh2W3FnQuWjgcswgcgwDgYDVR0PAQH/BAQD
AgEGMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFDbRnncB076GI0lvEiKJis2n
XC0TMB8GA1UdIwQYMBaAFEMv0o6cmTrJxnzA428ZcQxlN6NnMGUGA1UdEQReMFyC
JHByaS15MzEyMXdlLnZhdWx0LmNhLmU0NWY0Y2Q5LmNvbnN1bIY0c3BpZmZlOi8v
ZTQ1ZjRjZDktNTM2OS1mMjgxLWU0Y2UtMGI1MmVhZjNhMmVkLmNvbnN1bDAKBggq
hkjOPQQDAgNHADBEAiAOzyaq3KCvhyvbcd0QeNQXdQbryIILpIFaZCesZJhOpgIg
AS1swQA0eC34snC0Jiu2X9/eRM4b/QahclxbTClTrpg=
-----END CERTIFICATE----- -
Save the certificate output from the above command to <cert-filename> to use it in the below command for checking the validity of the certificate:
$ openssl x509 -in <cert-filename> -text -noout
Example output:
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
43:16:5a:ca:37:e0:0e:8c:37:8b:47:4f:ac:f2:fa:52:8d:22:92:a3
Signature Algorithm: ecdsa-with-SHA256
Issuer: CN = pri-pznnyctn.vault.ca.e45f4cd9.consul
Validity
Not Before: Nov 11 17:20:23 2022 GMT
Not After : Nov 11 20:20:53 2022 GMT ⇐= Matches step 4
...How to restart Consul servers in k8 env -
1. Set the server.updatePartition value equal to the number of server replicas. values.yaml server: updatePartition: <number of server replicas> 2. The updatePartition value controls how many instances of the server cluster are updated. Only instances with an index greater than the updatePartition value are updated (zero-indexed). Therefore, by setting it equal to replicas, none should update yet. 3. Next, run the following command: helm upgrade consul hashicorp/consul --namespace consul --version <your-version> --values /path/to/your/values.yaml This will not cause the servers to redeploy (although the resource will be updated). 4. If everything is stable, begin by decreasing the updatePartition value by one, and performing helm upgrade again. This will cause the first Consul server to be stopped and restarted. Wait until the Consul server cluster is healthy again (30s to a few minutes). This can be confirmed by issuing consul members on one of the previous servers, and ensuring that all servers are listed and are alive. 5. Decrease updatePartition by one and do `helm upgrade` again. Continue until updatePartition is 0. At this point, you may remove the updatePartition configuration.
Related Github issues:
1. WARN Message: agent.fsm: Failed to apply CA operation: operation=set-roots-config
https://github.com/hashicorp/consul/issues/15824
2. https://github.com/hashicorp/consul/issues/15217