Problem :-
Sometimes user encounter issue in their environment due to a bug in the code related to the automatic rotation of signing keys. This bug caused the key rotation mechanism to incorrectly identify the newest signing key.
Error :- Error: failed to refresh cached credentials, failed to retrieve credentials,exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: abcde123-a234-567d-abcde9874621, InvalidIdentityToken: Token signature invalid
Cause :-
For TFE :-
This problem will manifest in all previous versions of TFE that support dynamic credentials after the Vault OIDC key is rotated 9 times, so if customers are using Dynamic credentials, it’s important that they proactively perform one of the mitigation steps below in order to avoid runs failing! The default key rotation period for OIDC signing keys is 90 days, meaning all the customer’s runs relying on Dynamic Credentials will start failing approximately 810 days after the customer installed the first version of TFE that supported Dynamic Credentials (i.e. v202207-1). This means that customers who upgraded to version 202207-1 close to its release should start to see failures sometime in October of 2024.
Vault uses a monotonically increasing integer to denote key versions. These versions are sorted lexicographically in one part of atlas, and numerically in another part of atlas. The problem occurs when multiple key versions exist in atlas that each have a different number of digits, the two parts of atlas select differing key versions. For even more technical information about the problem, see the PR description for the issue in Github.
For TFC :-
The root cause was identified with a bug in the code. TFC automatically rotate the signing key on a periodic basis. Each key has a unique ID, which is a monotonically increasing integer.
As part of the automatic key rotation mechanism, the code identifies the newest key using the numeric ID. Due to the bug, the key rotation code identified the incorrect signing key because it was using lexicographical sorting instead of numerical sorting. Thereby, key ID 9 was incorrectly interpreted as newer than key ID 10.
Solution :-
For TFC :-
The bug has been fixed and is not expected to cause further impact.
For TFE :-
Upgrade TFE to latest version (Highly recommended)
This issue was fixed in v202407-1 of TFE. If customers are able to update to the latest version of TFE, they will be fully insulated from the problem. This is the permanent solution to mitigate the problem.
Forcefully rotate and trim signing keys
Using the rails console (or Vault UI), customers can temporarily resolve this issue by making sure that the first 9 versions of their OIDC signing key no longer exist in their Vault.
It might be easiest to simply tell customers to rotate the atlas_oidc_key 9 times and then to trim the key. This can be done via the Vault UI, or via the following commands in the rails console:
# rotate key 9 times
9.times { OIDC::KeyManager.instance.rotate_key }
# perform trim operation to trim all keys except the latest
OIDC::KeyManager.instance.trim_key
This is not a permanent solution to the problem, but depending on the customer’s configured key rotation period (default 90 days), it should allow them to safely rotate the key until the 100th time they perform rotation, at which point, their vault will contain keys with 2 and 3 digit numbers.
Extend key rotation period
This mitigation strategy will NOT work if the customer is already encountering the problem (i.e. their OIDC runs are all failing).
If customers cannot upgrade to the latest version of TFE AND are not able or willing to use the atlas rails console to manually trigger rotation, another possible solution is to extend the key rotation period to give them more time to upgrade their TFE version.
This key rotation can be configured via the WORKLOAD_IDENTITY_AUTO_ROTATE_PERIOD ENV variable.This period should remain as short as possible, but a key rotation period of a year or more should be perfectly safe.
Supporting Document :-