This article relates to Vault Enterprise with DR or PR replication enabled and using versions 1.11.3 or higher.
Dirty pages result from Vault operational activities that are more rapid than the system's ability to write those to its storage (aka backend). A current total is reported in the Vault operational logs with the reference num_dirty
indicating the number of pages that are pending a clean flush to store.
sudo journalctl -u vault --no-pager | grep -E 'flushed|dirty|pages'
……… [DEBUG] replication.index.perf: flushed dirty pages: pages_flushed=2 pages_outstanding=49
……… [DEBUG] replication.index.local: flushed dirty pages: pages_flushed=1 pages_outstanding=48
……… [DEBUG] replication.index.perf: saved checkpoint: num_dirty=49
The reasons why these conditions may arise can include both inadequate hardware / system resources (slow storage speeds, CPU, etc) as well as high volumes of requests to Vault.
Momentary bursts in the reported num_dirty
may be expected and can be normal, especially where Vault administrators are enabling or disabling components such as secrets or authentication mounts and namespaces for example. Vault attempts to flush dirty pages at a default rate of 2% per interval and this way its focus remains on processing user requests without compromise for other activities that are potentially blocking.
The reported num_dirty
is a relative measure and its rise is no cause for alarm if it is monitored to be steadily declining after any sudden spike.
However - in scenarios where num_dirty
remains high for too long, other broader factors such as the concurrent volume of requests to Vault or the configurations of upstream system mounts in Vault may be contributing causes that require further investigation. For example where the num_dirty
is in the thousands (eg: 9876
pages) and fluctuating instead of declining then it would be best to use the available monitoring and log data including Telemetry, Vault Operational Logs as well as Vault Audit Logs to determine root causes.
Impact
A sustained and high value of num_dirty
can result in potential issues in two areas of Vault:
- Restart / Leadership Change time is prolonged due to the need to parse pending dirty pages by the next instance of Vault.
- The replication state of dependent DR or PR clusters can stall and they are unable to remain in sync for the majority of times or for particular cycles when an increase in pages occurs on the Primary instance.
Solutions
In some cases simply increasing the storage type with greater IOPs and storage space can be a remedy.
Other example causes may be non-optimal user requests where writes or updates are made in rapid succession to the same paths in Vault and possibly with steps that could otherwise be performed in a single request without further requests in the same time frames.
Custom Plugins and mount-specific considerations related to upstream systems and their setup in Vault can also be a contributing cause.
Generally - it's best to strive for an understanding of the contributing causes whenever num_dirty
is observed to be high and whether they may be induced by specific user workflows, uses of particular mounts, inadequate system resources or any combination of these.
It's also possible to expedite the flushing of dirty pages by using the:
During the later stage of reindexing, a flush of the dirty is attempted and you can track its progress and completion by following the:
In most cases, there should be no need to invoke reindex as the number of dirty pages is expected to gradually decline to zero over time. Operators may still opt to invoke a reindex in line with other maintenance being performed where a sudden rise is expected and where there's a preference to flush these more immediate to their anticipated rise.
Exceptional flush rate adjustment in Vault 1.11.3 & higher
Vault administrators may deem it necessary to adjust the default 2% flush rate in Vault to a higher value in rarer circumstances or to accommodate for exceptional use cases. For example, telemetry and all available log data may prove that a flush rate or 4% or 8%, that's two times or four times greater than defaults, is more suitable in some periods and that adjusting Vault to perform flushes more aggressively will have no negative implication onto IOP limits or further delays being incurred to Vault user requests as a result of adjusting this.
The environment parameter VAULT_FLUSH_DIRTY_PAGES_PCT
was exposed in Vault 1.11.3 to allow for its adjustments from the 2% value of 0.02
that's set by default. To adjust this ensure that you're appropriately exporting or setting this variable as part of your launch / restart process of Vault (export VAULT_FLUSH_DIRTY_PAGES_PCT='0.04'
)
It's generally advised for administrators not to exceed a flush rate above 10% (ie 0.1
) since larger rates can negatively impact the time taken to process user requests that ought to remain as the primary focus and activity of Vault.
It's also possible that adjusting the default 2% flush rate will have no effect even when at the maximum recommended 10% or even higher wherever underlying causes continue to contribute to the fluctuation in dirty pages.
Related Links:
- Support KB: Replication Reindex Process