Monitoring Dirty Pages (num_dirty) in Vault Enterprise

This article relates to Vault Enterprise with DR or PR replication enabled and using versions 1.11.3 or higher.

Dirty pages result from Vault operational activities that are more rapid than the system's ability to write those to its storage (aka backend). A current total is reported in the Vault operational logs with the reference num_dirty indicating the number of pages that are pending a clean flush to store.

sudo journalctl -u vault --no-pager | grep -E 'flushed|dirty|pages'

……… [DEBUG] replication.index.perf: flushed dirty pages: pages_flushed=2 pages_outstanding=49
……… [DEBUG] replication.index.local: flushed dirty pages: pages_flushed=1 pages_outstanding=48
……… [DEBUG] replication.index.perf: saved checkpoint: num_dirty=49

The reasons why these conditions may arise can include both inadequate hardware / system resources (slow storage speeds, CPU, etc) as well as high volumes of requests to Vault.

Momentary bursts in the reported num_dirty may be expected and can be normal, especially where Vault administrators are enabling or disabling components such as secrets or authentication mounts and namespaces for example. Vault attempts to flush dirty pages at a default rate of 2% per interval and this way its focus remains on processing user requests without compromise for other activities that are potentially blocking.

The reported num_dirty is a relative measure and its rise is no cause for alarm if it is monitored to be steadily declining after any sudden spike.

However - in scenarios where num_dirty remains high for too long, other broader factors such as the concurrent volume of requests to Vault or the configurations of upstream system mounts in Vault may be contributing causes that require further investigation. For example where the num_dirty is in the thousands (eg: 9876 pages) and fluctuating instead of declining then it would be best to use the available monitoring and log data including Telemetry, Vault Operational Logs as well as Vault Audit Logs to determine root causes.

Impact

A sustained and high value of num_dirty can result in potential issues in two areas of Vault:

Restart / Leadership Change time is prolonged due to the need to parse pending dirty pages by the next instance of Vault.
The replication state of dependent DR or PR clusters can stall and they are unable to remain in sync for the majority of times or for particular cycles when an increase in pages occurs on the Primary instance.

Solutions

In some cases simply increasing the storage type with greater IOPs and storage space can be a remedy.

Other example causes may be non-optimal user requests where writes or updates are made in rapid succession to the same paths in Vault and possibly with steps that could otherwise be performed in a single request without further requests in the same time frames.

Custom Plugins and mount-specific considerations related to upstream systems and their setup in Vault can also be a contributing cause.

Generally - it's best to strive for an understanding of the contributing causes whenever num_dirty is observed to be high and whether they may be induced by specific user workflows, uses of particular mounts, inadequate system resources or any combination of these.

It's also possible to expedite the flushing of dirty pages by using the:

API: Reindex Replication /sys/replication/reindex

During the later stage of reindexing, a flush of the dirty is attempted and you can track its progress and completion by following the:

API: Check Status /sys/replication/status

In most cases, there should be no need to invoke reindex as the number of dirty pages is expected to gradually decline to zero over time. Operators may still opt to invoke a reindex in line with other maintenance being performed where a sudden rise is expected and where there's a preference to flush these more immediate to their anticipated rise.

Exceptional flush rate adjustment in Vault 1.11.3 & higher

Vault administrators may deem it necessary to adjust the default 2% flush rate in Vault to a higher value in rarer circumstances or to accommodate for exceptional use cases. For example, telemetry and all available log data may prove that a flush rate or 4% or 8%, that's two times or four times greater than defaults, is more suitable in some periods and that adjusting Vault to perform flushes more aggressively will have no negative implication onto IOP limits or further delays being incurred to Vault user requests as a result of adjusting this.

The environment parameter VAULT_FLUSH_DIRTY_PAGES_PCT was exposed in Vault 1.11.3 to allow for its adjustments from the 2% value of 0.02 that's set by default. To adjust this ensure that you're appropriately exporting or setting this variable as part of your launch / restart process of Vault (export VAULT_FLUSH_DIRTY_PAGES_PCT='0.04')

It's generally advised for administrators not to exceed a flush rate above 10% (ie 0.1) since larger rates can negatively impact the time taken to process user requests that ought to remain as the primary focus and activity of Vault.

It's also possible that adjusting the default 2% flush rate will have no effect even when at the maximum recommended 10% or even higher wherever underlying causes continue to contribute to the fluctuation in dirty pages.

This article relates to Vault Enterprise with DR or PR replication enabled and using versions 1.11.3 or higher.

Impact

Solutions

Exceptional flush rate adjustment in Vault 1.11.3 & higher

Related Links:

Articles in this section

This article relates to Vault Enterprise with DR or PR replication enabled and using versions 1.11.3 or higher.

Impact

Solutions

Exceptional flush rate adjustment in Vault 1.11.3 & higher

Related Links:

Articles in this section

Related articles