| --- |
| layout: docs |
| page_title: Vault Enterprise Eventual Consistency |
| description: Vault Enterprise Consistency Model |
| --- |
| |
| # Vault eventual consistency |
| |
| @include 'alerts/enterprise-and-hcp.mdx' |
| |
| When running in a cluster, Vault has an eventual consistency model. |
| Only one node (the leader) can write to Vault's storage. |
| Users generally expect read-after-write consistency: in other |
| words, after writing foo=1, a subsequent read of foo should return 1. Depending |
| on the Vault configuration this isn't always the case. When using performance |
| standbys with Integrated Storage, or when using performance replication, |
| there are some sequences of operations that don't always yield read-after-write |
| consistency. |
| |
| ## Performance standby nodes |
| |
| When using the Integrated Storage backend without performance standbys, only |
| a single Vault node (the active node) handles requests. Requests sent to |
| regular standbys are handled by forwarding them to the active node. This Vault configuration |
| gives Vault the same behavior as the default Consul consistency model. |
| |
| When using the Integrated Storage backend with performance standbys, both the |
| active node and performance standbys can handle requests. If a performance standby |
| handles a login request, or a request that generates a dynamic secret, the |
| performance standby will issue a remote procedure call (RPC) to the active node to store the token |
| and/or lease. If the performance standby handles any other request that |
| results in a storage write, it will forward that request to the active node |
| in the same way a regular standby forwards all requests. |
| |
| With Integrated Storage, all writes occur on the active node, which then issues |
| RPCs to update the local storage on every other node. Between when the active |
| node writes the data to its local disk, and when those RPCs are handled on the |
| other nodes to write the data to their local disks, those nodes present a stale |
| view of the data. |
| |
| As a result, even if you're always talking to the same performance standby, |
| you may not get read-after-write semantics. The write gets sent to the active |
| node, and if the subsequent read request occurs before the new data gets sent |
| to the node handling the read request, the read request won't be able to take |
| the write into account because the new data isn't present on that node yet. |
| |
| ## Performance replication |
| |
| A similar phenomenon occurs when using performance replication. One example |
| of how this manifests is when using shared mounts. If a KV secrets engine |
| is mounted on the primary with `local=false`, it will exist on the secondary |
| cluster as well. The secondary cluster can handle requests to that mount, |
| though as with performance standbys, write requests must be forwarded - in |
| this case to the primary active node. Once data is written to the primary cluster, |
| it won't be visible on the secondary cluster until the data has been replicated |
| from the primary. Therefore, on the secondary cluster, it initially appears as if |
| the data write hasn't happened. |
| |
| If the secondary cluster is using Integrated Storage, and the read request is |
| being handled on one of its performance standbys, the problem is exacerbated because it |
| has to be sent first from the primary active node to the secondary active node, |
| and then from there to the secondary performance standby, each of which can |
| introduce their own form of lag. |
| |
| Even without shared secret engines, stale reads can still happen with performance |
| replication. The Identity subsystem aims to provide a view on entities and |
| groups which span across clusters. As such, when logging in to a secondary cluster |
| using a shared mount, Vault tries to generate an entity and alias if they don't |
| already exist, and these must be stored on the primary using an RPC. Something |
| similar happens with groups. |
| |
| ## Mitigations |
| |
| There has long been a partial mitigation for the above problems. When writing |
| data via RPC, e.g. when a performance standby registers tokens and leases on the |
| active node after a login or generating a dynamic secret, part of the response |
| includes a number known as the "WAL index", aka Write-Ahead Log index. |
| |
| A full explanation of this is outside the scope of this document, but the short |
| version is that both performance replication and performance standbys use log |
| shipping to stay in sync with the upstream source of writes. The mitigation |
| historically used by nodes doing writes via RPC is to look at the WAL index in |
| the response and wait up to 2 seconds to see if that WAL index appear in the |
| logs being shipped from upstream. Once the WAL index is seen, the Vault node |
| handling the request that resulted in RPCs can return its own response to the |
| client: it knows that any subsequent reads will be able to see the value that |
| was just written. If the WAL index isn't seen within those 2 seconds, the Vault |
| node completes the request anyway, returning a warning in the response. |
| |
| This mitigation option still exists in Vault 1.7, though now there is a |
| configuration option to adjust the wait time: |
| [best_effort_wal_wait_duration](/vault/docs/configuration/replication). |
| |
| ## Vault 1.7 mitigations |
| |
| There are now a variety of other mitigations available: |
| |
| - per-request option to always forward the request to the active node |
| - per-request option to conditionally forward the request to the active node |
| if it would otherwise result in a stale read |
| - per-request option to fail requests if they might result in a stale read |
| - Vault Proxy configuration to do the above for proxied requests |
| |
| The remainder of this document describes the tradeoffs of these mitigations and |
| how to use them. |
| |
| Note that any headers requesting forwarding are disabled by default, and must |
| be enabled using [allow_forwarding_via_header](/vault/docs/configuration/replication). |
| |
| ### Unconditional forwarding (Performance standbys only) |
| |
| The simplest solution to never experience stale reads from a performance standby |
| is to provide the following HTTP header in the request: |
| |
| ``` |
| X-Vault-Forward: active-node |
| ``` |
| |
| The drawback here is that if all your requests are forwarded to the active node, |
| you might as well not be using performance standbys. So this mitigation only |
| makes sense to use selectively. |
| |
| This mitigation will not help with stale reads relating to performance replication. |
| |
| ### Conditional forwarding (Performance standbys only) |
| |
| As of Vault Enterprise 1.7, all requests that modify storage now return a new |
| HTTP response header: |
| |
| ``` |
| X-Vault-Index: <base64 value> |
| ``` |
| |
| To ensure that the state resulting from that write request is visible to a |
| subsequent request, add these headers to that second request: |
| |
| ``` |
| X-Vault-Index: <base64 value taken from previous response> |
| X-Vault-Inconsistent: forward-active-node |
| ``` |
| |
| The effect will be that the node handling the request will look at the state |
| it has locally, and if it doesn't contain the state described by the X-Vault-Index |
| header, the node will forward the request to the active node. |
| |
| The drawback here is that when requests are forwarded to the active node, |
| performance standbys provide less value. If this happens often enough |
| the active node can become a bottleneck, limiting the horizontal read scalability |
| performance standbys are intended to provide. |
| |
| ### Retry stale requests |
| |
| As of Vault Enterprise 1.7, all requests that modify storage now return a new |
| HTTP response header: |
| |
| ``` |
| X-Vault-Index: <base64 value> |
| ``` |
| |
| To ensure that the state resulting from that write request is visible to a |
| subsequent request, add this headers to that second request: |
| |
| ``` |
| X-Vault-Index: <base64 value taken from previous response> |
| ``` |
| |
| When the desired state isn't present, Vault will return a failure response with |
| HTTP status code 412. This tells the client that it should retry the request. |
| The advantage over the Conditional Forwarding solution above is twofold: |
| first, there's no additional load on the active node. Second, this solution |
| is applicable to performance replication as well as performance standbys. |
| |
| The Vault Go API will now automatically retry 412s, and provides convenience |
| methods for propagating the X-Vault-Index response header into the request |
| header of subsequent requests. Those not using the Vault Go API will want |
| to build equivalent functionality into their client library. |
| |
| ### Vault proxy and consistency headers |
| |
| When configured, the [Vault API Proxy](/vault/docs/agent-and-proxy/proxy/apiproxy) will proxy incoming requests to Vault. There is |
| Proxy configuration available in the `api_proxy` stanza that allows making use |
| of some of the above mitigations without modifying clients. |
| |
| By setting `enforce_consistency="always"`, Proxy will always provide |
| the `X-Vault-Index` consistency header. The value it uses for the header |
| will be based on the responses that have passed through the Proxy previously. |
| |
| The option `when_inconsistent` controls how stale reads are prevented: |
| |
| - `"fail"` means that when a `412` response is seen, it is returned to the client |
| - `"retry"` means that `412` responses will be retried automatically by Proxy, |
| so the client doesn't have to deal with them |
| - `"forward"` makes Proxy provide the |
| `X-Vault-Inconsistent: forward-active-node` header as described above under |
| Conditional Forwarding |
| |
| ## Vault 1.10 mitigations |
| |
| In Vault 1.10, the token format has changed, where service tokens now employ server side consistency. |
| This means that by default, requests made |
| to nodes which cannot support read-after-write consistency due to |
| not having the necessary WAL index to check Vault tokens locally will output |
| a 412 status code. The Vault Go API automatically retries when receiving 412s, so |
| unless there is a considerable replication delay, users will experience |
| read-after-write consistency. |
| |
| The replication option [allow_forwarding_via_token](/vault/docs/configuration/replication) |
| can be used to enforce requests that would have returned 412s in the |
| aforementioned way will be forwarded instead to the active node. |
| |
| Refer to the [Server Side Consistent Token FAQ](/vault/docs/faq/ssct) for details. |
| |
| ## Client API helpers |
| |
| There are some new helpers in the `api` package to work with the new headers. |
| `WithRequestCallbacks` and `WithResponseCallbacks` create a shallow clone of |
| the client and populate it with the given callbacks. `RecordState` and |
| `RequireState` are used to store the response header from one request and |
| provide it in a subsequent request. For example: |
| |
| ```go |
| client := api.NewClient(api.DefaultConfig) |
| var state string |
| _, err := client.WithResponseCallbacks(api.RecordState(&state)).Write(path, data) |
| secret, err := client.WithRequestCallbacks(api.RequireState(state)).Read(path) |
| ``` |
| |
| This will retry the `Read` until the data stored by the `Write` is present. |
| There are also callbacks to use forwarding: `ForwardInconsistent` and |
| `ForwardAlways`. |