Recover from catastrophic failure with disaster recovery replication
Enterprise Feature
This tutorial covers Disaster Recovery Replication, a Vault Enterprise feature that requires a Vault Enterprise Standard license.
A disaster recovery (DR) strategy to protect your Vault deployment from catastrophic failure of an entire cluster helps reduce recovery efforts and minimize outage downtime. Vault Enterprise supports multi-datacenter deployments, so that you can replicate data across datacenters for improved performance and disaster recovery capabilities.
Challenge
When a disaster occurs, a Vault operator must be able to respond to the situation by performing failover from the affected cluster. Similarly, failing back to an original cluster state is typically required after you resolve the incident.
Solution
Vault Enterprise Disaster Recovery (DR) Replication features failover and failback capabilities to assist in recovery from catastrophic failure of entire clusters.
Learning to failover a DR replication primary cluster to a secondary cluster, and failback to the original cluster state is crucial for operating Vault in more than one datacenter.
Use the basic example workflow in this tutorial scenario to get acquainted with the steps involved in failing over and failing back using the Vault API, CLI, or UI.
Prerequisites
This intermediate Vault Enterprise operations tutorial assumes that you already have some working knowledge of operating Vault with the API, CLI, or web UI. If you aren't familiar with the Vault Enterprise Disaster Recovery replication functionality, you should review the Disaster Recovery Replication Setup tutorial before proceeding with this tutorial.
You also need the following resources to complete the tutorial hands-on scenario:
Docker installed.
Vault binary installed on your
PATH
for CLI operations. You must use a Vault Enterprise server throughout this tutorial, but you can use the Vault Community Edition binary for all CLI examples.curl to use the API command examples.
jq for parsing and pretty-printing JSON output.
A web browser for accessing the Vault UI.
Note
This procedure requires both Vault clusters to run the same version of Vault.
Once the original DR primary cluster is demoted, you cannot replicate to it from a promoted cluster running a higher version of Vault.
For example, if you have Cluster A (a DR Primary) on 1.11.x and Cluster B (a new DR secondary running Vault 1.15.x), you can promote Cluster B and but you cannot replicate to Cluster A until Cluster A is upgraded to 1.15.x or above.
This limitation exists because Vault does not make backward-compatibility guarantees for its data store.
Policy requirements
You must have a token with highly privileged policies, such as a root
token to configure Vault Enterprise Replication. Some API endpoints also require the sudo
capability.
If you aren't using the root
token, expand the following example to learn more about the ACL policies required to perform the operations described in this tutorial.
Note
If you aren't familiar with policies, complete the policies tutorial.
Scenario introduction
To successfully follow this tutorial, you will deploy 2 single-node Vault Enterprise clusters with integrated storage:
- Cluster A is the initial primary cluster.
- Cluster B is the initial secondary cluster.
Note
The tutorial scenario uses single-node Vault clusters as a convenience to the learner and to simplify the deployment. For production Vault deployments, you should use highly available (HA) integrated storage described in the Vault with Integrated Storage Deployment Guide tutorial.
You will use these 2 clusters to simulate the following failover and failback workflows.
Failover to DR secondary cluster
In the current state, cluster A is the primary and replicates data to the secondary cluster B. You will perform the following actions to failover so that cluster B becomes the new primary cluster.
- Generate batch DR operation token on cluster A.
- Promote DR cluster B to become new primary.
- Demote cluster A to become secondary.
- Point cluster A to new primary cluster B.
- Test access to Vault data while cluster B is the primary.
Failback to original primary cluster
In the current state, cluster B is the primary and replicates data to the secondary cluster A. You will perform the following actions to failback to the original cluster replication state.
- Generate secondary token on cluster A.
- Promote cluster A.
- Demote cluster B.
- Point cluster B to cluster A, so cluster B is a DR secondary of cluster A.
- Test access to Vault data while cluster A is the primary cluster.
Prepare environment
The goal of this section is for you to prepare and deploy the Vault cluster containers.
You will start the Vault cluster Docker containers, and perform some initial configuration to ready the Vault clusters for replication.
This tutorial requires a Vault Enterprise Standard license, so you need to first specify your license string as the value of the
MY_VAULT_LICENSE
environment variable.Note
Be sure to use your Vault Enterprise license string value, and not the non-functional example value shown here.
Export the environment variable
HC_LEARN_LAB
with a value that represents the lab directory,/tmp/learn-vault-lab
.Make the directory.
Change into the lab directory.
You will perform all steps of the tutorial scenario from within this directory.
Create directories for Vault configuration and data for the 2 clusters.
Pull the latest Vault Enterprise Docker image.
Note
You must log into Docker Hub before pulling the Vault Enterprise image.
Create a Docker network named
learn-vault
.
Start the cluster A container
Each cluster container uses a unique Vault server configuration file.
Create the cluster A configuration file.
Note
Although the listener stanza disables TLS (
tls_disable = 1
) for this tutorial, Vault should always be used with TLS in production to enable secure communication between clients and the Vault server. This configuration requires a certificate file and key file on each Vault host.Start the cluster A container.
Confirm that the cluster A container is up.
Example expected output:
Initialize the cluster A Vault, writing the initialization information including unseal key and initial root token to the file
cluster-a/.init
.Note
The initialization example here uses the Shamir's Secret Sharing based seal with 1 key share for convenience in the hands on lab. You should use more than one key share or an auto seal type in production.
Export the environment variable
CLUSTER_A_UNSEAL_KEY
with the cluster A unseal key as its value.Export the environment variable
CLUSTER_A_ROOT_TOKEN
with the cluster A initial root token as its value.Unseal Vault in cluster A.
Successful output example:
Upon unsealing Vault, it returns a status with Sealed having a value of
false
. This means that Vault is now unsealed and ready for use in cluster A.
Start the cluster B container
Repeat a variation of the earlier workflow to start cluster B.
Create the cluster B configuration file.
Network ports
Cluster B uses a different and non-standard set of port numbers for the Vault API and cluster addresses than cluster A. This is for simplicity in communicating with each cluster from the Docker host.
Start the cluster B container.
Check the container status.
Initialize the cluster B Vault, writing the initialization information including unseal key and initial root token to the file
secondary/.init
.Export the environment variable
CLUSTER_B_UNSEAL_KEY
with the cluster B unseal key as its value.Export the environment variable
CLUSTER_B_ROOT_TOKEN
with the cluster B initial root token as its value.Unseal Vault in cluster B.
Successful output example:
You are now prepared to configure DR replication between cluster A and cluster B using the Vault CLI, HTTP API, or UI.
Configure replication
The basic steps to configure DR replication are as follows:
- Enable DR primary replication on cluster A.
- Generate secondary token on cluster A.
- Enable DR secondary replication on cluster B.
- Confirm replication status on both clusters.
Enable replication on cluster A
Export a VAULT_ADDR environment variable to communicate with the cluster A Vault.
Login with the initial root token.
Enable DR replication on cluster A.
Generate a secondary token and assign its value to the exported environment variable
DR_SECONDARY_TOKEN
.Confirm the
DR_SECONDARY_TOKEN
environment variable value.The output should resemble this example:
Enable replication on cluster B
You must perform following operations on cluster B.
Now you can enable replication on cluster B. Vault will use the secondary token to automatically configure cluster B as a secondary to cluster A.
Export a VAULT_ADDR environment variable to communicate with Vault in cluster B.
Log in with the cluster B initial root token.
Enable DR replication on the secondary cluster.
Warning
This clears all data in the secondary cluster.
Expected output:
Confirm replication status
Now that you have successfully enabled DR replication, you will enable a new secrets engine and create a secret on cluster A, then confirm replication status between the clusters.
Enable the KV version 2 secrets engine, write a secret, and verify the replication status.
Export a VAULT_ADDR environment variable to communicate with the primary cluster Vault.
Login with the root cluster A root token.
Enable a Key/Value version 2 secrets engine at the path
replicated-secrets
.Put a test secret into the newly enabled secrets engine.
Successful example output:
Check the replication status on primary cluster.
Check the replication status on cluster B.
The replication state on cluster A is running
and its mode is primary
. On cluster B, the state is stream-wals
and the mode is secondary
. This detail in combination with matching last_wal and last_remote_wal values confirms that the secret you created replicated to the secondary, and that the clusters synced.
Tip
You can learn more about replication monitoring in the Monitoring Vault Replication tutorial.
You are now ready to continue with the failover and failback scenario.
Failover scenario
The goal of this section is to failover the current primary cluster A, and then promote the current secondary cluster B to become the new primary cluster.
You will also validate access to your secret data from the newly promoted primary, and update cluster A, setting cluster B as its new primary.
Take a snapshot
Before proceeding with any failover or failback, it's critical that you have a recent backup of the Vault data. Since the scenario environment uses Vault servers with Integrated Storage, you can take a snapshot of the cluster A Vault data, and write it to cluster-a/vault-cluster-a-snapshot.snap
as a backup.
Export a VAULT_ADDR environment variable to communicate with the cluster A Vault.
Take a snapshot of the cluster A data, and write it to
cluster-a/vault-cluster-a-snapshot.snap
.This command produces no output.
Confirm that the snapshot file is present in the
cluster-a
directory:
After confirming replication status and taking a snapshot of Vault data, you are ready to begin the failover workflow.
Batch disaster recovery operation token strategy
To promote a DR secondary cluster to be the new primary, a DR operation token is typically needed. However, the process of generating a DR operation token requires a threshold of unseal keys or recovery keys if Vault uses auto unseal. This can be troublesome since a cluster failure is usually caused by unexpected incident. You find difficulty in coordinating amongst the key holders to generate the DR operation token in a timely fashion.
As of Vault 1.4, you can create a batch DR operation token that you can use to promote and demote clusters as needed. This is a strategic operation that the Vault administrator can use to prepare for loss of the DR primary ahead of time. The batch DR operation token also has the advantage of being usable from the primary or secondary more than once.
Vault version
The following steps require Vault 1.4 or later. If you are running an earlier version of Vault, follow the DR operation token generation steps in the Promote DR Secondary to Primary section.
Export a VAULT_ADDR environment variable to communicate with the cluster A Vault.
Create a policy named "dr-secondary-promotion" on cluster A allowing the
update
capability for thesys/replication/dr/secondary/promote
path. In addition, you can add a policy for thesys/replication/dr/secondary/update-primary
path so that you can use the same DR operation token to update the primary cluster that the secondary cluster points to.Successful example output:
Note
The policy on the
sys/storage/raft/autopilot/state
path is only required if your cluster uses Integrated Storage as its persistence layer. Refer to the Integrated Storage Autopilot tutorial to learn more about Autopilot.Verify that you enabled the "dr-secondary-promotion" policy.
Create a token role named "failover-handler" with the
dr-secondary-promotion
policy attached and its type should bebatch
. You can't renew a batch token, so set therenewable
parameter value tofalse
. Also, set theorphan
parameter totrue
.Create a token for role, "failover-handler" with time-to-live (TTL) set to 8 hours.
Successful example output:
Export a token as the value of the CLUSTER_B_DR_OP_TOKEN environment variable.
Securely store this batch token. If you need to promote the DR secondary cluster, you can use the batch DR operation token to perform the promotion. The batch token works on both primary and secondary clusters.
This eliminates the need for the unseal keys (or recovery keys if using auto unseal).
Note
Batch tokens have a fixed TTL and the Vault server automatically deletes them after they expire. You can use this in such a way that a Vault operator generates a batch DR operation token with TTL equals the duration of their shift.
Generate a disaster recovery operation token
If you are on a version of Vault before 1.4.0, you need to create a DR operation token to perform this task.
The following process is similar to Generating a Root Token (via CLI). You must share a number of unseal keys (or recovery keys for auto unseal) equal to the threshold value. Vault generated the unseal and recovery keys when you initialized cluster A.
Note
If you have a DR operation batch token, you can skip the DR operation token generation and proceed to the Promote cluster B to primary status section.
Perform this operation on the DR secondary cluster (Cluster B).
Start the DR operation token generation process.
Example expected output:
Tip
Distribute the generated Nonce value to each unseal key holder.
Each unseal key holder should execute the following operation with their key share to generate a DR operation token.
Example:
Once you reach the threshold, the output displays an encoded DR operation token.
Example:
Decode the generated DR operation token (
Encoded Token
).Example:
Export the token as the value of the CLUSTER_B_DR_OP_TOKEN environment variable.
Promote cluster B to primary status
The first step in this failover workflow is to promote cluster B as a primary.
While you can demote cluster A before promoting cluster B, in production DR scenarios you might instead promote cluster B before demoting cluster A due to unavailability of cluster A.
Note
For a brief time (between promotion of cluster B and demotion of cluster A) both clusters will be primary. You must redirect all traffic to cluster B once you promote it to primary. If there's a load balancer configured to route traffic to the cluster, you should change its rules to re-route traffic to the correct cluster. Consider also that you should to update DNS entries for the cluster servers as needed during this phase as well.
Promote cluster B to primary using the batch DR operation token.
Successful example output:
Demote cluster A to secondary status
Demote cluster A so that it's no longer the primary cluster.
Export a VAULT_ADDR environment variable to address cluster A.
Demote cluster A.
Successful example output:
Test access to Vault data
Now that cluster B is the primary, you can use the initial root token from cluster A to check that the Vault data is available the new primary cluster.
Export a VAULT_ADDR environment variable to address cluster B.
Check for the
failover
secret inreplicated-secrets
using the cluster A initial root token.Successful example output:
The secret is present in your newly promoted primary cluster.
Create an updated version of the secret, and set the value of key failover to
true
.Successful example output:
You have created version 2 of the secret while cluster B is acting as the primary cluster.
Point demoted cluster A to new primary cluster B
Now that you have verified access to cluster A, update it to be a secondary in DR replication to cluster B.
You can use the secondary_public_key parameter to demonstrate updating the secondary in a network environment where the primary's API port is not available and thus an unwrap API call cannot be made. This instructs the primary to encrypt the connection details with the secondary's public key instead of using a wrapping token, which is the default behavior.
Export VAULT_ADDR environment variable to address cluster A.
On cluster A, generate the public key and export its value as the
DR_SECONDARY_PUB_KEY
environment variable.Export a VAULT_ADDR environment variable to address cluster B.
Generate a new secondary token and assign its value to the exported environment variable
CLUSTER_A_DR_SECONDARY_TOKEN
. Notice that the secondary public key is also specified with thesecondary_public_key
parameter.Confirm the environment variable value.
Successful output example:
Export VAULT_ADDR environment variable to address cluster A.
Point cluster B to cluster A, so that cluster A becomes a secondary cluster of (the new primary) cluster B. Use the batch operation token value or DR operation token value with the secondary token value to do so.
Check replication status on cluster B using JSON output for a bit more readability.
Successful output example:
Cluster A is now in mode
secondary
, and shows that it has a primary at the value ofprimary_cluster_addr
ofhttps://secondary
(cluster B) as expected.Read the
replicated-secrets/learn-failover
secret with the cluster A initial root token.Successful example output:
Vault returns the expected secret value, and cluster A is now a secondary cluster to cluster B.
Failback scenario
Now it's time to failback, and restore the clusters to their initial replication state.
At this point cluster B is the new primary with cluster A as the secondary. You will now promote Cluster A (the original primary) back to primary.
Verify replication status on cluster A.
Successful output example:
Verify replication status on cluster B.
Successful output example:
From the replication status output, you can learn that cluster B is the primary, cluster A is the secondary, and replication is running and in stream-wals state.
You can now start the failback workflow.
Promote cluster A to primary status
Begin failback by promoting cluster A to primary status.
Note
At this point, you should begin redirecting all client traffic back to cluster A after its promotion to primary.
Use the batch DR operation token value from the CLUSTER_B_DR_OP_TOKEN
environment variable to promote cluster A back to primary status.
Successful output example:
Demote cluster B to secondary status
Demote cluster B back to secondary status.
Successful output example:
Confirm replication status and access to data
The goal of this section is to check the replication status of cluster A and B, and read the secret data to confirm the failback.
Verify replication status on cluster A.
Successful output example:
Verify replication state on cluster B.
Successful output example:
The status indicates that the clusters are replicating again in their original state with cluster A being the primary and cluster B the secondary.
Try to update the secret data in cluster A.
Successful example output:
You have created a second version of the secret while cluster A is once again acting as the primary cluster.
Update replication primary on cluster B
The goal of this section is to update cluster B and point it to cluster A as the new primary cluster.
This time, you can use the default secondary token generation behavior, which is to encrypt the connection details in a wrapping token.
Generate a secondary token on cluster A and assign its value to the exported environment variable
CLUSTER_A_DR_SECONDARY_TOKEN
.This command produces no output.
DR Operation Tokens are one-time use, so you need to generate a new one for this step. Use environment variables to override the vault host and root token values, and generate the DR Operation Token by initializing.
Successful output example:
Export the OTP value from the earlier output as the environment variable
CLUSTER_B_DR_OTP
.Display the cluster A unseal key value to use in the next step:
Generate the encoded token value.
When prompted, enter the unseal key from cluster A.
Successful output example:
Export the "Encoded Token" value from the earlier output as the environment variable
CLUSTER_B_DR_ENCODED_TOKEN
.Complete the DR operation token generation, and export the resulting token value as the environment variable
CLUSTER_B_DR_OP_TOKEN
for later use.
Echo the
CLUSTER_B_DR_OP_TOKEN
environment variable to confirm that it's set.Successful output example:
Update cluster so that it uses cluster A as the new primary cluster.
Now check replication status on cluster B.
Successful example output:
The output shows that cluster B is now a secondary with a known primary cluster address that matches cluster A.
You have completed the failover and failback scenario with the Vault DR Replication feature.
Clean up
Stop the Docker containers (this will also automatically remove them).
Remove the Docker network.
Change into your home directory
Remove the
learn-vault-lab
project directory.Unset the environment variables
Summary
You have learned how to establish a DR replication configuration between a primary and secondary cluster. You have also learned the essential workflow for failover from an existing primary cluster and failback to the original cluster state after operating in a failed over state.
Next steps
You can learn more about replication, including popular topics such as monitoring replication, setting up Performance Replication, and Performance Replication with HashiCorp Cloud Platform (HCP) Vault.