Disaster Recovery in a Serverless World - Part 2

Apurva Jantrania

This is part two of a multi-part blog series. In the previous post, we covered Disaster Recovery planning when building serverless applications. In this post, we'll discuss the systems engineering needed for an automated solution in the AWS cloud.

As I started looking into implementing Stackery's automated backup solution, my goal was simple: In order to support a disaster recovery plan, we needed to have a system that automatically creates backups of our database to a different account and to a different region. This seemed like a straightforward task, but I was surprised to find that there was no documentation on how to do this in an automated, scalable solution - all existing documentation I could find only discussed partial solutions and were all done manually via the AWS Console. Yuck.

I hope that this post will make help fill that void and help you understand how to implement an automated solution for your own disaster recovery solution. This post does get a bit long so if that's not your thing, see the tl;dr.

The Initial Plan

AWS RDS has automated backups which seemed like the perfect platform to base this automation upon. Furthermore, RDS even emits events that seem ideal for using to kick off a lambda function that will then copy the snapshot to the disaster recovery account.

Discoveries

The first issue I discovered was that AWS does not allow you to share automated snapshots - AWS requires that you first make a manual copy of the snapshot before you can share it with another account. I initially thought that this wouldn't be a major issue - I can easily make my lambda function first kick off a manual copy. According to the RDS Events documentation, there is an event RDS-EVENT-0042 that would fire when a manual snapshot was created. I could then use that event to then share the newly created manual snapshot to the disaster recovery account.

This leads to the second issue - while RDS will emit events for snapshots that are created manually, it does not emit events for snapshots that are copied manually. The AWS docs aren't clear about this and it's an unfortunate feature gap. This means that I have to fall back to a timer based lambda function that will search for and share the latest available snapshot.

Final Implementation Details

While this ended up more complicated than initially envisioned, Stackery still makes it easy to add all the needed pieces for fully automated backups. My implementation ended up looking like this:

The DB Event Subscription resource is a CloudFormation Resource in which contains a small snippet of CloudFormation that subscribes the DB Events topic to the RDS database

Function 1 - dbBackupHandler

This function will receive the events from the RDS database via the DB Events topic. It then creates a copy of the snapshot with an ID that identifies the snapshot as an automated disaster recovery snapshot

const AWS = require('aws-sdk'); const rds = new AWS.RDS(); const DR_KEY = 'dr-snapshot'; const ENV = process.env.ENV; module.exports = async message => { // Only run DB Backups on Production and Staging if (!['production', 'staging'].includes(ENV)) { return {}; } let records = message.Records; for (let i = 0; i < records.length; i++) { let record = records[i]; if (record.EventSource === 'aws:sns') { let msg = JSON.parse(record.Sns.Message); if (msg['Event Source'] === 'db-snapshot' && msg['Event Message'] === 'Automated snapshot created') { let snapshotId = msg['Source ID']; let targetSnapshotId = `${snapshotId}-${DR_KEY}`.replace('rds:', ''); let params = { SourceDBSnapshotIdentifier: snapshotId, TargetDBSnapshotIdentifier: targetSnapshotId }; try { await rds.copyDBSnapshot(params).promise(); } catch (error) { if (error.code === 'DBSnapshotAlreadyExists') { console.log(`Manual copy ${targetSnapshotId} already exists`); } else { throw error; } } } } } return {}; };

A couple of things to note:

  • I'm leveraging Stackery Environments in this function - I have used Stackery to define process.env.ENV based on the environment the stack is deployed to
  • Automatic RDS snapshots have an id that begins with 'rds:'. However, snapshots created by the user cannot have a ':' in the ID.
  • To make future steps easier, I append dr-snapshot to the id of the snapshot that is created

Function 2 - shareDatabaseSnapshot

This function runs every few minutes and shares any disaster recovery snapshots to the disaster recovery account

const AWS = require('aws-sdk'); const rds = new AWS.RDS(); const DR_KEY = 'dr-snapshot'; const DR_ACCOUNT_ID = process.env.DR_ACCOUNT_ID; const ENV = process.env.ENV; module.exports = async message => { // Only run on Production and Staging if (!['production', 'staging'].includes(ENV)) { return {}; } // Get latest snapshot let snapshot = await getLatestManualSnapshot(); if (!snapshot) { return {}; } // See if snapshot is already shared with the Disaster Recovery Account let data = await rds.describeDBSnapshotAttributes({ DBSnapshotIdentifier: snapshot.DBSnapshotIdentifier }).promise(); let attributes = data.DBSnapshotAttributesResult.DBSnapshotAttributes; let isShared = attributes.find(attribute => { return attribute.AttributeName === 'restore' && attribute.AttributeValues.includes(DR_ACCOUNT_ID); }); if (!isShared) { // Share Snapshot with Disaster Recovery Account let params = { DBSnapshotIdentifier: snapshot.DBSnapshotIdentifier, AttributeName: 'restore', ValuesToAdd: [DR_ACCOUNT_ID] }; await rds.modifyDBSnapshotAttribute(params).promise(); } return {}; }; async function getLatestManualSnapshot (latest = undefined, marker = undefined) { let result = await rds.describeDBSnapshots({ Marker: marker }).promise(); result.DBSnapshots.forEach(snapshot => { if (snapshot.SnapshotType === 'manual' && snapshot.Status === 'available' && snapshot.DBSnapshotIdentifier.includes(DR_KEY)) { if (!latest || new Date(snapshot.SnapshotCreateTime) > new Date(latest.SnapshotCreateTime)) { latest = snapshot; } } }); if (result.Marker) { return getLatestManualSnapshot(latest, result.Marker); } return latest; }
  • Once again, I'm leveraging Stackery Environments to populate the ENV and DR_ACCOUNT_ID environment variables.
  • When sharing a snapshot with another AWS account, the AttributeName should be set to restore (see the AWS RDS SDK)

Function 3 - copyDatabaseSnapshot

This function will run in the Disaster Recovery account and is responsible for detecting snapshots that are shared with it and making a local copy in the correct region - in this example, it will make a copy in us-east-1.

const AWS = require('aws-sdk'); const rds = new AWS.RDS(); const sourceRDS = new AWS.RDS({ region: 'us-west-2' }); const targetRDS = new AWS.RDS({ region: 'us-east-1' }); const DR_KEY = 'dr-snapshot'; const ENV = process.env.ENV; module.exports = async message => { // Only Production_DR and Staging_DR are Disaster Recovery Targets if (!['production_dr', 'staging_dr'].includes(ENV)) { return {}; } let [shared, local] = await Promise.all([getSourceSnapshots(), getTargetSnapshots()]); for (let i = 0; i < shared.length; i++) { let snapshot = shared[i]; let fullSnapshotId = snapshot.DBSnapshotIdentifier; let snapshotId = getCleanSnapshotId(fullSnapshotId); if (!snapshotExists(local, snapshotId)) { let targetId = snapshotId; let params = { SourceDBSnapshotIdentifier: fullSnapshotId, TargetDBSnapshotIdentifier: targetId }; await rds.copyDBSnapshot(params).promise(); } } return {}; }; // Get snapshots that are shared to this account async function getSourceSnapshots () { return getSnapshots(sourceRDS, 'shared'); } // Get snapshots that have already been created in this account async function getTargetSnapshots () { return getSnapshots(targetRDS, 'manual'); } async function getSnapshots (rds, typeFilter, snapshots = [], marker = undefined) { let params = { IncludeShared: true, Marker: marker }; let result = await rds.describeDBSnapshots(params).promise(); result.DBSnapshots.forEach(snapshot => { if (snapshot.SnapshotType === typeFilter && snapshot.DBSnapshotIdentifier.includes(DR_KEY)) { snapshots.push(snapshot); } }); if (result.Marker) { return getSnapshots(rds, typeFilter, snapshots, result.Marker); } return snapshots; } // Check to see if the snapshot `snapshotId` is in the list of `snapshots` function snapshotExists (snapshots, snapshotId) { for (let i = 0; i < snapshots.length; i++) { let snapshot = snapshots[i]; if (getCleanSnapshotId(snapshot.DBSnapshotIdentifier) === snapshotId) { return true; } } return false; } // Cleanup the IDs from automatic backups that are prepended with `rds:` function getCleanSnapshotId (snapshotId) { let result = snapshotId.match(/:([a-zA-Z0-9-]+)$/); if (!result) { return snapshotId; } else { return result[1]; } }
  • Once again, leveraging Stackery Environments to populate ENV, I ensure this function only runs in the Disaster Recovery accounts

TL;DR - How Automated Backups Should Be Done

  1. Have a function that will manually create an RDS snapshot using a timer and lambda. Use a timer that makes sense for your use case
  2. Don't bother trying to leverage the daily automated snapshot provided by AWS RDS.
  3. Have a second function, that monitors for the successful creation of the snapshot from the first function and shares it to your disaster recovery account.
  4. Have a third function that will operate in your disaster recovery account that will monitor for snapshots shared to the account, and then create a copy of the snapshot that will be owned by the disaster recovery account, and in the correct region.

Curious about Stackery and its capabilities?

Learn more