Disaster Recovery in a Serverless World - Part 3

This is part three of a multi-part blog series. In the first post we covered Disaster Recovery planning when building serverless applications and in the second post we covered the systems engineering needed for an automated solution in the AWS cloud. In this post, we’ll discuss the end-to-end Disaster Recovery exercise we performed to validate the plan.

The time has come to exercise the Disaster Recovery plan. The objective of this phase is to see how closely the plan correlates to the current reality of your systems and to identify areas for improvement. In practice, this means assembling a team to conduct the exercise and documenting the process within the communication channel you outlined in the plan.

Background

In the first post you’ll find an outline of the plan that we’ll be referencing in this exercise. Section 2 describes the process of initiating the plan and assigning roles and Section 3 describes the communication channels used to keep stakeholders in sync.

Set the Stage

Decide who will be involved in the exercise, block out a chunk of time equivalent to the Recovery Time Objective, and create the communication channel in Slack (or whatever communication tool that your organization utilizes; ie: Google Hangouts, Skype, etc).

Create a document to capture key information relevant to the Disaster Recovery exercise such as who is assigned which roles, the AWS regions in play, and most importantly, a timeline pertaining to the initiation and completion of each recovery step. Whereas the Disaster Recovery plan is a living document to be updated over time, the Disaster Recovery exercise document mentioned here is specific to the date of the exercise; ie: Disaster Recovery Exercise 20180706.

Conduct the Exercise

Enter the communication channel and post a message that signals the start of the exercise immediately followed by a message to the executive team to get their approval to initiate the Disaster Recovery plan.

Upon completion of the exercise, post a message that signals the end of the exercise.

Key Takeaways

  1. Disaster Recovery exercises can be stressful. Be courteous and supportive to one another.
  2. High performance teams develop trust through practicing to communicate effectively. Lean towards communication that is both precise and unambiguous. This is easier said than done but gets easier through experience. Over time, the Disaster Recovery plan and exercise will become self reinforcing.
  3. Don’t forget to involve a representative of the executive team. This is to ensure that the operational status is being communicated across all levels.
  4. Be clear on which AWS region represents the source and which AWS region represents the DR target.
  5. Research and experiment with the options you have available to automate DNS changes. Again, the key here is gaining the skills and confidence through testing and optimizing for fast recovery. Know your records (A and CNAME) and the importance of TTL.
  6. Verify completion of recovery from the perspective of a customer.
  7. Conduct a retrospective to illuminate areas for improvement or research. Ask questions that encourage participants to openly discuss both the technical and psychological aspects of the exercise.