This is part three of a multi-part blog series. In the first post we covered Disaster Recovery planning when building serverless applications and in the second post we covered the systems engineering needed for an automated solution in the AWS cloud. In this post, we’ll discuss the end-to-end Disaster Recovery exercise we performed to validate the plan.
The time has come to exercise the Disaster Recovery plan. The objective of this phase is to see how closely the plan correlates to the current reality of your systems and to identify areas for improvement. In practice, this means assembling a team to conduct the exercise and documenting the process within the communication channel you outlined in the plan.
In the first post you’ll find an outline of the plan that we’ll be referencing in this exercise. Section 2 describes the process of initiating the plan and assigning roles and Section 3 describes the communication channels used to keep stakeholders in sync.
Decide who will be involved in the exercise, block out a chunk of time equivalent to the Recovery Time Objective, and create the communication channel in Slack (or whatever communication tool that your organization utilizes; ie: Google Hangouts, Skype, etc).
Create a document to capture key information relevant to the Disaster Recovery exercise such as who is assigned which roles, the AWS regions in play, and most importantly, a timeline pertaining to the initiation and completion of each recovery step. Whereas the Disaster Recovery plan is a living document to be updated over time, the Disaster Recovery exercise document mentioned here is specific to the date of the exercise; ie: Disaster Recovery Exercise 20180706.
Enter the communication channel and post a message that signals the start of the exercise immediately followed by a message to the executive team to get their approval to initiate the Disaster Recovery plan.
Upon completion of the exercise, post a message that signals the end of the exercise.