Stacks on Stacks

The Serverless Ecosystem Blog by Stackery.

Posts on DevOps

Stand On the Shoulders of Giants or Fall in Their Footsteps?
Toby Fee

Toby Fee | October 09, 2018

Stand On the Shoulders of Giants or Fall in Their Footsteps?

The first-mover advantage is much touted in the technology press. After all, you won’t really see much benefit to being the second car insurance company that makes a sitcom about cavemen. But in technology, being first to market only works when that first enables a new business. On other occasions, following the pioneers can enable others to move faster than the pioneers.

All My Heroes Were the First to Release New Products

With a long enough temporal distance it’s easy to remember iPods, Nintendo Entertainment Systems, or even Heroku as ‘the first thing like that to appear’ when all of these are actually beneficiaries of second mover advantage.

Here I have the extreme pleasure to acquaint you with the MacRumors forum thread from 2001 when Apple fans decried the iPod as just another mp3 player with ‘no more features than my Rio Diamond Jukebox.’ Sadly while writing this article I could not find a way to work in the fact that Steve Job’s presentation stack for the first iPod is typeset in Comic Sans.

In short, first movers, even those who have a successful product launch, offer a ‘free ride’ to their competitors who can use the first movers’ market research, product design, and advertising. If parts of the market don’t respond to the first product, a second-mover can offer a new product that, along with improvements, addresses the market better.

How do these differences shake out in the technical field? When we’re not talking about marketing a product to a group but rather the engineering, the process is extremely different when you’re not breaking new ground.

And how, fundamentally, is the development process different? Can we stand on the shoulders of giants? Or will we get bogged down in a landscape already riddled with giant’s footprints? What are the pitfalls of going second?

Follow the giants’ map.

When Walmart Labs took up NodeJS when it was an open source darling with a few un-formatted websites proudly proclaiming its benefits. When the team documented their path to success it largely recapped a framework’s path to mainstream success.

  • Patching core modules to support interoperability with other frameworks - when node core HTTP was converting everything to lowercase and the existing Java framework expected custom headers in uppercase, a patch was the only solution. When you are trying to roll out a new open source tool in your enterprise, you won’t be able to wait for your Github issue to get upvotes. Like it or not you will end up either forking or patching core.

  • Implement procrustean resource imports - I want security updates and bug fixes as much as the next gal, so the inbuilt Semantic Versioning (semver) support in npm is perfect for me. But in enterprise, we need to know that rollouts are identical, to the bit, with what we did last week. That means we want to lock our packaging to an exact version of a module. There are a ton of ways to do this in npm now but all of this was stuff Walmart Labs had to work out on their own.

  • Onboarding - How long does it take to go from ‘tinkerer’ to ‘competent’? I’ve been writing C for the Arduino for for 4 years and I’m just above a beginner. Focused self-directed study might only take a few months, but onboarding in an enterprise shop is a matter of weeks. Again, in the early days of a framework you may end up drafting internal documents rather than forcing new devs to sift through mountains of repo documentation, blog posts, and Stack Overflow questions to get up to speed.

Enterprise features and tools are not automatic.

In these examples Node was 100 percent working it had no bugs or fundamental flaws. All of the missing pieces were ones that you only need if you are working in an enterprise environment. The clever engineers who built node didn’t ‘forget’ these parts, but using Node to make money required them!

Blazing a trail even if you’re using AWS

iRobot dived feet-first into serverless architecture to run countless IoT devices, and while this has come with some great successes it can lead to some odd moments! In articles like this one Ben Kehoe of iRobot tries to be the first one to wrap his mind around the new AWS Fargate, and how it changes the map for iRobot.

Even without facing road bumps with a new project, keeping up with bleeding-edge features means your team might change directions after every re:Invent conference. Occasionally it means that a service that you thought was an expansion of serverless is more like automatically managed containers.

Rewrite your queries to serve customers better, not to save RAM.

Being the second mover in a managed environment means that both performance issues and enterprise requirements should be solved for the majority of cases. The examples above are all about making your team and the code work, none of them involve real business problems.

What if all our bugs could be business focused? Rather than trying to optimize queries for memory use, you were optimizing sorting, providing more relevant results, or adding metadata to power cool new features?

The power of serverless, a highly managed environment where operations can focus on observing real performance, is to allow your developers to spend less time optimizing within frameworks and more time understanding your business needs.

…and Stackery Can Help

Stackery, a tool for creating serverless stacks, makes it easy for teams to collaborate on AWS resources and configuration. With tools to let your whole team modify and approve changes to your stack, it has the potential to let you stick to best practices as you roll out services for your customers. AWS CloudFormation has an open source standard for describing and deploying stacks Serverless Application Model (SAM) YAML, which Stackery automatically creates, making it easier to learn the standard.

Move Slow and Make Things

Did you ever hear the one about two friends who come upon a bear in the woods, and one man sits down to tie his shoes? “What are you doing?” says the other man, “shoes or no you can’t outrun a bear.” The first man finishes lacing up and says “I don’t need to outrun the bear, I just need to outrun you.”

If the message of this piece is ‘move slow’ it should really be ‘move a bit slower than the competition.’

If you are breaking new territory, plan the time you’ll need for enterprise features and documentation, and if you really want to deny your competitors second-mover advantage, keep tight-lipped about best practices and cancel your senior devs’ speaking schedule!

Observability is Not Just Logging or Metrics
Toby Fee

Toby Fee | October 01, 2018

Observability is Not Just Logging or Metrics

Lessons from Real-World Operations

We generally expect that every new technology will have, along with massive new advantages, also has fundamental flaws that will mean the old technology always has its place in our tool belt. Such a narrative is comforting! It means none of the time we spent learning older tools was wasted, since the new hotness will never truly replace what came before. The term ‘observability’ has a lot of cache when planning serverless applications. Some variation of this truism is ‘serverless is a great product that has real problems with observability.’

The reality is not one of equal offerings with individual strengths and weaknesses. Serverless is superior to previous managed hosting tools, and it is the lack of hassle associated with logging, metrics, measurement, and analytics. Observability stands out as one of the few problems that serverless doesn’t solve on its own.

What Exactly is Observability?

Everything from logging to alerts gets labelled as observability, but the shortest definition is: observability lets you see externally how a system is working internally.

Observability should let you see what’s going wrong with your code without deploying new code. Does logging qualify as observability? Possibly! If a lambda logs each request its receiving, and the error is being caused by malformed URL’s being passed to that lambda, logging would certainly resolve the issue! But when the question is ‘how are URLs getting malformed?’ It’s doubtful that logging will provide a clear answer.

In general, it would be difficult to say that aggregated metrics increase observability. If we know that all account updates sent after 9pm take over 200ms, it is hard to imagine how that will tell us what’s wrong with the code.

Preparing for the Past

A very common solution to an outage or other emergency is to deploy a dashboard of metrics to detect the problem in future. This is an odd thing to do. Unless you can explain why you’re unable to fix this problem, there’s no reason to add detection for this specific error. Further, dashboards often exist to detect the same symptoms e.g. memory running out on a certain subset of servers. But running out of memory could be caused by many things, and provided we’re not looking at exactly the same problem saying ‘the server ran out of memory’ is a pretty worthless clue to start worthless clue to start with.

Real crises are those that affect your users. And problems that have a real effect on users are neither single interactions nor are they aggregated information. Think about some statements and whether they constitute an acute crisis:

  • Average load times are up 5% for all users. This kind of issue is a critical datum for project planning and management, but ‘make the site go faster for everyone’ is, or should be, a goal for all development whenever you’re not adding features.
  • One transaction took 18 minutes. I bet you one million dollars this is either a maintenance task or delayed job.
  • Thousands of Croatian accounts can’t log in. Now we actually have a trend! While we might be seeing a usage pattern (possibly a brute force attack), but there’s a chance that a routing or database layer is acting up in a way that affects one subset of users.
  • All logins with a large number of notifications backed up are incredibly slow, more than 30 seconds. This gives us a nice tight section of code to examine. As long as our code base is functional, it shouldn’t be tough to root out a cause!

How Do We Fix This?

1. The right tools

The tool that could have been created to fix this exact problem is Rookout, which lets you add logging dynamically without re-deploying. While pre-baked logging is unlikely to help you fix a new problem, Rookout lets you add logging to any line of code without a re-deploy (a wrapper grabs new rookout config at invocation). Right now I’m working on a tutorial where we hunt down Python bugs using Rookout, and it’s a great tool for hunting down errors.

Two services offer event-based logging that moves away from a study of averages and metrics and toward trends.

  • isn’t targeted at serverless directly, but offers great sampling tools. Sampling offers performance advantages over logging event details every time.
  • IOpipe is targeted at serverless and is incredibly easy to get deployed on your lambdas. The information gathered favors transactions over invocations.

2. Tag, cross-reference, and group

Overall averages are dangerous, they lead us into broad-reaching diagnoses that don’t point to specific problems. Generalized optimization looks a lot like ‘pre-optimization,’ where you’re rewriting code without knowing what problems you’re trying to fix or how. The best way to ensure that you’re spotting trends is to add as many tags as are practical to what you’re measuring. You’ll also need a way to gather this back together, and try to find indicators of root causes. Good initial tag categories:

  • Geography
  • Account Status
  • Connection Type
  • Platform
  • Request Pattern

Note that a lot of analytics tools will measure things like user agent, but you have to be careful to make sure that you don’t gather information that’s too specific. You need to be able to make statements like ‘all Android users are seeing errors’ and not get bogged down in specific build numbers.

3. Real-word transactions are better than any other information

A lot of the cross-reference information mentioned above isn’t meaningful if data is only gathered from one layer. A list of the slowest functions or highest-latency DB requests indicates a possible problem, but only slow or error-prone user transactions indicate problems that a user somewhere will actually care about.

Indicators like testing, method instrumentation, or server health present very tiny fragments of a larger picture. It’s critical to do your best to measure total transaction time, with the as many tags and groupings as possible.

4. Annotate your timeline

This final tip is has become a standard part of the devops playbook but it bears repeating: once you’re measuring changes in transaction health, be ready to look back about what has changed in your codebase with enough time accurately to correlate it with performance hits.

This approach can seem crude: weren’t we supposed to be targeting problems with APM-like tools that give us high detail? Sure, but fundamentally the fastest way to find newly introduced problems is to see them shortly after deployment.

Wrapping Up: Won’t This Cost a Lot?

As you expand your logging and event measurement, you should identify that the logging and metrics become less and less useful. Dashboards that were going weeks without being looked at will go months, and the initial ‘overhead’ of more event-and-transaction-focused measurement will pay off ten-fold in shorter and less frequent outages where no one knows what’s going on.

Disaster Recovery in a Serverless World - Part 2
Apurva Jantrania

Apurva Jantrania | September 17, 2018

Disaster Recovery in a Serverless World - Part 2

This is part two of a multi-part blog series. In the previous post, we covered Disaster Recovery planning when building serverless applications. In this post, we’ll discuss the systems engineering needed for an automated solution in the AWS cloud.

As I started looking into implementing Stackery’s automated backup solution, my goal was simple: In order to support a disaster recovery plan, we needed to have a system that automatically creates backups of our database to a different account and to a different region. This seemed like a straightforward task, but I was surprised to find that there was no documentation on how to do this in an automated, scalable solution - all existing documentation I could find only discussed partial solutions and were all done manually via the AWS Console. Yuck.

I hope that this post will make help fill that void and help you understand how to implement an automated solution for your own disaster recovery solution. This post does get a bit long so if that’s not your thing, see the tl;dr.

The Initial Plan

AWS RDS has automated backups which seemed like the perfect platform to base this automation upon. Furthermore, RDS even emits events that seem ideal for using to kick off a lambda function that will then copy the snapshot to the disaster recovery account.


The first issue I discovered was that AWS does not allow you to share automated snapshots - AWS requires that you first make a manual copy of the snapshot before you can share it with another account. I initially thought that this wouldn’t be a major issue - I can easily make my lambda function first kick off a manual copy. According to the RDS Events documentation, there is an event RDS-EVENT-0042 that would fire when a manual snapshot was created. I could then use that event to then share the newly created manual snapshot to the disaster recovery account.

This leads to the second issue - while RDS will emit events for snapshots that are created manually, it does not emit events for snapshots that are copied manually. The AWS docs aren’t clear about this and it’s an unfortunate feature gap. This means that I have to fall back to a timer based lambda function that will search for and share the latest available snapshot.

Final Implementation Details

While this ended up more complicated than initially envisioned, Stackery still makes it easy to add all the needed pieces for fully automated backups. My implementation ended up looking like this:

The DB Event Subscription resource is a CloudFormation Resource in which contains a small snippet of CloudFormation that subscribes the DB Events topic to the RDS database

Function 1 - dbBackupHandler

This function will receive the events from the RDS database via the DB Events topic. It then creates a copy of the snapshot with an ID that identifies the snapshot as an automated disaster recovery snapshot

const AWS = require('aws-sdk');
const rds = new AWS.RDS();

const DR_KEY = 'dr-snapshot';
const ENV = process.env.ENV;

module.exports = async message => {
  // Only run DB Backups on Production and Staging
  if (!['production', 'staging'].includes(ENV)) {
    return {};

  let records = message.Records;
  for (let i = 0; i < records.length; i++) {
    let record = records[i];

    if (record.EventSource === 'aws:sns') {
      let msg = JSON.parse(record.Sns.Message);
      if (msg['Event Source'] === 'db-snapshot' && msg['Event Message'] === 'Automated snapshot created') {
        let snapshotId = msg['Source ID'];
        let targetSnapshotId = `${snapshotId}-${DR_KEY}`.replace('rds:', '');

        let params = {
          SourceDBSnapshotIdentifier: snapshotId,
          TargetDBSnapshotIdentifier: targetSnapshotId

        try {
          await rds.copyDBSnapshot(params).promise();
        } catch (error) {
          if (error.code === 'DBSnapshotAlreadyExists') {
            console.log(`Manual copy ${targetSnapshotId} already exists`);
          } else {
            throw error;

  return {};

A couple of things to note:

  • I’m leveraging Stackery Environments in this function - I have used Stackery to define process.env.ENV based on the environment the stack is deployed to
  • Automatic RDS snapshots have an id that begins with ‘rds:’. However, snapshots created by the user cannot have a ‘:’ in the ID.
  • To make future steps easier, I append dr-snapshot to the id of the snapshot that is created

Function 2 - shareDatabaseSnapshot

This function runs every few minutes and shares any disaster recovery snapshots to the disaster recovery account

const AWS = require('aws-sdk');
const rds = new AWS.RDS();

const DR_KEY = 'dr-snapshot';
const DR_ACCOUNT_ID = process.env.DR_ACCOUNT_ID;
const ENV = process.env.ENV;

module.exports = async message => {
  // Only run on Production and Staging
  if (!['production', 'staging'].includes(ENV)) {
    return {};

  // Get latest snapshot
  let snapshot = await getLatestManualSnapshot();

  if (!snapshot) {
    return {};

  // See if snapshot is already shared with the Disaster Recovery Account
  let data = await rds.describeDBSnapshotAttributes({ DBSnapshotIdentifier: snapshot.DBSnapshotIdentifier }).promise();
  let attributes = data.DBSnapshotAttributesResult.DBSnapshotAttributes;

  let isShared = attributes.find(attribute => {
    return attribute.AttributeName === 'restore' && attribute.AttributeValues.includes(DR_ACCOUNT_ID);

  if (!isShared) {
    // Share Snapshot with Disaster Recovery Account
    let params = {
      DBSnapshotIdentifier: snapshot.DBSnapshotIdentifier,
      AttributeName: 'restore',
      ValuesToAdd: [DR_ACCOUNT_ID]
    await rds.modifyDBSnapshotAttribute(params).promise();

  return {};

async function getLatestManualSnapshot (latest = undefined, marker = undefined) {
  let result = await rds.describeDBSnapshots({ Marker: marker }).promise();

  result.DBSnapshots.forEach(snapshot => {
    if (snapshot.SnapshotType === 'manual' && snapshot.Status === 'available' && snapshot.DBSnapshotIdentifier.includes(DR_KEY)) {
      if (!latest || new Date(snapshot.SnapshotCreateTime) > new Date(latest.SnapshotCreateTime)) {
        latest = snapshot;

  if (result.Marker) {
    return getLatestManualSnapshot(latest, result.Marker);

  return latest;
  • Once again, I’m leveraging Stackery Environments to populate the ENV and DR_ACCOUNT_ID environment variables.
  • When sharing a snapshot with another AWS account, the AttributeName should be set to restore (see the AWS RDS SDK)

Function 3 - copyDatabaseSnapshot

This function will run in the Disaster Recovery account and is responsible for detecting snapshots that are shared with it and making a local copy in the correct region - in this example, it will make a copy in us-east-1.

const AWS = require('aws-sdk');
const rds = new AWS.RDS();

const sourceRDS = new AWS.RDS({ region: 'us-west-2' });
const targetRDS = new AWS.RDS({ region: 'us-east-1' });

const DR_KEY = 'dr-snapshot';
const ENV = process.env.ENV;

module.exports = async message => {
  // Only Production_DR and Staging_DR are Disaster Recovery Targets
  if (!['production_dr', 'staging_dr'].includes(ENV)) {
    return {};

  let [shared, local] = await Promise.all([getSourceSnapshots(), getTargetSnapshots()]);

  for (let i = 0; i < shared.length; i++) {
    let snapshot = shared[i];
    let fullSnapshotId = snapshot.DBSnapshotIdentifier;
    let snapshotId = getCleanSnapshotId(fullSnapshotId);
    if (!snapshotExists(local, snapshotId)) {
      let targetId = snapshotId;

      let params = {
        SourceDBSnapshotIdentifier: fullSnapshotId,
        TargetDBSnapshotIdentifier: targetId
      await rds.copyDBSnapshot(params).promise();

  return {};

// Get snapshots that are shared to this account
async function getSourceSnapshots () {
  return getSnapshots(sourceRDS, 'shared');

// Get snapshots that have already been created in this account
async function getTargetSnapshots () {
  return getSnapshots(targetRDS, 'manual');

async function getSnapshots (rds, typeFilter, snapshots = [], marker = undefined) {
  let params = {
    IncludeShared: true,
    Marker: marker

  let result = await rds.describeDBSnapshots(params).promise();

  result.DBSnapshots.forEach(snapshot => {
    if (snapshot.SnapshotType === typeFilter && snapshot.DBSnapshotIdentifier.includes(DR_KEY)) {

  if (result.Marker) {
    return getSnapshots(rds, typeFilter, snapshots, result.Marker);

  return snapshots;

// Check to see if the snapshot `snapshotId` is in the list of `snapshots`
function snapshotExists (snapshots, snapshotId) {
  for (let i = 0; i < snapshots.length; i++) {
    let snapshot = snapshots[i];
    if (getCleanSnapshotId(snapshot.DBSnapshotIdentifier) === snapshotId) {
      return true;
  return false;

// Cleanup the IDs from automatic backups that are prepended with `rds:`
function getCleanSnapshotId (snapshotId) {
  let result = snapshotId.match(/:([a-zA-Z0-9-]+)$/);

  if (!result) {
    return snapshotId;
  } else {
    return result[1];
  • Once again, leveraging Stackery Environments to populate ENV, I ensure this function only runs in the Disaster Recovery accounts

TL;DR - How Automated Backups Should Be Done

  1. Have a function that will manually create an RDS snapshot using a timer and lambda. Use a timer that makes sense for your use case
    • Don’t bother trying to leverage the daily automated snapshot provided by AWS RDS.
  2. Have a second function, that monitors for the successful creation of the snapshot from the first function and shares it to your disaster recovery account.

  3. Have a third function that will operate in your disaster recovery account that will monitor for snapshots shared to the account, and then create a copy of the snapshot that will be owned by the disaster recovery account, and in the correct region.
Why You Should Stop YAML Engineering
Chase Douglas

Chase Douglas | August 30, 2018

Why You Should Stop YAML Engineering

Here’s some sample JavaScript code:

let foo = 5;

Do you know register held the value 5 when executing that code? Do you even know what a register is or what are plausible answers to the question?

The correct answer to this question is: No, and I don’t care. Registers are locations in a CPU that can hold numerical values. Your computing device of choice calculates values using registers millions of times each second.

Way back when computers were young people programmed them using assembly languages. The languages directly tell CPUs what to do: load values into registers from memory, calculate other values in the registers, save values back to memory, etc.

Between 1943 and 1945, Konrad Zuse developed the first (or at least one of the first) high-level programming languages: Plankalkül. It wasn’t pretty by modern standards, but its thesis could be boiled down to: you will program more efficiently and effectively using logical formulas. A more blunt corollary might be: Stop programming CPU registers!

A Brief History Of Computing Abstractions

Software engineering is built upon the practice of developing higher-order abstractions for common tasks. Allow me to evince this pattern:

1837 — Charles Babbage: Stop computing by pencil and paper! (Analytical Engine)

1938 — Conrad Zuse: Stop computing by hand! (Z1, first electrical computer)

1950 — UNIVAC: Stop repeating yourself! (UNIVAC 1101, first computer with programs stored in memory)

1985 — Bjarne Stroustrup: Stop treating everything like primitives! (C++ language, introduced object-oriented-programming to the C language)

1995 — James Gosling: Stop tracking memory! (Java language makes garbage collection mainstream)

2006 — Amazon: Stop managing data centers! (AWS EC2 makes virtual servers easy)

2009 — Ryan Dahl: Stop futzing with threads! (Node.js introduces event/callback-based semantics)

2011 — Amazon: Stop provisioning resources manually! (AWS CloudFormation makes Infrastructure-As-Code easier)

2014 — Amazon: Stop managing servers! (AWS Lambda makes services “functional”)

Software engineers are always finding ways to make whatever “annoying thing” they have to deal with go away. They do this by building abstractions. It’s now time to build another abstraction.

2018: Stop YAML Engineering!

The combination of infrastructure-as-code and serverless apps means developers are inundated with large, complex infrastructure templates, mostly written in YAML.

It’s time we learn from the past. We are humans who work at a logical, higher-order abstraction level. Infrastructure-as-code is meant to be consumed by computers. We need to abstract away the compiling of our logic into infrastructure code.

That’s what Stackery does. It turns this…

…into Serverless Application Model (SAM) YAML without asking the user to do anything more than drag and wire resources in a canvas. Stackery’s own backend is over 2,000 lines of YAML. We wouldn’t be able to manage it all without a tool that helps us both maintain and diagram the relationships between resources. It even performs this feat in both directions: you can take your existing SAM applications, import them into Stackery, and instantly visualize and extend the infrastructure architecture.

You may be a principal architect leading adoption of serverless in your organization. You may be perfectly capable of holding thousands of lines of infrastructure configuration in your head. But can everyone on your team do the same? Would your organization benefit from the automatic visualization and relationship/dependency management of a higher-order infrastructure-as-code abstraction?

As can be seen throughout the history of software engineering, the industry will move (and already is moving) to abstract away this lower level of engineering. Check out Stackery if you want to stay ahead of the curve.

Disaster Recovery in a Serverless World - Part 1
Nuatu Tseggai

Nuatu Tseggai | July 12, 2018

Disaster Recovery in a Serverless World - Part 1

This is part one of a multi-part blog series. In this post we’ll discuss Disaster Recovery planning when building serverless applications. In future posts we’ll highlight Disaster Recovery exercises and the engineering preparation necessary for success.

‘Eat Your Own Dog Food’

Nearly the entire mix of Stackery backend microservices run on AWS Lambda compute. That’s not shocking - after all - the entire purpose of our business is to build a cohesive set of tools that enable teams to build production-ready serverless applications. It’s only fitting that we eat our own dogfood and use serverless technologies wherever possible.

Which leads to the central question this blog post is highlighting: How should a team reason about Disaster Recovery when they build software atop serverless technologies?

(Spoiler Alert) Serverless doesn’t equate to a free lunch! The important bits of DR revolve around establishing a cohesive plan and exercising it regularly - all of which remain important when utilizing serverless infrastructure. But there’s good news! Serverless architectures free engineers from the minutia of administering a platform leaving them more time to focus their sights on higher level concepts such as Disaster Recovery, Security, and Technical Debt.

Before we get too far - let’s define Disaster Recovery (DR). In simple terms, it’s a documented plan that aims to minimize downtime and data loss in the event of a disaster. The term is most often used in the context of yearly audit-related exercises wherein organizations demonstrate compliance in order to meet regulatory requirements. It’s also very familiar to those who are charged with developing IT capabilities for mission-critical functions of the government.

Many of us at Stackery used to work at New Relic during a particularly explosive growth stage of the business. We were exposed to DR exercises that took months of work (from dozens of managers/engineers) to reach the objectives set by the business. That experience influenced us as we embarked on developing a DR plan for Stackery, but we still needed to work through a multitude of questions specific to our architecture.

What would happen to our product(s) if any of the following services running in AWS region XYZ experienced an outage? (S3, RDS, Dynamo, Cognito, Lambda, Fargate, etc.)

  • How long before we fully recover?
  • How much data loss would we incur?
  • What process would we follow to recover?
  • How would we communicate status and next steps internally?
  • How would we communicate status and next steps to customers?

These questions quickly reminded us that DR planning requires direction from the business. In our case, we looked to our CEO, CTO, and VP of Engineering to set two goals:

  1. Recovery Time Objective (RTO): the length of time it would take us to swap to a second, hot production service in a separate AWS region.
  2. Recovery Point Objective (RPO): the acceptable amount of data loss measured in time.

In order to determine these goals our executives had to consider the financial impact to the business during downtime (determined by considering loss of business and damage to our reputation). Not surprisingly, the dimensions of this business decision will be unique to every business. It’s important that your executive team takes the time to understand why it’s important for them to be in charge of defining the RTO and RPO and that they are engaged in the ongoing development and execution of the DR plan. It’s a living plan and as such will require improvements as the company evolves.

Based on our experience, we developed the below outline that you may find helpful as your team develops a DR plan.

Disaster Recovery Plan

  1. Goals
  2. Process
    • Initiating DR
    • Assigning Roles
    • Incident Commander
    • Technical Lead
  3. Communication
    • Engineering Coordination
    • Leadership Updates
  4. Recovery Steps
  5. Continuous Improvement
    • TODO
    • Lessons Learned
    • Frequency

This section describes our RTO and RPO (see above).


This section describes the process to follow in the event that it becomes necessary to initiate Disaster Recovery. This is the same process followed during Disaster Recovery Exercises.

Initiating DR:

The Disaster Recovery procedure may be initiated in the event of a major prolonged outage upon the CEO’s request. If the CEO is unavailable and cannot be reached DR can be initiated by another member of the executive team.

Assigning Roles:

Roles will be assigned by the executive initiating the DR process.

Incident Commander (IC):

The Incident Commander is responsible for coordinating the operational response and communicating status to stakeholders. The IC is responsible for designating a Technical Lead and engaging additional employees necessary for the response. During the DR process the IC will send hourly email updates to the executive team. These updates will include: current status of DR process, timeline of events since DR was initiated, requests for help or additional resources.

Technical Lead (TL):

The Technical Lead has primary responsibility for driving the DR process towards a successful technical resolution. The IC will solicit status information and requests for additional assistance from the TL.


Communication is critical to an effective and well coordinated response. The following communication channels should be used:

Engineering Coordination:

The IC, TL and engineers directly involved with the response will communicate in the #disaster-recovery-XYZ slack channel. In the event that slack is unavailable the IC will initiate a Google Hangout and communicate instructions for connecting via email and cell phone.

Leadership Updates:

The IC will provide hourly updates to the executive team via email. See details in separate Incident Commander doc.

Recovery Steps:

High level steps to be performed during DR.

  1. Update Status Page
  2. Restore Datastore(s) in prodY from latest prodX
    • DB
    • Authentication
    • Authorization
    • Cache
    • Blob Storage
  3. Restore backend microservices
    • Bootstrap services with particular focus on upstream and downstream dependencies
  4. Swap CloudFront distribution(s)
  5. Swap API endpoint(s) via DNS
    • Update DNS records to point to prodY API endpoints
  6. Verify recovery is complete
    • Redeploy stack from user account to verify service level
  7. Update Status Page
Continuous Improvement:

This section captures TODO action items and next steps, lessons learned, and the frequency in which we’ll revisit the plan and accomplish the TODO action items.

In the next post, we’ll dig into the work it takes to prepare for and perform DR exercises. To learn how Stackery can make building microservices on Lambda manageable and efficient, contact our sales team or get a free trial today.

Self Healing Serverless Applications - Part 3
Nate Taggart

Nate Taggart | July 04, 2018

Self Healing Serverless Applications - Part 3

This blog post is based on a presentation I gave at Glue Conference 2018. The original slides: Self-Healing Serverless Applications – GlueCon 2018. View the rest here. Parts: 1, 2, 3

This is part three of a three-part blog series. In the first post we covered some of the common failure scenarios you may face when building serverless applications. In the second post we introduced the principles behind self-healing application design. In this next post, we’ll apply these principles with solutions to real-world scenarios.

Self-Healing Serverless Applications

We’ve been covering the fundamental principles of self-healing applications and the underlying errors that serverless applications commonly face. Now it’s time to apply these principles to the real-world scenarios that the errors will raise. Let’s step through five common problems together.

We’ll solve:

  • Uncaught Exceptions
  • Upstream Bottlenecks
  • Timeouts
  • Stuck steam processing
  • and, Downstream Bottlenecks

Uncaught Exceptions

Uncaught exceptions are unhandled errors in your application code. While this problem isn’t unique to Lambda, diagnosing it in a serverless application can be a little trickier because your compute instance is ultimately ephemeral and will shut down upon an uncaught exception. What we want to do is detect that an exception is about to occur, and either remediate or collect diagnostic information at runtime before the Lambda instance is gone. After we’ve handled the error, we can simply re-throw it to avoid corrupting the behavior of the application. To do that, we’ll use three of the principles we previously introduced: introducing universal instrumentation, collecting event-centric diagnostics, and giving everyone visibility.

At an abstract level, it’s relatively easy to instrument a function to catch errors (we could simply wrap our code in a try/except loop). While this solution works just fine for a single function, though, it doesn’t easily scale across an entire application or organization. Do you really want to be responsible for ensuring that every single function is individually instrumented?

The better approach is to use “universal instrumentation.” We’ll create a generic handler which will invoke our real target function and always use it as the top level handler for every piece of code. Here’s an example:

def genericHandler(message, context):
		return targetHandler(message, context)
	except Exception as error:
		# Collect event diagnostics
		# Possibly re-route the event or otherwise remediate the transaction
		raise error

As you can see, we have in fact just run our function through a try/except clause, with the benefit of now being able to invoke any function with one standard piece of instrumentation code. This means that every function will now behave consistently (consistent logs, metrics, etc) across our entire stack.

This instrumentation also allows us to collect event-centric diagnostics. Keep in mind that by default, a Lambda exception will give you a log with a stack trace, but no information on the event which led to this exception. It’s much easier to debug and improve application health with relevant event data. And now that we have centralized logs, events, and metrics, it’s much easier for everyone on the team to have visibility into the health of the entire application.

Note: you’ll want to be careful that you’re not logging any sensitive data when you capture events.

Upstream Bottleneck

An upstream bottleneck occurs when a service calling into Lambda hits a scaling limit, even though Lambda isn’t being throttled. A classic example of this is API Gateway reaching throughput limits and failing to invoke downstream Lambdas on every request.

The key principles to focus on here is: identifying service limits, using self-throttling, and notifying a human.

It’s pretty straightforward to identify service limits, and if you haven’t done this you really should. Know what your throughput limits are for each of the AWS services you’re using and set alarms on throughput metrics before you hit capacity (notify a human!).

The more sophisticated, self-healing approach comes into play when you choose to throttle yourself before you get throttled by AWS. In the case of an API Gateway limit, you (or someone in your organization) may already control the requests coming to this Gateway. If, say for example, you have a front-end application backed by API Gateway and Lambda, you could introduce an exponential backoff logic to kick in whenever you have backend errors. Pay particular attention to HTTP 429 To Many Requests responses, which is (generally) what API Gateway will return when it’s being throttled. I say “generally” because in practice this is actually inconsistent and will sometimes return 5XX error codes as well. In any event, if you are able to control the volume of requests (which may be from another service tier), you can help your application to self-heal and fail gracefully.


Sometimes Lambdas time out, which can be a particularly painful and expensive kind of error, since Lambdas will automatically retry multiple times in many cases, driving up the active compute time. When a timeout occurs, the Lambda will ultimately fail without capturing much in terms of diagnostics. No event data, no stack trace – just a timeout error log. Fortunately, we can handle these errors pretty similarly to uncaught exceptions. We’ll use the principles of self-throttle, universal instrumentation, and considering alternative resource types.

The instrumentation for this is a little more complex, but stick with me:

def genericHandler(message, context):
	# Detect when the Lambda will time out and set a timer for 1 second sooner
	timeout_duration = context.get_remaining_time_in_millis() - 1000

	# Invoke the original handler in a separate thread and set our new stricter timeout limit
	handler_thread = originalHandlerThread(message, context)
	handler_thread.join(timeout_duration / 1000)

	# If timeout occurs
	if handler_thread.is_alive():
		error = TimeoutError('Function timed out')

		# Collect event diagnostics here

		raise error
	return handler_thread.result

This universal instrumentation is essentially self-throttling by forcing us to conform to a slightly stricter timeout limit. In this way, we’re able to detect an imminent timeout while the Lambda is still alive and can extract meaningful diagnostic data to retroactively diagnose the issue. This instrumentation can, of course, be mixed with our error handling logic.

If this seems a bit complex, you might like using Stackery: we automatically provide instrumentation for all of our customers without requiring *any* code modification. All of these best practices are just built in.

Finally, sometimes we should be considering other resource types. Fargate is another on-demand compute instance which can run longer and with higher resource limits that Lambda. It can still be triggered by events and is a better fit for certain workloads.

Stream Processing Gets “Stuck”

When Lambda is reading off of a kinesis stream, failing invocations can cause the stream to get stuck (more accurately: just that shard). This is because Lambda will continue to retry the failing message until it’s successful and will not get to the workload behind the stuck message until it’s handled. This introduces an opportunity for some of the other self-healing principles: reroute and unblock, automate known solutions, consider alternative resource types.

Ultimately, you’re going to need to remove the stuck message. Initially, you might be doing this manually. That will work if this is a one-off issue, but issues rarely are. The ideal solution here is to automate the process of rerouting failures and unblocking the rest of the workload.

The approach that we use is to build a simple state machine. The logic is very straightforward: is this the first time we’ve seen this message? If so, log it. If not, this is a recurring failure and we need to move it out of the way. You might simply “pass” on the message, if your workload is fairly fault tolerant. If it’s critical, though, you could move it to a dedicated “failed messages” stream for someone to investigate or possibly to route through a separate service.

This is where alternative resources come into play again. Maybe the Lambda is failing because it’s timing out (good thing you introduced universal instrumentation!). Maybe sending your “failed messages” stream to a Fargate instance solves your problem. You might also want to investigate the similar but actually different ways that Kinesis, SQS, and SNS work and make sure you’re choosing the right tool for the job.

Downstream Bottleneck

We talked about upstream bottlenecks where Lambda is failing to be invoked, but you can also hit a case where Lambda is scaling up faster that its dependencies and causing downstream bottlenecks. A classic example of this is Lambda depleting the connection pool for an RDS instance.

You might be surprised to learn that Lambda holds onto its connection, even while cached, unless you explicitly close the connection in your code. So do that. Easy enough. But you’re also going to want to pay attention to some of our self-healing key principles again: identify service limits, automate known solutions, and give everyone visibility.

In theory, Lambda is (nearly, kind of, sort of) infinitely scalable. But the rest of your application (and the other resource tiers) aren’t. Know your service limits: how many connections can your database handle? Do you need to scale your database?

What makes this error class actually tricky, though, is that multiple services may have shared dependencies. You’re looking at a performance bottleneck thinking to yourself, “but I’m not putting that much load on the database…” This is an example of why it’s so important to have shared visibility. If your dependencies are shared, you need to understand not just your own load, but that of all of the other services hammering this resource. You’ll really want a monitoring solution that includes service maps and makes it clear how the various parts of your stack are related. That’s why, even though most of our customers work day-to-day from the Stackery CLI, the UI is still a meaningful part of the product.

The Case for Self-Healing

Before we conclude, I’d like to circle back and talk again about the importance of self-healing applications. Serverless is a powerful technology that outsources a lot of the undifferentiated heavy lifting of infrastructure management, but it requires a thoughtful approach to software development. As we add tools to accelerate the software lifecycle, we need to keep some focus on application health and resiliency. The “Self-Healing” philosophy is the approach that we’ve found which allows us to capture the velocity gains of serverless and unlock the upside of scalability, without sacrificing reliability or SLAs. If you’re taking serverless seriously, you should incorporate these techniques and champion them across your organization so that serverless becomes a mainstay technology in your stack.

Self Healing Serverless Applications - Part 2
Nate Taggart

Nate Taggart | June 18, 2018

Self Healing Serverless Applications - Part 2

This blog post is based on a presentation I gave at Glue Conference 2018. The original slides: Self-Healing Serverless Applications – GlueCon 2018. View the rest here. Parts: 1, 2, 3

This is part two of a three-part blog series. In the last post we covered some of the common failure scenarios you may face when building serverless applications. In this post we’ll introduce the principles behind self-healing application design. In the next post, we’ll apply these principles with solutions to real-world scenarios.

Learning to Fail

Before we dive into the solution-space, it’s worth introducing the defining principles for creating self-healing systems: plan for failure, standardize, and fail gracefully.

Plan for Failure

As an industry, we have a tendency to put a lot of time planning for success and relatively little time planning for failure. Think about it. How long ago did you first hear about Load Testing? Load testing is a great way to prepare for massive success! Ok, now when did you first hear about Chaos Engineering? Chaos Engineering is a great example of planning for failure. Or, at least, a great example of intentionally failing and learning from it. In any event, if you want to build resilient systems, you have to start planning to fail.

Planning for failure is not just a lofty ideal, there are easy, tangible steps that you can begin to take today:

  • Identify Service Limits: Remember the Lambda concurrency limits we covered in Part 1? You should know what yours are. You’re also going to want to know the limits on the other service tiers, like any databases, API Gateway, and event streams you have in your architecture. Where are you most likely to bottleneck first?
  • Use Self-Throttling: By the time you’re being throttled by Lambda, you’ve already ceded control of your application to AWS. You should handle capacity limits in your own application, while you still have options. I’ll show you how we do it, when we get to the solutions.
  • Consider Alternative Resource Types: I’m about as big a serverless fan as there is, but let’s be serious: it’s not the only tool in the tool chest. If you’re struggling with certain workloads, take a look at alternative AWS services. Fargate can be good for long-running tasks, and I’m always surprised how poorly understood the differences are between Kinesis vs. SQS vs. SNS. Choose wisely.


One of the key advantages of serverless is the dramatic velocity it can enable for engineering orgs. In principle, this velocity gain comes from outsourcing the “undifferentiated heavy lifting” of infrastructure to the cloud vendor, and while that’s certainly true, a lot of the velocity in practice comes from individual teams self-serving their infrastructure needs and abandoning the central planning and controls that an expert operations team provides.

The good news is, it doesn’t need to be either-or. You can get the engineering velocity of serverless while retaining consistency across your application. To do this, you’ll need a centralized mechanism for building and delivering each release, managing multiple AWS accounts, multiple deployment targets, and all of the secrets management to do this securely. This means that your open source framework, which was an easy way to get started, just isn’t going to cut it anymore.

Whether you choose to plug in software like Stackery or roll your own internal tooling, you’re going to need to build a level of standardization across your serverless architecture. You’ll need standardized instrumentation so that you know that every function released by every engineer on every team has the same level of visibility into errors, metrics, and event diagnostics. You’re also going to want a centralized dashboard that can surface performance bottlenecks to the entire organization – which is more important than ever before since many serverless functions will distribute failures to other areas of the application. Once these basics are covered, you’ll probably want to review your IAM provisioning policies and make sure you have consistent tagging enforcement for inventory management and cost tracking.

Now, admittedly, this standardization need isn’t the fun part of serverless development. That’s why many enterprises are choosing to use a solution like Stackery to manage their serverless program. But even if standardization isn’t exciting, it’s critically important. If you want to build serverless into your company’s standard tool chest, you’re going to need for it to be successful. To that end, you’ll want to know that there’s a single, consistent way to release or roll back your serverless applications. You’ll want to ensure that you always have meaningful log and diagnostic data and that everyone is sending it to the same place so that in a crisis you’ll know exactly what to do. Standardization will make your serverless projects reliable and resilient.

Fail Gracefully

We plan for failure and standardize serverless implementations so that when failure does happen we can handle it gracefully. This is where “self-healing” gets exciting. Our standardized instrumentation will help us identify bottlenecks automatically and we can automate our response in many instances.

One of the core ideas in failing gracefully is that small failures are preferable to large ones. With this in mind, we can oftentimes control our failures to minimize the impact and we do this by rerouting and unblocking from failures. Say, for example, you have a Lambda reading off of a Kinesis stream and failing on a message. That failure is now holding up the entire Kinesis shard, so instead of having one failed transaction you’re now failing to process a significant workload. If we, instead, allow that one blocking transaction to fail by removing it from the stream and (instead of processing it normally) simply log it out as a failed transaction and get it out of the way. Small failure, but better than a complete system failure in most cases.

Finally, while automating solutions is always the goal with self-healing serverless development, we can’t ignore the human element. Whenever you’re taking an automated action (like moving that failing Kinesis message), you should be notifying a human. Intelligent notification and visibility is great, but it’s even better when that notification comes with all of the diagnostic details including the actual event that failed. This allows your team to quickly reproduce and debug the issue and turn around a fix as quickly as possible.

See it in action

In our final post we’ll talk through five real-world use cases and show how these principles apply to common failure patterns in serverless architectures.

We’ll solve: - Uncaught Exceptions - Upstream Bottlenecks - Timeouts - Stuck steam processing - and, Downstream Bottlenecks

If you want to start playing with self-healing serverless concepts right away, go start a free 60 day evaluation of Stackery. You’ll get all of these best practices built-in.

Self Healing Serverless Applications - Part 1
Nate Taggart

Nate Taggart | June 07, 2018

Self Healing Serverless Applications - Part 1

This blog post is based on a presentation I gave at Glue Conference 2018. The original slides: Self-Healing Serverless Applications – GlueCon 2018. View the rest here. Parts: 1, 2, 3

This is part one of a multi-part blog series. In this post we’ll discuss some of the common failure scenarios you may face when building serverless applications. In future posts we’ll highlight solutions based on real-world scenarios.

What to expect when you’re not expecting

If you’ve been swept up in the serverless hype, you’re not alone. Frankly, serverless applications have a lot of advantages, and my bet is that (despite the stupid name) “serverless” is the next major wave of cloud infrastructure and a technology you should be betting heavily on.

That said, like all new technologies, serverless doesn’t always live up to the hype, and separating fact from fiction is important. At first blush, serverless seems to promise infinite and immediate scaling, high availability, and little-to-no configuration. The truth is that while serverless does offload a lot of the “undifferentiated heavy lifting” of infrastructure management, there are still challenges with managing application health and architecting for scale. So, let’s take a look at some of the common failure modes for AWS Lambda-based applications.

Runtime failures

The first major category of failures are what I classify as “runtime” or application errors. I assume it’s no surprise to you that if you introduce bugs in your application, Lambda doesn’t solve that problem. That said, you may be very surprised to learn that when you throw an uncaught exception, Lambda will behave differently depending on how you architect your application.

Let’s briefly touch on three common runtime failures:

  • Uncaught Exceptions: any unhandled exception (error) in your application.
  • Timeouts: your code doesn’t complete within the maximum execution time.
  • Bad State: a malformed message or improperly provided state causes unexpected behavior.
Uncaught Exceptions

When Lambda is running synchronously (like in a request-response loop, for example), the Lambda will return an error to the caller and will log an error message and stack trace to CloudWatch, which is probably what you would expect. It’s different though when a Lambda is called asynchronously, as might be the case with a background task. In that event, when throwing an error, Lambda will retry up to three times. In fact, it will even try indefinitely when reading off of a stream, like with Kinesis. In any case, when a Lambda fails in an asychronous architecture, the caller is unaware of the error – although there is still a log record sent to CloudWatch with the error message and stack trace.


Sometimes Lambda will fail to complete within the configured maximum execution time, which by default is 3 seconds. In this case, it will behave like an uncaught exception, with the caveat that you won’t get a stack trace out in the logs and the error message will be for the timeout and not for the potentially underlying application issue, if there is one. Using only the default behavior, it can be tricky to diagnose why Lambdas are timing out unexpectedly.

Bad State

Since serverless function invocations are stateless, state must be supplied on or after invocation. This means that you may be passing state data through input messages or by connecting to a database to retrieve state when the function starts. In either scenario, it’s possible to invoke a function but not properly supply the state which the function needs to properly execute. The trick here is that these “bad state” situations can either fairly noisily (as an uncaught exception) or fail silently without raising any alarms. When these errors occur silently it can be nearly impossible to diagnose, or sometimes to even notice that you have a problem. The major risk is that the events which trigger these functions have expired and since state may not be being stored correctly, you may have permanent data loss.

Scaling failures

The other class of failures worth discussing are scaling failures. If we think of runtime errors as application-layer problems, we can think of scaling failures as infrastructure-layer problems.

The three common scaling failures are:

  • Concurrency Limits: when Lambda can’t scale high enough.
  • Spawn Limits: when Lambda can’t scale fast enough.
  • Bottlenecking: when your architecture isn’t scaling as much as Lambda.
Concurrency Limits

You can understand why Amazon doesn’t exactly advertise their concurrency limits, but there are, in fact, some real advantages to having limits in place. First, concurrency limits are account limits that determine how many simultaneously running instances you can have of your Lambda functions. These limits can really save you in scenarios where you accidentally trigger an unexpected and massive workload. Sure, you may quickly run up a tidy little bill, but at least you’ll hit a cap and have time to react before things get truly out of hand. Of course, the flipside to this is that your application won’t really scale “infinitely,” and if you’re not careful you could hit your limits and throttle your real traffic. No bueno.

For synchronous architectures, these Lambdas will simply fail to be invoked without any retry. If you’re invoking your Lambdas asynchronously, like reading off of a stream, the Lambdas will fail to invoke initially, but will resume once your concurrency drops below the limit. You may experience some performance bottlenecks in this case, but eventually the workload should catch up. It’s worth noting, by contacting AWS you can usually get them to raise your limits if needed.

Spawn Limits

While most developers with production Lambda workloads have probably heard of concurrency limits, in my experience very few know about spawn limits. Spawn limits are account limits on what rate new Lambdas can be invoked. This can be tricky to identify because if you glance at your throughput metrics you may not even be close to your concurrency limit, but could still be throttling traffic.

The default behavior for spawn limits matches concurrency limits, but again, it may be especially challenging to identify and diagnose spawn limit throttling. Spawn limits are also very poorly documented and, to my knowledge, it’s not possible to have these limits raised in any way.


The final scaling challenge involves managing your overall architecture. Even when Lambda scales perfectly (which, in fairness is most of the time!), you must design your other service tiers to scale as well or you may experience upstream or downstream bottlenecks. In an upstream bottleneck, like when you hit throughput limits in API Gateway, your Lambdas may fail to invoke. In this case, you won’t get any Lambda logs (they really didn’t invoke), and so you’ll have to be paying attention to other metrics to detect this. It’s also possible to create downstream bottlenecks. One way this can happen is when your Lambdas scale up, but deplete your connection pool for a downstream database. These kind of problems can behave like an uncaught exception, lead to timeouts, or distribute failures to other functions and services.

Introducing Self-Healing Serverless Applications

The solution to all of this is to build resiliency with “Self-Healing” Serverless Applications. This is an approach for architecting applications for high-resiliency and for automated error resolution.

In our next post, we’ll dig into the three design principles for self-healing systems:

  • Plan for Failure
  • Standardize
  • Fail Gracefully

We’ll also learn to apply these principles to real-world scenarios that you’re likely to encounter as you embrace serverless architecture patterns. Be sure to watch for the next post!

Get the Serverless Development Toolkit for Teams

Sign up now for a 60-day free trial. Contact one of our product experts to get started building amazing serverless applications today.

To Top