Stacks on Stacks

The Serverless Ecosystem Blog by Stackery.

Posts on Cloud Infrastructure

Custom CloudFormation Resources: Real Ultimate Power
Chase Douglas

Chase Douglas | May 24, 2018

Custom CloudFormation Resources: Real Ultimate Power

my ninja friend mark

Lately, I’ve found CloudFormation custom resources to be supremely helpful for many use cases. I actually wanted to write a post mimicing Real Ultimate Power:

Hi, this post is all about CloudFormation custom resources, REAL CUSTOM RESOURCES. This post is awesome. My name is Chase and I can’t stop thinking about custom resources. These things are cool; and by cool, I mean totally sweet.

Trust me, it would have been hilarious, but rather than spend a whole post on a meme that’s past its prime let’s take a look at the real reasons why custom resources are so powerful!

an awesome ninja

What Are Custom Resources?

Custom resources are virtual CloudFormation resources that can invoke AWS Lambda functions. Inside the Lambda function you have access to the properties of the custom resource (which can include information about other resources in the same CloudFormation stack by way of Ref and Fn::GetAtt functions). The function can then do anything in the world as long as it (or another resource it invokes) reports success or failure back to CloudFormation within one hour. In the response to CloudFormation, the custom resource can provide data that can be referenced from other resources within the same stack.

another awesome ninja

What Can I Do With Custom Resources?

Custom resources are such a fundamental resource that it isn’t obvious at first glance all the use cases it enables. Because it can be invoked once or on every deployment, it’s a powerful mechanism for lifecycle management of many resources. Here are a few examples:

You could even use custom resources to enable post-provisioning smoke/verification testing:

  1. A custom resource is “updated” as the last resource of a deployment (this is achieved by adding every other resource in the stack to its DependsOn property)
  2. The Lambda function backing the custom resource triggers smoke tests to run, then returns success or failure to CloudFormation
  3. If a failure occurs, CloudFormation automatically rolls back the deployment

Honestly, while I have begun using custom resources for many use cases, I discover new use cases all the time. I feel like I have hardly scratched the surface of what’s possible through custom resources.

And that’s what I call REAL Ultimate Power!!!!!!!!!!!!!!!!!!

more awesome ninja

Alexa, tell Stackery to deploy
Apurva Jantrania

Apurva Jantrania | May 01, 2018

Alexa, tell Stackery to deploy

We have a couple of Amazon Dot’s around the office and one day, Nate was wondering if we could use Alexa to deploy a stack. That sounded like a fun side project, although I’d never created an Alexa skill before. So this week, I’m going to write a bit about the proof-of-concept I made, and some of the learnings I came across.

To learn about Alexa skills, I used two guides:

  1. Steps to Build a Custom Skill to guide me through building the custom Alexa Skill
  2. Developing Your First Skill to understand how custom Skill Handlers are written in NodeJS

Creating the Alexa Skill

Designing and building the Alexa Skill following the first guide was surprisingly straight-forward. I decided I wanted to build my skill to enable deploying a stack into a specific environment. For the purpose of this POC, I decided that adding which branch to use for the deployment to start getting to be too long of an utterance/dialog. My ideal phrasing was to be able to say “Alexa, tell stackery to deploy $STACK_NAME into $ENVIRONMENT_NAME”.

The first issue I came across is the skill invocation name. I wanted to just use stackery but there is a very large dialog box that lists requirements, and at the top of that list is that the invocation name should be two or more words. That seemed increadibly unwieldy and I wasn’t sure what I’d go with. This requirement also seemed to go against some of the examples I’d seen in some of Amazon’s own guides: I decided that I really did want stackery as my invocation and I got lucky when I tried it - turns out that Amazon’s definition of requirement here is synonomous with guideline.

I then created a new intent that I called deployIntent and populated the sample utterences with a couple of phrases:

deploy
deploy {stackName}
deploy {stackName} to {env}

Where {stackName} and {env} are slots that I was able to dive into their Edit Dialog settings to tell Alexa that both slots are required and how to prompt for them if the user doesn’t provided it.

I got to say, the Alexa Skills UI/UX was really making this easy for me as a first time developer. It felt slick.

With this, I was pretty much done creating the skill, and now I needed to create the handler that would actually do the deployment.


Creating the Alexa Skill Handler

I created a new stack in Stackery called alexaDeployments. As an Alexa skill can directly invoke an AWS Lambda function, I deleted all of the existing resources and started with a fresh function which I called alexaHandler. I updated the timeout to be 300 seconds. Note that deployments can easily take more than 5 minutes. To really be robust, the stack deployment should be handled by Docker Task resource instead, but since this was just a POC, I was willing to accept this limitation to speed things up.

I then saved the stack in the UI and cloned the repo locally to start developing the handler. Following the second guide quickly gave me the skeleton of my alexaHandler lambda function. It’s a lot of relatively repetative code, thats well outlined in the guide, so I’m not going to add it here. What I needed to do now was code my DeployIntentHandler and add the stackery CLI to the function.

When Stackery packages a function to lambda, it includes everything in the function directory, so taking advantage of that, I downloaded the Linux varient of the Stackery CLI into the /Stackery/functions/alexaHanlder folder in my repo. The Stackery CLI requires a few steps to be able to deploy:

  • A .stackery.toml file that is created by running through the stackery login command
  • AWS credentials provided either via the command line (--access-key-id and --secret-access-key) or via a profile in the ~/.aws/credentials file

To make things easier, I took my .stackery.toml file and added that to the function folder so I could skip the stackery login step on each invocation. As for my AWS Credentials, I will get them from environment variables set via Stackery’s Environment Configurations.

With that, my DeployIntentHandler looked like this

const DeployIntentHandler = {
  canHandle (handlerInput) {
    return handlerInput.requestEnvelope.request.type === 'IntentRequest'
      && handlerInput.requestEnvelope.request.intent.name === 'DeployIntent';
  },
  handle (handlerInput) {
    console.log('DeployIntent Invoked');
    console.dir(handlerInput);

    const request = handlerInput.requestEnvelope.request;

    if (request.dialogState !== 'COMPLETED') {
      return {
        directives: [{"type": "Dialog.Delegate"}]
      };
    }

    const stackName = request.intent.slots.stackName.value;
    const env = request.intent.slots.env.value;

    let args = ['deploy', stackName, env, 'master',
                '--config', './.stackery.toml',
                '--access-key-id', process.env.accessKeyId,
                '--secret-access-key', process.env.secretAccessKey];

    return childProcess.execFile('./stackery', args)
      .then(result => {
        console.log(`stackery returned: stdout: ${result.stdout}`);
        console.log(`stackery returned: stderr: ${result.stderr}`);
      })
      .catch(error => {
        console.log(`ChildProcess errored with ${JSON.stringify(error)}`);
        if (error.stdout) {
          console.log(error.stdout);
          console.log(error.stderr);
        }
      })
      .then(() => {
        const speechText = `Starting deployment of ${stackName} into ${env}`;

        return handlerInput.responseBuilder
          .speak(speechText)
          .getResponse();
      })
  }
};

I commited my changes and deployed my alexaDeployments stack. Once deployed, I was able to go into the Deployed Stack Dashboard and click on the alexaHandler resource to get the Lambda ARN, which let me finish the last step in setting up my Alexa Skill - connecting the Alexa Skill to the Lambda function.

Function Permission Errors

However, when I tried to add the ARN of the Lambda function to the Alexa skill, I got an error The trigger setting for the Lambda arn:aws:lambda:us-west-2:<account>:function:<functionName> is invalid. Error code: SkillManifestError - Friday, Apr 27, 2018, 1:43 PM. Whoops, I forgot to give Alexa permission to access the lambda function. Stackery usually takes care of all the permissions needed, but since it didn’t know about the Alexa Skill, I was going to have to manually add the needed permission. Luckily, Stackery makes this easy with Custom CloudFormation Resources. I added a custom resource to my stack with the following CloudFormation:

{
  "Resources": {
    "alexaSkillPolicy": {
      "Type": "AWS::Lambda::Permission",
      "Properties": {
        "Action": "lambda:InvokeFunction",
        "FunctionName": "stackery-85928785027043-development-33181332-alexaHandler",
        "Principal": "alexa-appkit.amazon.com"
      }
    }
  }
}

This let’s alexa-appkit.amazon.com invoke my function. After re-deploying my stack with this change, I was able to finish linking my Alexa Skill to my handler, and it was time to test!

Timeouts and Retry Errors

Initial testing looked good - Alexa was able to run my skill, my handler was getting invoked, and I could see my test stack (a simple hello world stack) was getting re-deployed. However, when I looked into the CloudWatch logs for my alexaHandler function, I noticed that I was getting the errors printed from the Stackery CLI Failed to prepare deployment: \nStackery API responded with status code: 409\nYou probably already have a deployment in progress\n. With some inspection, I realized that since the handler took the time to actually deploy before responding to Alexa, Alexa was seemingly timing out and retrying in about 30 seconds. So this error was from the re-invocation of the Stackery CLI.

Ideally, I’d be able to provide intermittent status updates via Alexa, but unfortunately you are only allowed to respond once. To handle this issue, I refactored my alexaHandler function to asynchronously invoke another lambda function stackeryWrapper.

So now, my DeployIntentHandler looked like this:

const DeployIntentHandler = {
  canHandle (handlerInput) {
    return handlerInput.requestEnvelope.request.type === 'IntentRequest'
      && handlerInput.requestEnvelope.request.intent.name === 'DeployIntent';
  },
  handle (handlerInput) {
    console.log('DeployIntent Invoked');
    console.dir(handlerInput);

    const request = handlerInput.requestEnvelope.request;

    if (request.dialogState !== 'COMPLETED') {
      return {
        directives: [{ "type": "Dialog.Delegate" }]
      };
    }

    const stackName = request.intent.slots.stackName.value.replace(' ', '');
    const env = request.intent.slots.env.value.replace(' ', '');
    let message = { stackName, env };
    const Payload = JSON.stringify(message, null, 2);

    return lambda.invoke({
      FunctionName: stackeryWrapper.functionName,
      InvocationType: 'Event',
      Payload
    }).promise()
      .then(() => {
        const speechText = `Starting deployment of ${stackName} into ${env}!`;

        return handlerInput.responseBuilder
          .speak(speechText)
          .getResponse();
      })
  }
};

And my new stackeryWrapper function looks like this:

const childProcess = require('child-process-promise');

module.exports = async message => {
  console.dir(message);

  const stackName = message.stackName;
  const env = message.env;

  return childProcess.execFile('./stackery', ['deploy', stackName, env, 'master', '--config', './.stackery.toml', '--access-key-id', process.env.accessKeyId, '--secret-access-key', process.env.secretAccessKey])
    .then(result => {
      console.log(`stackery returned: stdout: ${result.stdout}`);
      console.log(`stackery returned: stderr: ${result.stderr}`);
    })
    .catch(error => {
      console.log(`ChildProcess errored with ${error}`);
      if (error.stdout) {
        console.log(error.stdout);
        console.log(error.stderr);
      }
    });
}

And my stack looks like this:


Final Thoughts

While this project is far from being useable by anyone else as it stands, I found it interesting and honestly exciting to be able to get Stackery deployment to work via Alexa. Ramping on Alexa was relatively painless, although Amazon does have some contradictory documentation that can confuse the waters. And with Stackery, it was painless to handle adding the CLI and the refactoring that I needed. There’s a lot that could still be done to this project such as authorization, authentication, status updates, etc, but that will have to wait for another day.

Fargate and Cucumber-js: A Review
Stephanie Baum

Stephanie Baum | April 16, 2018

Fargate and Cucumber-js: A Review

Lately, here at Stackery, as we’ve begun shipping features more rapidly into the product, we’ve also been shifting some of our focus towards reliability and integration testing in preparation. I decided to try out AWS Fargate for UI integration testing using BDD and Cucumber-js in a day long experimental POC. Cucumber is a behavior driven development testing framework with test cases written in a language called gherkin that focuses specifically on user features. AWS Fargate is a recently released abstraction on top of ECS services that gets rid of managing EC2 instances. These are my conclusions:

1. Fargate is awesome. Why would you not use Fargate?

If you’re configuring a Fargate task via the AWS UI it’s somewhat confusing and clumsy. With Stackery, you can configure Fargate while avoiding the pain of the AWS UI entirely. The communication between AWS Lambda to a Fargate task is the same as it would be for a normal ECS service, so moving existing ECS clusters/services to Fargate is straightforward application logic-wise. Here’s a simplified code snippet, dockerTaskPort refers to the conveniently provided Stackery Port environment variable. See our docs for the Docker Task node for more information.

  const repoName = `cross-region-us-east`;
  const browserCiRepo = `https://${token}@github.com/sbaum1994/${repoName}.git`;

  const dockerCommands = [
    `echo 'Running node index.js'`,
    `node index.js`
  ];

  const env = {
    ENV_VAR: 'value'
  };

  let dockerCommand = ['/bin/bash', '-c', dockerCommands.join('; ')];

  const params = {
    taskDefinition: dockerTaskPort.taskDefinitionId,
    overrides: {
      containerOverrides: [
        {
          name: '0'
        }
      ]
    },
    launchType: 'FARGATE'
  };

  params.networkConfiguration = {
    awsvpcConfiguration: {
      subnets: dockerTaskPort.vpcSubnets.split(','),
      assignPublicIp: (dockerTaskPort.assignPublicIPAddress ? 'ENABLED' : 'DISABLED')
    }
  };

  params.overrides.containerOverrides[0].command = dockerCommand;

  params.overrides.containerOverrides[0].environment = Object.keys(env).map((name) => {
    return {name, value: env[name]};
  });

  const ecs = new AWS.ECS({ region: process.env.AWS_REGION });
  return ecs.runTask(params)...

It’s a nice plus that there are no EC2 configurations to worry about, and it also simplified scaling. In the past we’ve had to use an ECS cluster and service for CI when the integration testing has been too long running for AWS lambda. Here, my Fargate service just scales up and down nicely without having to worry about configuration, bottlenecks or cost.

Here’s my UI integration testing set up, triggered by an endpoint that specifies the environment to test.

With Fargate there is still technically an ECS cluster that needs configuring on set up, and when using a load balancer and target group. You are still creating a task definition, containers, and a service. Stackery’s UI makes it easy to understand and configure, but if I were doing this on my own I’d still find it a PIA. Furthermore, I could see Fargate not being ideal in some use cases, since you can’t select the EC2 instance type.

Stackery UI setting up Fargate:

2. Cucumber is pretty cool too. BDD creates clear tests and transparent reporting.

I really like the abstraction Cucumber provides between the test definitions and underlying assertions/implementations. For this POC I created a simple “login.feature” file as follows:

Feature: Login
  In order to use Stackery
  As a single user
  I want to login to my Stackery account

  Background:
    Given I've navigated to the Stackery app in the browser
    And it has loaded successfully

  Scenario: Logging in as a user with a provider set up
    Given a test user account exists with a provider
    When I login with my username and password
    Then I'm taken to the "Stacks" page and see the text "Select a stack"
    And I see the "Stackery Stacks" section populated in the page
    And I see the "CloudFormation Stacks" section populated in the page

Each step maps to a function that uses Selenium Webdriver on headless chrome under the hood to run the tests. I also pass in configuration that lets the test know what the test account username and password is, what Stackery environment is being tested, and other definitions like the timeout settings. In my pipeline, I also added an S3 bucket to hold the latest Cucumber reporting results for visibility after a test finishes.

Report generated:

Overall, I think this can potentially be a great way to keep track of adding new features while maintaining existing ones / making sure everything is regressively tested on each merge. Furthermore it’s clear, organized and user flow oriented, which can work well for a dashboard style app like ours with multiple, repeatable, extensible steps (Create Environment, Deploy a Stack To Environment) etc.

Quickly Iterating on Developing and Debugging AWS Lambda Functions
Apurva Jantrania

Apurva Jantrania | March 15, 2018

Quickly Iterating on Developing and Debugging AWS Lambda Functions

Recently, I found myself having to develop a complex lambda function that required a lot of iteration and the need for interactive debugging. Iterating on lambda functions can be painful due to the amount of time it takes to re-deploy an update to lambda and trying to attach a debugger to lambda just isn’t an option. If you find yourself re-deploying more than a handful of times, the delay introduced by the redeployment processes can feel like watching paint dry. I thought I’d take this opportunity to share some of my strategies to alleviate the issues I’ve encountered developing and debugging both simple and complex lambda functions.

I find that it is always useful to log the event or input (depending on your language of choice) for any deployed lambda function - while you can mock this out (and should for unit tests!), I’ve found that having the full event has been critical for some debug cases. Even with AWS X-Ray enabled on your function, there isn’t enough information to usually recreate the full event structure. Depending on your codebase you may want to also log the context object, but in my experience, this is isn’t usually necessary.

Method 1: A quick and dirty method

With the event logged, it is straightforward to build a quick harness to run the failure instance locally in a way that is usually good enough.

Let’s look at an example in Python - if for example, our handler is handler() in my_lambda.py:

def handler(message):
    print('My Handler')
    print(message)
    # Do stuff

    # Error happens here
    raise Exception('Beep boop bop')

    return None

First, open your cloud watch logs for this lambda function (If you are using Stackery to manage your stack, you can find a direct link to your logs in the deployment panel) and capture the message that the function printed Cloud Watch Log

Then, we can create a simple wrapper file tester.py and import the handler inside. For expediency, I also just dump the event into a variable in this file.

import my_lambda

message = {
  'headers': {
    'accept': '...',
    'accept-language': '...',
    # ...
  }
}


my_lambda.handler(message)

With this, you can quickly iterate on the code in your handler with the message that caused your failure. Just run python tester.py.

There are a handful of caveat’s to keep in mind with this implementation:

  • ENV vars: If your function requires any ENV vars to be set, you’ll want to add those to the testing harness.
  • AWS SDK: If your lambda function invokes any AWS SDK’s, they will run with the credentials defined for the default user in ~/.aws/credentials which may cause permission issues
  • Dependencies: You’ll need to install any dependencies your function requires

But, with those caveats in mind, I find this usually is good enough and is the fastest way to replicate an error/iterate on lambda development.

Method 2: Using Docker

For the times you need to run in a sandboxed environment that is identical (or as close to as possible) as Lambda, I turn to using Docker with the images provided by LambCI.

When debugging/iterating, I find that my cycle time is sped up by using the build images versions of the LambCI images and running bash interactively. Eg, if my function is running on Python 2.7, I’ll use the lambci/lambda:build-python2.7 image. I prefer launching into bash rather than having Docker run my lambda function directly, because otherwise, any dependencies will need to be downloaded & installed each run, which can add significant latency to the run.

So in the above example, my command would be docker run -v /path/to/code:/test -it lambci/lambda:build-python2.7 bash. Then, once bash is loaded in the Docker Container, I first do the following:

  1. CD to the test directory: cd /test
  2. Install your dependencies
  3. Run test tester: python /test/tester.py

With this, since we are running docker run with the -v flag to mount the handler directory inside the container as a volume, any changes you make to your code will immediately affect your next run, enabling the same iteration speed as Method 1 above. You can also attach a debugger of your choice if needed.

While this method requires some setup of Docker and thus is a little more cumbersome to start up than Method 1, it will enable you to run locally in an environment identical to Lambda.

Implementing the Strangler Pattern with Serverless
Stephanie Baum

Stephanie Baum | February 28, 2018

Implementing the Strangler Pattern with Serverless

By now we’ve all read Martin Fowler’s Strangler Pattern approach to splitting up monolithic applications. It sounds wonderful, but in practice it can be tough to do, particularly when you’re under a time crunch to enable shiny new “modern” features at the same time.

An example I’ve seen several times now is going from an older, on-prem, or in general just slower, traditional architecture to a cloud-based, event-streaming one, which enables things like push notifications to customers, high availability, and sophisticated data analytics. This can seem like a big leap, especially when you’ve got an application that is so old most of your engineering team doesn’t know how it still functions, and is so fragile that if you look at it wrong it’ll start returning java.null.pointer .html pages from its API.

Here’s the good news, serverless can help you! Stackery can help you! By creating serverless API layers for your existing domains, you can abstract away the old, exposing painless restful API interfaces to your frontends, while simultaneously incorporating event streaming into your architecture. Furthermore, by using Stackery, you can do this while maintaining a high degree of monitoring (with our Health Dashboard), operations management, and security (since we configure the IAM permissions between services, handle environments, and encrypt configuration storage, for you).

The Situation

Let’s take a hypothetical customer loyalty application. It has some XML based Java API’s that map to some pretty old, non-restful application logic. The application works as is, if slowly, but the cost of maintaining it is getting too high, it’s fragile and prone to tip overs, and we’ve got a directive to start abstracting away some of it into some sort of new cloudbased architecture on AWS. We also want to justify some of this refactor with some new feature enablement, such as push notifications to customer’s phones when they reach a certain loyalty tier or spending cashback amount, and an event-based data analytics pipeline.

Steps to Enlightenment

  1. Use Domain Driven Design techniques to define a new, cleaner, microservice-like understanding of your application. Including events that you want to surface.
  2. Define your new API contracts based on these new domains. In our example, the domains are pretty straightforward, loyalty and customer, perhaps before they were combined into one, but as we add more loyalty based functionality we’ve decided to separate them for future proofing and ease of understanding.
  3. Define how your old APIs map to these new APIs. For example, say we want to enable a new POST /customer endpoint. Previously, the frontend had to send an XML request to service x and an XML request to service y. We will encapsulate and abstract that logic away in our serverless API function.
  4. Build your new architecture!

Above, I have laid out a hypothetical strangler pattern-esque architecture in Stackery’s editor panel to solve the situation.

We have two Rest API Nodes corresponding to the two new domains, that front and forward all requests to two Function Nodes, CustomerAPI and LoyaltyAPI which would be implementing our new API contract, combined with any abstracted-away logic dealing with the underlying legacy application to enable this contract. So far we have achieved the essential goal of the strangler pattern by abstracting away some of our old logic, and exposing it via new domain driven, segmented APIs.

Now for enabling some new functionality. These API nodes, in addition to returning respondes to the frontend, emit contextual events to the Events Stream Node, which in turn outputs to the Listener Function Node that listens for customer or loyalty events it “cares” about. Those events are forwarded on to a NotificationsSNS Topic Node, enabling event-based SNS. We also have an Analytics Function Node, that gets events from the event stream as well as any error events. The Errors node emits any uncaught errors from our new functions to the UncaughtExceptionHandler Function Node for easier error management and greater visibility.

Conclusion

Not all legacy application migrations will follow the steps I’ve listed here. In fact, one of the biggest struggles with doing something like this is that each strangler pattern must be uniquely tailored based on an in-depth understanding of the existing business logic, risks, and end goals. Often times, the engineering team implementing the pattern will be somewhat unfamiliar with some of the new technology being used. That also comes inherently with risk such as…

  • What if it takes too long to PoC?
  • What if you configure the IAM policies and security groups incorrectly?
  • What if something breaks anywhere in the pipeline? How do we know if it was in the new API layer or the old application?

When one migrates to distributed cloud-based services, it’s more complicated than it’s made out to be. Stackery can help you manage these risks and concerns, by making your new applications faster to PoC, managing secure access between services for you, and surfacing errors and metrics. There are a lot of things that can go wrong, and AWS doesn’t make it easy to find the problem. There’s also the task of fine tuning all these services for cost efficiency and maximum availability. Ask yourself if you would rather be doing that by digging through the inception that is AWS’s UI, or with Stackery’s Serverless Health Dashboard.

Tracing Serverless Applications with AWS X-Ray
Apurva Jantrania

Apurva Jantrania | February 01, 2018

Tracing Serverless Applications with AWS X-Ray

Debugging serverless applications can be very hard. Often, the traditional tools and methodologies that are commonly used in monolithic applications don’t work (easily, at least). While each service is smaller and easier to fully understand and test, a lot of the complexity and issues are now found in the interconnections between the micro-services. The event-driven architecture inherent in serverless further increases the complexity of tracing data through the application, increasing the debugging complexity.

Much of the DevOps tooling in this area is still in its infancy, but Amazon took a large step forward with AWS X-Ray. X-Ray helps tie together the various pieces of your serverless application in a way that makes it possible to understand the relationships between the different services and trace the flow of data and failures. One of the key features is X-Ray’s service map, a visual representation of the AWS services in your application and the data flow between them; this ability to visually see your architecture is something we’ve always valued at Stackery and is a key reason we let you design your application architecture visually.

As a quick side note, it is interesting to see how a Stackery visualizes a stack compared to the AWS X-Ray visualization:

Stackery Representation

Stackery Representation

AWS X-Ray Representation

AWS X-Ray Representation

When a request hits a service that provides active X-Ray integration (and one that you’ve set up to use X-Ray), it will add a unique tracing header to the request which will also be added to any downstream requests that are generated. Currently, Amazon supports only AWS Lambda, API Gateway, EC2, Elastic Load Balancers and Elastic Beanstalk for active integration. Most other services support passive integration, which is to say that they’ll continue adding to the trace if the request already has the tracing header set.

With AWS X-Ray enabled throughout your application, you can click on nodes in the Service Map to see details such as the response distribution and dive into trace data. Here are some traces for a few AWS services - CloudFormation, DynamoDB, Lambda, and STS:

Response Distributions

This view is useful to get a high-level view of the health and status of your services. Diving in further will allow you to view specific traces, which is critical for understanding which services are slowing your application down or root causing failures.

Trace

One limitation to keep in mind is that the X-Ray service map will only allow you to view data in 6 hours or smaller chunks, but it keeps a 30-day rolling history.

Enabling X-Ray can be tedious. For instance, to enable X-Ray on AWS Lambda, you need to do three things for each lambda function:

  1. Enable active tracing
  2. Update your code to use the AWS X-Ray enabled SDK rather than the standard AWS SDK
    • Node.js - Java - Go - Python - .Net - Ruby
    • Using the AWS X-Ray enabled SDK lets Lambda decide on how often and when to sample/upload requests
  3. Add the needed IAM permissions to upload the trace segments

Unfortunately, needing to do this for every lambda function, old and new, makes it ripe for human error.

Details on how to enable active tracing on other services can be found here.

At Stackery, we think enabling data tracing is another critical component in Serverless Ops, just like handle Errors and Lambda timeouts. So any stack deployed with Stackery has AWS X-Ray automatically enabled - we make sure that any AWS service used has the correct settings to enable active AWS X-Ray tracing if supported and for lambda functions, we take care of all of the steps so you don’t need to worry about permissions or updating your code to use the right SDK.

How Does Docker Fit In A Serverless World?
Chase Douglas

Chase Douglas | January 04, 2018

How Does Docker Fit In A Serverless World?

The debut of AWS Lambda in 2014 spawned a debate: serverless vs Docker. There have been countless articles comparing cost efficiency, performance, constraints, and vendor lock-in between the two technologies.

Thankfully, the second half of 2017 has shown this all to be a bit beside the point. With recent product announcements from Azure and AWS, it is more clear than ever that serverless and Docker are not opposing technologies. In this article, we’re going to take a look at how Docker fits into a serverless world.

Docker Isn’t Serverless (circa 2016)

“Serverless” has about a dozen definitions, but a few are particularly important when we talk about Docker:

  • On-demand resource provisioning
  • Pay-per-use billing (vs up-front provisioning based on peak utilization)
  • Elimination of underlying server management

Until recently, Docker-based applications deployed to the major cloud providers were not “serverless” according to these three attributes. In the beginning, Docker containers were deployed to servers directly, and Ops teams had to build custom solutions to provision and manage pools of these servers. In 2014, both Kubernetes and AWS EC2 Container Service (since renamed Elastic Container Service) enabled easier provisioning and management of these server pools.

Kubernetes and AWS ECS provided two key benefits. On the technical side, they provided templates for provisioning pools of servers to run Docker containers, making it easier to get started and maintain for devops teams. On the business side, they provided proof that Docker as a technology was mature enough for production work loads. Partly because of these tools, in the past few years Docker became an increasingly common choice for hosting services.

And yet, with all that Kubernetes and AWS ECS provided, we were still left with the tasks of optimizing resource utilization and maintaining the underlying servers that make up the Docker cluster resource pools.

Docker Is Serverless (2017)

Two new services have brought Docker into the serverless realm: Azure Container Instances and AWS Fargate. These services enable running a Docker container on-demand without up-front provisioning of underlying server resources. By extension, this also means there is no management of the underlying server, either.

According to our definition above, Docker is now “serverless”. Now it starts to make sense to compare Docker and Functions-as-a-Service (FaaS), like AWS Lambda. In one sense, we’ve come full circle back to our familiar comparisons between Docker and “serverless”. Except the goal has shifted from the less useful question of which technology is “better” to the more interesting question of when you should use Docker vs when you should use FaaS.

FaaS vs Docker

Going back to the dozen definitions of “serverless”, there are a few definitions that are now clearly misplaced. They are instead definitions of FaaS:

  • Low latency scaling (on the order of a second or less to invoke computation)
  • Managed runtime (Node.js, Java, Python, etc.)
  • Short-lived executions (5 minutes or less)

The new Docker invocation mechanisms now show how these are not applicable to all forms of “serverless” computing. Serverless Docker has the following characteristics instead:

  • Medium latency scaling (on the order of minutes or less to invoke computation)
  • Complete control of runtime environment
  • Unlimited execution duration

These differences help us determine where Docker fits in the serverless world.

How Does Docker Fit In Serverless?

Now that we have seen how Docker can be serverless and also how it differs from FaaS, we can make some generalizations about where to use FaaS and Docker in serverless applications:

Use Cases For Functions-as-a-Service

  • Low-latency, highly volatile (e.g. API services, database side-effect computation, generic event handling)
  • Short-lived computations (FaaS is cheaper because of faster startup, which is reflected in the per-invocation costs)
  • Where the provided runtimes work (if the runtimes work for your application, let the service provider deal with maintaining them)

Use Cases For Docker

  • Background jobs (where invocation latency is not an issue)
  • Long-lived computations (execution duration is unlimited)
  • Custom runtime requirements

This categorization papers over a few cases and leaves a lot of gray area. For example, a serverless Docker application could still back a low-latency API service by spinning up multiple containers and load balancing across them. But having these gray areas is also helpful because it means that we now have two tools we can choose from to optimize for other concerns like cost.

Taking the low-latency API service example again, a decision could be made between a FaaS and a Docker backend based on the cost difference between the two. One could even imagine a future where base load for a highly volatile service is handled by a Docker-based backend, but peak demand is handled by a FaaS backend.

2018 Will Be Exciting For Serverless In All Forms

Given that it’s the beginning of the new year, it’s hard not to look forward and be excited about what this next year will bring in the serverless space. A little over three years since AWS Lambda was announced it has become clear that building applications without worrying about servers is empowering. With Docker joining the fold, even more exciting possibilities open up for serverless.

AWS Lambda Cost Optimization
Sam Goldstein

Sam Goldstein | December 22, 2017

AWS Lambda Cost Optimization

Serverless application architectures put a heavy emphasis on pay-per-use billing models. In this post I’ll look at the characteristics of pay-per-use vs. other billing models and discuss how to approach optimizing your AWS lambda usage for optimal cost/performance tradeoffs.

How Do You Want To Pay For That?

There are basically three ways to pay for your infrastructure.

  1. Purchase hardware up front. You install it in a datacenter and use it until it breaks or you replace it with newer hardware. This is the oldest method for managing capacity and the least flexible. IT procurement and provisioning cycles are generally measured in weeks, if not months, and as a result it’s necessary to provision capacity well ahead of actual need. It’s common for servers provisioned into these environments to use 15% of their capacity or less, meaning most capacity is sitting idle most of the time.
  2. Pay-to-provision. You provision infrastructure using a cloud provider’s pay-to-provision Infrastructure as a Service (IaaS). This approach eliminates the long procurement and provisioning cycles since new servers can be spun up at the push of a button. However it’s still necessary to provision enough capacity to handle peak load, meaning it’s typical to have an (often large) buffer of capacity sitting idle, waiting for the next traffic spike. It’s common to see infrastructure provisioned with this approach with an average utilization in the 30-60% range.
  3. Pay-per-use. This is the most recent infrastructure billing model and it’s closely tied to the rise of serverless architectures. Functions as a Service (FaaS) compute services such as AWS Lambda and Azure Functions bill you only for the time your code is running and scale automatically to handle incoming traffic. As a result it’s possible to build systems that handle large spikes in load, without having a buffer of idle capacity. This billing model is gaining popularity since it aligns costs closely with usage and it’s being applied to an increasing variety of services like databases (both SQL and NoSQL) and Docker-based services.

Approaching AWS Cost Optimization

There’s a few things that are important to note before we get into how to optimize your AWS lambda costs.

  1. AWS Lambda allows you to choose the amount of memory you want for your function from 128MB to 3GB.
  2. Based on the memory setting you choose, a proportional amount of CPU and other resources are allocated.
  3. Billing is based on GB-SECONDS consumed, meaning a 256MB function invocation that runs for 100ms will cost twice as much as a 128MB function invocation that runs for 100ms.
  4. For billing purposes the function duration is rounded up to the nearest 100ms. A 128MB function that runs for 50ms will cost the same amount as one that runs for 100ms.

There’s also a few questions you should ask yourself before diving into Lambda cost optimization:

  1. What percentage of my total infrastructure costs is AWS Lambda? In nearly every serverless application FaaS components integrate with resources like databases, queueing systems, and/or virtual networks, and often are a fraction of the overall costs. It may not be worth spending cycles optimizing Lambda costs if they’re a small percentage of your total.
  2. What are the performance requirements of my system? Changing your functions memory setttings can have a significant impact on cold start time and overall run time. If parts of your system have low latency requirements you’ll want to avoid changes that degrade performance in favor of lower costs.
  3. Which functions are run most frequently? Since the cost of a single Lambda invocation is insanely low, it makes sense to focus cost optimization on functions with monthly invocation counts in hundreds of thousands or millions.

AWS Lambda Cost Optimization Metrics

Now let’s look at the two primary metrics you’ll use when optimizing Lambda cost.

Allocated Memory Utilization

Each time a Lambda function is invoked two memory related values are printed to CloudWatch logs. These are labeled Memory Size and Max Memory Used. Memory size is the function’s memory setting (which also controls allocation of CPU resources). Max Memory Used is how much memory was actually used during function invocation. It may make sense to write a Lambda function that parses these value out of Cloudwatch logs, calculates the percentage of allocated memory used. Watching this metric you can decrease memory allocation on functions that are overprovisioned, and watch for increasing memory use that may indicate functions becoming underallocated.

Billed Duration Utilization

It’s important to remember that AWS Lambda usage is billed in 100ms intervals. Like Memory usage, Duration and Billed Duration are logged into Cloudwatch after each function invocation, and these can be used to calculate a metric representing the percentage of billed time for which your functions were running. While 100ms billing intervals are granular compared to most pay-to-provision services there can still be major cost implications to watch out for. Take, for example, a 1GB function that generally runs in 10ms. Each invocation of this function will be billed as if it takes 100ms, a 10x difference in cost! In this case it may make sense to decreae the memory setting of this function, so it’s runtime is closer to 100ms with significantly lower costs. An alternative approach is to rewrite the function to perform more work per invocation (in use cases where this is possible), for example processing multiple items from a queue instead of one, to increase Billed Duration Utilization.

Conversely there are cases where increasing the memory setting can result in lower costs and better performance. Take as an example a 1Gb function that runs in 110ms. This will be billed as 200ms. Increasing the memory setting (which also controls CPU resources) slightly may allow the function to execute under 100ms, which will decrease the billing duration by 50%, and result in lower costs.

The New Cost Optimization

The pay-per-use billing model significantly changes the relationship between application code and infrastructure costs, and in many ways enforces a DevOps approach to managing these concerns. Instead of provisioning for peak load, plus a buffer, infrastructure is provisioned on demand and billed based on application performance characteristics. In general this dramatically simplifies the process of tracking utilization and optimizing costs, but it also transforms this concern. Instead of using a pool of servers and monitoring resource utilization it becomes necessary to track application level metrics like invocation duration and memory utilization in order to fully understand and optimize costs. Traditional application performances metrics like response time, batch size, and memory utilization now have direct cost implications and can be used as levers to control infrastructure costs. This is yet another example of where serverless technologies are driving the convergence of developmental and operational concerns. In the serverless world the infrastucture costs and application performance and behavior become highly coupled.

Ready to Get Started?

Contact one of our product experts to get started building amazing serverless applications quickly with Stackery.

To Top