Stacks on Stacks

The Serverless Ecosystem Blog by Stackery.

Posts on DevOps

PHP on Lambda? Layers Makes it Possible!
Nuatu Tseggai

Nuatu Tseggai | November 29, 2018

PHP on Lambda? Layers Makes it Possible!

AWS’s announcement of Lambda Layers means big things for those of us using serverless in production. The creation of set components that can be included with any number of Lambdas means you no longer have to zip up your application code and all its dependencies each time you deploy a serverless stack. This allows you to include dependencies that are much more bespoke to your particular serverless environment.

In order to enable Stackery customers with Layers at launch, we took a look at Lambda Layers use cases. I also decided to go a bit further and publish a layer that enables you to write a Lambda in PHP. Keep in mind that this is an early iteration of the PHP runtime Layer, which is not yet ready for production. Feel free to use this Layer to learn about the new Lambda Layers feature and begin experimenting with PHP functions and send us any feedback; we expect this will evolve as the activity around proof of concepts expands.

What does PHP do?

PHP is a pure computing language and you can use to emulate the event processing syntax of a general-purpose Lambda. But really, PHP is used to create websites, so Chase’s implementation maintains that model: your Lambda accepts API gateway events and processes them through a PHP web server.

How do you use it?

Configure your function as follows:

  1. Set the Runtime to provided
  2. Determine the latest version of the layer: aws lambda list-layer-versions --layer-name arn:aws:lambda:<your 3. region>:887080169480:layer:php71
  3. Add the following Lambda Layer: arn:aws:lambda:<your region>:887080169480:layer:php71:<latest version>

If you are using AWS SAM it’s even easier! Update your function:

Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      ...
      Runtime: provided
      Layers:
        - !Sub arn:aws:Lambda:${AWS::Region}:887080169480:layer:php71

Now let’s write some Lambda code!

<?php
header('Foo: bar');
print('Request Headers:');
print_r(getallheaders());
print('Query String Params:');
print_r($_GET);
?>

Hi!

The response you get from this code isn’t very well formatted, but it does contain the header information passed by the API gateway:

If you try anything other than the path with a set API endpoint response, you’ll get an error response that the sharp-eyed will recognize as being from the PHP web server, which as mentioned above is processing all requests

Implementation Details

Layers can be shared between AWS accounts, which is why the instructions above for adding a layer works: you don’t have to create a new layer for each Lambda. Some key points to remember:

  • A layer must be published on your region
  • You must specify the version number for a layer
  • For the layer publisher, a version number is an integer that increments each time you deploy your layer

How Stackery Makes an Easy Process Easier

Stackery can improve every part of the serverless deployment process, and Lambda Layers are no exception. Stackery makes it easy to configure components like your API gateway.

Stackery also has integrated features in the Stackery Operations Console, which lets you add layers to your Lambda:

Conclusions

Lambda Layers offers the potential for more complex serverless applications making use of a deep library of components, both internal to your team and shared. Try adding some data variables or a few npm modules as a layer today!

GitHub Actions: Automating Serverless Deployments
Toby Fee

Toby Fee | November 20, 2018

GitHub Actions: Automating Serverless Deployments

The whole internet is abuzz over GitHub Actions, if by ‘whole internet’ you mean ‘the part of the internet that is obsessed with serverless ops’ and by ‘abuzz’ you mean ‘aware of‘.

But Actions are a bit surprising! GitHub is a company that has famously focused on doing a single thing extremely well. As the ranks of developer-tooling SaaS companies swells by the day, you would think GitHub would have long ago joined the fray. Wouldn’t you like to try out a browser-based IDE, CI/CD tools, or live debugging tools, with the GitHub logo at the top left corner?

With the GitHub Actions product page promising to let you build your own ‘workflow’ from private and shared ‘actions’ in any order you want, with each action to be run in its own docker container, the whole thing configured with a simple-but-powerful scripting logic; GitHub Actions feels like a big ambitious expansion of the product. Could this be the influence of notoriously over ambitious new leadership at Microsoft?

In reality GitHub Actions are a powerful new tool for expanding what you do based on GitHub events and nothing else!

What can it do?

A lot! Again in the realm of ‘doing something to your repo when something happens to your repo. Some use cases that stand out:

  • Build your assets when you commit to Master
  • Raise alerts for critical issues
  • Take some custom action when commits get comments
  • Notify stakeholders when feature branches are merged to production

What can’t it do?

A whole lot! Workflows can’t:

  • Respond to anything other than GitHub repo events (you can’t send a request from anywhere else to kick off a workflow)
  • Take more than an hour
  • Have more than 100 actions - a limitation that seems pretty minor since actions can do arbitrarily large tasks

Overall Impressions

GitHub Actions are definitely a step in the right direction, since both the configuration for a workflow and the docker images for each action can all be part of a single repo managed like others code. And as I and others have often harped on: one should always prefer managing code over managing config. GitHub Actions increases the likelihood that your whole teams will be able to see how the workflow around your code is supposed to work, and that’s an unalloyed benefit to your team.

“I’m sold, GitHub Actions forever! I’ll add them to master tomorrow.”

Bad news sport, GitHub Actions are on a beta with a waitlist and while GitHub has its sight set on integrating actions with your production process, a warning at the top of the developer guide explicitly states that GitHub Actions isn’t ready to do that.

So for now head on over and get on the waiting list for the feature, and try it out with your dev branches sometime soon.

GitHub makes no secret of the fact that Actions replace the task of building an app to receive webhooks from your repository. If you’d like to build an app in the simplest possible structure, my coworker Anna Spysz wrote about how to receive GitHub webhooks in just a few steps. Further, using Stackery makes it easy to hook your app up to a docker container to run your build tasks.

The Case for Minimalist Infrastructure
Garrett Gillas

Garrett Gillas | November 13, 2018

The Case for Minimalist Infrastructure

If your company could grow its engineering organization by 40% without increasing costs, would they do it? If your DevOps team could ship more code and features with fewer people, would they want to? Hopefully, the answer to both of these questions is ‘yes’. At Stackery, we believe in helping people create the most minimal application infrastructure possible.

Let me give you some personal context. Last year, I was in charge of building a web application integrated with a CMS that required seven virtual machines, three MongoDBs, a MySQL database and CDN caching for production. In addition, we had staging and dev environments with similar quantities of infrastructure. Over the course of 2 weeks, we were able to work with our IT-Ops team to get our environments up and running and start building the application relatively painlessly.

At Stackery, we saw a big opportunity that allows software teams to spend less time on infrastructure, and more time building software.

After we got our application running, something happened. Our IT-Ops team went through their system hardening procedure. For those outside the cybersecurity industry, system hardening can be defined as “securing a system by reducing its surface of vulnerability”. This often includes things like changing default passwords, removing unnecessary software, unnecessary logins, and the disabling or removal of unnecessary services. This sounds fairly straightforward, but it isn’t.

In our case, it involved checking our system against a set of rules like this one for Windows VMs and this one for Linux. Because we cared about security, this included closing every single port on every single applicant that was not in use. As the project lead, I discovered three things by the end.

  • We had spent much more people-hours on security and ops than on development.
  • Because there were no major missteps, this was nobody’s fault.
  • This should never happen.

Every engineering manager should have a ratio in their head of work hours spent in their organization on software engineering vs other related tasks (ops, QA, product management, etc…). The idea is that organizations that spend the majority of their time actually shipping code will perform better than groups that spend a larger percentage of their time on operations. At this point, I was convinced that there had to be a better way.

Serverless Computing

There have been many attempts since the exodus to the cloud to make infrastructure easier to manage in a way that requires fewer personnel hours. We came from bare-metal hardware to datacenter VMs, then VMs in the cloud and later containers.

In November 2014 Amazon Web Services announced AWS Lambda. The purpose of Lambda was to simplify building on-demand applications that are responsive to events and data. At Stackery, we saw a big opportunity that allows software teams to spend less time on infrastructure, and more time building software. We have made it our mission to make it easier for software engineers to build highly-scalable applications on the most minimal, modern cloud infrastructure available.

Five Ways Serverless Changes DevOps
Sam Goldstein

Sam Goldstein | October 31, 2018

Five Ways Serverless Changes DevOps

I spent last week at DevOps Enterprise Summit in Las Vegas where I had the opportunity to talk with many people from the world’s largest companies about DevOps, serverless, and the ways they are delivering software faster with better stability. We were encouraged to hear of teams using serverless from cron jobs to core bets on accelerating digital transformation initiatives.

Lots of folks had questions about what we’ve learned running the serverless engineering team at Stackery, how to ensure innovative serverless projects can coexist with enterprise standards, and most frequently, how serverless changes DevOps workflows. Since I now have experience building developer enablement software out of virtual machines, container infrastructures, and serverless services I thought I’d share some of the key differences with you in this post.

Developers Need Cloud-Side Environments to Work Effectively

At its core, serverless development is all about combining managed services in the cloud to create applications and microservices. The serverless approach has major benefits. You can move incredibly fast, outsourcing tons of infrastructure friction and orchestration complexity.

However, because your app consists of managed services, you can’t run it on your laptop. You can’t run the cloud on your laptop.

Let’s pause here to consider the implications of this. With VMs and containers, deploying to the cloud is part of the release process. New features get developed locally on laptops and deployed when they’re ready. With serverless, deploying to the cloud becomes part of the development process. Engineers need to deploy as part of their daily workflow developing and testing functionality. Automated testing generally needs to happen against a deployed environment, where the managed service integrations can be fully exercised and validated.

This means the environment management needs of a serverless team shift significantly. You need to get good at managing a multitude of AWS accounts, developer specific environments, avoiding namespace collisions, injecting environment specific configuration, and promoting code versions from cloud-side development environments towards production.

Note: While there are tools like SAM CLI and localstack that enable developers to invoke functions and mimic some managed services locally, they tend to have gaps and behave differently than a cloud-side environment.

Infrastructure Management = Configuration Management

The serverless approach focuses on leveraging the cloud provider do more of the undifferentiated heavy lifting of scaling the IT infrastructure, freeing your team to maintain laser focus on the unique problems which your organization solves.

To repeat what I wrote a few paragraphs ago, serverless teams build applications by combining managed services that have the most desirable scaling, cost, and durability characteristics. However, here’s another big shift. Developers now need familiarity with a hefty catalog of services. They need to understand their pros and cons, when to use each service, and how to configure each service correctly.

A big part of solving this problem is to leverage Infrastructure as Code (IaC) to define your serverless infrastructure. For serverless teams this commonly takes the form of an AWS Serverless Application Model (SAM) template, a serverless.yml, or a CloudFormation template. Infrastructure as Code provides the mechanism to declare the configuration and relationships between the managed services that compose your serverless app. However, because serverless apps typically involve coordinating many small components (Lambda functions, IAM permissions, API & GraphQL gateways, datastores, etc.) the YAML files containing the IaC definition tend to balloon to hundreds (or sometimes thousands) of lines, making it tedious to modify and hard to keep consistent with good hygiene. Multiply the size and complexity of a microservice IaC template by your dev, test, and prod environments, engineers on the team, and microservices; you quickly get to a place where you will want to carefully consider how they’ll manage the IaC layer and avoid being sucked into YAML hell.

Microservice Strategies Are Similar But Deliver Faster

Serverless is now an option for both new applications and refactoring monoliths into microservices. We’ve seen teams deliver highly scalable, fault-tolerant services in days instead of months to replace functionality in monoliths and legacy systems. We recently saw a team employ the serverless strangler pattern to transition a monolith to GraphQL serverless microservices, delivering a production ready proof of concept in just under a week. We’ve written about the Serverless Strangler Pattern before on the Stackery blog, and I’d highly recommend you consider this approach to technical transformation.

A key difference with serverless is the potential to eliminate infrastructure and platform provisioning cycles completely from the project timeline. By choosing managed services, you’re intentionally limiting yourself to a menu of services with built-in orchestration, fault tolerance, scalability, and defined security models. Building scalable distributed systems is now focused exclusively on the configuration management of your infrastructure as code (see above). Just whisper the magic incantation (in 500-1000 lines of YAML) and microservices spring to life, configured to scale on demand, rather than being brought online through cycles of infrastructure provisioning.

Regardless of platform, enforcing cross-cutting operational concerns when the number of services increases is a (frequently underestimated) challenge. With microservices it’s easy to keep the pieces of your system simple, but it’s hard to keep them all consistent as the number of pieces grows.

What cross-cutting concerns need to be kept in sync? It’s things like:

  • access control
  • secrets management
  • environment configuration
  • deployment
  • rollback
  • auditability
  • so many other things…

Addressing cross-cutting concerns is an area many serverless teams struggle, sometimes getting bogged down in a web of inconsistent tooling, processes, and visibility. However the serverless teams that do master cross-cutting effectively are able to deliver on microservice transformation initiatives much faster than those using other technologies.

Serverless is Innovating Quickly

Just like serverless teams, the serverless ecosystem is moving fast. Cloud providers are pushing out new services and features every day. Serverless patterns and best practices are undergoing rapid, iterative evolution. There are multiple AWS product and feature announcements every day. It’s challenging to stay current on the ever expanding menu of cloud managed services, let alone best practices.

Our team at Stackery is obsessed with tracking changes in the serverless ecosystem, identifying best practices, and sharing these with the serverless community. AWS Secrets Manager, easy authorization hooks for REST APIs in AWS SAM, 15 minute Lambda timeouts, and AWS Fargate Containers are just a few examples of recent serverless ecosystem changes our team is using. Only a serverless team can keep up with a serverless team. We have learned a lot of lessons, some of them the hard way, about how to do serverless right. We’ll keep refining our serverless approach and can honestly say we’re moving faster and with more focus than we’d ever thought possible.

Patching and Capacity Distractions Go Away (Mostly)

Raise your hand if the productivity of your team ever ground to a halt because you needed to fight fires or were blocked waiting for infrastructure to be provisioned. High profile security vulnerabilities are being discovered all the time. The week Heartbleed was announced a lot of engineers dropped what they had been working on to patch operating systems and reboot servers. Serverless teams intentionally don’t manage OS’s. There’s less surface area for them to patch, and as a result they’re less likely to get distracted by freshly discovered vulnerabilities. This doesn’t completely remove a serverless team’s need to track vulnerabilities in their dependencies, but it does significantly scope them down.

Capacity constraints are a similar story. Since serverless systems scale on demand, it’s not necessary to plan capacity in a traditional sense, managing a buffer of (often slow to provision) capacity to avoid hitting a ceiling in production. However serverless teams do need to watch for a wide variety AWS resource limits and request increases before they are hit. It is important to understand how your architecture scales and how that will effect your infrastructure costs. Instead of your system breaking, it might just send you a larger bill so understanding the relationship between scale, reliability, and cost is critical.

As a community we need to keep pushing the serverless envelope and guiding more teams in the techniques to break out of technical debt, overcome infrastructure inertia, embrace a serverless mindset, and start showing results they never knew they could achieve.

The '8 Fallacies of Distributed Computing' Aren't Fallacies Anymore
Apurva Jantrania

Apurva Jantrania | October 23, 2018

The '8 Fallacies of Distributed Computing' Aren't Fallacies Anymore

In the mid 90’s, centralized ‘mainframe’ systems were in direct competition with microcomputing for dominance of the technology marketplace and developers’ time. Peter Deutsch, a Sun Microsystems engineer who was a ‘thought leader’ before we had the term, wrote seven fallacies that many developers assumed about distributed computing, to which James Gosling added one more to make the famous list of The 8 Fallacies of Distributed Computing.

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Microcomputing would win that debate in the 90’s, with development shifting to languages and platforms that could be run on a single desktop machine. Twenty years later we’re still seeing these arguments used against distributed computing, especially against the most rarefied version of distributed computing, serverless. Recently, more than one person has replied to these fallacies by saying they ‘no longer apply’ or ‘aren’t as critical’ as they once were, but the truth is none of these are fallacies any more.

How can these fallacies be true? There’s still latency and network vulnerabilities.

Before about 2000, the implied comparison with local computing didn’t need to be stated: “The network is more reliable than your local machine…” was obviously a fallacy. But now that assumption is the one under examination. Networks still have latency, but the superiority of local machines over networks is now either in doubt or absent.

1. The Network is Reliable

This is a pretty good example of how the list of fallacies ‘begs the question.’ What qualifies as ‘reliable?’ Now that we have realistic stats about hard drive failure rates, even a RAID-compliant local cluster has some failure rate.

2. Latency is Zero

Latency is how much time it takes for data to move from one place to another (versus bandwidth, which is how much data can be transferred during that time). The reference here is to mainframe ‘thin client’ systems where every few keystrokes had to round-trip to a server over an unreliable, laggy network.

While our networking technology is a lot better, another major change ha been effective AJAX and async tools that check in when needed and show responsive interfaces all the time. On top of a network request structure that hasn’t been much updated since the 90’s, and a browser whose memory needs seem to double annually, we still manage to run cloud IDE’s that perform pretty well.

3. Bandwidth is Infinite

Bandwidth still costs something, but beyond the network improvements mentioned above, the cost of bandwidth has become extremely tiny. While bills for bandwidth do exist, and I’ve even seen some teams optimize to try and save on their bandwidth costs!

In general this costs way more in developer wages than it saves, and brings us to the key point. Bandwidth still costs money, but the limited resource is not technology but people. You can buy a better network connection at 2AM on a Saturday, but you cannot hire a SQL expert who can halve your number of DB queries.

4. The Network is Secure

As Bloomberg struggles to back up its reports of a massive hardware bugging attack against server hardware, many people want to return to a time when network’s were inherently untrustworthy. More accurately, since few developers can do their jobs without constant network access for at least Github and NPM, untrustworthy networks are an easy scapegoat for poor operational security practices that almost everyone commits.

The Dimmie attack which peaked in 2017 targeted the actual spot where most developers are vulnerable: their laptops. With enough access to load in-memory modules on your home machine, attackers can corrupt your code repos with malicious content. In a well-run dev shop it’s the private computing resources that tend to be entry points for malicious attacks. The laptops that we take home with us for personal use that should be the least trustworthy component.

5. Topology Doesn’t Change

With the virtualization options available in something serverless like AWS’s Relational Database Service (RDS), it’s likely that topology never has to change from the point of view of the client. On a local or highly controlled environment there are setups where no DB architectures, interfaces, or request structures have changed in years. This is called ‘Massive Tech Debt’.

6. There is One Administrator

If this isn’t totally irrelevant (no one works on code that has a single human trust source anymore, or if they do that’s… real bad get a second Github admin please), it might still be a reason to use serverless and not roll your own network config.

For people just dipping a toe in to managed services, there are still horror stories about the one single AWS admin leaving for 6 weeks of vacation, deciding to join a monastery, and leaving the dev team unable to make changes. In those situations where there wasn’t much consideration of the ‘bus factor’ on your team there still is just one administrator it’s the service provider and as long as you’re the one paying for the service you can wrest back control.

7. Transport Cost is Zero

Yes, transport cost is zero. This one is just out of date.

8. The Network is Homogeneous

Early networked systems had real issues with this, I am reminded of the college that reported they could only send emails to places within 500 miles, there are ‘special places’ in a network that can confound your tests and your general understanding of the network.

This fallacy isn’t so much true as now the awkward parts of a network are clearly labelled as such. CI/CD is explicitly testing in a controlled environment, and even AWS which does its darndest to present you with a smooth homogenous system intentionally makes you aware of geographic zones.

Conclusions

We’ve all seen people on Twitter pointing out an AWS outage shouting how this means we should ‘not trust someone else’s computer’ but I’ve never seen an AWS-hosted service have 1/10th the outages of a self-hosted services. Next time someone shares a report of a 2 hour outage in a single Asian AWS region, ask to see their red team logs from the last 6 months.

At Stackery, we have made it our mission to make modern cloud infrastructure as accessible and useful as possible. Get your engineering team the best solution for building, managing and scaling serverless applications with Stackery today.

Disaster Recovery in a Serverless World - Part 3
Nuatu Tseggai

Nuatu Tseggai | October 18, 2018

Disaster Recovery in a Serverless World - Part 3

This is part three of a multi-part blog series. In the first post we covered Disaster Recovery planning when building serverless applications and in the second post we covered the systems engineering needed for an automated solution in the AWS cloud. In this post, we’ll discuss the end-to-end Disaster Recovery exercise we performed to validate the plan.

The time has come to exercise the Disaster Recovery plan. The objective of this phase is to see how closely the plan correlates to the current reality of your systems and to identify areas for improvement. In practice, this means assembling a team to conduct the exercise and documenting the process within the communication channel you outlined in the plan.

Background

In the first post you’ll find an outline of the plan that we’ll be referencing in this exercise. Section 2 describes the process of initiating the plan and assigning roles and Section 3 describes the communication channels used to keep stakeholders in sync.

Set the Stage

Decide who will be involved in the exercise, block out a chunk of time equivalent to the Recovery Time Objective, and create the communication channel in Slack (or whatever communication tool that your organization utilizes; ie: Google Hangouts, Skype, etc).

Create a document to capture key information relevant to the Disaster Recovery exercise such as who is assigned which roles, the AWS regions in play, and most importantly, a timeline pertaining to the initiation and completion of each recovery step. Whereas the Disaster Recovery plan is a living document to be updated over time, the Disaster Recovery exercise document mentioned here is specific to the date of the exercise; ie: Disaster Recovery Exercise 20180706.

Conduct the Exercise

Enter the communication channel and post a message that signals the start of the exercise immediately followed by a message to the executive team to get their approval to initiate the Disaster Recovery plan.

Upon completion of the exercise, post a message that signals the end of the exercise.

Key Takeaways

  1. Disaster Recovery exercises can be stressful. Be courteous and supportive to one another.
  2. High performance teams develop trust through practicing to communicate effectively. Lean towards communication that is both precise and unambiguous. This is easier said than done but gets easier through experience. Over time, the Disaster Recovery plan and exercise will become self reinforcing.
  3. Don’t forget to involve a representative of the executive team. This is to ensure that the operational status is being communicated across all levels.
  4. Be clear on which AWS region represents the source and which AWS region represents the DR target.
  5. Research and experiment with the options you have available to automate DNS changes. Again, the key here is gaining the skills and confidence through testing and optimizing for fast recovery. Know your records (A and CNAME) and the importance of TTL.
  6. Verify completion of recovery from the perspective of a customer.
  7. Conduct a retrospective to illuminate areas for improvement or research. Ask questions that encourage participants to openly discuss both the technical and psychological aspects of the exercise.
Stand On the Shoulders of Giants or Fall in Their Footsteps?
Toby Fee

Toby Fee | October 09, 2018

Stand On the Shoulders of Giants or Fall in Their Footsteps?

The first-mover advantage is much touted in the technology press. After all, you won’t really see much benefit to being the second car insurance company that makes a sitcom about cavemen. But in technology, being first to market only works when that first enables a new business. On other occasions, following the pioneers can enable others to move faster than the pioneers.

All My Heroes Were the First to Release New Products

With a long enough temporal distance it’s easy to remember iPods, Nintendo Entertainment Systems, or even Heroku as ‘the first thing like that to appear’ when all of these are actually beneficiaries of second mover advantage.

Here I have the extreme pleasure to acquaint you with the MacRumors forum thread from 2001 when Apple fans decried the iPod as just another mp3 player with ‘no more features than my Rio Diamond Jukebox.’ Sadly while writing this article I could not find a way to work in the fact that Steve Job’s presentation stack for the first iPod is typeset in Comic Sans.

In short, first movers, even those who have a successful product launch, offer a ‘free ride’ to their competitors who can use the first movers’ market research, product design, and advertising. If parts of the market don’t respond to the first product, a second-mover can offer a new product that, along with improvements, addresses the market better.

How do these differences shake out in the technical field? When we’re not talking about marketing a product to a group but rather the engineering, the process is extremely different when you’re not breaking new ground.

And how, fundamentally, is the development process different? Can we stand on the shoulders of giants? Or will we get bogged down in a landscape already riddled with giant’s footprints? What are the pitfalls of going second?

Follow the giants’ map.

When Walmart Labs took up NodeJS when it was an open source darling with a few un-formatted websites proudly proclaiming its benefits. When the team documented their path to success it largely recapped a framework’s path to mainstream success.

  • Patching core modules to support interoperability with other frameworks - when node core HTTP was converting everything to lowercase and the existing Java framework expected custom headers in uppercase, a patch was the only solution. When you are trying to roll out a new open source tool in your enterprise, you won’t be able to wait for your Github issue to get upvotes. Like it or not you will end up either forking or patching core.

  • Implement procrustean resource imports - I want security updates and bug fixes as much as the next gal, so the inbuilt Semantic Versioning (semver) support in npm is perfect for me. But in enterprise, we need to know that rollouts are identical, to the bit, with what we did last week. That means we want to lock our packaging to an exact version of a module. There are a ton of ways to do this in npm now but all of this was stuff Walmart Labs had to work out on their own.

  • Onboarding - How long does it take to go from ‘tinkerer’ to ‘competent’? I’ve been writing C for the Arduino for for 4 years and I’m just above a beginner. Focused self-directed study might only take a few months, but onboarding in an enterprise shop is a matter of weeks. Again, in the early days of a framework you may end up drafting internal documents rather than forcing new devs to sift through mountains of repo documentation, blog posts, and Stack Overflow questions to get up to speed.

Enterprise features and tools are not automatic.

In these examples Node was 100 percent working it had no bugs or fundamental flaws. All of the missing pieces were ones that you only need if you are working in an enterprise environment. The clever engineers who built node didn’t ‘forget’ these parts, but using Node to make money required them!

Blazing a trail even if you’re using AWS

iRobot dived feet-first into serverless architecture to run countless IoT devices, and while this has come with some great successes it can lead to some odd moments! In articles like this one Ben Kehoe of iRobot tries to be the first one to wrap his mind around the new AWS Fargate, and how it changes the map for iRobot.

Even without facing road bumps with a new project, keeping up with bleeding-edge features means your team might change directions after every re:Invent conference. Occasionally it means that a service that you thought was an expansion of serverless is more like automatically managed containers.

Rewrite your queries to serve customers better, not to save RAM.

Being the second mover in a managed environment means that both performance issues and enterprise requirements should be solved for the majority of cases. The examples above are all about making your team and the code work, none of them involve real business problems.

What if all our bugs could be business focused? Rather than trying to optimize queries for memory use, you were optimizing sorting, providing more relevant results, or adding metadata to power cool new features?

The power of serverless, a highly managed environment where operations can focus on observing real performance, is to allow your developers to spend less time optimizing within frameworks and more time understanding your business needs.

…and Stackery Can Help

Stackery, a tool for creating serverless stacks, makes it easy for teams to collaborate on AWS resources and configuration. With tools to let your whole team modify and approve changes to your stack, it has the potential to let you stick to best practices as you roll out services for your customers. AWS CloudFormation has an open source standard for describing and deploying stacks Serverless Application Model (SAM) YAML, which Stackery automatically creates, making it easier to learn the standard.

Move Slow and Make Things

Did you ever hear the one about two friends who come upon a bear in the woods, and one man sits down to tie his shoes? “What are you doing?” says the other man, “shoes or no you can’t outrun a bear.” The first man finishes lacing up and says “I don’t need to outrun the bear, I just need to outrun you.”

If the message of this piece is ‘move slow’ it should really be ‘move a bit slower than the competition.’

If you are breaking new territory, plan the time you’ll need for enterprise features and documentation, and if you really want to deny your competitors second-mover advantage, keep tight-lipped about best practices and cancel your senior devs’ speaking schedule!

Observability is Not Just Logging or Metrics
Toby Fee

Toby Fee | October 01, 2018

Observability is Not Just Logging or Metrics

Lessons from Real-World Operations

We generally expect that every new technology will have, along with massive new advantages, also has fundamental flaws that will mean the old technology always has its place in our tool belt. Such a narrative is comforting! It means none of the time we spent learning older tools was wasted, since the new hotness will never truly replace what came before. The term ‘observability’ has a lot of cache when planning serverless applications. Some variation of this truism is ‘serverless is a great product that has real problems with observability.’

The reality is not one of equal offerings with individual strengths and weaknesses. Serverless is superior to previous managed hosting tools, and it is the lack of hassle associated with logging, metrics, measurement, and analytics. Observability stands out as one of the few problems that serverless doesn’t solve on its own.

What Exactly is Observability?

Everything from logging to alerts gets labelled as observability, but the shortest definition is: observability lets you see externally how a system is working internally.

Observability should let you see what’s going wrong with your code without deploying new code. Does logging qualify as observability? Possibly! If a lambda logs each request its receiving, and the error is being caused by malformed URL’s being passed to that lambda, logging would certainly resolve the issue! But when the question is ‘how are URLs getting malformed?’ It’s doubtful that logging will provide a clear answer.

In general, it would be difficult to say that aggregated metrics increase observability. If we know that all account updates sent after 9pm take over 200ms, it is hard to imagine how that will tell us what’s wrong with the code.

Preparing for the Past

A very common solution to an outage or other emergency is to deploy a dashboard of metrics to detect the problem in future. This is an odd thing to do. Unless you can explain why you’re unable to fix this problem, there’s no reason to add detection for this specific error. Further, dashboards often exist to detect the same symptoms e.g. memory running out on a certain subset of servers. But running out of memory could be caused by many things, and provided we’re not looking at exactly the same problem saying ‘the server ran out of memory’ is a pretty worthless clue to start worthless clue to start with.

Real crises are those that affect your users. And problems that have a real effect on users are neither single interactions nor are they aggregated information. Think about some statements and whether they constitute an acute crisis:

  • Average load times are up 5% for all users. This kind of issue is a critical datum for project planning and management, but ‘make the site go faster for everyone’ is, or should be, a goal for all development whenever you’re not adding features.
  • One transaction took 18 minutes. I bet you one million dollars this is either a maintenance task or delayed job.
  • Thousands of Croatian accounts can’t log in. Now we actually have a trend! While we might be seeing a usage pattern (possibly a brute force attack), but there’s a chance that a routing or database layer is acting up in a way that affects one subset of users.
  • All logins with a large number of notifications backed up are incredibly slow, more than 30 seconds. This gives us a nice tight section of code to examine. As long as our code base is functional, it shouldn’t be tough to root out a cause!

How Do We Fix This?

1. The right tools

The tool that could have been created to fix this exact problem is Rookout, which lets you add logging dynamically without re-deploying. While pre-baked logging is unlikely to help you fix a new problem, Rookout lets you add logging to any line of code without a re-deploy (a wrapper grabs new rookout config at invocation). Right now I’m working on a tutorial where we hunt down Python bugs using Rookout, and it’s a great tool for hunting down errors.

Two services offer event-based logging that moves away from a study of averages and metrics and toward trends.

  • Honeycomb.io isn’t targeted at serverless directly, but offers great sampling tools. Sampling offers performance advantages over logging event details every time.
  • IOpipe is targeted at serverless and is incredibly easy to get deployed on your lambdas. The information gathered favors transactions over invocations.

2. Tag, cross-reference, and group

Overall averages are dangerous, they lead us into broad-reaching diagnoses that don’t point to specific problems. Generalized optimization looks a lot like ‘pre-optimization,’ where you’re rewriting code without knowing what problems you’re trying to fix or how. The best way to ensure that you’re spotting trends is to add as many tags as are practical to what you’re measuring. You’ll also need a way to gather this back together, and try to find indicators of root causes. Good initial tag categories:

  • Geography
  • Account Status
  • Connection Type
  • Platform
  • Request Pattern

Note that a lot of analytics tools will measure things like user agent, but you have to be careful to make sure that you don’t gather information that’s too specific. You need to be able to make statements like ‘all Android users are seeing errors’ and not get bogged down in specific build numbers.

3. Real-word transactions are better than any other information

A lot of the cross-reference information mentioned above isn’t meaningful if data is only gathered from one layer. A list of the slowest functions or highest-latency DB requests indicates a possible problem, but only slow or error-prone user transactions indicate problems that a user somewhere will actually care about.

Indicators like testing, method instrumentation, or server health present very tiny fragments of a larger picture. It’s critical to do your best to measure total transaction time, with the as many tags and groupings as possible.

4. Annotate your timeline

This final tip is has become a standard part of the devops playbook but it bears repeating: once you’re measuring changes in transaction health, be ready to look back about what has changed in your codebase with enough time accurately to correlate it with performance hits.

This approach can seem crude: weren’t we supposed to be targeting problems with APM-like tools that give us high detail? Sure, but fundamentally the fastest way to find newly introduced problems is to see them shortly after deployment.

Wrapping Up: Won’t This Cost a Lot?

As you expand your logging and event measurement, you should identify that the logging and metrics become less and less useful. Dashboards that were going weeks without being looked at will go months, and the initial ‘overhead’ of more event-and-transaction-focused measurement will pay off ten-fold in shorter and less frequent outages where no one knows what’s going on.

Get the Serverless Development Toolkit for Teams

Sign up now for a 30-day free trial. Contact one of our product experts to get started building amazing serverless applications today.

To Top