Stacks on Stacks

A serverless/FaaS technology blog by Stackery

Prototyping Serverless Applications

Prototyping Serverless Applications

Author | Anna Yovandich

By Teemu

When starting a prototype, it’s easy to get lost in the weeds before anything is built. What helps me before writing code is to outline a build plan that clarifies: What is the simplest approach to build an initial concept? What valuable features reach beyond the basics? How can they be progressively implemented? What architectural resources will be required for each stage?

For instance, I’m building a prototype for a browser-based multiplayer game that tracks player connections, turns, and scores in realtime. To initialize the game, a url will be generated by the “host” player, which will open a socket connection scoped to the url’s unique path. The url serves as the entry point for other players to join the game. A socket connection will enable bi-directional messages to be sent and received between client and server when a new player joins, a player takes their turn, or the game ends. I scoped three build strategies, from feature-light to most-robust – using Stackery to prototype, simplify, and expedite the heavy-lifting.

The first, most feature-light, approach can be achieved using only javascript on the client and server with Express (a node.js application framework) and socket.io (send/receive messages in realtime). When a player creates a new game, a unique game url path will be provided to Express as the endpoint to open a scoped socket connection. The game client will send and receive messages as players join, take turns, and score/win/lose. For lightweight data persistence, localStorage can be used to store game and player data so a game can be rejoined and resumed after a broken connection, by reloading the url. At this point, it would be helpful to test the game on a remote domain. To do this, I’ll create a simple stack with an ObjectStore and a CDN which will provide access to a stackery-stacks.io domain.

The next strategy adds data persistence, beyond localStorage capabilities, that can store user data (profiles), joinable game urls (lobby), and game scores (leaderboards). To quickly prototype these features without much overhead (especially for a frontender like me), it’s Stackery to the rescue. It’s quick to spin-up a Rest Api that will receive user and game data, then send it to a Function node that will pipe it into a Table.

The third, and most robust, implementation adds another Function node to the pipeline above to enable a multitude of user notification potential. When a Table connects its output to a Function, changes in state can be detected by the Function using the transaction events it receives from the Table. The Function can then notify users accordingly, in various ways:

  • Email an invite for another player to join a game
  • Notify a player when it’s their turn
  • Email a player when their highest score is defeated on the leaderboard

A solid starting point is the first approach – relying solely on javascript and the browser for a simple and useable multiplayer experience. From there, advanced features can be prototyped and implemented without too much architectural sweat. Depending on desired behavior (e.g. varying responses to state changes), the Function code will require a range of effort but that’s what’s great about Stackery – when architectural complexity becomes trivial, building behavior becomes central.

Implementing the Strangler Pattern with Serverless

Implementing the Strangler Pattern with Serverless

Author | Stephanie Baum

By now we’ve all read Martin Fowler’s Strangler Pattern approach to splitting up monolithic applications. It sounds wonderful, but in practice it can be tough to do, particularly when you’re under a time crunch to enable shiny new “modern” features at the same time.

An example I’ve seen several times now is going from an older, on-prem, or in general just slower, traditional architecture to a cloud-based, event-streaming one, which enables things like push notifications to customers, high availability, and sophisticated data analytics. This can seem like a big leap, especially when you’ve got an application that is so old most of your engineering team doesn’t know how it still functions, and is so fragile that if you look at it wrong it’ll start returning java.null.pointer .html pages from its API.

Here’s the good news, serverless can help you! Stackery can help you! By creating serverless API layers for your existing domains, you can abstract away the old, exposing painless restful API interfaces to your frontends, while simultaneously incorporating event streaming into your architecture. Furthermore, by using Stackery, you can do this while maintaining a high degree of monitoring (with our Health Dashboard), operations management, and security (since we configure the IAM permissions between services, handle environments, and encrypt configuration storage, for you).

The Situation

Let’s take a hypothetical customer loyalty application. It has some XML based Java API’s that map to some pretty old, non-restful application logic. The application works as is, if slowly, but the cost of maintaining it is getting too high, it’s fragile and prone to tip overs, and we’ve got a directive to start abstracting away some of it into some sort of new cloudbased architecture on AWS. We also want to justify some of this refactor with some new feature enablement, such as push notifications to customer’s phones when they reach a certain loyalty tier or spending cashback amount, and an event-based data analytics pipeline.

Steps to Enlightenment

  1. Use Domain Driven Design techniques to define a new, cleaner, microservice-like understanding of your application. Including events that you want to surface.
  2. Define your new API contracts based on these new domains. In our example, the domains are pretty straightforward, loyalty and customer, perhaps before they were combined into one, but as we add more loyalty based functionality we’ve decided to separate them for future proofing and ease of understanding.
  3. Define how your old APIs map to these new APIs. For example, say we want to enable a new POST /customer endpoint. Previously, the frontend had to send an XML request to service x and an XML request to service y. We will encapsulate and abstract that logic away in our serverless API function.
  4. Build your new architecture!

Above, I have laid out a hypothetical strangler pattern-esque architecture in Stackery’s editor panel to solve the situation.

We have two Rest API Nodes corresponding to the two new domains, that front and forward all requests to two Function Nodes, CustomerAPI and LoyaltyAPI which would be implementing our new API contract, combined with any abstracted-away logic dealing with the underlying legacy application to enable this contract. So far we have achieved the essential goal of the strangler pattern by abstracting away some of our old logic, and exposing it via new domain driven, segmented APIs.

Now for enabling some new functionality. These API nodes, in addition to returning respondes to the frontend, emit contextual events to the Events Stream Node, which in turn outputs to the Listener Function Node that listens for customer or loyalty events it “cares” about. Those events are forwarded on to a NotificationsSNS Topic Node, enabling event-based SNS. We also have an Analytics Function Node, that gets events from the event stream as well as any error events. The Errors node emits any uncaught errors from our new functions to the UncaughtExceptionHandler Function Node for easier error management and greater visibility.


Not all legacy application migrations will follow the steps I’ve listed here. In fact, one of the biggest struggles with doing something like this is that each strangler pattern must be uniquely tailored based on an in-depth understanding of the existing business logic, risks, and end goals. Often times, the engineering team implementing the pattern will be somewhat unfamiliar with some of the new technology being used. That also comes inherently with risk such as…

  • What if it takes too long to PoC?
  • What if you configure the IAM policies and security groups incorrectly?
  • What if something breaks anywhere in the pipeline? How do we know if it was in the new API layer or the old application?

When one migrates to distributed cloud-based services, it’s more complicated than it’s made out to be. Stackery can help you manage these risks and concerns, by making your new applications faster to PoC, managing secure access between services for you, and surfacing errors and metrics. There are a lot of things that can go wrong, and AWS doesn’t make it easy to find the problem. There’s also the task of fine tuning all these services for cost efficiency and maximum availability. Ask yourself if you would rather be doing that by digging through the inception that is AWS’s UI, or with Stackery’s Serverless Health Dashboard.

How an under-provisioned database 10X'd our AWS Lambda costs

How an under-provisioned database 10X'd our AWS Lambda costs

Author | Sam Goldstein

This is the story of how running a too small Postgres RDS instance caused a 10X increase in our daily AWS Lambda costs.

It was Valentine’s day. I’d spent a good chunk of the week working on an internal business application which I’d built using serverless architecture. This application does several million function invocations per month and stores about 2Gb of data in an RDS Postgres database. This week I’d been working on adding additional data sources which had increased the amount of data stored in Postgres by about 30%.

Shortly after I got into the office on Valentine’s day I was alerted that there were problems with this application. Errors on several of the Lambda functions had spiked and I was seeing function timeouts as well. I started digging into Cloudwatch metrics and quickly discovered that my recently added datasources were causing growing pains in my Postgres DB. More specifically it was running out of memory.

You can see the memory pressure clearly in this graph:

I was able to quickly diagnose that memory pressure within the RDS instance was leading to slow queries, causing function timeouts and errors, which would trigger automatic retries (AWS Lambda automatically retries failed function invocations twice). At some point this hit the DB’s connection limits, causing even more errors, a downward spiral. Fortunately I’d designed the microservices within the application to be fault tolerant and resilient to failures, but at the moment the system was limping along and needed intervention to fully recover.

It was clear I needed to increase DB resources so I initiated an upgrade to a larger RDS instance through Stackery’s Operations Console. While the upgrade was running I did some more poking around in AWS console.

This is when things started to get really interesting. I popped into the AWS Cost Explorer and immediately noticed something strange. My Lambda costs for the application had increase 10X, from about 50¢ the previous day to over $5 on Valentine’s Day. What was going on here?

I did some more digging and things started to make sense. Not only had the underprovisioned RDS instance resulted in degraded app performance. It had also increase dramatically increased my average function duration. Functions that ordinarily completed in a few tenths of a second were running up until their timeouts, 30 seconds, or longer in some cases. Because they hit the timeout and failed they’d be retried, which meant even more long function invocations.

You can see the dramatic increase in function runtime clearly in this graph:

Once the RDS instance upgrade had completed things settled down. Error rates dropped and function duration returned to normal. Fortunately the additional $4.50 in Lambda costs won’t break the bank either. However this highlights the tighter relationship between cost and performance that exists for serverless architectures. Generally this results in significantly lower hosting costs than traditional architectures, but the serverless world is not without it’s gotchas. Fortunately I had excellent monitoring, alerting, and observability in place for the performance, health, and cost of my system, which meant I could quickly detect and resolve the problem before it turned into a full scale outage and a spiking AWS bill.


Building Serverless State Machines

Building Serverless State Machines

Author | Chase Douglas @txase

State machines are useful mechanisms for decoupling logic flow from computation. They can be used to refactor complex spaghetti code into easy, understandable statements and diagrams. Further, state machines are built upon a mathematical model, which provides the ability to “prove” whether they correctly implement desired functionality.

A little over a year ago, AWS released Step Functions, their service built on top of AWS Lambda to provide state machine logic flow. The best way to describe Step Functions in terms of state machines is to equate the states in a state machine with invocations of Lambda Functions, and state transitions as the evaluations of the results of the invocations. Here’s an example AWS Step Function diagram:


Microsoft has a similar service called Azure Logic Apps. Both Logic Apps and Step Functions provide a powerful abstraction layer on top of bare Functions-as-a-Service (FaaS) services. However, these services have one glaring issue: they are too expensive for high-throughput applications. Instead, these services cater to “business logic” use cases. Some examples:

  • Customer account onboarding
  • Contract approvals
  • Paperwork-to-electronic-record synchronization

These scenarios benefit from auditing and the ability for state machines to exist over a long period of time (potentially up to a year in length). Unfortunately, providing these features drives cost up to the point that the services are not a great fit for high-throughput applications.

That said, it is still very possible to build serverless applications using state machine models. There are two key mechanisms we need to build serverless state machines: concurrency coordination and state transition logic.

Concurrency Coordination

One of the primary features of FaaS is the ability to easily scale the number of simultaneous invocations of functions. One typical use case example is performing a MapReduce computation, where a large data set is split into smaller chunks that are processed by a reducer function in parallel. This is easily accomplished using serverless techniques, excepting one issue: how can you tell when the last reducer function has completed?

External state is the primary mechanism for providing concurrency coordination. The state can be stored anywhere with atomic operations. Great fits in the serverless ecosystem are Key-Value stores like AWS DynamoDB and Azure CosmosDB. These stores are serverless, cost-effective, and horizontally scalable.

The following are the basic steps to perform coordination among concurrent function invocations for a MapReduce algorithm:

  1. A mapper function splits the data set into chunks
  2. The mapper function counts the total number of chunks and inserts the count into the external state data store (this represents the total number of in-process function invocations)
  3. The mapper function invokes the required number of reducer functions in parallel to process each chunk
  4. When each reducer function invocation completes, it atomically decrements the count of in-process invocations in the external state data store
  5. If the count of in-process invocations is zero, then the current reducer function is the last to complete and the system can continue onto the next state in the state machine

There are many other ways concurrency may need to be coordinated, but they all share the property of needing to track state in an atomic, external store.

State Transition Logic

This brings us to state transitions. How can one function invocation make the proper decision on which function(s) should be executed next? At first glance, this doesn’t sound too difficult. We are used to writing this sort of logic directly in our function source code. But embedding the logic of a state machine in the source code of all its functions leads to the spaghetti code mess we are hoping to avoid by using a more formal state machine model.

To inch our way towards a more general approach to state machine transition logic, let’s take a look at a simple use case: custom retry logic. In this example, the process performed for state “A” in the diagram below should be retried up to MAX_RETRIES.

Upon success, the logic transitions by invoking the function for state “B”. But let’s focus on the failure scenario.

Function A is invoked with a message. That message may look something like:

  userId: 12345

When a failure occurs, function A looks at the input message to see if it has a retries property. It does not, so function A adds the property with a value of 1:

  userId: 12345,
  retries: 1

Then, function A can re-invoke itself with the newly update message. If it fails again, it now increments the retries property:

  userId: 12345,
  retries: 2

Each time a failure occurs, the number of retries is compared against the MAX_RETRIES value. If the number of retries exceeds MAX_RETRIES, then the state machine can transition into a failure state. This may include pushing the input message onto a dead letter queue or notifying someone of the failure.

The key here is that the state of the machine can be passed as part of the input into the function. This is a powerful mechanism for managing the logic flow of an application.

A Generic Serverless State Machine?

While the above mechanisms for concurrency coordination and state transitions can be used to build bespoke state machines, it would be great if we could use a standard state machine specification to build general logic flows. While this has yet to be completely achieved, there have been some promising attempts. One example is Ben Kehoe’s Heaviside project. Heaviside attempts to provide both the state machine definition and state to functions via AWS Lambda’s Context mechanism using the same Amazon States Language specification used for Step Functions. While this works in theory, it breaks down in practice because AWS Lambda doesn’t support providing arbitrary context data when invoking functions asynchronously. This is an impediment to implementing state machines where the machine definition and state are passed separately from the input messages. It would be interesting to investigate whether mixing the state machine data with the input message could achieve the desired result.

Even with the current challenges, I’m excited to see the progress being made in pursuit of serverless state machines. I believe it is only a matter of time before someone builds a generic framework on top of existing FaaS solutions and/or cloud providers support high-throughput state machines.

Serverless Health Status Dashboard

Serverless Health Status Dashboard

Author | Sam Goldstein

Stackery’s Operations Console is the place DevOps teams go to manage their serverless infrastructure and applications. This week we’re announcing the general availability of Serverless Health Dashboards which surfaces realtime health status data for deployed serverless applications. As early adopters of microservice and serverless architectures, we’ve experienced first hand how complexity shifts away from monolithic codebases towards integrating (and reasoning about) many distributed components. That’s why we designed Serverless Health Dashboards to provide visibility into the realtime status of serverless applications, surfacing the key data needed to identify production problems and understand the health of serverless applications.

Once you’ve setup a Stackery account you’ll see a list of all the Cloudformation stacks that you’ve deployed within your AWS account. When you drill into a stack we display a visual representation that shows the stack’s provisioned resources and architectural relationships. I personally love this aspect of the console, since it’s challenging to track the many moving parts of a microservices architecture. Having an always-up-to-date visualization of how all the pieces fit together is incredibly valuable to keeping a team coordinated and up to speed on the systems they manage.

Within the stack visualization we surface key health metrics related to each node. This enables you to assess the operational health of the stack at a glance, and quickly drilldown on the parts of the stack experiencing errors or other problems. When you need to dig deeper to understand complex interactions between different stack components you can access detailed logs, historical metrics, and X-Ray transaction traces through the node’s properties panel.

Getting access to Stackery’s Serverless Health Dashboards requires creating a free Stackery account. You’ll immediately be able to see health status for any application that’s been deployed via AWS CloudFormation, Serverless Framework, or Stackery Deployment Pipeline. We hope you’ll try it out and enjoy the increased visibility into the health and status of your serverless infrastructure.

Tracing Serverless Applications with AWS X-Ray

Tracing Serverless Applications with AWS X-Ray

Author | Apurva Jantrania

Debugging serverless applications can be very hard. Often, the traditional tools and methodologies that are commonly used in monolithic applications don’t work (easily, at least). While each service is smaller and easier to fully understand and test, a lot of the complexity and issues are now found in the interconnections between the micro-services. The event-driven architecture inherent in serverless further increases the complexity of tracing data through the application, increasing the debugging complexity.

Much of the DevOps tooling in this area is still in its infancy, but Amazon took a large step forward with AWS X-Ray. X-Ray helps tie together the various pieces of your serverless application in a way that makes it possible to understand the relationships between the different services and trace the flow of data and failures. One of the key features is X-Ray’s service map, a visual representation of the AWS services in your application and the data flow between them; this ability to visually see your architecture is something we’ve always valued at Stackery and is a key reason we let you design your application architecture visually.

As a quick side note, it is interesting to see how a Stackery visualizes a stack compared to the AWS X-Ray visualization:

Stackery Representation

Stackery Representation

AWS X-Ray Representation

AWS X-Ray Representation

When a request hits a service that provides active X-Ray integration (and one that you’ve set up to use X-Ray), it will add a unique tracing header to the request which will also be added to any downstream requests that are generated. Currently, Amazon supports only AWS Lambda, API Gateway, EC2, Elastic Load Balancers and Elastic Beanstalk for active integration. Most other services support passive integration, which is to say that they’ll continue adding to the trace if the request already has the tracing header set.

With AWS X-Ray enabled throughout your application, you can click on nodes in the Service Map to see details such as the response distribution and dive into trace data. Here are some traces for a few AWS services - CloudFormation, DynamoDB, Lambda, and STS:

Response Distributions

This view is useful to get a high-level view of the health and status of your services. Diving in further will allow you to view specific traces, which is critical for understanding which services are slowing your application down or root causing failures.


One limitation to keep in mind is that the X-Ray service map will only allow you to view data in 6 hours or smaller chunks, but it keeps a 30-day rolling history.

Enabling X-Ray can be tedious. For instance, to enable X-Ray on AWS Lambda, you need to do three things for each lambda function:

  1. Enable active tracing
  2. Update your code to use the AWS X-Ray enabled SDK rather than the standard AWS SDK
    • Node.js - Java - Go - Python - .Net - Ruby
    • Using the AWS X-Ray enabled SDK lets Lambda decide on how often and when to sample/upload requests
  3. Add the needed IAM permissions to upload the trace segments

Unfortunately, needing to do this for every lambda function, old and new, makes it ripe for human error.

Details on how to enable active tracing on other services can be found here.

At Stackery, we think enabling data tracing is another critical component in Serverless Ops, just like handle Errors and Lambda timeouts. So any stack deployed with Stackery has AWS X-Ray automatically enabled - we make sure that any AWS service used has the correct settings to enable active AWS X-Ray tracing if supported and for lambda functions, we take care of all of the steps so you don’t need to worry about permissions or updating your code to use the right SDK.


Try Stackery For Free

Gain control and visibility of your serverless operations from architecture design to application deployment and infrastructure monitoring.