Self Healing Serverless Applications - Part 3

This blog post is based on a presentation I gave at Glue Conference 2018. The original slides: Self-Healing Serverless Applications – GlueCon 2018. View the rest here. Parts: 1, 2, 3

This is part three of a three-part blog series. In the first post we covered some of the common failure scenarios you may face when building serverless applications. In the second post we introduced the principles behind self-healing application design. In this next post, we’ll apply these principles with solutions to real-world scenarios.

Self-Healing Serverless Applications

We’ve been covering the fundamental principles of self-healing applications and the underlying errors that serverless applications commonly face. Now it’s time to apply these principles to the real-world scenarios that the errors will raise. Let’s step through five common problems together.

We’ll solve:

  • Uncaught Exceptions
  • Upstream Bottlenecks
  • Timeouts
  • Stuck steam processing
  • and, Downstream Bottlenecks

Uncaught Exceptions

Uncaught exceptions are unhandled errors in your application code. While this problem isn’t unique to Lambda, diagnosing it in a serverless application can be a little trickier because your compute instance is ultimately ephemeral and will shut down upon an uncaught exception. What we want to do is detect that an exception is about to occur, and either remediate or collect diagnostic information at runtime before the Lambda instance is gone. After we’ve handled the error, we can simply re-throw it to avoid corrupting the behavior of the application. To do that, we’ll use three of the principles we previously introduced: introducing universal instrumentation, collecting event-centric diagnostics, and giving everyone visibility.

At an abstract level, it’s relatively easy to instrument a function to catch errors (we could simply wrap our code in a try/except loop). While this solution works just fine for a single function, though, it doesn’t easily scale across an entire application or organization. Do you really want to be responsible for ensuring that every single function is individually instrumented?

The better approach is to use “universal instrumentation.” We’ll create a generic handler which will invoke our real target function and always use it as the top level handler for every piece of code. Here’s an example:

def genericHandler(message, context):
	try:
		return targetHandler(message, context)
	except Exception as error:
		# Collect event diagnostics
		# Possibly re-route the event or otherwise remediate the transaction
		raise error

As you can see, we have in fact just run our function through a try/except clause, with the benefit of now being able to invoke any function with one standard piece of instrumentation code. This means that every function will now behave consistently (consistent logs, metrics, etc) across our entire stack.

This instrumentation also allows us to collect event-centric diagnostics. Keep in mind that by default, a Lambda exception will give you a log with a stack trace, but no information on the event which led to this exception. It’s much easier to debug and improve application health with relevant event data. And now that we have centralized logs, events, and metrics, it’s much easier for everyone on the team to have visibility into the health of the entire application.

Note: you’ll want to be careful that you’re not logging any sensitive data when you capture events.

Upstream Bottleneck

An upstream bottleneck occurs when a service calling into Lambda hits a scaling limit, even though Lambda isn’t being throttled. A classic example of this is API Gateway reaching throughput limits and failing to invoke downstream Lambdas on every request.

The key principles to focus on here is: identifying service limits, using self-throttling, and notifying a human.

It’s pretty straightforward to identify service limits, and if you haven’t done this you really should. Know what your throughput limits are for each of the AWS services you’re using and set alarms on throughput metrics before you hit capacity (notify a human!).

The more sophisticated, self-healing approach comes into play when you choose to throttle yourself before you get throttled by AWS. In the case of an API Gateway limit, you (or someone in your organization) may already control the requests coming to this Gateway. If, say for example, you have a front-end application backed by API Gateway and Lambda, you could introduce an exponential backoff logic to kick in whenever you have backend errors. Pay particular attention to HTTP 429 To Many Requests responses, which is (generally) what API Gateway will return when it’s being throttled. I say “generally” because in practice this is actually inconsistent and will sometimes return 5XX error codes as well. In any event, if you are able to control the volume of requests (which may be from another service tier), you can help your application to self-heal and fail gracefully.

Timeouts

Sometimes Lambdas time out, which can be a particularly painful and expensive kind of error, since Lambdas will automatically retry multiple times in many cases, driving up the active compute time. When a timeout occurs, the Lambda will ultimately fail without capturing much in terms of diagnostics. No event data, no stack trace – just a timeout error log. Fortunately, we can handle these errors pretty similarly to uncaught exceptions. We’ll use the principles of self-throttle, universal instrumentation, and considering alternative resource types.

The instrumentation for this is a little more complex, but stick with me:

def genericHandler(message, context):
	# Detect when the Lambda will time out and set a timer for 1 second sooner
	timeout_duration = context.get_remaining_time_in_millis() - 1000

	# Invoke the original handler in a separate thread and set our new stricter timeout limit
	handler_thread = originalHandlerThread(message, context)
	handler_thread.start()
	handler_thread.join(timeout_duration / 1000)

	# If timeout occurs
	if handler_thread.is_alive():
		error = TimeoutError('Function timed out')

		# Collect event diagnostics here

		raise error
	return handler_thread.result

This universal instrumentation is essentially self-throttling by forcing us to conform to a slightly stricter timeout limit. In this way, we’re able to detect an imminent timeout while the Lambda is still alive and can extract meaningful diagnostic data to retroactively diagnose the issue. This instrumentation can, of course, be mixed with our error handling logic.

If this seems a bit complex, you might like using Stackery: we automatically provide instrumentation for all of our customers without requiring *any* code modification. All of these best practices are just built in.

Finally, sometimes we should be considering other resource types. Fargate is another on-demand compute instance which can run longer and with higher resource limits that Lambda. It can still be triggered by events and is a better fit for certain workloads.

Stream Processing Gets “Stuck”

When Lambda is reading off of a kinesis stream, failing invocations can cause the stream to get stuck (more accurately: just that shard). This is because Lambda will continue to retry the failing message until it’s successful and will not get to the workload behind the stuck message until it’s handled. This introduces an opportunity for some of the other self-healing principles: reroute and unblock, automate known solutions, consider alternative resource types.

Ultimately, you’re going to need to remove the stuck message. Initially, you might be doing this manually. That will work if this is a one-off issue, but issues rarely are. The ideal solution here is to automate the process of rerouting failures and unblocking the rest of the workload.

The approach that we use is to build a simple state machine. The logic is very straightforward: is this the first time we’ve seen this message? If so, log it. If not, this is a recurring failure and we need to move it out of the way. You might simply “pass” on the message, if your workload is fairly fault tolerant. If it’s critical, though, you could move it to a dedicated “failed messages” stream for someone to investigate or possibly to route through a separate service.

This is where alternative resources come into play again. Maybe the Lambda is failing because it’s timing out (good thing you introduced universal instrumentation!). Maybe sending your “failed messages” stream to a Fargate instance solves your problem. You might also want to investigate the similar but actually different ways that Kinesis, SQS, and SNS work and make sure you’re choosing the right tool for the job.

Downstream Bottleneck

We talked about upstream bottlenecks where Lambda is failing to be invoked, but you can also hit a case where Lambda is scaling up faster that its dependencies and causing downstream bottlenecks. A classic example of this is Lambda depleting the connection pool for an RDS instance.

You might be surprised to learn that Lambda holds onto its connection, even while cached, unless you explicitly close the connection in your code. So do that. Easy enough. But you’re also going to want to pay attention to some of our self-healing key principles again: identify service limits, automate known solutions, and give everyone visibility.

In theory, Lambda is (nearly, kind of, sort of) infinitely scalable. But the rest of your application (and the other resource tiers) aren’t. Know your service limits: how many connections can your database handle? Do you need to scale your database?

What makes this error class actually tricky, though, is that multiple services may have shared dependencies. You’re looking at a performance bottleneck thinking to yourself, “but I’m not putting that much load on the database…” This is an example of why it’s so important to have shared visibility. If your dependencies are shared, you need to understand not just your own load, but that of all of the other services hammering this resource. You’ll really want a monitoring solution that includes service maps and makes it clear how the various parts of your stack are related. That’s why, even though most of our customers work day-to-day from the Stackery CLI, the UI is still a meaningful part of the product.

The Case for Self-Healing

Before we conclude, I’d like to circle back and talk again about the importance of self-healing applications. Serverless is a powerful technology that outsources a lot of the undifferentiated heavy lifting of infrastructure management, but it requires a thoughtful approach to software development. As we add tools to accelerate the software lifecycle, we need to keep some focus on application health and resiliency. The “Self-Healing” philosophy is the approach that we’ve found which allows us to capture the velocity gains of serverless and unlock the upside of scalability, without sacrificing reliability or SLAs. If you’re taking serverless seriously, you should incorporate these techniques and champion them across your organization so that serverless becomes a mainstay technology in your stack.