Uptime is prime time — λ das automator

Aug 14

Aug 14 Uptime is prime time

An old cliché - engineering is about tradeoffs. Cost vs Quality. Reliability vs Cost of a failure. Maintainability vs speed, etc. How do you know what to optimize for? What is more important - getting the product to market a month early vs spending the time to harden the key user flows with more automated testing?

There's no general answer to any of these questions since they are all context dependent. Shipping a major update to a key library that thousands of developers depend on? You might want to spend more time making sure it's up to snuff before publishing the new release. Conversely, if you're leading the team responsible for getting your shiny new blue ocean startup's MVP into the hands of early adopters, speed is key to snatching up a big share of this tasty new market - ship first, tweak later.

“Uptime, uptime uber alles!”

— Hoffmann von Fallersleben, probably.

“Thou shalt love thy uptime as thyself”

— Yahweh the Jealous

“Keep thou uptime high, shalt be the whole of the law.”

— Aleister Crowley

I'm a product engineer. I work on a 12 year old codebase that powers a SaaS product that tens of thousands of customers spend a large portion of their day using, and rely upon to drive and maintain their relationships with their customers. We have all sorts of metrics to track how we're doing as an engineering team. Reigning supreme above all others, is uptime.

What's my uptime?

In the early days of the web, uptime was pretty straight-forward. Everyone was publishing monolithic apps that either let you log in and do your thing or it didn't. The URL either resolved correctly and presented a working (for a given value of 'working') application, or it didn't and it either showed a 500 page, browser DNS error page or it spun indefinitely and never did anything.

These days, large and complex apps are typically delivered as a set of networked but independent services, each performing their task and communicating with others. We began with a monolithic application from which we extracted several services, each with their own persistent storage and responsibilities. From a systems engineering perspective, this approach arms us with a powerful new tool - circuit breakers.

Circuit Breakers

The boundaries between services, where one part of the application is depending on data or functionality provided by another, can be hardened against cascading failure using a circuit breaker.

This is a piece of logic that will be invoked if there is a network error, internal server error, etc. that prevents a service from responding in a timely manner - returning a default value to the client service and allowing execution of the user's web request to continue, albeit with a potentially degraded or eventually-consistent state.

For example, one of the core services that powers PipelineDeals is responsible for billing and subscriptions. When the main app needs to know which features a particular user or account has access to, it must make an internal API call to this billing service. What if the database server this billing service talks to has a hardware failure?

Well, since we have a circuit breaker installed in the code that handles this communication, the main app will detect that requests to the billing app are currently timing out or failing. Hence, it will substitute a standard set of features instead of the realtime set that would be returned from the billing service, and continue with serving the user's request.

Without the circuit breaker, a failure in the billing service would have cascaded into failures in all services dependent on it. In this naive case, the app's uptime is only as good as the uptime of the least reliable service in the request's critical path through the system.

Early Warning System

Service-based apps have some failure modes that you don't see in monolithic apps. Chief among these is the cascading failure. One component will start receiving more load than it can handle, causing requests to it from other components to fail or hang around waiting for a response.

These delays or failed requests will pile up in these client components, causing retries and degraded service to their clients, and so forth. Before you know it, the whole web of interdependent services will seize up and start refusing or failing requests from the web app and customer API calls and cause a downtime event.

There's an opportunity here, as well. Well-designed services have well-defined boundaries, and it's at these boundaries that we can look for trouble (any good monitoring service will make this easy to set up) and alert early. Data points that I've found helpful to monitor include queue job count, error rates and API request 95th percentile response times.

Responding to these alerts early enough will allow a team to avoid a full-blown downtime by, at worst, shutting down services not critical to the core product during an emergency. Happy days!

Defence in Depth

We can also look at the monitoring and alerting infrastructure in a top-down way. Failures can occur not only across components but also at different layers within a single component. Imagine you suddenly start seeing elevated 5xx responses to user-facing web transactions. What's the cause? Depending on your infrastructure and setup, this could point to a problem with your DNS setup, load balancers, application host, web server, routing layer, database, in-memory object storage, etc.

In an emergency situation, the less probing in the dark we have to do, the better. Setting up monitoring at every level of the stack is an excellent way to cut down this search space. For example, our ping tests report everything is fine but application error monitoring is showing elevated error responses, its reasonable to conclude that everything upstream of your app servers is functioning correctly and that the likely culprit is a recent code change.

Cost/Benefit

So, what are all these handy dashboards and fancy infrastructure going to cost you? It's a sliding scale. There are engineering teams working at all points on the spectrum between a single monolithic app and a constellation of microservices.

Having said that, there is no question that there are additional fixed and marginal costs of delivering your app as a set of networked services. Most easily measurable are the direct costs of extra servers and hosting infrastructure. We usually want redundancy at the level of the hardware that the service is running on, so most production setups will deploy one service per server instance. The extra database hardware to isolate storage per service is another significant cost, as is the data transferred between service APIs.

The cost incurred for your engineering team's time is much harder to quantify. Even assuming everyone on the team has the DevOps chops to isolate and debug production issues, the resolution of those issues will necessarily be more complex as they involve more moving parts than under the original monolithic approach.

Do I need it?

Distributed application architectures are not a silver bullet for all that ails your application. In fact, shoehorned into a deployment that doesn't make sense, it will multiply your problems. Best practice is to begin your app with a simple, monolithic deployment model that's quick to iterate and develop new features for.

As your app acquires paying customers, you're building a business case for investing the time and money required to improve the resiliency of the infrastructure the business relies on. Even at this point, that may not mean a fully distributed service model as we've discussed above - it's an art as much as it is a science.

Slinging code from Montreal