Reliable Systems Don't Happen by Accident

Dec 04, 2020 in TECH
engineering
3 min read

While we’re designing intricate systems, beginning with the happy path can be a helpful simplification—but it’s a big mistake to design only for the happy path. While it’s true for any computer program, our problems multiply when we interconnect things in the cloud. Here’s a partial list of things that go wrong all the time:

Something you want to reach over the network is unreachable.
Something you want to reach over the network is unusually slow.
Demand for your service suddenly overwhelms its capacity.
Users create data payloads orders of magnitude larger than you expected.
Your API requests are being throttled by your platform.

Among other implications, the cloud era means that operational concerns have become development concerns. Guarding against the unhappy path will make the difference between a reliable system and a smoking wreck. Any part of your system that is without limits is a part that can bring down your system. This applies to everything from inputs you accept to the amount of time you wait for a response from a downstream system. Enforce cardinalities. Do you expect your customers to create thousands of entries in your content management system? Then don’t make it possible for them to create billions. Another place to enforce limits is at the front door to your service; even with automated horizontal scaling, you must place limits on the number of requests you will accept. When your service is running at peak capacity, it is far better to reject new work than to accept it and fall over.

The Architecture Diagram Is Also a Map of Failure Modes

When you look at an architecture diagram with your reliability engineering hat on, you’ll begin to see that every box is a subsystem that can fail, and every line is a communication path that can flake out on you. Enumerating these failure modes in your design documentation is a good idea. For every dependency, ask the following questions:

Do you have a good, brisk timeout?
What is the retry policy?
Is there a circuit breaker?
Is there a reasonable fallback value we can use in case of failure?
Can we defer the work and try again later?

Asynchronous Communication Is a Friend of Cloud Reliability

Since everything you want to communicate with on a network can fail, synchronous requests are the most brittle of all. Whenever you can tolerate a little more latency, put requests into a queue, so that the consumer of that queue can do the work when it’s ready. In case of trouble, the consumer can handle the highest-priority messages first. In case of an outage, the requests can be deferred until the system is healthy again.

Exercise Adverse Conditions

You can make educated guesses about how your service will perform under duress, but it’s far better to put it through its paces in a series of controlled experiments. For any alerts that are configured to page your team, try to create the conditions that will trigger the alert. Practice your disaster recovery plans in a controlled environment before you need them in production.