Ideas Behind Site Reliability Engineering

by kingcoyote

As the software industry grows and matures, the systems that run all around us grow in size and complexity.

Users' demand for reliability combined with this growth has produced a new specialization: the Site Reliability Engineer (SRE).

While the role relies on a mixture of sysadmin and software development skills and overlaps with infrastructure engineering, it is made unique by the mindset that it brings to bear on the problem.  I want to share what I know about working like this because it's a relatively unknown specialty and because it soothes my heart to know that humanity isn't one error away from turning the world into a Mad Max-like desert.

At its core is the belief that as systems grow, they become less legible.  No longer can we look at a Unified Modeling Language (UML) diagram and predict all its behaviors.  When we had to take care of ten or 20 hosts or a simple web application, it was possible for a single person, usually the senior engineer, to understand the system and keep it in a stable state.  But when the number of hosts grows and the application becomes distributed and has dozens or hundreds of engineers changing it every day, it becomes a murky pool of statistical probability where something somewhere is always failing.

Disks are dying, network links are going down, and processes are exhausting the available resources.  Hiring more people doesn't work for two reasons: it's really expensive and it increases the communication overhead (Brooks's law).  How do SREs attack this problem?  By learning from the broader engineering community how to deal with complex systems like aircraft.

The foundation of this approach is observability.  The system has to continuously report its state so that the engineering team knows whether it's working, broken, or becoming broken.  This pushes the existing practices into overdrive because we want to get and store all the metrics we can get our hands on.  Some examples here are host-level metrics like CPU, disk, network, and memory utilization; service-level metrics like rate, type, and latency of incoming and outgoing requests; and every log line the service produces.  Not only should these all be gathered and stored, but they should be easily accessible and searchable by everyone on the team.

Having these, we can, over time, single out those that provide us the strongest signal about how well things are working.  We will be able to go back and study the state of the system closely and investigate all the dimensions in which it deteriorated when things were broken.  We will also be able to build some automation on top of them to fix certain recurring problems automatically.  Any system will experience a steady flow of problems, like disks dying or hosts getting into a weird configuration state, but time is precious for us, so we want the system to react to these events on its own.  We want to take as many humans out of the equation as possible.

Knowing how the system is behaving every second, we can automate away a good chunk of senseless toil that happens whenever we change it.  The biggest contributor here is the stream of new features and bug fixes.  Having service-level metrics means that once a change has been reviewed by a human, it can be deployed automatically because we trust the system to detect a problem, revert the change to the last known good state, and notify someone.  This is a great thing to have for a couple of reasons.  Our users will appreciate that even if something is broken, it's likely to get fixed within minutes or even seconds.  The people making these changes will appreciate it because they will be getting quick feedback about their code while it's still fresh in their minds.

Finally, exercising this flow gives us confidence that we can make changes quickly, which is pretty handy when we need to get a fix out ASAP.

The second source of toil is usually managing the configuration of all the hosts.  Instead of crafting artisanal coconut milk configs by hand for each of them, we can roll out a uniform, self-enforcing set of configuration everywhere.  Whenever a host deviates from this golden standard, it can be automatically re-imaged and reconfigured without a single person taking action.  This view is summed up as "treat your hosts like cattle, not pets."  This setup leaves us with more time to focus on anomalies that need a human to investigate.  It also speeds up our reaction time considerably.  Imagine if the primary data center goes down.  Now imagine how much stress, sweat, and coffee all this automation would save us if all we had to do was point it at a set of blank machines in a new location and wait an hour for everything to go back to normal.

In my experience, the most important piece in all of this is how the engineering team handles failures.It's organizational, not technical, in nature..  First, all production incidents should be investigated and discussed at a post-mortem meeting with all affected present..  The goal isn't to dish out blame and punishment, it's to build a shared understanding of how the system entered a bad state..  Trust is essential in order to bring up all the little details and go through as many follow-ups as feasible to prevent the problem from happening again..  Without trust, people will hesitate to report incidents or their details for fear of punishment..  Think of it as a group learning process.  It's important to note that some incidents may be the result of how the work is organized, so managers should be a part of this, too.

Second, there's the on-call process, where a rotating member of the team is notified whenever something is broken and has to fix it.  It's familiar to many, but to make it truly work, all technical team members should be part of the rotation.  This puts equal pressure on everyone to keep reliability in mind as no one likes to be woken up at 2 am.  It directs everyone on the team toward the same set of goals.  The opposite approach is why ops and security teams used to fail in the past - the "feature team" doesn't understand that security or reliability is part of the product and introduces bug after bug, vulnerability after vulnerability, while the ops and security teams take up drinking because it's the only way to handle a dysfunctional relationship like that.

None of these practices are new, they just needed to be discovered and put into practice by the right people in the right place.  I imagine we, as both users and builders of systems, will reap more benefits of these practices as they gain popularity.

For those interested in learning more, here are some reading materials:

The Field Guide to Understanding Human Error by Sidney Dekker

Debriefing Facilitation Guide by John Allspaw, Morgan Evans, Daniel Schauenberg

Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy

An Introduction to General Systems Thinking by Gerald M. Weinberg

Return to $2600 Index