Distributed systems theory for the distributed systems engineer

Updated June 2018 with content on atomic broadcast, gossip, chain replication and more

Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed. But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory; particularly material that does so without condescending. Considering that gap lead me to another interesting question:

What distributed systems theory should a distributed systems engineer know?

A little theory is, in this case, not such a dangerous thing. So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer. Let me know what you think I missed!

First steps

These four readings do a pretty good job of explaining what about building distributed systems is challenging. Collectively they outline a set of abstract but technical difficulties that the distributed systems engineer has to overcome, and set the stage for the more detailed investigation in later sections

Distributed Systems for Fun and Profit is a short book which tries to cover some of the basic issues in distributed systems including the role of time and different strategies for replication.

Notes on distributed systems for young bloods - not theory, but a good practical counterbalance to keep the rest of your reading grounded.

A Note on Distributed Systems - a classic paper on why you can’t just pretend all remote interactions are like local objects.

The fallacies of distributed computing - 8 fallacies of distributed computing that set the stage for the kinds of things system designers forget.

You should know about safety and liveness properties:

  • safety properties say that nothing bad will ever happen. For example, the property of never returning an inconsistent value is a safety property, as is never electing two leaders at the same time.

  • liveness properties say that something good will eventually happen. For example, saying that a system will eventually return a result to every API call is a liveness property, as is guaranteeing that a write to disk always eventually completes.

Failure and Time

Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:

  1. Processes may fail

  2. There is no good way to tell that they have done so

There is a very deep relationship between what, if anything, processes share about their knowledge of time, what failure scenarios are possible to detect, and what algorithms and primitives may be correctly implemented. Most of the time, we assume that two different nodes have absolutely no shared knowledge of what time it is, or how quickly time passes.

You should know:

The basic tension of fault tolerance

A system that tolerates some faults without degrading must be able to act as though those faults had not occurred. This means usually that parts of the system must do work redundantly, but doing more work than is absolutely necessary typically carries a cost both in performance and resource consumption. This is the basic tension of adding fault tolerance to a system.

You should know:

Basic primitives

There are few agreed-upon basic building blocks in distributed systems, but more are beginning to emerge. You should know what the following problems are, and where to find a solution for them:

Fundamental Results

Some facts just need to be internalised. There are more than this, naturally, but here’s a flavour:

  • You can’t implement consistent storage and respond to all requests if you might drop messages between processes. This is the CAP theorem.

  • Consensus is impossible to implement in such a way that it both a) is always correct and b) always terminates if even one machine might fail in an asynchronous system with crash-* stop failures (the FLP result). The first slides - before the proof gets going - of my Papers We Love SF talk do a reasonable job of explaining the result, I hope. Suggestion: there’s no real need to understand the proof.

  • Consensus is impossible to solve in fewer than 2 rounds of messages in general.

  • Atomic broadcast is exactly as hard as consensus - in a precise sense, if you solve atomic broadcast, you solve consensus, and vice versa. Chandra and Toueg prove this, but you just need to know that it’s true.

Real systems

The most important exercise to repeat is to read descriptions of new, real systems, and to critique their design decisions. Do this over and over again. Some suggestions:


Not Google:


If you tame all the concepts and techniques on this list, I’d like to talk to you about engineering positions working with the menagerie of distributed systems we curate at Cloudera.