Barbara Liskov has just been announced as the recipient of the 2008 Turing Award, which is one of the most important prizes in computer science, and can be thought of as our field’s equivalent to the various Nobel Prizes. Professor Liskov is a worthy recipient of the award, even if judged alone by her citation which lists a number of the important contributions she has made to operating systems, programming languages and distributed systems.
Professor Liskov seems to be particularly well known for the Liskov substitution principle which says that some property of a supertype ought to hold of its subtypes. I’m not in any position to speak as to the importance of this contribution. However, her more recent work has been regarding the tolerance of Byzantine failures in distributed systems, which is much more close to my heart.
The only work of Liskov’s that I am really familiar with is the late 90s work on Practical Byzantine Fault Tolerance with Miguel Castro and is first published in this OSDI ‘99 paper. I’m not going to do a full review, but the topic sits so nicely with my recent focus on consensus protocols that it makes sense to briefly discuss its importance.
Remember that in the articles on Paxos, we were able to deal with two kinds of failure: fail-stop and fail-recover. Although these two failure modes capture a lot of the possible failures in a distributed system there is a yet more general failure mode which captures all possible kinds of errors. Byzantine failures occur when a participant in a protocol deviates arbitrarily from the specified protocol, and thus cannot be relied upon to do anything useful. Worse than that, Byzantine nodes may often look like they are non-faulty up until a crucial point where they might, for example, lie about the result of a transaction, or send two different responses to identical requests, or simply fail to respond. Although Byzantine failures are often equated with malicious attacks, where an attacker has gained control of a node and is controlling its behaviour, in fact many Byzantine failures occur due to programming errors in the implementation of a protocol (Amazon were susceptible to a Byzantine failure recently when a bit-flip caused a gossipped message to be incorrectly propogated, taking down S3 for hours).
As we’ve talked about elsewhere on this blog, consensus is a hugely important primitive for distributed algorithms. When Byzantine failures come into the picture consensus takes even more of the centre stage. Every action that a replica is instructed to perform has to be validated by every fault-free replica because there’s a strong possibility that the coordinator is itself faulty and has told different nodes to do different things. So nodes must run a consensus protocol to agree on what they were asked to do, let alone the result of doing it.
Of course, standard consensus algorithms won’t do as they themselves are not Byzantine fault tolerant. So building Byzantine fault tolerant systems is often reduced to building a Byzantine fault tolerant consensus primitive, and using that to build replicated state machines in the usual style.
When we talk about BFT consensus we often distinguish a co-ordinator node responsible for initiating the consensus protocol and proposing the initial value, which is usually received from an external client. The other nodes are replicas, as normal. There’s nothing particularly special about the co-ordinator, as it still functions like a replica for the rest of the protocol. However, we can show that if the co-ordinator is definitely not faulty then we can execute a more relaxed protocol; more on this later.
BFT consensus has similar requirements to standard consensus: validity, which is that only a proposed value can be decided upon, termination and agreement. Agreement says that all correct nodes must agree upon the same value - this is the same as for standard consensus, but the definition of faulty here is widened to include Byzantine faults. Validity also requires a small relaxation - if the co-ordinator is faulty then we allow correct nodes to agree upon some default ‘no-op’ value. This is because if the co-ordinator is faulty it’s impossible to tell what the original request from the client was, and so the safe action is to do nothing. If the co-ordinator is non-faulty, then validity is required as normal. This of course implies that correct replicas must be able to detect a faulty co-ordinator.
How might faults appear in an instance of Byzantine consensus? There are the standard failure modes that we know about - messages might take a long time to be processed, due to a fail-recover fault, or replicas might crash. More perniciously, replicas might falsify messages to other replicas. The co-ordinator might tell half the replicas that value A has been proposed, and half the replicas that value B was. Faulty replicas in either half might then lie to other replicas about the value that it received from the co-ordinator. The task of a Byzantine fault-tolerant consensus algorithm is to figure out if the co-ordinator is non-faulty, and if so what values it sent to the good replicas. This is a difficult problem - how, with no a priori notion of trust, do you tell who is lying?
Leslie Lamport, along with Shostak and Pease, wrote the seminal paper on the subject of BFT. Indeed, they introduced the metaphor of the ‘Byzantine Generals’, deciding whether to attack or retreat from their distributed camp sites in the presence of potential traitors, and therefore misinformation.
Their paper set out two important results. The first is the most well known: that to tolerate completely arbitrary Byzantine failures in a network where Byzantine replicas may lie not only about their own state, but what they have heard from other replicas, at least \(3f+1\) nodes total are required to tolerate \(f\) faults.
This is an extremely famous result, and one that is quoted time and time again in the literature - often without a full understanding of why it holds and in what conditions. The proof itself will be the subject of a later post, but the intuition comes from considering the \(2f+1\) case when \(f=1\) . In order to reach agreement with three nodes, it can be seen very easily that every replica must send every other replica a message containing the proposal it got from the co-ordinator. If only the co-ordinator could be faulty then the replicas can figure this out very quickly as, when they compare notes, they will see that both received different proposals. However, if one of the replicas could be faulty then the good replica will still see the same message from the faulty replica: both values that they claim to have received from the co-ordinator are different. Since the good replica can’t distinguish this situation from when the co-ordinator is faulty, it can’t decide who to believe and therefore can’t decide consistently. The paper then goes on to show that any purported solution which uses fewer than \(3f+1\) replicas could be used to solve the three-replica, one fault case. Therefore, any correct solution must involve \(3f+1\) replicas or more. The paper gives one such solution.
The second result is that, if nodes may sign their messages to other nodes in a non-forgeable way so that Byzantine nodes cannot lie about what they have heard, there is an algorithm for Byzantine consensus in a synchronous network that tolerates \(f = n-2\) failures. This is dramatically better than the \(3f+1\) bound, but comes at the cost of an expensive protocol in terms of the number of rounds needed to execute.
It’s difficult to overestimate how influential this paper has been. All subsequent BFT papers that I’ve read (quite a few!) have characterised their solutions similarly, in terms of the fraction of faulty replicas that their algorithm will tolerate. The idea of having replicas sign their messages, so that other replicas could not lie about what they had heard is highly practical given the advent of public-key cryptography, and clearly greatly strengthens the system.
But Lamport’s paper only dealt with synchronous signed-message networks. What about asynchronous ones? Treatment of these had to wait until a few years later for a paper by Bracha and Toueg. They showed that, even if replicas could not forge responses from other replicas, there was still a fundamental \(3f+1\) lower bound on the number of replicas. The proof was by construction of an arrangement of replicas such that correct replicas couldn’t decide between good and bad responses under two different, but identical from the perspective of correct replicas, executions.
The intuition behind this result is that each correct replica must hear from \(2f+1\) other replicas in the worst case, so that the \(f\) faulty replicas that may be lying about the proposed value that they heard can be outvoted. So the number of replicas must be at least \(r \geq 2f+1\) .
At the same time, a replica can only wait to hear from at most \(n-f\) replicas (where \(n\) is the total number of replicas). Why? Because in an asynchronous system it’s possible either that a) \(f\) replicas are faulty and have failed to reply or b) \(f\) replicas are not faulty, but haven’t replied yet. A correct replica can’t distinguish between these cases, and so can’t wait for an unbounded amount of time to get all \(n\) replies, which might never be forthcoming. If the network were synchronous, a good replica could detect that the \(f\) replicas that had failed to reply must be faulty, which means that the replies it had already received were all from the other correct replicas. In an asynchronous network, this is not possible.
Therefore we have that \(n -f \geq r \geq 2f+1\) , or that \(n \geq 3f+1\) .
Practical Byzantine Fault Tolerance
Miguel Castro and Professor Liskov had a 1999 OSDI paper called Practical Byzantine Fault Tolerance. Their contribution was to develop a Byzantine fault tolerant consensus protocol that was both efficient and applicable to realistic scenarios.
Liskov and Castro were the first to propose a correct algorithm that worked efficiently in asynchronous networks, and that realised the \(3f+1\) lower bound. FLP impossibility tells us that consensus is in general impossible to solve in asynchronous networks with even just fail-stop failures, rather than the more general Byzantine failures, so Liskov and Castro’s PBFT took a similar approach to Paxos and guaranteed liveness only when the network was synchronous. During periods of asynchrony PBFT may not terminate, but will never violate its safety properties (which are, as ever, validity and agreement).
PBFT’s other advantages were its high performance - owing in part to the use of the more efficient Message Authentication Codes rather than public-key cryptography for fast message signing - and its ability to tolerate an unbounded number of faults over the lifetime of execution, as long as no more than \(f\) were concurrent. PBFT provides the familiar state-machine abstract interface, with a primary through which requests go that proposes a sequence number for every operation. The correct replicas agree (via Byzantine consensus) on the sequence number, and then commit the requests in sequence order.
The protocol itself is reasonably simple, but I’m not going to describe it in detail here as that would require a lot of groundwork to prove some basic results and show why it’s correct: this will be the subject of a future series of posts, as there’s a lot of interesting work in BFT still being done. The basic idea of PBFT is to have the primary broadcast a request to all replicas, which then retransmit what they have heard to every other replica. If all replicas agree on the same operation, then the primary is currently correct, and the replicas broadcast a commit message to each other, much like a standard three-phase commit protocol. In fact, replicas will proceed once they have got \(2f+1\) identical replies in the first phase, which is a Byzantine quorum, any two of which will be guaranteed to contain at least one correct replica in common. The idea is to ensure that the primary can’t have two conflicting sequence numbers for a single request accepted (since this would require a correct replica to accept both, which it will not do).
If, however, the primary turns out to be faulty, then a view-change protocol has to be executed, which causes all replicas to stop taking requests from the primary, and elect a new one in its stead. Getting this protocol right involves taking checkpoints of mutually agreed upon histories, since a view-change might occur during a request, then agreeing on the new leader and restarting any pending requests. All the while making sure that faulty replicas can’t hijack the process.
If you’re interested, the best paper is not the OSDI one, but the one in Transactions on Computer Systems, which is a longer paper and an easier read.
PBFT has since been update by a variety of different work, but the main structure of the protocol has yet to be fundamentally improved upon. There is still some distance to go before Byzantine fault tolerance techniques habitually make their way into production systems - there’s still increased complexity, cost and performance issues to worry about - but when they do, PBFT will doubtless have been a tremendously important step in the right direction.