Paper Trail

Network Load Balancing with Maglev

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 22 Jun 2020 12:52:31 -0700

Maglev: A Fast and Reliable Software Network Load Balancer Eisenbud et. al., NSDI 2016 Load balancing is a fundamental primitive in modern service architectures - a service that assigns requests to servers so as to, well, balance the load on each server. This improves resource utilisation and ensures that servers aren’t unnecessarily overloaded. Maglev is - or was, sometime before 2016 - Google’s network load-balancer that managed load-balancing duties for search, GMail and other high-profile Google services.

Gray Failures

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sat, 18 Apr 2020 22:04:31 -0700

Gray Failure: The Achilles Heel of Cloud-Scale Systems Huang et. al., HotOS 2017

Detecting faults in a large system is a surprisingly hard problem. First you have to decide what kind of thing you want to measure, or ‘observe’. Then you have to decide what pattern in that observation constitutes a sufficiently worrying situation (or ‘failure’) to require mitigation. Then you have to decide how to mitigate it!

Complicating this already difficult issue is the fact that the health of your system is in part a matter of perspective. Your service might be working wonderfully from inside your datacenter, where your probes are run, but all of that means nothing to your users who have been trying to get their RPCs through an overwhelmed firewall for the last hour.

That gap, between what your failure detectors observe, and what clients observe, is the subject of this paper on ‘Gray Failures’, which are the failure modes that happen when clients perceive an issue that is not yet detected by your internal systems. This is a good name for an old phenomenon (every failure detector I have built includes client-side mitigations to work around this exact issue).

Availability in AWS' Physalia

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 06 Apr 2020 22:04:31 -0700

Physalia: Millions of Tiny Databases Brooker et. al., NSDI 2020

Some notes on AWS’ latest systems publication, which continues and expands their thinking about reducing the effect of failures in very large distributed systems (see shuffle sharding as an earlier and complementary technique for the same kind of problem).

Physalia is a configuration store for AWS’ Elastic-Block Storage (i.e. network-attached disks). EBS disks are replicated using chain replication, but the configuration of the replication chain needs to be stored somewhere - enter Physalia.

Beating hash tables with trees? The ART-ful radix trie

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sat, 03 Nov 2018 22:04:31 -0700

The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases Leis et. al., ICDE 2013

Tries are an unloved third data structure for building key-value stores and indexes, after search trees (like B-trees and red-black trees) and hash tables. Yet they have a number of very appealing properties that make them worthy of consideration - for example, the height of a trie is independent of the number of keys it contains, and a trie requires no rebalancing when updated. Weighing against those advantages is the heavy memory cost that vanilla radix tries can incur, because each node contains a pointer for every possible value of the ‘next’ character in the key. With ASCII as an example, that’s 256 pointers for every node in the tree.

But the astute reader will feel in their bones that this is naive - there must be more efficient ways to store a set of pointers, indexed by a fixed size set of keys (the trie’s alphabet). Indeed, there are - several of them, in fact, distinguished by the number of children the node actually has, not just how many it might potentially have.

This is where the Adaptive Radix Tree (ART) comes in. In this breezy, easy-to-read paper, the authors show how to reduce the memory cost of a regular radix trie by adapting the data structure used for each node to the number of children that it needs to store. In doing so they show, perhaps surprisingly, that the amount of space consumed by a single key can be bounded no matter how long the key is.

Outperforming hash-tables with MICA

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 26 Sep 2018 12:52:31 -0700

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage Lim et. al., NSDI 2014

In this installment we’re going to look at a system from NSDI 2014. MICA is another in-memory key-value store, but in contrast to Masstree it does not support range queries and in much of the paper it keeps a fixed working set by evicting old items, like a cache. Indeed, the closest comparison system that you might think of when reading about MICA for the first time is a humble… hash table. Is there still room for improvement over such a fundamental data structure? Read on and find out (including benchmarks!).

Masstree: A cache-friendly mashup of tries and B-trees

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 10 Sep 2018 12:13:31 -0700

Cache Craftiness for Fast Multicore Key-Value Storage Mao et. al., EuroSys 2012 [code]

The Big Idea

Consider the problem of storing, in memory, millions of (key, value) pairs, where key is a variable-length string. If we just wanted to support point lookup, we’d use a hash table. But assuming we want to support range queries, some kind of tree structure is probably required. One candidate might be a traditional B+-tree.

In such a B+-tree, the number of levels of the tree are kept small thanks to the fact that each node has a high fan-out. However, that means that a large number of keys are packed into a single node, and so there’s still a large number of key comparisons to perform when searching through the tree.

This is further exacerbated by variable-length keys (e.g. strings), where the cost of key comparisons can be quite high. If the keys are really long they can each occupy multiple cache lines, and so comparing two of them can really mess up your cache locality.

This paper proposes an efficient tree data structure that relies on splitting variable length keys into a variable number of fixed-length keys called slices. As you go down the tree, you compare the first slice of each key, then the second, then the third and so on, but each comparision has constant cost.

For example, think about the string the quick brown fox jumps over the lazy dog. This string consists of the following 8-byte slices: the quic, k brown_, fox jump, s over t, he lazy_ and finally dog. To find a string in a tree, you can look for all strings that match the first slice first, and then look for the second slice only in strings that matched the first slice, and so on - only comparing a fixed size subset of the key at any time. This is much more efficient than comparing long strings to one another over and over again. The trick is to design a structure that takes advantage of the cache benefits of doing these fixed-size comparisons, without losing a tradeoff based on the large cardinality of the slice ‘alphabet’. Enter the Masstree.

The CAP FAQ

henry.robinson+papertrail@gmail.com (Henry Robinson) — Fri, 08 Jun 2018 16:22:58 -0700

0. What is this document? No subject appears to be more controversial to distributed systems engineers than the oft-quoted, oft-misunderstood CAP theorem. The purpose of this FAQ is to explain what is known about CAP, so as to help those new to the theorem get up to speed quickly, and to settle some common misconceptions or points of disagreement. Of course, there’s every possibility I’ve made superficial or completely thorough mistakes here.

Dist Sys Slack

henry.robinson+papertrail@gmail.com (Henry Robinson) — Fri, 08 Jun 2018 11:51:35 -0700

Want to chat about distributed systems and databases? Come join over 2000 like-minded individuals at the dist-sys slack!. Click here for an invite.

Reading List

henry.robinson+papertrail@gmail.com (Henry Robinson) — Thu, 07 Jun 2018 14:49:07 -0700

Distributed Systems Service Fabric: A Distributed Platform for Building Microservices in the Cloud - Kakivaya et. al., EuroSys 2018 [notes] Gray Failure: The Achilles’ Heel of Cloud-Scale Systems - Huang et. al., HotOS 2017 Cache-aware load balancing of data center applications - Archer et. al., VLDB 2019 Slicer: Auto-Sharding for Datacenter Applications - Adya et. al., OSDI 2016 [notes] Maglev: A Fast and Reliable Software Network Load Balancer - Eisunbed et.

Exactly-once or not, atomic broadcast is still impossible in Kafka - or anywhere

henry.robinson+papertrail@gmail.com (Henry Robinson) — Fri, 28 Jul 2017 16:23:38 +0000

Intro

I read an article recently by Jay Kreps about a feature for delivering messages ‘exactly-once’ within the Kafka framework. Everyone’s excited, and for good reason. But there’s been a bit of a side story about what exactly ‘exactly-once’ means, and what Kafka can actually do.

In the article, Jay identifies the safety and liveness properties of atomic broadcast as a pretty good definition for the set of properties that Kafka is going after with their new exactly-once feature, and then starts to address claims by naysayers that atomic broadcast is impossible.

For this note, I’m not going to address whether or not exactly-once is an implementation of atomic broadcast. I also believe that exactly-once is a powerful feature that’s been impressively realised by Confluent and the Kafka community; nothing here is a criticism of that effort or the feature itself. But the article makes some claims about impossibility that are, at best, a bit shaky - and, well, impossibility’s kind of my jam. Jay posted his article with a tweet saying he couldn’t ‘resist a good argument’. I’m responding in that spirit.

In particular, the article makes the claim that atomic broadcast is ‘solvable’ (and later that consensus is as well…), which is wrong. What follows is why, and why that matters.

This deserves a response: I think the conclusions are right but the imposs. arguments aren't. But it's 8pm in England and I'm in the pub. https://t.co/akmVv9rhW7
— Henry Robinson (@HenryR) July 2, 2017

I have since left the pub. So let’s begin.

Make any algorithm lock-free with this one crazy trick

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 25 May 2016 22:51:03 +0000

Lock-free algorithms often operate by having several versions of a data structure in use at one time. The general pattern is that you can prepare an update to a data structure, and then use a machine primitive to atomically install the update by changing a pointer. This means that all subsequent readers will follow the pointer to its new location - for example, to a new node in a linked-list - but this pattern can’t do anything about readers that have already followed the old pointer value, and are traversing the previous version of the data structure.

Distributed systems theory for the distributed systems engineer

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sat, 09 Aug 2014 20:45:38 +0000

Updated June 2018 with content on atomic broadcast, gossip, chain replication and more

Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.

I need to improve my proficiency in distributed systems theory. Where do I start? Any recommended books?
— Gwen (Chen) Shapira (@gwenshap) August 7, 2014

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed. But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory; particularly material that does so without condescending. Considering that gap lead me to another interesting question:

What distributed systems theory should a distributed systems engineer know?

A little theory is, in this case, not such a dangerous thing. So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer. Let me know what you think I missed!

The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 25 Jun 2014 17:49:39 +0000

Note: this is a personal blog post, and doesn’t reflect the views of my employers at Cloudera

Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise.

This morning, at their I/O Conference, Google revealed that they’re not using Map-Reduce to process data internally at all any more.

We shouldn’t be surprised. The writing has been on the wall for Map-Reduce for some time. The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems.

Paper notes: MemC3, a better Memcached

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 18 Jun 2014 14:36:51 +0000

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

Fan and Andersen, NSDI 2013

The big idea

This is a paper about choosing your data structures and algorithms carefully. By paying careful attention to the workload and functional requirements, the authors reimplement memcached to achieve a) better concurrency and b) better space efficiency. Specifically, they introduce a variant of cuckoo hashing that is highly amenable to concurrent workloads, and integrate the venerable CLOCK cache eviction algorithm with the hash table for space-efficient approximate LRU.

Paper notes: Anti-Caching

henry.robinson+papertrail@gmail.com (Henry Robinson) — Fri, 06 Jun 2014 11:03:39 +0000

Anti-Caching: A New Approach to Database Management System Architecture

DeBrabant et. al., VLDB 2013

The big idea

Traditional databases typically rely on the OS page cache to bring hot tuples into memory and keep them there. This suffers from a number of problems:

No control over granularity of caching or eviction (so keeping a tuple in memory might keep all the tuples in its page as well, even though there’s not necessarily a usage correlation between them)
No control over when fetches are performed (fetches are typically slow, and transactions may hold onto locks or latches while the access is being made)
Duplication of resources - tuples can occupy both disk blocks and memory pages.

Paper notes: Stream Processing at Google with Millwheel

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 04 Jun 2014 12:07:04 +0000

Millwheel: Fault-Tolerant Stream Processing at Internet Scale

Akidau et. al., VLDB 2013

The big idea

Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but one that runs at ‘Google’ scale. This paper really answers the following questions: what guarantees should be made about delivery and fault-tolerance to support most common use cases cheaply? What optimisations become available if you choose these guarantees carefully?

Paper notes: DB2 with BLU Acceleration

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 14 May 2014 18:02:15 +0000

DB2 with BLU Acceleration: So Much More than Just a Column Store

Raman et. al., VLDB 2013

The big idea

IBM’s venerable DB2 technology was based on traditional row-based technology. By moving to a columnar execution engine, and crucially then by taking full advantage of the optimisations that columnar formats allow, the ‘BLU Acceleration’ project was able to improve read-mostly BI workloads by a 10 to 50 times speed-up.

Étale cohomology

henry.robinson+papertrail@gmail.com (Henry Robinson) — Tue, 04 Mar 2014 22:22:25 +0000

The second in an extremely irregular series of posts made on behalf of my father, who has spent much of his retirement so far doing very hard mathematics. What is attached here is the essay he wrote for the Part III of the Cambridge Mathematical Tripos, a one year taught course. The subject is the Étale cohomology.

ByteArrayOutputStream is really, really slow sometimes in JDK6

henry.robinson+papertrail@gmail.com (Henry Robinson) — Fri, 10 Jan 2014 14:57:41 +0000

TLDR: Yesterday I mentioned on Twitter that I’d found a bad performance problem when writing to a large ByteArrayOutputStream in Java. After some digging, it appears to be the case that there’s a bad bug in JDK6 that doesn’t affect correctness, but does cause performance to nosedive when a ByteArrayOutputStream gets large. This post explains why.

On Raft, briefly

henry.robinson+papertrail@gmail.com (Henry Robinson) — Thu, 31 Oct 2013 12:03:51 +0000

Raft is a new-ish consensus implementation whose great benefit, to my mind it, is its applicability for real systems. We briefly discussed it internally at Cloudera, and I thought I’d share what I contributed, below. There’s an underlying theme here regarding the role of distributed systems research in practitioners’ daily work, and how the act of building a distributed system has not yet been sufficiently well commoditised to render a familiarity with the original research unnecessary.

Some miscellanea

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sun, 19 May 2013 22:39:57 +0000

CAP FAQ I wrote an FAQ on The CAP Theorem. The aim is to definitively settle some of the common misconceptions around CAP so as to help prevent its invocation in useless places. If someone says they got around CAP, refer them to the FAQ. It should be a pretty simple introduction to the theorem as well. I think that CAP itself is a pretty uninteresting result, but it does at least shine a light on tradeoffs implicit in distributed systems.

Columnar Storage

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 30 Jan 2013 19:46:31 +0000

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage - which is a simple idea - can drastically improve performance for certain workloads.

Caveat: This is a personal, general research summary post, and as usual doesn’t neccessarily reflect our thinking at Cloudera about columnar storage.

Disks are still the major bottleneck in query execution over large datasets. Even a machine with twelve disks running in parallel (for an aggregate bandwidth of north of 1GB/s) can’t keep all the cores busy; running a query against memory-cached data can get tens of GB/s of throughput. IO bandwidth matters. Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk. This can mean decreasing the latency (for small queries where the time to find the data to read might dominate), but most usually this means improving the effective throughput of reads from disk.

The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster. However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can’t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.

The other way to improve disk performance is to maximise the ratio of ‘useful’ bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem. Enter columnar storage, a principle for file format design that aims to do exactly that for query engines that deal with record-based data.

Cloudera Impala

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sun, 04 Nov 2012 18:12:12 +0000

If you have a strong background in either databases or distributed systems, and fancy working on such an exciting technology, send me a note!

It’s great to finally be able to say something about what I’ve been working at Cloudera for nearly a year. At StrataConf / Hadoop World in New York a couple of weeks ago we announced Cloudera Impala. Impala is a distributed query execution engine that understands a subset of SQL, and critically runs over HDFS and HBase as storage managers. It’s very similar in functionality to Apache Hive, but it is much, much, much (anecdotally up to 100x) faster.

On some subtleties of Paxos

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sat, 03 Nov 2012 19:02:22 +0000

There’s one particular aspect of the Paxos protocol that gives readers of this blog - and for some time, me! - some difficulty. This short post tries to clear up some confusion on a part of the protocol that is poorly explained in pretty much every major description.

Links

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 06 Aug 2012 14:05:50 +0000

Reasoning about Knowledge Toward a Cloud Computing Research Agenda (2009) - “One of the LADIS attendees commented at some point that Byzantine Consensus could be used to improve Chubby, making it tolerant of faults that could disrupt it as currently implemented. But for our keynote speakers, enhancing Chubby to tolerate such faults turns out to be of purely academic interest.” Low-level data structures - The llds general working thesis is: for large memory applications, virtual memory layers can hurt application performance due to increased memory latency when dealing with large data structures.

Something a bit different: translations of classic mathematical texts (!)

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sat, 04 Aug 2012 14:50:18 +0000

During his retirement, my father has been able to spend much time indulging his love of mathematics. This included, amongst other impressive endeavours, attending Cambridge at a more advanced age than average to take (and pass!) the Part III of the Mathematical Tripos, often considered one of the hardest taught courses in maths in the world. Since then, he has hardly been idle, and has recently been undertaking a translation of a classic work in modern algebra by Dedekind and Weber from its original 100+ pages of German into English.

EuroSys 2012 blog notes

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sun, 15 Apr 2012 18:20:33 +0000

EuroSys 2012 was last week - one of the premier European systems conferences. Over at the Cambridge System Research Group’s blog, various people from the group have written notes on the papers presented. They’re very well-written summaries, and worth checking out for an overview of the research presented. Day 1 Day 2 Day 3

FLP and CAP aren't the same thing

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sun, 25 Mar 2012 20:55:34 +0000

An interesting question came up on Quora this last week. Roughly speaking, the question asked how, if at all, the FLP theorem and the CAP theorem were related. I’d thought idly about exactly the same question myself before. Both theorems concern the impossibility of solving fairly similar fundamental distributed systems problems in what appear to be fairly similar distributed systems settings. The CAP theorem gets all the airtime, but FLP to me is a more beautiful result. Wouldn’t it be fascinating if both theorems turned out to be equivalent; that is effectively restatements of each other?

Should I take a systems reading course?

henry.robinson+papertrail@gmail.com (Henry Robinson) — Fri, 09 Mar 2012 18:05:13 +0000

A smart student asked me a couple of days ago whether I thought taking a 2xx-level reading course in operating systems was a good idea. The student, understandably, was unsure whether talking about these systems was as valuable as actually building them, and also whether, since his primary interest is in ‘distributed’ systems, he stood to benefit from a deep understanding of things like virtual memory.

How consistent is eventual consistency?

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 04 Jan 2012 15:22:05 +0000

This page, from the ‘PBS’ team at Berkeley’s AMPLab is quite interesting. It allows you to tweak the parameters of a Dynamo-style system, then by running a series of Monte Carlo simulations gives an estimate of the likelihood of staleness of reads after writes. Since the Dynamo paper appeared and really popularised eventual consistency, the debate has focused on a fairly binary treatment of its merits. Either you can’t afford to be wrong, ever, or it’s ok to have your reads be stale for a potentially unbounded amount of time.

STM: Not (much more than) a research toy?

henry.robinson+papertrail@gmail.com (Henry Robinson) — Thu, 21 Apr 2011 13:02:38 +0000

It’s a sign of how down-trodden the Software Transactional Memory (STM) effort must have become that the article (sorry, ACM subscription required) published in a recent CACM might have been just as correctly called “STM: Not as bad as the worst possible case”. The authors present a series of experiments that demonstrate that highly concurrent STM code beats sequential, single threaded code. You’d hope that this had long ago become a given, but what this demonstrates is only hey, STM allows some parallelism.

The Theorem That Will Not Go Away

henry.robinson+papertrail@gmail.com (Henry Robinson) — Thu, 07 Oct 2010 23:28:55 +0000

The CAP theorem gets another airing. I think the article makes a point worth making again, and makes it fairly well - that CAP is really about P=> ~(C & A). A couple of things I want to call out though, after a rollicking discussion on Hacker News. “For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die.

CAP confusion: Problems with Partition Tolerance

henry.robinson+papertrail@gmail.com (Henry Robinson) — Tue, 27 Apr 2010 10:44:29 +0000

Over on the Cloudera blog I’ve written an article that should be of interest to readers of this blog. I’m no great fan of the ubiquity of the CAP theorem - it’s a solid impossibility result which appeals to the theorist in me, but it doesn’t capture every fundamental tension in a distributed system. For example: we make our systems distributed across more than one machine usually for reasons of performance and to eliminate a single point of failure.

Apache ZooKeeper is looking for Google Summer of Code applicants

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 24 Mar 2010 10:18:02 +0000

Students! Over at Apache ZooKeeper we’re looking for great students with a strong interest in distributed systems to work with us over the summer as part of Google’s Summer of Code, 2010. Summer of Code is a great program - providing stipends to students and more importantly connecting them with mentors in open source projects. ZooKeeper has a number of interesting projects to get started on. ZooKeeper is a distributed coordination platform on which you can build the distributed equivalents of many traditional concurrent primitives like locks, queues and barriers.

GFS Retrospective in ACM Queue

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 12 Aug 2009 21:01:11 +0000

This is a really great article. Sean Quinlan talks very openly and critically about the design of the Google File System given ten years of use (ten years!). What’s interesting is that the general sentiment seems to be that the concessions that GFS made for performance and simplicity (single master, loose consistency model) have turned out to probably be net bad decisions, although they probably weren’t at the time. There are scaling issues with GFS - the well known many-small-files problem that also plagues HDFS, and a similar huge-files problem.

SOSP 2009 Program Available

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 29 Jun 2009 14:49:17 +0000

The accepted papers for SOSP 2009 are here. As ever, some excellent looking papers. If you search for the titles you can often turn up drafts or even the submitted versions. The best looking sessions to me are ‘scalability’ and ‘clusters’, but there’s at least one great looking title in every session. I’ll start posting some reviews once I find some bandwidth (and have finished the computation theory series - next one on its way).

In Which I Prove Employable

henry.robinson+papertrail@gmail.com (Henry Robinson) — Thu, 09 Apr 2009 08:51:54 +0000

Although I try and keep personal information to a relative minimum on this blog, here’s some news that’s relevant. Recently I accepted an offer to start work at Cloudera, a young company in the San Francisco area. Initially I’ll be working from the UK, with a view to a permanent move out to California when timing and visas allow. Hadoop is Cloudera’s business. Hadoop is an open-source implementation of Google’s MapReduce Cloudera provides support for Hadoop, and their own fully supported distribution of the Hadoop toolset.

Barbara Liskov's Turing Award, and Byzantine Fault Tolerance

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 30 Mar 2009 14:53:08 +0000

Barbara Liskov has just been announced as the recipient of the 2008 Turing Award, which is one of the most important prizes in computer science, and can be thought of as our field’s equivalent to the various Nobel Prizes. Professor Liskov is a worthy recipient of the award, even if judged alone by her citation which lists a number of the important contributions she has made to operating systems, programming languages and distributed systems.

Professor Liskov seems to be particularly well known for the Liskov substitution principle which says that some property of a supertype ought to hold of its subtypes. I’m not in any position to speak as to the importance of this contribution. However, her more recent work has been regarding the tolerance of Byzantine failures in distributed systems, which is much more close to my heart.

The only work of Liskov’s that I am really familiar with is the late 90s work on Practical Byzantine Fault Tolerance with Miguel Castro and is first published in this OSDI ‘99 paper. I’m not going to do a full review, but the topic sits so nicely with my recent focus on consensus protocols that it makes sense to briefly discuss its importance.

OSDI '08: FlightPath: Obedience vs. Choice in Cooperative Services

henry.robinson+papertrail@gmail.com (Henry Robinson) — Tue, 03 Mar 2009 15:45:18 +0000

This is one of my favourite papers from OSDI ‘08 (yes, still doing a few reviews, trying to get to five or so before SOSP…). FlightPath is a system developed by some folks mainly at UT Austin for peer-to-peer streaming in dynamic networks. This is a reasonably challenging problem in itself, although one that’s seen a good deal of work before. However, the really cool thing about this paper is that they treat participants in the network as potentially rational agents. Since Lamport’s seminal work on the Byzantine generals problem, it’s been standard practice to assign one of two behaviour modes to members of distributed systems: either you’re alturistic, which means that you do exactly what the protocol tells you to do, no matter what the cost to yourself, or Byzantine, which means that you do whatever you like, again no matter what the cost to yourself.

It was realised recently that this is a false dichotomy: there’s a whole class of behaviour that’s not captured by these two extremes. Rational agents participate in a protocol as long as it is worth their while to do so. At its most simple, this means that rational agents will not incur a cost unless they expect to recoup a benefit that is worth equal to or more than the original cost to them. This gave rise to the Byzantine-Alturistic-Rational (BAR) model, due to the same UTA group, which can be used to more realistically model the performance of peer-to-peer protocols.

Consensus Protocols: A Paxos Implementation

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 09 Feb 2009 19:37:44 +0000

It’s one thing to wax lyrical about an algorithm or protocol having simply read the paper it appeared in. It’s another to have actually taken the time to build an implementation. There are many slips twixt hand and mouth, and the little details that you’ve abstracted away at the point of reading come back to bite you hard at the point of writing.

I’m a big fan of building things to understand them - this blog is essentially an expression of that idea, as the act of constructing an explanation of something helps me understand it better. Still, I felt that in order to be properly useful, this blog probably needed more code.

So when, yesterday, it was suggested I back up my previous post on Paxos with a toy implementation I had plenty of motivation to pick up the gauntlet. However, I’m super-pressed for time at the moment while I write my PhD thesis, so I gave myself a deadline of a few hours, just to keep it interesting.

A few hours later, I’d written this from-scratch implementation of Paxos. There’s enough interesting stuff in it, I think, to warrant this post on how it works. Hopefully some of you will find it useful, and something you can use as a springboard to your own implementations. You can run an example by simply invoking python toy_paxos.py.

Consensus Protocols: Paxos

henry.robinson+papertrail@gmail.com (Henry Robinson) — Tue, 03 Feb 2009 17:03:14 +0000

You can’t really read two articles about distributed systems today without someone mentioning the Paxos algorithm. Google use it in Chubby, Yahoo used something a bit like it (but not the same!) in ZooKeeper and it seems that it’s considered the ne plus ultra of consensus algorithms. It also comes with a reputation as being fantastically difficult to understand - a subtle, complex algorithm that is only properly appreciated by a select few.

This is kind of true and not true at the same time. Paxos is an algorithm whose entire behaviour is subtly difficult to grasp. However, the algorithm itself is fairly intuitive, and certainly relatively simple. In this article I’ll describe how basic Paxos operates, with reference to previous articles on two-phase and three-phase commit. I’ve included a bibliography at the end, for those who want plenty more detail.

Tuesday Links, 27th January 2009

henry.robinson+papertrail@gmail.com (Henry Robinson) — Tue, 27 Jan 2009 14:52:33 +0000

Web highlights discovered in the last week or so: The C10K problem - detailed discussion of how to do IO on a server that you want to handle 10000 simultaneous connections. Scalability by Design - Coding for Systems With Large CPU Counts - via High Scalability Anti-RDBMS: A list of distributed key-value stores - good, if superficial, survey. Mark Russinovich: Inside Windows 7 - kernel level look at what’s new in Windows Project Voldemort - Dynamo-a-like from LinkedIn

OSDI '08 - CuriOS: Improving Reliability Through Operating System Structure

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 19 Jan 2009 17:28:01 +0000

The second paper from OSDI that I’ll mention here is one I’ll only treat briefly - partly because it’s a bit lightweight compared to some, and partly because I’m writing in a hurry. CuriOS: Improving Reliability Through Operating System Structure attacks a problem with recovery from errors in microkernel operating systems.

OSDI '08: Corey, an operating system for many cores

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 14 Jan 2009 22:50:19 +0000

Just before Christmas, the systems community held one of its premier conferences - Operating Systems Design and Implementation (OSDI ‘08). This biannual conference showcases some of the best research in operating systems, networks, distributed systems and software technology from the past couple of years.

Although I wasn’t lucky enough to go, I did grab a copy of the proceedings and had a read through a bunch of the papers that interested me. I plan to post summaries of a few to this blog. I see people ask repeatedly on various forums (fora?) “what’s new in computer science?”. No-one seems to give a satisfactory answer, for a number of reasons. Hopefully I can redress some of the balance here, at least in the systems world.

Without further ado, I’ll get stuck in to one of the OSDI papers: Corey: an operating system for many cores by Boyd-Wickizer et al from a combination of MIT, Fudan University, MSR Asia and Xi’an Jiaotong University (12 authors!). Download the paper and play along at home, as usual.

Consensus with lossy links: Establishing a TCP connection

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 12 Jan 2009 13:51:27 +0000

After a hiatus for the Christmas break, during which I travelled to the States, had a job interview, went to Vegas, became an uncle and got a cold, I’m back on a more regular posting schedule now. And I’ve got lots to post about.

Before I talk about other theoretical consensus protocols such as Paxos, I want to illustrate a consensus protocol running in the wild, and show how different modelling assumptions can lead to protocols that are rather different to the *PC variants we’ve looked at in the last couple of posts. We’ve been considering situations like database commit, where many participants agree en-masse to the result of a transaction. We’ve assumed that all participants may communicate reliably, without fear of packet loss (or if the packets are lost then the situation is the same as if the host that had sent the packet had failed).

The Transmission Control Protocol (TCP) gives us at least some approximation to a reliable link due to the use of sequence numbers and acknowledgements. However before we can use TCP both hosts involved in a point to point communication have to establish a connection: that is, they must both agree that a connection is established. This is a two-party consensus problem. Neither party can rely on reliable transmission, and can instead only use the IP stack and below to negotiate a connection. IP does not give reliable transmission semantics to packets and works only on a best-effort principle. If the network is noisy or prone to outages then packets will be lost. How can we achieve consensus in this scenario?

Those who have been reading this blog as far back as my explanation of FLP impossibility will probably be thinking that this is a trick question. FLP impossibility shows that if there is an unbounded delay in the transmission of a packet (i.e. an asynchronous network model) then consensus is, in general, unsolvable. Lossy links can be regarded as delaying packet delivery infinitely - therefore it seems very likely that consensus is unsolvable with packet loss.

In fact, this is completely true. Consensus with arbitrary packet loss is an unsolvable problem, even in an otherwise synchronous network. In this post I want to demonstrate the short and intuitive proof that this is the case, then show how this impossibility is avoided where possible in TCP connection establishment.

Consensus Protocols: Three-phase Commit

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sat, 29 Nov 2008 14:35:36 +0000

Last time we looked extensively at two-phase commit, a consensus algorithm that has the benefit of low latency but which is offset by fragility in the face of participant machine crashes. In this short note, I’m going to explain how the addition of an extra phase to the protocol can shore things up a bit, at the cost of a greater latency.

Consensus Protocols: Two-Phase Commit

henry.robinson+papertrail@gmail.com (Henry Robinson) — Thu, 27 Nov 2008 16:41:53 +0000

For the next few articles here, I’m going to write about one of the most fundamental concepts in distributed computing - of equal importance to the theory and practice communities. The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network.

For example, some possible uses of consensus are:

deciding whether or not to commit a transaction to a database
synchronising clocks by agreeing on the current time
agreeing to move to the next stage of a distributed algorithm (this is the famous replicated state machine approach)
electing a leader node to coordinate some higher-level protocol

Such a simple-sounding problem has surprisingly been at the core particularly of theoretical distributed systems research for over twenty years. How come? As I see it, the answers are threefold.

BigTable: Google's Distributed Data Store

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 29 Oct 2008 21:53:28 +0000

Although GFS provides Google with reliable, scalable distributed file storage, it does not provide any facility for structuring the data contained in the files beyond a hierarchical directory structure and meaningful file names. It’s well known that more expressive solutions are required for large data sets. Google’s terabytes upon terabytes of data that they retrieve from web crawlers, amongst many other sources, need organising, so that client applications can quickly perform lookups and updates at a finer granularity than the file level.

So they built BigTable, wrote it up, and published it in OSDI 2006. The paper is here, and my walkthrough follows.

Yahoo's PNUTS

henry.robinson+papertrail@gmail.com (Henry Robinson) — Sun, 12 Oct 2008 22:53:00 +0000

In these politically charged times, it’s important for written media to give equal coverage to all major parties so as not to appear biased or to be endorsing one particular group. With that in mind, we at Paper Trail are happy to devote significant programming time to all the major distributed systems players.

This, therefore, is a party political broadcast on behalf of the Yahoo Party.

PNUTS: Yahoo!’s Hosted Data Serving Platform

(Please note, that’s the first and last time in this article that I’ll be using the exclamation mark in Yahoo’s name, it looks funny.)

As you might expect from the company that runs Flickr, Yahoo have need for a large scale distributed data store. In particular, they need a system that runs in many geographical locations in order to optimise response times for users from any region, while at the same time coordinating data across the entire system. As ever, the system must exhibit high availability and fault tolerance, scalability and good latency properties.

These, of course, are not new or unique requirements. We’ve seen already that Amazon’s Dynamo, and Google’s BigTable/GFS stack offer similar services. Any business that has a web-based product that requires storing and updating data for thousands of users has a need for a system like Dynamo. Many can’t afford the engineering time required to develop their own tuned solution, so settle for well-understood RDBMS-based stacks. However, as readers of this blog will know, RDBMSs can be almost too strict in terms of how data are managed, sacrificing responsiveness and throughput for correctness. This is a tradeoff that many systems are willing to explore.

PNUTS is Yahoo’s entry into this space. As usual, it occupies the grey areas somewhere between a straight-forward distributed hash-table and a fully-featured relational database. They published details in the conference on Very Large DataBases (VLDB) in 2008. Read on to find out what design decisions they made…

(The paper is here, and playing along at home is as ever encouraged).

The Google File System

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 01 Oct 2008 13:19:44 +0000

It’s been a little while since my last technically meaty update. One system that I’ve been looking at a fair bit recently is Hadoop, which is an open-source implementation of Google’s MapReduce. For me, the interesting part is the large-scale distributed filesystem on which it runs called HDFS. It’s well known that HDFS is based heavily on its Google equivalent.

In 2003 Google published a paper on their Google File System (GFS) at SOSP, the Symposium on Operating Systems Principles. This is the same venue at which Amazon published their Dynamo work, albeit four years earlier. One of the lecturers in my group tells me that SOSP is a venue where “interesting” is rated highly as a criterion for acceptance, over other more staid conferences. So what, if anything, was interesting about GFS? Read on for some details…

Pain and suffering and ffmpeg

henry.robinson+papertrail@gmail.com (Henry Robinson) — Fri, 19 Sep 2008 11:55:02 +0000

All I wanted to do was to transcode real media files from MIT OCW to iPod compatible mp4 on Linux. It shouldn’t have been this difficult. As of now, I still don’t have a satisfactory solution. Problem 1: mplayer / mencoder read and play the stream correctly, but the mp4 files they produce when transcoding don’t work on the iPod. In particular, they’re not readable by any utilities I have such as Easytag and Amarok.

The Real GoogleOS?

henry.robinson+papertrail@gmail.com (Henry Robinson) — Tue, 02 Sep 2008 14:43:45 +0000

So Google have announced Chrome, their entrant into the web browser circus. They are presenting Chrome as a complete reboot of the browser, which of course it isn’t. It is interesting, however, to speculate wildly about Google’s intentions. We shouldn’t, of course, discount their stated intent of ‘adding value for users’; a lot of features of Chrome are focused upon improving today’s browsing experience. See, for example, pop-ups that are modal only in their own tab, which is something I have been wishing for for ages. However, looking at the big picture, even from a viewpoint far removed, is good for a laugh sometimes.

Read on for some rampant speculation.

Consistency and availability in Amazon's Dynamo

henry.robinson+papertrail@gmail.com (Henry Robinson) — Tue, 26 Aug 2008 12:50:08 +0000

There is a continuing and welcome trend amongst large, modern technology companies like Google, Yahoo and Amazon to publish details of their systems at academic conferences. One of the problems that researchers at universities have is making a convincing case that their ideas would work well in the real world, since no matter how many assumptions are made there really is no substitute for field testing, and the infrastructure, workloads and data just aren’t available to do that effectively. However, companies have infrastructure to burn and a genuine use-case with genuine users. Using their experience and data to discover what does and doesn’t work, and what is and is not really important provides an invaluable feedback loop to researchers.

More than that, large systems are built from a set of independent ideas. Most academic papers leave the construction of a practical real-world system as an exercise for the reader. Synthesising a set of disparate techniques often throws up lots of gotchas which no papers directly address. Companies with businesses to run have a much greater incentive to build a robust system that works.

At 2007’s Symposium on Operating Systems Principles (SOSP), Amazon presented a paper about one of their real-world systems: “Dynamo: Amazon’s Highly Available Key-value Store”. It wound up winning, I think, the audience prize for best paper. In this post, I was planning to describe Dynamo ‘inside-out’, based on a reading group mandated close reading of the paper. However, trying to lucidly explain a dense 12 page paper leads to many more than 12 pages of explanation. So instead, I want to focus on one particular aspect of Dynamo which I think is the most interesting.

Good survey of the important papers in distributed consensus

henry.robinson+papertrail@gmail.com (Henry Robinson) — Mon, 25 Aug 2008 16:11:35 +0000

This blog post is an excellent survey of the last thirty years of research into consensus problems.

A Brief Tour of FLP Impossibility

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 13 Aug 2008 11:30:29 +0000

One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch and Patterson. Their short paper ‘Impossibility of Distributed Consensus with One Faulty Process’, which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.

This particular result, known as the ‘FLP result’, settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus - that is, getting a distributed network of processors to agree on a common value - was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.

This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it’s not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.

In this post, I want to give a tour of the proof itself because, although it is quite subtle, it is short and profound. I’ll start by introducing consensus, and then after describing some notation and assumptions I’ll work through the main two lemmas in the paper.

If you want to follow along at home (highly, highly recommended) a copy of the paper is available here.

Binomial Heaps

henry.robinson+papertrail@gmail.com (Henry Robinson) — Fri, 11 Jul 2008 17:35:44 +0000

(The python code for this article is available here)

The standard binary heaps that everyone learns as part of a first algorithms course are very cool. They give guaranteed \(n\) sorting cost, can be stored compactly in memory since they’re full binary trees and allow for very fast implementations of priority queues. However, there are a couple of operations that we might be interested in that binary trees don’t give us, at least not cheaply.

In particular, we might be concerned with merging two heaps together. Say, for example, that we’re shutting down a processor with its own priority queue for schedulable processes, and we want to merge the workload in with another processor. One way to do this would be to insert every item in the first processor’s queue into the receiving processor’s queue. However, this takes \(O(n)\) time - at least, depending on how the queues are implemented. We’d like to be able to do that more efficiently.

Step forward binomial heaps. Binomial heaps are rather different to binary heaps - although they share a few details in common. Binomial heaps allow us to merge two heaps together in \(O(\log n)\) time, in return for some extra cost when finding the minimum. However, extracting the minimum still takes \(O(\log n)\) , which is the same as a binary heap.

Reservoir Sampling

henry.robinson+papertrail@gmail.com (Henry Robinson) — Wed, 09 Apr 2008 16:55:09 +0000

Right, time to get this blog back on track. I want to talk about a useful technique that’s both highly practical and crops up in interview scenarios regularly.

Consider this problem: How can we efficiently randomly select \(k\) items from a set \(S\) of size \(n > k\) , where \(n\) is unknown? Each member of \(S\) should have an equal probability of being selected.

At first glance, this problem looks a strange mix of trivial and impossible. If we don’t know \(n\) , how can we know how to weight our selection probabilities? And, assuming \(S\) is finite, how can we not know \(n\) if we’re able - as we must be - to visit every element in \(S\) ?

To address the second point first: we can find \(n\) by iterating over all elements in \(S\) . However, this adds a pass over the entire set that might be expensive. Consider selecting rows at random from a large database table. We don’t want to bring the whole table into memory just to count the number of rows. Linked lists are another good motivating example - say we would like to select \(k\) elements from a linked list of length \(n\) . Even if we do loop over the list to count its elements, employing the simple approach of choosing \(k\) elements at random will still be slow because random access in linked lists is not a constant time operation. In general we will take on average \(O(kn)\) time. By using the following technique, called ‘reservoir sampling’, we can bring this down to \(\Theta(n)\) .

We can keep our selection of \(k\) elements updated on-line as we scan through our data structure, and we can do it very simply.