<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Paper Trail</title>
    <link>https://www.the-paper-trail.org/</link>
    <description>Recent content on Paper Trail</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <managingEditor>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</managingEditor>
    <webMaster>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</webMaster>
    <lastBuildDate>Mon, 22 Jun 2020 12:52:31 -0700</lastBuildDate>
    
	<atom:link href="https://www.the-paper-trail.org/index.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>Network Load Balancing with Maglev</title>
      <link>https://www.the-paper-trail.org/post/2020-06-23-maglev/</link>
      <pubDate>Mon, 22 Jun 2020 12:52:31 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2020-06-23-maglev/</guid>
      <description>Maglev: A Fast and Reliable Software Network Load Balancer Eisenbud et. al., NSDI 2016
Load balancing is a fundamental primitive in modern service architectures - a service that assigns requests to servers so as to, well, balance the load on each server. This improves resource utilisation and ensures that servers aren&amp;rsquo;t unnecessarily overloaded.
Maglev is - or was, sometime before 2016 - Google&amp;rsquo;s network load-balancer that managed load-balancing duties for search, GMail and other high-profile Google services.</description>
    </item>
    
    <item>
      <title>Gray Failures</title>
      <link>https://www.the-paper-trail.org/post/2020-04-19-gray-failures/</link>
      <pubDate>Sat, 18 Apr 2020 22:04:31 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2020-04-19-gray-failures/</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;&lt;a href=&#34;https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf&#34;&gt;Gray Failure: The Achilles Heel of Cloud-Scale Systems&lt;/a&gt;
&lt;em&gt;Huang et. al., HotOS 2017&lt;/em&gt;  &lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Detecting faults in a large system is a surprisingly hard problem. First you have to decide what kind of thing you want to measure, or &amp;lsquo;observe&amp;rsquo;. Then you have to decide what pattern in that observation constitutes a sufficiently worrying situation (or &amp;lsquo;failure&amp;rsquo;) to require mitigation. Then you have to decide how to mitigate it!&lt;/p&gt;
&lt;p&gt;Complicating this already difficult issue is the fact that the health of your system is in part a matter of perspective. Your service might be working wonderfully from inside your datacenter, where your probes are run, but all of that means nothing to your users who have been trying to get their RPCs through an overwhelmed firewall for the last hour.&lt;/p&gt;
&lt;p&gt;That gap, between what your failure detectors observe, and what clients observe, is the subject of this paper on &amp;lsquo;Gray Failures&amp;rsquo;, which are the failure modes that happen when clients perceive an issue that is not yet detected by your internal systems. This is a good name for an old phenomenon (every failure detector I have built includes client-side mitigations to work around this exact issue).&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Availability in AWS&#39; Physalia</title>
      <link>https://www.the-paper-trail.org/post/2020-04-06-physalia/</link>
      <pubDate>Mon, 06 Apr 2020 22:04:31 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2020-04-06-physalia/</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;&lt;a href=&#34;https://www.usenix.org/conference/nsdi20/presentation/brooker&#34;&gt;Physalia: Millions of Tiny Databases&lt;/a&gt;  &lt;em&gt;Brooker et. al., NSDI 2020&lt;/em&gt;&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Some notes on AWS&amp;rsquo; latest systems publication, which continues and expands their thinking about reducing the effect of failures in very large distributed systems (see &lt;a href=&#34;https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/&#34;&gt;shuffle sharding&lt;/a&gt; as an earlier and complementary technique for the same kind of problem).&lt;/p&gt;
&lt;p&gt;Physalia is a configuration store for AWS&amp;rsquo; Elastic-Block Storage (i.e. network-attached disks). EBS disks are replicated using chain replication, but the configuration of the replication chain needs to be stored somewhere - enter Physalia.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Beating hash tables with trees? The ART-ful radix trie</title>
      <link>https://www.the-paper-trail.org/post/art-paper-notes/</link>
      <pubDate>Sat, 03 Nov 2018 22:04:31 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/art-paper-notes/</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;&lt;a href=&#34;https://db.in.tum.de/~leis/papers/ART.pdf&#34;&gt;The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases&lt;/a&gt;
&lt;em&gt;Leis et. al., ICDE 2013&lt;/em&gt;&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;!-- raw HTML omitted --&gt;Tries&lt;!-- raw HTML omitted --&gt; are an unloved third data structure for
building key-value stores and indexes, after search trees (like
&lt;a href=&#34;https://en.wikipedia.org/wiki/B-tree&#34;&gt;B-trees&lt;/a&gt; and &lt;a href=&#34;https://en.wikipedia.org/wiki/Red%E2%80%93black_tree&#34;&gt;red-black
trees&lt;/a&gt;) and hash tables. Yet they have a
number of very appealing properties that make them worthy of consideration - for example, the height
of a trie is independent of the number of keys it contains, and a trie requires no rebalancing when
updated. Weighing against those advantages is the heavy memory cost that vanilla radix tries can
incur, because each node contains a pointer for every possible value of the &amp;lsquo;next&amp;rsquo; character in the
key. With ASCII as an example, that&amp;rsquo;s 256 pointers for every node in the tree.&lt;/p&gt;
&lt;p&gt;But the astute reader will feel in their bones that this is naive - there must be more efficient
ways to store a set of pointers, indexed by a fixed size set of keys (the trie&amp;rsquo;s alphabet). Indeed,
there are - several of them, in fact, distinguished by the number of children the node &lt;em&gt;actually&lt;/em&gt;
has, not just how many it might &lt;em&gt;potentially&lt;/em&gt; have.&lt;/p&gt;
&lt;p&gt;This is where the &lt;em&gt;Adaptive Radix Tree&lt;/em&gt; (ART) comes in. In this breezy, easy-to-read paper, the
authors show how to reduce the memory cost of a regular radix trie by &lt;em&gt;adapting&lt;/em&gt; the data structure
used for each node to the number of children that it needs to store. In doing so they show, perhaps
surprisingly, that the amount of space consumed by a single key can be bounded no matter how long
the key is.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Outperforming hash-tables with MICA</title>
      <link>https://www.the-paper-trail.org/post/mica-paper-notes/</link>
      <pubDate>Wed, 26 Sep 2018 12:52:31 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/mica-paper-notes/</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;&lt;a href=&#34;https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf&#34;&gt;MICA: A Holistic Approach to Fast In-Memory Key-Value Storage&lt;/a&gt; &lt;em&gt;Lim et. al., NSDI 2014&lt;/em&gt;&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In this installment we&amp;rsquo;re going to look at a system from NSDI 2014. &lt;strong&gt;MICA&lt;/strong&gt; is another in-memory
key-value store, but in contrast to Masstree it does not support range queries and in much of the
paper it keeps a fixed working set by evicting old items, like a cache. Indeed, the closest
comparison system that you might think of when reading about MICA for the first time is a
humble&amp;hellip; hash table. Is there still room for improvement over such a fundamental data structure?
Read on and find out (including benchmarks!).&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Masstree: A cache-friendly mashup of tries and B-trees</title>
      <link>https://www.the-paper-trail.org/post/masstree-paper-notes/</link>
      <pubDate>Mon, 10 Sep 2018 12:13:31 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/masstree-paper-notes/</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;&lt;a href=&#34;https://pdos.csail.mit.edu/papers/masstree:eurosys12.pdf&#34;&gt;Cache Craftiness for Fast Multicore Key-Value Storage&lt;/a&gt;
&lt;em&gt;Mao et. al., EuroSys 2012&lt;/em&gt; [&lt;a href=&#34;https://github.com/kohler/masstree-beta&#34;&gt;code&lt;/a&gt;]&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2 id=&#34;the-big-idea&#34;&gt;The Big Idea&lt;/h2&gt;
&lt;p&gt;Consider the problem of storing, in memory, millions of &lt;code&gt;(key, value)&lt;/code&gt; pairs, where &lt;code&gt;key&lt;/code&gt; is a
variable-length string. If we just wanted to support point lookup, we&amp;rsquo;d use a hash table. But
assuming we want to support range queries, some kind of tree structure is probably required. One
candidate might be a traditional B+-tree.&lt;/p&gt;
&lt;p&gt;In such a B+-tree, the number of levels of the tree are kept small thanks to the fact that
each node has a high fan-out. However, that means that a large number of keys are packed into a
single node, and so there&amp;rsquo;s still a large number of key comparisons to perform when searching
through the tree.&lt;/p&gt;
&lt;p&gt;This is further exacerbated by variable-length keys (e.g. strings), where the cost of key
comparisons can be quite high. If the keys are really long they can each occupy multiple cache
lines, and so comparing two of them can really mess up your cache locality.&lt;/p&gt;
&lt;p&gt;This paper proposes an efficient tree data structure that relies on splitting variable length keys
into a variable number of fixed-length keys called &lt;em&gt;slices&lt;/em&gt;. As you go down the tree, you compare
the first slice of each key, then the second, then the third and so on, but each comparision has
&lt;em&gt;constant cost&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;For example, think about the string &lt;code&gt;the quick brown fox jumps over the lazy dog&lt;/code&gt;. This string
consists of the following 8-byte slices: &lt;code&gt;the quic&lt;/code&gt;, &lt;code&gt;k brown_&lt;/code&gt;, &lt;code&gt;fox jump&lt;/code&gt;, &lt;code&gt;s over t&lt;/code&gt;, &lt;code&gt;he lazy_&lt;/code&gt;
and finally &lt;code&gt;dog&lt;/code&gt;. To find a string in a tree, you can look for all strings that match the first
slice first, and then look for the second slice only in strings that matched the first slice, and so
on - only comparing a &lt;em&gt;fixed&lt;/em&gt; size subset of the key at any time. This is much more efficient than
comparing long strings to one another over and over again. The trick is to design a structure that
takes advantage of the cache benefits of doing these fixed-size comparisons, without losing a
tradeoff based on the large cardinality of the slice &amp;lsquo;alphabet&amp;rsquo;. Enter the &lt;strong&gt;&lt;em&gt;Masstree&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>The CAP FAQ</title>
      <link>https://www.the-paper-trail.org/page/cap-faq/</link>
      <pubDate>Fri, 08 Jun 2018 16:22:58 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/page/cap-faq/</guid>
      <description>0. What is this document? No subject appears to be more controversial to distributed systems engineers than the oft-quoted, oft-misunderstood CAP theorem. The purpose of this FAQ is to explain what is known about CAP, so as to help those new to the theorem get up to speed quickly, and to settle some common misconceptions or points of disagreement.
Of course, there&amp;rsquo;s every possibility I&amp;rsquo;ve made superficial or completely thorough mistakes here.</description>
    </item>
    
    <item>
      <title>Dist Sys Slack</title>
      <link>https://www.the-paper-trail.org/page/dist-sys-slack/</link>
      <pubDate>Fri, 08 Jun 2018 11:51:35 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/page/dist-sys-slack/</guid>
      <description>Want to chat about distributed systems and databases? Come join over 2000 like-minded individuals at the dist-sys slack!.
Click here for an invite. </description>
    </item>
    
    <item>
      <title>Reading List</title>
      <link>https://www.the-paper-trail.org/page/reading-list/</link>
      <pubDate>Thu, 07 Jun 2018 14:49:07 -0700</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/page/reading-list/</guid>
      <description>Distributed Systems   Service Fabric: A Distributed Platform for Building Microservices in the Cloud - Kakivaya et. al., EuroSys 2018
  [notes] Gray Failure: The Achilles’ Heel of Cloud-Scale Systems - Huang et. al., HotOS 2017
  Cache-aware load balancing of data center applications - Archer et. al., VLDB 2019
  Slicer: Auto-Sharding for Datacenter Applications - Adya et. al., OSDI 2016
  [notes] Maglev: A Fast and Reliable Software Network Load Balancer - Eisunbed et.</description>
    </item>
    
    <item>
      <title>Exactly-once or not, atomic broadcast is still impossible in Kafka - or anywhere</title>
      <link>https://www.the-paper-trail.org/post/2017-07-28-exactly-not-atomic-broadcast-still-impossible-kafka/</link>
      <pubDate>Fri, 28 Jul 2017 16:23:38 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2017-07-28-exactly-not-atomic-broadcast-still-impossible-kafka/</guid>
      <description>&lt;!-- raw HTML omitted --&gt;
&lt;h5 id=&#34;intro&#34;&gt;Intro&lt;/h5&gt;
&lt;p&gt;I read an &lt;a href=&#34;https://t.co/xrA4IROUue&#34;&gt;article recently by Jay Kreps&lt;/a&gt; about a feature for delivering
messages &amp;lsquo;exactly-once&amp;rsquo; within the Kafka framework. Everyone&amp;rsquo;s excited, and for good reason. But
there&amp;rsquo;s been a bit of a side story about what exactly &amp;lsquo;exactly-once&amp;rsquo; means, and what Kafka can
actually do.&lt;/p&gt;
&lt;p&gt;In the article, Jay identifies the safety and liveness properties of &lt;a href=&#34;https://en.wikipedia.org/wiki/Atomic_broadcast&#34;&gt;atomic
broadcast&lt;/a&gt; as a pretty good definition for the set
of properties that Kafka is going after with their new exactly-once feature, and then starts to
address claims by naysayers that atomic broadcast is impossible.&lt;/p&gt;
&lt;p&gt;For this note, I&amp;rsquo;m &lt;em&gt;not&lt;/em&gt; going to address whether or not exactly-once is an implementation of atomic
broadcast. I also believe that exactly-once is a powerful feature that&amp;rsquo;s been impressively realised
by Confluent and the Kafka community; nothing here is a criticism of that effort or the feature
itself. But the article makes some claims about impossibility that are, at best, a bit shaky - and,
well, impossibility&amp;rsquo;s kind of my jam. Jay posted his article with a
&lt;a href=&#34;https://twitter.com/jaykreps/status/881563991742349313&#34;&gt;tweet&lt;/a&gt; saying he couldn&amp;rsquo;t &amp;lsquo;resist a good
argument&amp;rsquo;. I&amp;rsquo;m responding in that spirit.&lt;/p&gt;
&lt;p&gt;In particular, the article makes the claim that atomic broadcast is &amp;lsquo;solvable&amp;rsquo; (and later that
consensus is as well&amp;hellip;), which is wrong. What follows is why, and why that matters.&lt;/p&gt;
&lt;blockquote class=&#34;twitter-tweet&#34;&gt;&lt;p lang=&#34;en&#34; dir=&#34;ltr&#34;&gt;This deserves a response: I think the conclusions are right but the imposs. arguments aren&amp;#39;t. But it&amp;#39;s 8pm in England and I&amp;#39;m in the pub. &lt;a href=&#34;https://t.co/akmVv9rhW7&#34;&gt;https://t.co/akmVv9rhW7&lt;/a&gt;&lt;/p&gt;&amp;mdash; Henry Robinson (@HenryR) &lt;a href=&#34;https://twitter.com/HenryR/status/881591741966569472?ref_src=twsrc%5Etfw&#34;&gt;July 2, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src=&#34;https://platform.twitter.com/widgets.js&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;


&lt;p&gt;I have since left the pub. So let&amp;rsquo;s begin.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Make any algorithm lock-free with this one crazy trick</title>
      <link>https://www.the-paper-trail.org/post/2016-05-25-make-any-algorithm-lock-free-with-this-one-crazy-trick/</link>
      <pubDate>Wed, 25 May 2016 22:51:03 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2016-05-25-make-any-algorithm-lock-free-with-this-one-crazy-trick/</guid>
      <description>&lt;p&gt;Lock-free algorithms often operate by having several versions of a data structure in use at one time. The general pattern is that you can prepare an update to a data structure, and then use a machine primitive to atomically install the update by changing a pointer. This means that all subsequent readers will follow the pointer to its new location - for example, to a new node in a linked-list - but this pattern can’t do anything about readers that have already followed the old pointer value, and are traversing the previous version of the data structure.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Distributed systems theory for the distributed systems engineer</title>
      <link>https://www.the-paper-trail.org/post/2014-08-09-distributed-systems-theory-for-the-distributed-systems-engineer/</link>
      <pubDate>Sat, 09 Aug 2014 20:45:38 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2014-08-09-distributed-systems-theory-for-the-distributed-systems-engineer/</guid>
      <description>&lt;p&gt;&lt;em&gt;Updated June 2018 with content on atomic broadcast, gossip, chain replication and more&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.&lt;/p&gt;
&lt;blockquote class=&#34;twitter-tweet&#34;&gt;&lt;p lang=&#34;en&#34; dir=&#34;ltr&#34;&gt;I need to improve my proficiency in distributed systems theory. Where do I start? Any recommended books?&lt;/p&gt;&amp;mdash; Gwen (Chen) Shapira (@gwenshap) &lt;a href=&#34;https://twitter.com/gwenshap/status/497203248332165121?ref_src=twsrc%5Etfw&#34;&gt;August 7, 2014&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src=&#34;https://platform.twitter.com/widgets.js&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;


&lt;p&gt;My response of old might have been &amp;ldquo;well, here&amp;rsquo;s the FLP paper, and here&amp;rsquo;s the Paxos paper, and here&amp;rsquo;s the Byzantine generals paper&amp;hellip;&amp;rdquo;, and I&amp;rsquo;d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed. But I&amp;rsquo;ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually &lt;em&gt;significant experience&lt;/em&gt; to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?&lt;/p&gt;
&lt;p&gt;And yet, unfortunately, there&amp;rsquo;s a paucity of good &amp;lsquo;bridge&amp;rsquo; material that summarises, distills and contextualises the important results and ideas in distributed systems theory; particularly material that does so without condescending. Considering that gap lead me to another interesting question:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;What distributed systems theory should a distributed systems engineer know?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;A little theory is, in this case, not such a dangerous thing. So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer. Let me know what you think I missed!&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google</title>
      <link>https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/</link>
      <pubDate>Wed, 25 Jun 2014 17:49:39 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/</guid>
      <description>&lt;p&gt;&lt;em&gt;Note: this is a personal blog post, and doesn&amp;rsquo;t reflect the views of my employers at Cloudera&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Map-Reduce is on its way out. But we shouldn&amp;rsquo;t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This morning, at their I/O Conference, Google revealed that they’re &lt;a href=&#34;http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/&#34;&gt;not using Map-Reduce to process data internally at all any more&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We shouldn’t be surprised. The writing has been on the wall for Map-Reduce for some time. The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Paper notes: MemC3, a better Memcached</title>
      <link>https://www.the-paper-trail.org/post/2014-06-18-paper-notes-memc3-a-better-memcached/</link>
      <pubDate>Wed, 18 Jun 2014 14:36:51 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2014-06-18-paper-notes-memc3-a-better-memcached/</guid>
      <description>&lt;h2 id=&#34;memc3-compact-and-concurrent-memcache-with-dumber-caching-and-smarter-hashing&#34;&gt;MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Fan and Andersen, &lt;a href=&#34;https://www.usenix.org/conference/nsdi13/&#34;&gt;NSDI 2013&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h3 id=&#34;the-big-idea&#34;&gt;The big idea&lt;/h3&gt;
&lt;p&gt;This is a paper about choosing your data structures and algorithms carefully. By paying careful attention to the workload and functional requirements, the authors reimplement &lt;a href=&#34;http://memcached.org/&#34;&gt;memcached&lt;/a&gt; to achieve a) better concurrency and b) better space efficiency. Specifically, they introduce a variant of cuckoo hashing that is highly amenable to concurrent workloads, and integrate the venerable CLOCK cache eviction algorithm with the hash table for space-efficient approximate LRU.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Paper notes: Anti-Caching</title>
      <link>https://www.the-paper-trail.org/post/2014-06-06-paper-notes-anti-caching/</link>
      <pubDate>Fri, 06 Jun 2014 11:03:39 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2014-06-06-paper-notes-anti-caching/</guid>
      <description>&lt;h2 id=&#34;anti-caching-a-new-approach-to-database-management-system-architecture&#34;&gt;Anti-Caching: A New Approach to Database Management System Architecture&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;DeBrabant et. al., VLDB 2013&lt;/em&gt;&lt;/p&gt;
&lt;h3 id=&#34;the-big-idea&#34;&gt;The big idea&lt;/h3&gt;
&lt;p&gt;Traditional databases typically rely on the OS page cache to bring hot tuples into memory and keep them there. This suffers from a number of problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No control over granularity of caching or eviction (so keeping a tuple in memory might keep all the tuples in its page as well, even though there&amp;rsquo;s not necessarily a usage correlation between them)&lt;/li&gt;
&lt;li&gt;No control over when fetches are performed (fetches are typically slow, and transactions may hold onto locks or latches while the access is being made)&lt;/li&gt;
&lt;li&gt;Duplication of resources - tuples can occupy both disk blocks and memory pages.&lt;/li&gt;
&lt;/ul&gt;</description>
    </item>
    
    <item>
      <title>Paper notes: Stream Processing at Google with Millwheel</title>
      <link>https://www.the-paper-trail.org/post/2014-06-04-paper-notes-stream-processing-at-google-with-millwheel/</link>
      <pubDate>Wed, 04 Jun 2014 12:07:04 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2014-06-04-paper-notes-stream-processing-at-google-with-millwheel/</guid>
      <description>&lt;h2 id=&#34;millwheel-fault-tolerant-stream-processing-at-internet-scale&#34;&gt;Millwheel: Fault-Tolerant Stream Processing at Internet Scale&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Akidau et. al., VLDB 2013&lt;/em&gt;&lt;/p&gt;
&lt;h3 id=&#34;the-big-idea&#34;&gt;The big idea&lt;/h3&gt;
&lt;p&gt;Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but
one that runs at &amp;lsquo;Google&amp;rsquo; scale. This paper really answers the following questions: what guarantees
should be made about delivery and fault-tolerance to support most common use cases cheaply? What
optimisations become available if you choose these guarantees carefully?&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Paper notes: DB2 with BLU Acceleration</title>
      <link>https://www.the-paper-trail.org/post/2014-05-14-paper-notes-db2-with-blu-acceleration/</link>
      <pubDate>Wed, 14 May 2014 18:02:15 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2014-05-14-paper-notes-db2-with-blu-acceleration/</guid>
      <description>&lt;h2 id=&#34;db2-with-blu-acceleration-so-much-more-than-just-a-column-store&#34;&gt;DB2 with BLU Acceleration: So Much More than Just a Column Store&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Raman et. al., VLDB 2013&lt;/em&gt;&lt;/p&gt;
&lt;h3 id=&#34;the-big-idea&#34;&gt;The big idea&lt;/h3&gt;
&lt;p&gt;IBM&amp;rsquo;s venerable DB2 technology was based on traditional row-based technology. By moving to a columnar execution engine, and crucially then by taking full advantage of the optimisations that columnar formats allow, the &amp;lsquo;BLU Acceleration&amp;rsquo; project was able to improve read-mostly BI workloads by a 10 to 50 times speed-up.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Étale cohomology</title>
      <link>https://www.the-paper-trail.org/post/2014-03-04-etale-cohomology/</link>
      <pubDate>Tue, 04 Mar 2014 22:22:25 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2014-03-04-etale-cohomology/</guid>
      <description>&lt;p&gt;&lt;em&gt;The second in an extremely irregular series of posts made on behalf of my father, who has spent much of his retirement so far doing very hard mathematics. What is attached here is the essay he wrote for the &lt;a href=&#34;http://en.wikipedia.org/wiki/PartIII_of_the_Mathematical_Tripos&#34; title=&#34;Cambridge Mathematics Part III&#34;&gt;Part III of the Cambridge Mathematical Tripos&lt;/a&gt;, a one year taught course. The subject is the &lt;a href=&#34;http://en.wikipedia.org/wiki/%C3%89tale_cohomology&#34;&gt;Étale cohomology&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>ByteArrayOutputStream is really, really slow sometimes in JDK6</title>
      <link>https://www.the-paper-trail.org/post/2014-01-10-535/</link>
      <pubDate>Fri, 10 Jan 2014 14:57:41 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2014-01-10-535/</guid>
      <description>&lt;p&gt;TLDR: Yesterday I &lt;a href=&#34;https://twitter.com/HenryR/status/421415424807297024&#34; title=&#34;Twitter&#34;&gt;mentioned on Twitter&lt;/a&gt; that I&amp;rsquo;d found a bad performance problem when writing to a large &lt;a href=&#34;http://docs.oracle.com/javase/6/docs/api/java/io/ByteArrayOutputStream.html&#34; title=&#34;ByteArrayOutputStream Javadoc&#34;&gt;&lt;code&gt;ByteArrayOutputStream&lt;/code&gt;&lt;/a&gt; in Java. After some digging, it appears to be the case that there&amp;rsquo;s a bad bug in JDK6 that doesn&amp;rsquo;t affect correctness, but does cause performance to nosedive when a &lt;code&gt;ByteArrayOutputStream&lt;/code&gt; gets large. This post explains why.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>On Raft, briefly</title>
      <link>https://www.the-paper-trail.org/post/2013-10-31-on-raft-briefly/</link>
      <pubDate>Thu, 31 Oct 2013 12:03:51 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2013-10-31-on-raft-briefly/</guid>
      <description>Raft is a new-ish consensus implementation whose great benefit, to my mind it, is its applicability for real systems. We briefly discussed it internally at Cloudera, and I thought I&amp;rsquo;d share what I contributed, below. There&amp;rsquo;s an underlying theme here regarding the role of distributed systems research in practitioners&amp;rsquo; daily work, and how the act of building a distributed system has not yet been sufficiently well commoditised to render a familiarity with the original research unnecessary.</description>
    </item>
    
    <item>
      <title>Some miscellanea</title>
      <link>https://www.the-paper-trail.org/post/2013-05-19-some-miscellanea/</link>
      <pubDate>Sun, 19 May 2013 22:39:57 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2013-05-19-some-miscellanea/</guid>
      <description>CAP FAQ I wrote an FAQ on The CAP Theorem. The aim is to definitively settle some of the common misconceptions around CAP so as to help prevent its invocation in useless places. If someone says they got around CAP, refer them to the FAQ. It should be a pretty simple introduction to the theorem as well. I think that CAP itself is a pretty uninteresting result, but it does at least shine a light on tradeoffs implicit in distributed systems.</description>
    </item>
    
    <item>
      <title>Columnar Storage</title>
      <link>https://www.the-paper-trail.org/post/2013-01-30-columnar-storage/</link>
      <pubDate>Wed, 30 Jan 2013 19:46:31 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2013-01-30-columnar-storage/</guid>
      <description>&lt;p&gt;&lt;!-- raw HTML omitted --&gt;You&amp;rsquo;re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I&amp;rsquo;ll explain why we care so much about IO efficiency and show how columnar storage - which is a simple idea - can drastically improve performance for certain workloads.&lt;/p&gt;
&lt;p&gt;Caveat: This is a personal, general research summary post, and as usual doesn&amp;rsquo;t neccessarily reflect our thinking at Cloudera about columnar storage.&lt;!-- raw HTML omitted --&gt;&lt;/p&gt;
&lt;p&gt;Disks are still the major bottleneck in query execution over large datasets. Even a machine with twelve disks running in parallel (for an aggregate bandwidth of north of 1GB/s) can&amp;rsquo;t keep all the cores busy; running a query against memory-cached data can get tens of GB/s of throughput. IO bandwidth matters. Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk. This can mean decreasing the latency (for small queries where the time to find the data to read might dominate), but most usually this means improving the effective throughput of reads from disk.&lt;/p&gt;
&lt;p&gt;The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster. However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can&amp;rsquo;t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.&lt;/p&gt;
&lt;p&gt;The other way to improve disk performance is to maximise the ratio of &amp;lsquo;useful&amp;rsquo; bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem. Enter &lt;em&gt;columnar storage&lt;/em&gt;, a principle for file format design that aims to do exactly that for query engines that deal with record-based data.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Cloudera Impala</title>
      <link>https://www.the-paper-trail.org/post/2012-11-04-cloudera-impala/</link>
      <pubDate>Sun, 04 Nov 2012 18:12:12 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2012-11-04-cloudera-impala/</guid>
      <description>&lt;p&gt;&lt;em&gt;If you have a strong background in either databases or distributed systems, and fancy working on such an exciting technology, &lt;a href=&#34;mailto:henry@cloudera.com&#34;&gt;send me a note!&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s great to finally be able to say something about what I&amp;rsquo;ve been working at &lt;a href=&#34;http://www.cloudera.com&#34;&gt;Cloudera&lt;/a&gt; for nearly a year. At StrataConf / Hadoop World in New York a couple of weeks ago we announced &lt;a href=&#34;http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/&#34;&gt;Cloudera Impala&lt;/a&gt;. Impala is a distributed query execution engine that understands a subset of SQL, and critically runs over HDFS and HBase as storage managers. It&amp;rsquo;s very similar in functionality to Apache Hive, but it is much, much, much (anecdotally up to 100x) faster.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>On some subtleties of Paxos</title>
      <link>https://www.the-paper-trail.org/post/2012-11-03-on-some-subtleties-of-paxos/</link>
      <pubDate>Sat, 03 Nov 2012 19:02:22 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2012-11-03-on-some-subtleties-of-paxos/</guid>
      <description>&lt;p&gt;There&amp;rsquo;s one particular aspect of the Paxos protocol that gives readers of this blog - and for some time, me! - some difficulty. This short post tries to clear up some confusion on a part of the protocol that is poorly explained in pretty much every major description.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Links</title>
      <link>https://www.the-paper-trail.org/post/2012-08-06-links/</link>
      <pubDate>Mon, 06 Aug 2012 14:05:50 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2012-08-06-links/</guid>
      <description>Reasoning about Knowledge
  Toward a Cloud Computing Research Agenda (2009) -
 &amp;ldquo;One of the LADIS attendees commented at some point that Byzantine Consensus could be used to improve Chubby, making it tolerant of faults that could disrupt it as currently implemented. But for our keynote speakers, enhancing Chubby to tolerate such faults turns out to be of purely academic interest.&amp;rdquo;
   Low-level data structures -  The llds general working thesis is: for large memory applications, virtual memory layers can hurt application performance due to increased memory latency when dealing with large data structures.</description>
    </item>
    
    <item>
      <title>Something a bit different: translations of classic mathematical texts (!)</title>
      <link>https://www.the-paper-trail.org/post/2012-08-04-something-a-bit-different-translations-of-classic-mathematical-texts/</link>
      <pubDate>Sat, 04 Aug 2012 14:50:18 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2012-08-04-something-a-bit-different-translations-of-classic-mathematical-texts/</guid>
      <description>During his retirement, my father has been able to spend much time indulging his love of mathematics. This included, amongst other impressive endeavours, attending Cambridge at a more advanced age than average to take (and pass!) the Part III of the Mathematical Tripos, often considered one of the hardest taught courses in maths in the world.
Since then, he has hardly been idle, and has recently been undertaking a translation of a classic work in modern algebra by Dedekind and Weber from its original 100+ pages of German into English.</description>
    </item>
    
    <item>
      <title>EuroSys 2012 blog notes</title>
      <link>https://www.the-paper-trail.org/post/2012-04-15-eurosys-2012-blog-notes/</link>
      <pubDate>Sun, 15 Apr 2012 18:20:33 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2012-04-15-eurosys-2012-blog-notes/</guid>
      <description>EuroSys 2012 was last week - one of the premier European systems conferences. Over at the Cambridge System Research Group&amp;rsquo;s blog, various people from the group have written notes on the papers presented. They&amp;rsquo;re very well-written summaries, and worth checking out for an overview of the research presented.
 Day 1 Day 2 Day 3  </description>
    </item>
    
    <item>
      <title>FLP and CAP aren&#39;t the same thing</title>
      <link>https://www.the-paper-trail.org/post/2012-03-25-flp-and-cap-arent-the-same-thing/</link>
      <pubDate>Sun, 25 Mar 2012 20:55:34 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2012-03-25-flp-and-cap-arent-the-same-thing/</guid>
      <description>&lt;p&gt;An &lt;a href=&#34;http://www.quora.com/Distributed-Systems/Are-the-FLP-impossibility-result-and-Brewers-CAP-theorem-basically-equivalent&#34;&gt;interesting question&lt;/a&gt; came up on &lt;a href=&#34;http://www.quora.com&#34;&gt;Quora&lt;/a&gt; this last week. Roughly speaking, the question asked how, if at all, the &lt;a href=&#34;https://the-paper-trail.org/blog/?p=49&#34;&gt;FLP&lt;/a&gt; theorem and the &lt;a href=&#34;https://the-paper-trail.org/blog/?p=290&#34;&gt;CAP theorem&lt;/a&gt; were related. I&amp;rsquo;d thought idly about exactly the same question myself before. Both theorems concern the impossibility of solving fairly similar fundamental distributed systems problems in what appear to be fairly similar distributed systems settings. The CAP theorem gets all the airtime, but FLP to me is a more beautiful result. Wouldn&amp;rsquo;t it be fascinating if both theorems turned out to be equivalent; that is effectively restatements of each other?&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Should I take a systems reading course?</title>
      <link>https://www.the-paper-trail.org/post/2012-03-09-should-i-take-a-systems-reading-course/</link>
      <pubDate>Fri, 09 Mar 2012 18:05:13 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2012-03-09-should-i-take-a-systems-reading-course/</guid>
      <description>&lt;p&gt;A smart student asked me a couple of days ago whether I thought taking a 2xx-level reading course in operating systems was a good idea. The student, understandably, was unsure whether talking about these systems was as valuable as actually building them, and also whether, since his primary interest is in &amp;lsquo;distributed&amp;rsquo; systems, he stood to benefit from a deep understanding of things like virtual memory.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>How consistent is eventual consistency?</title>
      <link>https://www.the-paper-trail.org/post/2012-01-04-how-consistent-is-eventual-consistency/</link>
      <pubDate>Wed, 04 Jan 2012 15:22:05 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2012-01-04-how-consistent-is-eventual-consistency/</guid>
      <description>This page, from the &amp;lsquo;PBS&amp;rsquo; team at Berkeley&amp;rsquo;s AMPLab is quite interesting. It allows you to tweak the parameters of a Dynamo-style system, then by running a series of Monte Carlo simulations gives an estimate of the likelihood of staleness of reads after writes.
Since the Dynamo paper appeared and really popularised eventual consistency, the debate has focused on a fairly binary treatment of its merits. Either you can&amp;rsquo;t afford to be wrong, ever, or it&amp;rsquo;s ok to have your reads be stale for a potentially unbounded amount of time.</description>
    </item>
    
    <item>
      <title>STM: Not (much more than) a research toy?</title>
      <link>https://www.the-paper-trail.org/post/2011-04-21-stm-not-much-more-than-a-research-toy/</link>
      <pubDate>Thu, 21 Apr 2011 13:02:38 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2011-04-21-stm-not-much-more-than-a-research-toy/</guid>
      <description>It&amp;rsquo;s a sign of how down-trodden the Software Transactional Memory (STM) effort must have become that the article (sorry, ACM subscription required) published in a recent CACM might have been just as correctly called &amp;ldquo;STM: Not as bad as the worst possible case&amp;rdquo;. The authors present a series of experiments that demonstrate that highly concurrent STM code beats sequential, single threaded code. You&amp;rsquo;d hope that this had long ago become a given, but what this demonstrates is only hey, STM allows some parallelism.</description>
    </item>
    
    <item>
      <title>The Theorem That Will Not Go Away</title>
      <link>https://www.the-paper-trail.org/post/2010-10-07-the-theorem-that-will-not-go-away/</link>
      <pubDate>Thu, 07 Oct 2010 23:28:55 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2010-10-07-the-theorem-that-will-not-go-away/</guid>
      <description>The CAP theorem gets another airing.
I think the article makes a point worth making again, and makes it fairly well - that CAP is really about P=&amp;gt; ~(C &amp;amp; A). A couple of things I want to call out though, after a rollicking discussion on Hacker News.
 &amp;ldquo;For a distributed (i.e., multi-node) system to not require partition-tolerance it would have to run on a network which is guaranteed to never drop messages (or even deliver them late) and whose nodes are guaranteed to never die.</description>
    </item>
    
    <item>
      <title>CAP confusion: Problems with Partition Tolerance</title>
      <link>https://www.the-paper-trail.org/post/2010-04-27-cap-confusion-problems-with-partition-tolerance/</link>
      <pubDate>Tue, 27 Apr 2010 10:44:29 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2010-04-27-cap-confusion-problems-with-partition-tolerance/</guid>
      <description>Over on the Cloudera blog I&amp;rsquo;ve written an article that should be of interest to readers of this blog.
I&amp;rsquo;m no great fan of the ubiquity of the CAP theorem - it&amp;rsquo;s a solid impossibility result which appeals to the theorist in me, but it doesn&amp;rsquo;t capture every fundamental tension in a distributed system. For example: we make our systems distributed across more than one machine usually for reasons of performance and to eliminate a single point of failure.</description>
    </item>
    
    <item>
      <title>Apache ZooKeeper is looking for Google Summer of Code applicants</title>
      <link>https://www.the-paper-trail.org/post/2010-03-24-apache-zookeeper-is-looking-for-google-summer-of-code-applicants/</link>
      <pubDate>Wed, 24 Mar 2010 10:18:02 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2010-03-24-apache-zookeeper-is-looking-for-google-summer-of-code-applicants/</guid>
      <description>Students! Over at Apache ZooKeeper we&amp;rsquo;re looking for great students with a strong interest in distributed systems to work with us over the summer as part of Google&amp;rsquo;s Summer of Code, 2010.
Summer of Code is a great program - providing stipends to students and more importantly connecting them with mentors in open source projects. ZooKeeper has a number of interesting projects to get started on.
ZooKeeper is a distributed coordination platform on which you can build the distributed equivalents of many traditional concurrent primitives like locks, queues and barriers.</description>
    </item>
    
    <item>
      <title>GFS Retrospective in ACM Queue</title>
      <link>https://www.the-paper-trail.org/post/2009-08-12-gfs-retrospective-in-acm-queue/</link>
      <pubDate>Wed, 12 Aug 2009 21:01:11 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-08-12-gfs-retrospective-in-acm-queue/</guid>
      <description>This is a really great article. Sean Quinlan talks very openly and critically about the design of the Google File System given ten years of use (ten years!).
What&amp;rsquo;s interesting is that the general sentiment seems to be that the concessions that GFS made for performance and simplicity (single master, loose consistency model) have turned out to probably be net bad decisions, although they probably weren&amp;rsquo;t at the time.
There are scaling issues with GFS - the well known many-small-files problem that also plagues HDFS, and a similar huge-files problem.</description>
    </item>
    
    <item>
      <title>SOSP 2009 Program Available</title>
      <link>https://www.the-paper-trail.org/post/2009-06-29-sosp-2009-program-available/</link>
      <pubDate>Mon, 29 Jun 2009 14:49:17 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-06-29-sosp-2009-program-available/</guid>
      <description>The accepted papers for SOSP 2009 are here. As ever, some excellent looking papers. If you search for the titles you can often turn up drafts or even the submitted versions.
The best looking sessions to me are &amp;lsquo;scalability&amp;rsquo; and &amp;lsquo;clusters&amp;rsquo;, but there&amp;rsquo;s at least one great looking title in every session. I&amp;rsquo;ll start posting some reviews once I find some bandwidth (and have finished the computation theory series - next one on its way).</description>
    </item>
    
    <item>
      <title>In Which I Prove Employable</title>
      <link>https://www.the-paper-trail.org/post/2009-04-09-in-which-i-prove-employable/</link>
      <pubDate>Thu, 09 Apr 2009 08:51:54 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-04-09-in-which-i-prove-employable/</guid>
      <description>Although I try and keep personal information to a relative minimum on this blog, here&amp;rsquo;s some news that&amp;rsquo;s relevant. Recently I accepted an offer to start work at Cloudera, a young company in the San Francisco area. Initially I&amp;rsquo;ll be working from the UK, with a view to a permanent move out to California when timing and visas allow.
Hadoop is Cloudera&amp;rsquo;s business. Hadoop is an open-source implementation of Google&amp;rsquo;s MapReduce Cloudera provides support for Hadoop, and their own fully supported distribution of the Hadoop toolset.</description>
    </item>
    
    <item>
      <title>Barbara Liskov&#39;s Turing Award, and Byzantine Fault Tolerance</title>
      <link>https://www.the-paper-trail.org/post/2009-03-30-barbara-liskovs-turing-award-and-byzantine-fault-tolerance/</link>
      <pubDate>Mon, 30 Mar 2009 14:53:08 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-03-30-barbara-liskovs-turing-award-and-byzantine-fault-tolerance/</guid>
      <description>&lt;p&gt;&lt;a href=&#34;www.pmg.csail.mit.edu/~liskov/&#34;&gt;Barbara Liskov&lt;/a&gt; has just been announced as the recipient of the &lt;a href=&#34;http://www.acm.org/press-room/news-releases/turing-award-08/&#34;&gt;2008 Turing Award&lt;/a&gt;, which is one of the most important prizes in computer science, and can be thought of as our field&amp;rsquo;s equivalent to the various Nobel Prizes. Professor Liskov is a worthy recipient of the award, even if judged alone by her &lt;a href=&#34;http://awards.acm.org/citation.cfm?id=1108679&amp;amp;srt=all&amp;amp;aw=140&amp;amp;ao=AMTURING&#34;&gt;citation&lt;/a&gt; which lists a number of the important contributions she has made to operating systems, programming languages and distributed systems.&lt;/p&gt;
&lt;p&gt;Professor Liskov seems to be particularly well known for the &lt;a href=&#34;http://en.wikipedia.org/wiki/Liskov_substitution_principle&#34;&gt;Liskov substitution principle&lt;/a&gt; which says that some property of a supertype ought to hold of its subtypes. I&amp;rsquo;m not in any position to speak as to the importance of this contribution. However, her more recent work has been regarding the tolerance of Byzantine failures in distributed systems, which is much more close to my heart.&lt;/p&gt;
&lt;p&gt;The only work of Liskov&amp;rsquo;s that I am really familiar with is the late 90s work on &lt;a href=&#34;http://www.pmg.lcs.mit.edu/~castro/osdi99_html/osdi99.html&#34;&gt;Practical Byzantine Fault Tolerance&lt;/a&gt; with &lt;a href=&#34;http://research.microsoft.com/en-us/um/people/mcastro/&#34;&gt;Miguel Castro&lt;/a&gt; and is first published in &lt;a href=&#34;http://www.pmg.lcs.mit.edu/papers/osdi99.pdf&#34;&gt;this OSDI &amp;lsquo;99 paper&lt;/a&gt;. I&amp;rsquo;m not going to do a full review, but the topic sits so nicely with my recent focus on consensus protocols that it makes sense to briefly discuss its importance.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>OSDI &#39;08: FlightPath: Obedience vs. Choice in Cooperative Services</title>
      <link>https://www.the-paper-trail.org/post/2009-03-03-osdi-08-flightpath-obedience-vs-choice-in-cooperative-services/</link>
      <pubDate>Tue, 03 Mar 2009 15:45:18 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-03-03-osdi-08-flightpath-obedience-vs-choice-in-cooperative-services/</guid>
      <description>&lt;p&gt;This is one of my favourite papers from OSDI &amp;lsquo;08 (yes, still doing a few reviews, trying to get to five or so before SOSP&amp;hellip;). &lt;a href=&#34;http://www.usenix.org/events/osdi08/tech/full_papers/li_h/li_h.pdf&#34;&gt;FlightPath&lt;/a&gt; is a system developed by some folks mainly at UT Austin for peer-to-peer streaming in dynamic networks. This is a reasonably challenging problem in itself, although one that&amp;rsquo;s seen a good deal of work before. However, the really cool thing about this paper is that they treat participants in the network as potentially rational agents. Since Lamport&amp;rsquo;s seminal work on the Byzantine generals problem, it&amp;rsquo;s been standard practice to assign one of two behaviour modes to members of distributed systems: either you&amp;rsquo;re alturistic, which means that you do exactly what the protocol tells you to do, no matter what the cost to yourself, or Byzantine, which means that you do whatever you like, again no matter what the cost to yourself.&lt;/p&gt;
&lt;p&gt;It was realised recently that this is a false dichotomy: there&amp;rsquo;s a whole class of behaviour that&amp;rsquo;s not captured by these two extremes. &lt;em&gt;Rational&lt;/em&gt; agents participate in a protocol as long as it is worth their while to do so. At its most simple, this means that rational agents will not incur a cost unless they expect to recoup a benefit that is worth equal to or more than the original cost to them. This gave rise to the Byzantine-Alturistic-Rational (BAR) model, due to the same UTA group, which can be used to more realistically model the performance of peer-to-peer protocols.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Consensus Protocols: A Paxos Implementation</title>
      <link>https://www.the-paper-trail.org/post/2009-02-09-consensus-protocols-a-paxos-implementation/</link>
      <pubDate>Mon, 09 Feb 2009 19:37:44 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-02-09-consensus-protocols-a-paxos-implementation/</guid>
      <description>&lt;p&gt;It&amp;rsquo;s one thing to wax lyrical about an algorithm or protocol having simply read the paper it appeared in. It&amp;rsquo;s another to have actually taken the time to build an implementation. There are many slips twixt hand and mouth, and the little details that you&amp;rsquo;ve abstracted away at the point of reading come back to bite you hard at the point of writing.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m a big fan of building things to understand them - this blog is essentially an expression of that idea, as the act of constructing an explanation of something helps me understand it better. Still, I felt that in order to be properly useful, this blog probably needed more code.&lt;/p&gt;
&lt;p&gt;So when, yesterday, it was suggested I back up my previous post on Paxos with a toy implementation I had plenty of motivation to pick up the gauntlet. However, I&amp;rsquo;m super-pressed for time at the moment while I write my PhD thesis, so I gave myself a deadline of a few hours, just to keep it interesting.&lt;/p&gt;
&lt;p&gt;A few hours later, I&amp;rsquo;d written &lt;a href=&#34;https://github.com/henryr/toy_paxos&#34;&gt;this&lt;/a&gt; from-scratch implementation of Paxos. There&amp;rsquo;s enough interesting stuff in it, I think, to warrant this post on how it works. Hopefully some of you will find it useful, and something you can use as a springboard to your own implementations. You can run an example by simply invoking &lt;!-- raw HTML omitted --&gt;python toy_paxos.py&lt;!-- raw HTML omitted --&gt;.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Consensus Protocols: Paxos</title>
      <link>https://www.the-paper-trail.org/post/2009-02-03-consensus-protocols-paxos/</link>
      <pubDate>Tue, 03 Feb 2009 17:03:14 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-02-03-consensus-protocols-paxos/</guid>
      <description>&lt;p&gt;You can&amp;rsquo;t really read two articles about distributed systems today without someone mentioning the Paxos algorithm. Google use it in &lt;a href=&#34;http://labs.google.com/papers/chubby.html&#34;&gt;Chubby&lt;/a&gt;, Yahoo used something a bit like it (but not the same!) in &lt;a href=&#34;https://zookeeper.apache.org/&#34;&gt;ZooKeeper&lt;/a&gt; and it seems that it&amp;rsquo;s considered the ne plus ultra of consensus algorithms. It also comes with a reputation as being fantastically difficult to understand - a subtle, complex algorithm that is only properly appreciated by a select few.&lt;/p&gt;
&lt;p&gt;This is kind of true and not true at the same time. Paxos is an algorithm whose entire behaviour is subtly difficult to grasp. However, the algorithm itself is fairly intuitive, and certainly relatively simple. In this article I&amp;rsquo;ll describe how basic Paxos operates, with reference to previous articles on two-phase and three-phase commit. I&amp;rsquo;ve included a bibliography at the end, for those who want plenty more detail.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Tuesday Links, 27th January 2009</title>
      <link>https://www.the-paper-trail.org/post/2009-01-27-tuesday-links-27th-january-2009/</link>
      <pubDate>Tue, 27 Jan 2009 14:52:33 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-01-27-tuesday-links-27th-january-2009/</guid>
      <description>Web highlights discovered in the last week or so:
 The C10K problem - detailed discussion of how to do IO on a server that you want to handle 10000 simultaneous connections. Scalability by Design - Coding for Systems With Large CPU Counts - via High Scalability Anti-RDBMS: A list of distributed key-value stores - good, if superficial, survey. Mark Russinovich: Inside Windows 7 - kernel level look at what&amp;rsquo;s new in Windows Project Voldemort - Dynamo-a-like from LinkedIn  </description>
    </item>
    
    <item>
      <title>OSDI &#39;08 - CuriOS: Improving Reliability Through Operating System Structure</title>
      <link>https://www.the-paper-trail.org/post/2009-01-19-osdi-08-curios-improving-reliability-through-operating-system-structure/</link>
      <pubDate>Mon, 19 Jan 2009 17:28:01 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-01-19-osdi-08-curios-improving-reliability-through-operating-system-structure/</guid>
      <description>&lt;p&gt;The second paper from OSDI that I&amp;rsquo;ll mention here is one I&amp;rsquo;ll only treat briefly - partly because it&amp;rsquo;s a bit lightweight compared to some, and partly because I&amp;rsquo;m writing in a hurry. &lt;a href=&#34;http://www.usenix.org/events/osdi08/tech/full_papers/david/david.pdf&#34;&gt;CuriOS: Improving Reliability Through Operating System Structure&lt;/a&gt; attacks a problem with recovery from errors in microkernel operating systems.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>OSDI &#39;08: Corey, an operating system for many cores</title>
      <link>https://www.the-paper-trail.org/post/2009-01-14-osdi-08-corey-an-operating-system-for-many-cores/</link>
      <pubDate>Wed, 14 Jan 2009 22:50:19 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-01-14-osdi-08-corey-an-operating-system-for-many-cores/</guid>
      <description>&lt;p&gt;Just before Christmas, the systems community held one of its premier conferences - Operating Systems Design and Implementation (OSDI &amp;lsquo;08). This biannual conference showcases some of the best research in operating systems, networks, distributed systems and software technology from the past couple of years.&lt;/p&gt;
&lt;p&gt;Although I wasn&amp;rsquo;t lucky enough to go, I did grab a copy of the proceedings and had a read through a bunch of the papers that interested me. I plan to post summaries of a few to this blog. I see people ask repeatedly on various forums (fora?) &amp;ldquo;what&amp;rsquo;s new in computer science?&amp;rdquo;. No-one seems to give a satisfactory answer, for a number of reasons. Hopefully I can redress some of the balance here, at least in the systems world.&lt;/p&gt;
&lt;p&gt;Without further ado, I&amp;rsquo;ll get stuck in to one of the OSDI papers: &lt;a href=&#34;http://www.mit.edu/~y_z/papers/corey-osdi08.pdf&#34;&gt;Corey: an operating system for many cores&lt;/a&gt; by Boyd-Wickizer et al from a combination of MIT, Fudan University, MSR Asia and Xi&amp;rsquo;an Jiaotong University (12 authors!). Download the paper and play along at home, as usual.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Consensus with lossy links: Establishing a TCP connection</title>
      <link>https://www.the-paper-trail.org/post/2009-01-12-consensus-with-lossy-links-establishing-a-tcp-connection/</link>
      <pubDate>Mon, 12 Jan 2009 13:51:27 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2009-01-12-consensus-with-lossy-links-establishing-a-tcp-connection/</guid>
      <description>&lt;p&gt;After a hiatus for the Christmas break, during which I travelled to the States, had a job interview, went to Vegas, became an uncle and got a cold, I&amp;rsquo;m back on a more regular posting schedule now. And I&amp;rsquo;ve got lots to post about.&lt;/p&gt;
&lt;p&gt;Before I talk about other theoretical consensus protocols such as Paxos, I want to illustrate a consensus protocol running in the wild, and show how different modelling assumptions can lead to protocols that are rather different to the *PC variants we&amp;rsquo;ve looked at in the &lt;a href=&#34;http://hnr.dnsalias.net/wordpress/?p=90&#34;&gt;last&lt;/a&gt; &lt;a href=&#34;http://hnr.dnsalias.net/wordpress/?p=103&#34;&gt;couple&lt;/a&gt; of posts. We&amp;rsquo;ve been considering situations like database commit, where many participants agree en-masse to the result of a transaction. We&amp;rsquo;ve assumed that all participants may communicate reliably, without fear of packet loss (or if the packets are lost then the situation is the same as if the host that had sent the packet had failed).&lt;/p&gt;
&lt;p&gt;The Transmission Control Protocol (TCP) gives us at least some approximation to a reliable link due to the use of sequence numbers and acknowledgements. However before we can use TCP both hosts involved in a point to point communication have to establish a connection: that is, they must both agree that a connection is established. This is a two-party consensus problem. Neither party can rely on reliable transmission, and can instead only use the IP stack and below to negotiate a connection. IP does not give reliable transmission semantics to packets and works only on a best-effort principle. If the network is noisy or prone to outages then packets will be lost. How can we achieve consensus in this scenario?&lt;/p&gt;
&lt;p&gt;Those who have been reading this blog as far back as my explanation of &lt;a href=&#34;http://hnr.dnsalias.net/wordpress/?p=49&#34;&gt;FLP impossibility&lt;/a&gt; will probably be thinking that this is a trick question. FLP impossibility shows that if there is an unbounded delay in the transmission of a packet (i.e. an asynchronous network model) then consensus is, in general, unsolvable. Lossy links can be regarded as delaying packet delivery infinitely - therefore it seems very likely that consensus is unsolvable with packet loss.&lt;/p&gt;
&lt;p&gt;In fact, this is completely true. Consensus with arbitrary packet loss is an unsolvable problem, even in an otherwise synchronous network. In this post I want to demonstrate the short and intuitive proof that this is the case, then show how this impossibility is avoided where possible in TCP connection establishment.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Consensus Protocols: Three-phase Commit</title>
      <link>https://www.the-paper-trail.org/post/2008-11-29-consensus-protocols-three-phase-commit/</link>
      <pubDate>Sat, 29 Nov 2008 14:35:36 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-11-29-consensus-protocols-three-phase-commit/</guid>
      <description>&lt;p&gt;Last time we looked extensively at two-phase commit, a consensus algorithm that has the benefit of low latency but which is offset by fragility in the face of participant machine crashes. In this short note, I&amp;rsquo;m going to explain how the addition of an extra phase to the protocol can shore things up a bit, at the cost of a greater latency.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Consensus Protocols: Two-Phase Commit</title>
      <link>https://www.the-paper-trail.org/post/2008-11-27-consensus-protocols-two-phase-commit/</link>
      <pubDate>Thu, 27 Nov 2008 16:41:53 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-11-27-consensus-protocols-two-phase-commit/</guid>
      <description>&lt;p&gt;For the next few articles here, I&amp;rsquo;m going to write about one of the most fundamental concepts in distributed computing - of equal importance to the theory and practice communities. The &lt;em&gt;consensus problem&lt;/em&gt; is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network.&lt;/p&gt;
&lt;p&gt; For example, some possible uses of consensus are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;deciding whether or not to commit a transaction to a database&lt;/li&gt;
&lt;li&gt;synchronising clocks by agreeing on the current time&lt;/li&gt;
&lt;li&gt;agreeing to move to the next stage of a distributed algorithm (this is the famous &lt;em&gt;replicated state machine&lt;/em&gt; approach)&lt;/li&gt;
&lt;li&gt;electing a leader node to coordinate some higher-level protocol&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Such a simple-sounding problem has surprisingly been at the core particularly of theoretical distributed systems research for over twenty years. How come? As I see it, the answers are threefold.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>BigTable: Google&#39;s Distributed Data Store</title>
      <link>https://www.the-paper-trail.org/post/2008-10-29-bigtable-googles-distributed-data-store/</link>
      <pubDate>Wed, 29 Oct 2008 21:53:28 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-10-29-bigtable-googles-distributed-data-store/</guid>
      <description>&lt;p&gt;Although GFS provides Google with reliable, scalable distributed file storage, it does not provide any facility for structuring the data contained in the files beyond a hierarchical directory structure and meaningful file names. It&amp;rsquo;s well known that more expressive solutions are required for large data sets. Google&amp;rsquo;s terabytes upon terabytes of data that they retrieve from web crawlers, amongst many other sources, need organising, so that client applications can quickly perform lookups and updates at a finer granularity than the file level.&lt;/p&gt;
&lt;p&gt;So they built BigTable, wrote it up, and published it in OSDI 2006. The paper is &lt;a href=&#34;http://labs.google.com/papers/bigtable-osdi06.pdf&#34;&gt;here&lt;/a&gt;, and my walkthrough follows.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Yahoo&#39;s PNUTS</title>
      <link>https://www.the-paper-trail.org/post/2008-10-12-yahoos-pnuts/</link>
      <pubDate>Sun, 12 Oct 2008 22:53:00 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-10-12-yahoos-pnuts/</guid>
      <description>&lt;p&gt;In these politically charged times, it&amp;rsquo;s important for written media to give equal coverage to all major parties so as not to appear biased or to be endorsing one particular group. With that in mind, we at Paper Trail are happy to devote significant programming time to all the major distributed systems players.&lt;/p&gt;
&lt;p&gt;This, therefore, is a party political broadcast on behalf of the Yahoo Party.&lt;/p&gt;
&lt;h2 id=&#34;pnuts-yahoos-hosted-data-serving-platform&#34;&gt;PNUTS: Yahoo!&amp;rsquo;s Hosted Data Serving Platform&lt;/h2&gt;
&lt;p&gt;(Please note, that&amp;rsquo;s the first and last time in this article that I&amp;rsquo;ll be using the exclamation mark in Yahoo&amp;rsquo;s name, it looks funny.)&lt;/p&gt;
&lt;p&gt;As you might expect from the company that runs Flickr, Yahoo have need for a large scale distributed data store. In particular, they need a system that runs in many geographical locations in order to optimise response times for users from any region, while at the same time coordinating data across the entire system. As ever, the system must exhibit high availability and fault tolerance, scalability and good latency properties.&lt;/p&gt;
&lt;p&gt;These, of course, are not new or unique requirements. We&amp;rsquo;ve seen already that Amazon&amp;rsquo;s Dynamo, and Google&amp;rsquo;s BigTable/GFS stack offer similar services. Any business that has a web-based product that requires storing and updating data for thousands of users has a need for a system like Dynamo. Many can&amp;rsquo;t afford the engineering time required to develop their own tuned solution, so settle for well-understood RDBMS-based stacks. However, as readers of this blog will know, RDBMSs can be almost too strict in terms of how data are managed, sacrificing responsiveness and throughput for correctness. This is a tradeoff that many systems are willing to explore.&lt;/p&gt;
&lt;p&gt;PNUTS is Yahoo&amp;rsquo;s entry into this space. As usual, it occupies the grey areas somewhere between a straight-forward distributed hash-table and a fully-featured relational database. They published details in the conference on Very Large DataBases (VLDB) in 2008. Read on to find out what design decisions they made&amp;hellip;&lt;/p&gt;
&lt;p&gt;(The paper is &lt;a href=&#34;http://www.brianfrankcooper.net/pubs/pnuts.pdf&#34;&gt;here&lt;/a&gt;, and playing along at home is as ever encouraged).&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>The Google File System</title>
      <link>https://www.the-paper-trail.org/post/2008-10-01-the-google-file-system/</link>
      <pubDate>Wed, 01 Oct 2008 13:19:44 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-10-01-the-google-file-system/</guid>
      <description>&lt;p&gt;It&amp;rsquo;s been a little while since my last technically meaty update. One system that I&amp;rsquo;ve been looking at a fair bit recently is Hadoop, which is an open-source implementation of Google&amp;rsquo;s MapReduce. For me, the interesting part is the large-scale distributed filesystem on which it runs called HDFS. It&amp;rsquo;s well known that HDFS is based heavily on its Google equivalent.&lt;/p&gt;
&lt;p&gt;In 2003 Google published a &lt;a href=&#34;http://labs.google.com/papers/gfs-sosp2003.pdf&#34;&gt;paper&lt;/a&gt; on their Google File System (GFS) at &lt;a href=&#34;http://www.cs.rochester.edu/meetings/sosp2003/&#34;&gt;SOSP&lt;/a&gt;, the Symposium on Operating Systems Principles. This is the same venue at which Amazon published their Dynamo work, albeit four years earlier. One of the lecturers in my group tells me that SOSP is a venue where &amp;ldquo;interesting&amp;rdquo; is rated highly as a criterion for acceptance, over other more staid conferences. So what, if anything, was interesting about GFS? Read on for some details&amp;hellip;&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Pain and suffering and ffmpeg</title>
      <link>https://www.the-paper-trail.org/post/2008-09-19-pain-and-suffering-and-ffmpeg/</link>
      <pubDate>Fri, 19 Sep 2008 11:55:02 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-09-19-pain-and-suffering-and-ffmpeg/</guid>
      <description>All I wanted to do was to transcode real media files from MIT OCW to iPod compatible mp4 on Linux. It shouldn&amp;rsquo;t have been this difficult. As of now, I still don&amp;rsquo;t have a satisfactory solution.
Problem 1: mplayer / mencoder read and play the stream correctly, but the mp4 files they produce when transcoding don&amp;rsquo;t work on the iPod. In particular, they&amp;rsquo;re not readable by any utilities I have such as Easytag and Amarok.</description>
    </item>
    
    <item>
      <title>The Real GoogleOS?</title>
      <link>https://www.the-paper-trail.org/post/2008-09-02-the-real-googleos/</link>
      <pubDate>Tue, 02 Sep 2008 14:43:45 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-09-02-the-real-googleos/</guid>
      <description>&lt;p&gt;So Google have announced &lt;a href=&#34;http://googleblog.blogspot.com/2008/09/fresh-take-on-browser.html&#34;&gt;Chrome&lt;/a&gt;, their entrant into the web browser circus. They are presenting Chrome as a complete reboot of the browser, which of course it isn&amp;rsquo;t. It is interesting, however, to speculate wildly about Google&amp;rsquo;s intentions. We shouldn&amp;rsquo;t, of course, discount their stated intent of &amp;lsquo;adding value for users&amp;rsquo;; a lot of features of Chrome are focused upon improving today&amp;rsquo;s browsing experience. See, for example, pop-ups that are modal only in their own tab, which is something I have been wishing for for ages. However, looking at the big picture, even from a viewpoint far removed, is good for a laugh sometimes.&lt;/p&gt;
&lt;p&gt;Read on for some rampant speculation.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Consistency and availability in Amazon&#39;s Dynamo</title>
      <link>https://www.the-paper-trail.org/post/2008-08-26-consistency-and-availability-in-amazons-dynamo/</link>
      <pubDate>Tue, 26 Aug 2008 12:50:08 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-08-26-consistency-and-availability-in-amazons-dynamo/</guid>
      <description>&lt;p&gt;There is a continuing and welcome trend amongst large, modern technology companies like Google, Yahoo and Amazon to publish details of their systems at academic conferences. One of the problems that researchers at universities have is making a convincing case that their ideas would work well in the real world, since no matter how many assumptions are made there really is no substitute for field testing, and the infrastructure, workloads and data just aren&amp;rsquo;t available to do that effectively. However, companies have infrastructure to burn and a genuine use-case with genuine users. Using their experience and data to discover what does and doesn&amp;rsquo;t work, and what is and is not really important provides an invaluable feedback loop to researchers.&lt;/p&gt;
&lt;p&gt;More than that, large systems are built from a set of independent ideas. Most academic papers leave the construction of a practical real-world system as an exercise for the reader. Synthesising a set of disparate techniques often throws up lots of gotchas which no papers directly address. Companies with businesses to run have a much greater incentive to build a robust system that works.&lt;/p&gt;
&lt;p&gt;At 2007&amp;rsquo;s &lt;a href=&#34;http://www.sosp2007.com&#34;&gt;Symposium on Operating Systems Principles&lt;/a&gt; (SOSP), Amazon presented a &lt;a href=&#34;http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf&#34;&gt;paper&lt;/a&gt; about one of their real-world systems: &amp;ldquo;Dynamo: Amazon&amp;rsquo;s Highly Available Key-value Store&amp;rdquo;. It wound up winning, I think, the audience prize for best paper. In this post, I was planning to describe Dynamo &amp;lsquo;inside-out&amp;rsquo;, based on a reading group mandated close reading of the paper. However, trying to lucidly explain a dense 12 page paper leads to many more than 12 pages of explanation. So instead, I want to focus on one particular aspect of Dynamo which I think is the most interesting.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Good survey of the important papers in distributed consensus</title>
      <link>https://www.the-paper-trail.org/post/2008-08-25-good-survey-of-the-important-papers-in-distributed-consensus/</link>
      <pubDate>Mon, 25 Aug 2008 16:11:35 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-08-25-good-survey-of-the-important-papers-in-distributed-consensus/</guid>
      <description>This blog post is an excellent survey of the last thirty years of research into consensus problems.</description>
    </item>
    
    <item>
      <title>A Brief Tour of FLP Impossibility</title>
      <link>https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/</link>
      <pubDate>Wed, 13 Aug 2008 11:30:29 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/</guid>
      <description>&lt;p&gt;One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch and Patterson. Their short paper &lt;a href=&#34;http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf&#34;&gt;&amp;lsquo;Impossibility of Distributed Consensus with One Faulty Process&amp;rsquo;&lt;/a&gt;, which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.&lt;/p&gt;
&lt;p&gt;This particular result, known as the &amp;lsquo;FLP result&amp;rsquo;, settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus - that is, getting a distributed network of processors to agree on a common value - was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.&lt;/p&gt;
&lt;p&gt;This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it&amp;rsquo;s not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.&lt;/p&gt;
&lt;p&gt;In this post, I want to give a tour of the proof itself because, although it is quite subtle, it is short and profound. I&amp;rsquo;ll start by introducing consensus, and then after describing some notation and assumptions I&amp;rsquo;ll work through the main two lemmas in the paper.&lt;/p&gt;
&lt;p&gt;If you want to follow along at home (highly, highly recommended) a copy of the paper is available &lt;a href=&#34;http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Binomial Heaps</title>
      <link>https://www.the-paper-trail.org/post/2008-07-11-binomial-heaps/</link>
      <pubDate>Fri, 11 Jul 2008 17:35:44 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-07-11-binomial-heaps/</guid>
      <description>&lt;p&gt;(The python code for this article is available &lt;a href=&#34;http://hnr.dnsalias.net/binomial_heap.py&#34;&gt;here&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;The standard binary heaps that everyone learns as part of a first algorithms course are very cool. They give guaranteed  \(n\) sorting cost, can be stored compactly in memory since they&amp;rsquo;re full binary trees and allow for very fast implementations of priority queues. However, there are a couple of operations that we might be interested in that binary trees don&amp;rsquo;t give us, at least not cheaply.&lt;/p&gt;
&lt;p&gt;In particular, we might be concerned with merging two heaps together. Say, for example, that we&amp;rsquo;re shutting down a processor with its own priority queue for schedulable processes, and we want to merge the workload in with another processor. One way to do this would be to insert every item in the first processor&amp;rsquo;s queue into the receiving processor&amp;rsquo;s queue. However, this takes  \(O(n)\) time - at least, depending on how the queues are implemented. We&amp;rsquo;d like to be able to do that more efficiently.&lt;/p&gt;
&lt;p&gt;Step forward binomial heaps. Binomial heaps are rather different to binary heaps - although they share a few details in common. Binomial heaps allow us to merge two heaps together in  \(O(\log n)\) time, in return for some extra cost when finding the minimum. However, extracting the minimum still takes \(O(\log n)\) , which is the same as a binary heap.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Reservoir Sampling</title>
      <link>https://www.the-paper-trail.org/post/2008-04-09-reservoir-sampling/</link>
      <pubDate>Wed, 09 Apr 2008 16:55:09 +0000</pubDate>
      <author>henry.robinson&#43;papertrail@gmail.com (Henry Robinson)</author>
      <guid>https://www.the-paper-trail.org/post/2008-04-09-reservoir-sampling/</guid>
      <description>&lt;p&gt;Right, time to get this blog back on track. I want to talk about a useful technique that&amp;rsquo;s both highly practical and crops up in interview scenarios regularly.&lt;/p&gt;
&lt;p&gt;Consider this problem: How can we efficiently randomly select  \(k\) items from a set  \(S\) of size \(n &amp;gt; k\) , where  \(n\) is unknown? Each member of  \(S\) should have an equal probability of being selected.&lt;/p&gt;
&lt;p&gt;At first glance, this problem looks a strange mix of trivial and impossible. If we don&amp;rsquo;t know \(n\) , how can we know how to weight our selection probabilities? And, assuming  \(S\) is finite, how can we not know  \(n\) if we&amp;rsquo;re able - as we must be - to visit every element in \(S\) ?&lt;/p&gt;
&lt;p&gt;To address the second point first: we can find  \(n\) by iterating over all elements in \(S\) . However, this adds a pass over the entire set that might be expensive. Consider selecting rows at random from a large database table. We don&amp;rsquo;t want to bring the whole table into memory just to count the number of rows. Linked lists are another good motivating example - say we would like to select  \(k\) elements from a linked list of length \(n\) . Even if we do loop over the list to count its elements, employing the simple approach of choosing  \(k\) elements at random will still be slow because random access in linked lists is not a constant time operation. In general we will take on average  \(O(kn)\) time. By using the following technique, called &amp;lsquo;reservoir sampling&amp;rsquo;, we can bring this down to \(\Theta(n)\) .&lt;/p&gt;
&lt;p&gt;We can keep our selection of  \(k\) elements updated on-line as we scan through our data structure, and we can do it very simply.&lt;/p&gt;</description>
    </item>
    
  </channel>
</rss>