Beating hash tables with trees? The ART-ful radix trie

The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases

Leis et. al., ICDE 2013 [paper]

Tries are an unloved third data structure for building key-value stores and indexes, after search trees (like B-trees and red-black trees) and hash tables. Yet they have a number of very appealing properties that make them worthy of consideration - for example, the height of a trie is independent of the number of keys it contains, and a trie requires no rebalancing when updated. Weighing against those advantages is the heavy memory cost that vanilla radix tries can incur, because each node contains a pointer for every possible value of the ‘next’ character in the key. With ASCII as an example, that’s 256 pointers for every node in the tree.

But the astute reader will feel in their bones that this is naive - there must be more efficient ways to store a set of pointers, indexed by a fixed size set of keys (the trie’s alphabet). Indeed, there are - several of them, in fact, distinguished by the number of children the node actually has, not just how many it might potentially have.

This is where the Adaptive Radix Tree (ART) comes in. In this breezy, easy-to-read paper, the authors show how to reduce the memory cost of a regular radix trie by adapting the data structure used for each node to the number of children that it needs to store. In doing so they show, perhaps surprisingly, that the amount of space consumed by a single key can be bounded no matter how long the key is.

[Read More]

Outperforming hash-tables with MICA

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

Lim et. al., NSDI 2014 [paper, code]

In this installment we’re going to look at a system from NSDI 2014. MICA is another in-memory key-value store, but in contrast to Masstree it does not support range queries and in much of the paper it keeps a fixed working set by evicting old items, like a cache. Indeed, the closest comparison system that you might think of when reading about MICA for the first time is a humble… hash table. Is there still room for improvement over such a fundamental data structure? Read on and find out (including benchmarks!).

[Read More]

Masstree: A cache-friendly mashup of tries and B-trees

Cache Craftiness for Fast Multicore Key-Value Storage

Mao et. al., EuroSys 2012 [paper, code]

The Big Idea

Consider the problem of storing, in memory, millions of (key, value) pairs, where key is a variable-length string. If we just wanted to support point lookup, we’d use a hash table. But assuming we want to support range queries, some kind of tree structure is probably required. One candidate might be a traditional B+-tree.

In such a B+-tree, the number of levels of the tree are kept small thanks to the fact that each node has a high fan-out. However, that means that a large number of keys are packed into a single node, and so there’s still a large number of key comparisons to perform when searching through the tree.

This is further exacerbated by variable-length keys (e.g. strings), where the cost of key comparisons can be quite high. If the keys are really long they can each occupy multiple cache lines, and so comparing two of them can really mess up your cache locality.

This paper proposes an efficient tree data structure that relies on splitting variable length keys into a variable number of fixed-length keys called slices. As you go down the tree, you compare the first slice of each key, then the second, then the third and so on, but each comparision has constant cost.

For example, think about the string the quick brown fox jumps over the lazy dog. This string consists of the following 8-byte slices: the quic, k brown_, fox jump, s over t, he lazy_ and finally dog. To find a string in a tree, you can look for all strings that match the first slice first, and then look for the second slice only in strings that matched the first slice, and so on - only comparing a fixed size subset of the key at any time. This is much more efficient than comparing long strings to one another over and over again. The trick is to design a structure that takes advantage of the cache benefits of doing these fixed-size comparisons, without losing a tradeoff based on the large cardinality of the slice ‘alphabet’. Enter the Masstree.

[Read More]

Exactly-once or not, atomic broadcast is still impossible in Kafka - or anywhere

Update: Jay responded on Twitter, which you can read here.

I read an article recently by Jay Kreps about a feature for delivering messages ‘exactly-once’ within the Kafka framework. Everyone’s excited, and for good reason. But there’s been a bit of a side story about what exactly ‘exactly-once’ means, and what Kafka can actually do.

In the article, Jay identifies the safety and liveness properties of atomic broadcast as a pretty good definition for the set of properties that Kafka is going after with their new exactly-once feature, and then starts to address claims by naysayers that atomic broadcast is impossible.

For this note, I’m not going to address whether or not exactly-once is an implementation of atomic broadcast. I also believe that exactly-once is a powerful feature that’s been impressively realised by Confluent and the Kafka community; nothing here is a criticism of that effort or the feature itself. But the article makes some claims about impossibility that are, at best, a bit shaky - and, well, impossibility’s kind of my jam. Jay posted his article with a tweet saying he couldn’t ‘resist a good argument’. I’m responding in that spirit.

In particular, the article makes the claim that atomic broadcast is ‘solvable’ (and later that consensus is as well…), which is wrong. What follows is why, and why that matters.

I have since left the pub. So let’s begin.

[Read More]

Make any algorithm lock-free with this one crazy trick

Lock-free algorithms often operate by having several versions of a data structure in use at one time. The general pattern is that you can prepare an update to a data structure, and then use a machine primitive to atomically install the update by changing a pointer. This means that all subsequent readers will follow the pointer to its new location - for example, to a new node in a linked-list - but this pattern can’t do anything about readers that have already followed the old pointer value, and are traversing the previous version of the data structure.

[Read More]

Distributed systems theory for the distributed systems engineer

Updated June 2018 with content on atomic broadcast, gossip, chain replication and more

Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed. But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory; particularly material that does so without condescending. Considering that gap lead me to another interesting question:

What distributed systems theory should a distributed systems engineer know?

A little theory is, in this case, not such a dangerous thing. So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer. Let me know what you think I missed!

[Read More]

The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google

Note: this is a personal blog post, and doesn’t reflect the views of my employers at Cloudera

Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise.

This morning, at their I/O Conference, Google revealed that they’re not using Map-Reduce to process data internally at all any more.

We shouldn’t be surprised. The writing has been on the wall for Map-Reduce for some time. The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems.

[Read More]

Paper notes: MemC3, a better Memcached

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

Fan and Andersen, NSDI 2013

The big idea

This is a paper about choosing your data structures and algorithms carefully. By paying careful attention to the workload and functional requirements, the authors reimplement memcached to achieve a) better concurrency and b) better space efficiency. Specifically, they introduce a variant of cuckoo hashing that is highly amenable to concurrent workloads, and integrate the venerable CLOCK cache eviction algorithm with the hash table for space-efficient approximate LRU.

[Read More]

Paper notes: Anti-Caching

Anti-Caching: A New Approach to Database Management System Architecture

DeBrabant et. al., VLDB 2013

The big idea

Traditional databases typically rely on the OS page cache to bring hot tuples into memory and keep them there. This suffers from a number of problems:

  • No control over granularity of caching or eviction (so keeping a tuple in memory might keep all the tuples in its page as well, even though there’s not necessarily a usage correlation between them)
  • No control over when fetches are performed (fetches are typically slow, and transactions may hold onto locks or latches while the access is being made)
  • Duplication of resources - tuples can occupy both disk blocks and memory pages.

[Read More]

Paper notes: Stream Processing at Google with Millwheel

Millwheel: Fault-Tolerant Stream Processing at Internet Scale

Akidau et. al., VLDB 2013

The big idea

Streaming computations at scale are nothing new. Millwheel is a standard DAG stream processor, but one that runs at ‘Google’ scale. This paper really answers the following questions: what guarantees should be made about delivery and fault-tolerance to support most common use cases cheaply? What optimisations become available if you choose these guarantees carefully?

[Read More]