You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage - which is a simple idea - can drastically improve performance for certain workloads.
Caveat: This is a personal, general research summary post, and as usual doesn’t neccessarily reflect our thinking at Cloudera about columnar storage.
Disks are still the major bottleneck in query execution over large datasets. Even a machine with twelve disks running in parallel (for an aggregate bandwidth of north of 1GB/s) can’t keep all the cores busy; running a query against memory-cached data can get tens of GB/s of throughput. IO bandwidth matters. Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk. This can mean decreasing the latency (for small queries where the time to find the data to read might dominate), but most usually this means improving the effective throughput of reads from disk.
The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster. However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can’t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.
The other way to improve disk performance is to maximise the ratio of ‘useful’ bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem. Enter columnar storage, a principle for file format design that aims to do exactly that for query engines that deal with record-based data.[Read More]
If you have a strong background in either databases or distributed systems, and fancy working on such an exciting technology, send me a note!
It’s great to finally be able to say something about what I’ve been working at Cloudera for nearly a year. At StrataConf / Hadoop World in New York a couple of weeks ago we announced Cloudera Impala. Impala is a distributed query execution engine that understands a subset of SQL, and critically runs over HDFS and HBase as storage managers. It’s very similar in functionality to Apache Hive, but it is much, much, much (anecdotally up to 100x) faster.[Read More]
On some subtleties of Paxos
There’s one particular aspect of the Paxos protocol that gives readers of this blog - and for some time, me! - some difficulty. This short post tries to clear up some confusion on a part of the protocol that is poorly explained in pretty much every major description.[Read More]
Something a bit different: translations of classic mathematical texts (!)
EuroSys 2012 blog notes
EuroSys 2012 was last week - one of the premier European systems conferences. Over at the Cambridge System Research Group’s blog, various people from the group have written notes on the papers presented. They’re very well-written summaries, and worth checking out for an overview of the research presented.
FLP and CAP aren't the same thing
An interesting question came up on Quora this last week. Roughly speaking, the question asked how, if at all, the FLP theorem and the CAP theorem were related. I’d thought idly about exactly the same question myself before. Both theorems concern the impossibility of solving fairly similar fundamental distributed systems problems in what appear to be fairly similar distributed systems settings. The CAP theorem gets all the airtime, but FLP to me is a more beautiful result. Wouldn’t it be fascinating if both theorems turned out to be equivalent; that is effectively restatements of each other?[Read More]
Should I take a systems reading course?
A smart student asked me a couple of days ago whether I thought taking a 2xx-level reading course in operating systems was a good idea. The student, understandably, was unsure whether talking about these systems was as valuable as actually building them, and also whether, since his primary interest is in ‘distributed’ systems, he stood to benefit from a deep understanding of things like virtual memory.[Read More]