Availability in AWS' Physalia

Physalia: Millions of Tiny Databases Brooker et. al., NSDI 2020

Some notes on AWS’ latest systems publication, which continues and expands their thinking about reducing the effect of failures in very large distributed systems (see shuffle sharding as an earlier and complementary technique for the same kind of problem).

Physalia is a configuration store for AWS’ Elastic-Block Storage (i.e. network-attached disks). EBS disks are replicated using chain replication, but the configuration of the replication chain needs to be stored somewhere - enter Physalia.

A single database is not available or reliable enough, so EBS has ‘millions’ of Physalia deployments. Physalia itself is a transactional key-value store. The most interesting part is that Physalia’s deployment is adapted to increase availability and durability from the perspective of a single client. Physalia aims to be extremely available, only for the keys that it knows each client needs.

One of the major ideas here is that of blast radius. When a service fails, one parameter that we worry about is the MTTR - how quickly on average we can get the system working again. Blast radius is another important parameter: how many users / clients / workloads etc are affected by a single outage? If MTTR is the length of an outage, blast radius is its width. This is one reason why a single sharded database can’t support Physalia’s requirements: a service outage means that all keys are unavailable.

It’s worth pausing here to note that the paper says that Physalia came out of a real incident: the EBS control plane suffered a complete brown out during a cascading failure. That’s a failure mode with a huge blast radius!

The other important availability idea is failure correlation. When a service fails, it doesn’t matter so much if the client of that service is also experiencing an outage. If we can ensure that a service fails if and only if its client has failed, we perfectly correlate their failure modes, and get perfect availability from the perspective of the client, which is all that matters. One easy way to do this is to co-locate the service with the client (here, that means storing configuration data on the same instance as the EBS client). The problem is that there are other design goals that are in tension with this kind of deployment; here there is a need for a degree of replication for durability, and that means several different hosts.

So roughly the availability goal of Physalia is:

… given other service constraints such as replication. (You can see this argument made in section 3.2, mostly for EBS but it’s a bit confusing the way it switches between describing volumes and Physalia instances). Thinking hard about blast radius is one way to improve failure correlation.

The details of the Physalia service themselves are more traditional: each instance is a seven-way-replicated Paxos instance with some protocol tweaks. Physalia itself offers a partitioned key space: each instance serves one key. That key itself contains an entire K-V store which may be transactionally updated (with linearizable, multi-row operations). Each instance, called a cell, is independent of all other cells, except that the instances running each replica can also run other replicas; here some blast radius is given up in return for the efficiencies of packing replicas onto a single node. But care is taken that no two cells have too much in common, so a cell failure cannot take out a majority of some other cell just by being a bad neighbour.

One the most fascinating parts of the system is the ‘control plane’ - the part of Physalia that decides where cell replicas should be placed. The paper describes what sounds like a couple of heuristics: place replicas close to clients, so that unrelated infrastructure outages don’t cause an uncorrelated failure, but also place replicas with enough diversity that the cell is reliable in toto (so not all on the same rack). The paper suggests that cell placement should be at least as diverse as the client + EBS replicas.

Other than these tantalising hints, not too much is said about the control plane. For example, how does it decide when a cell has failed and needs to be replaced? This is a subtle question: if the idea is to maximise availability from the client, shouldn’t availability of the cell be measured from the client as well? Otherwise the classic failure detection problem of some centralised entity thinking everything is fine while the client is completely partitioned from the cell can lead to uncorrelated failure modes again. I would love to see a follow-up paper about the control-plane itself; how it’s distributed, how much coordination it needs, how millions of instances are managed by operators in aggregate, and so on.

Some particular failure modes are considered: software-induced failures, including ‘poison pill’ transactions that exercise a bug in the software, and execute the same on every cell replica. Physalia deploys software by ‘colors’ - each cell has exactly one color (and cells which share nodes have the same color), so it’s very hard for bugs to propagate across colors. Basically this partitions the set of Physalia instances into several labelled subsets, and you can only take down one of those at a time if you’re careful. Again, blast radius!

© - 2023 · Paper Trail · Theme Simpleness Powered by Hugo ·