Network Load Balancing with Maglev

June 22, 2020

Maglev: A Fast and Reliable Software Network Load Balancer Eisenbud et. al., NSDI 2016

Load balancing is a fundamental primitive in modern service architectures - a service that assigns requests to servers so as to, well, balance the load on each server. This improves resource utilisation and ensures that servers aren’t unnecessarily overloaded.

Maglev is - or was, sometime before 2016 - Google’s network load-balancer that managed load-balancing duties for search, GMail and other high-profile Google services. A network load balancer typically balances TCP connections, having no understanding of the actual request load that any connection contains, unlike an application load balancer which is request-aware (e.g. an HTTP request load balancer).

In times of yore, load balancing was performed by dedicated network hardware, usually a monolithic beast which was hard to scale out, and which had poor fault tolerance properties (because you were roughly resilient to at most $N=1$ failures, and then only if there was a standby). As with databases, a scale-out, ‘commodity’ hardware approach made more sense - e.g. the proverbial racks of pizza boxes - and the load balancing could be done in software, once those pizza boxes got fast enough to process a high packet rate.

To be competitive with a hardware load balancer, however, a scale-out software solution must solve two challenges, per this paper: it must have high-enough per-node throughput to be economical (no use replacing one load balancer with 10k machines), and it must support connection persistence:

packets belonging to the same connection should always be directed to the same service endpoint

Connection persistence is a no-brainer requirement: TCP is a stateful, connection oriented protocol and if you start sending packets meant for one machine to another, otherwise identical, server, the connection will be reset immediately as the server realises it has no idea what the client is talking about.

Connection persistence is easy with a single machine, with no-one to share state with! It’s harder with a fleet of load balancers: they must have some way of coordinating their choices of where to send packets so that ultimately the same decision is made for every connection. It’s made even harder by the fact that not only might the set of backends change (and connections still need to be routed to the right place), but the set of load balancers might change, and a new load balancer server will have to ensure uninterrupted packet flows for existing connections.

Maglev’s contributions to the load balancing problem are:

A new consistent hashing algorithm to achieve near-perfect load balancing (with enough connections, while keeping connection churn low when the set of backends changes
A fast per-node packet forwarder that doesn’t involve the Linux kernel to send packets
Using a per-Maglev flow table that ensures all packets in the same flow go to the same host, which is computable independently at every Maglev.

Packet Flow

As a precursor, let’s remember that connections are identified by the so-called 5-tuple - (source host, destination host, source port, destination port, protocol). Almost all IP packets include the 5-tuple as part of their headers, which makes it possible to identify which connection, or flow, a packet belongs to. There’s going to be a lot of hashing of the 5-tuple throughout the packet lifecycle.

A Maglev deployment consists of a fleet of frontend load balancers, called Maglevs, and one or more sets of backend servers to which packets must be forwarded. Maglevs are internet-facing, and the clients that send traffic to them live out in the wild, non-Google network. Maglevs advertise one Virtual IP for each service instance - so simplistically, GMail might be 172.217.6.37 . Clients use DNS to look up that VIP, and (unknowingly) actually send a packet to a Maglev to start a connection.

Clients must also load balance amongst Maglevs, so we need some mechanism for ensuring that packets sent over the Internet arrive at some randomly picked Maglev. Maglev uses Equal-Cost Multipath routing (ECMP) - all Maglevs advertise BGP routes for that VIP of the same weight. A router should then pick amongst the available Maglevs at random. There is no guarantee of affinity to a particular Maglev for a given flow, even not withstanding restarts.

When a packet arrives at a Maglev it must be forwarded to one of the available backends. Let’s postpone the question of choosing which backend for now. The actual mechanics of forwarding a packet are handled by direct access to the local NIC’s send and receive queues, without involving the kernel at any point, a la DPDK. Packets are pulled from the NIC inbound queue and hashed to one of the per-core receiving queues - the hash as ever is based on the 5-tuple of the inbound packet. A per-core thread then pulls a packet out of the receiving queue, decides which Maglev to send it to, and wraps the packet using Generic Routing Encapsulation, which is essentially a standardised way of forwarding a packet that was not addressed to its actual destination. The wrapped packet, or more precisely a pointer to it, lands on a transmission queue, and a separate thread of execution pulls packets off each transmission queue and sends them to the NIC for actual transmission. This is all pretty standard user-space packet processing, using ring-buffers for queues and passing pointers wherever possible to avoid moving packets between cores.

At this point, the packet arrives at the actual service backend that can process it. Responses are sent directly to the client using Direct Server Return, because there’s no need for the inbound load-balancing tier to handle the extra work.

Backend Selection

So the remaining question is how a Maglev selects a service backend for a given connection. We recall that ideal backend selection does a great job of load balancing amongst available backends, and suffers little or no churn when the set of backends (or Maglevs) changes. We’d also like to avoid coordination between Maglevs where possible to make the LB tier more scalable, and achieving connection persistence under this constraint is a particular pain.

Google care more about load balancing than perfect connection persistence: you get more backend efficiency and capacity planning is easier. The downside is likely more connection churn, but some small amount is acceptable.

Some ideas that seem like they might work, but don’t:

Each Maglev round-robins connections between backends - load balances very well, but when the set of Maglevs change it’s impossible for a replacement Maglev to figure out what decisions its predecessor made and so all connections get reset because they are assigned to a likely-new backend.
Hash connections onto the set of backends - pretty good load balancing with enough connections, and two different Maglevs should make the same assignment decisions, but when the set of backends changes everything gets re-hashed (like when a hash table gets resized), and pretty much every connection gets broken.

A better approach is to come up with some hash scheme where the set of backends changing doesn’t actually change the existing assignment of connections to backends. A really simple device is simply to store connection assignments in a table, so that you don’t recompute it every time you see a packet on that connection. That way, if the set of backends changes, you are using the cached assignment for a connection which should point to the actual backend that connection is using.

This works well, but doesn’t handle the set of Maglevs changing - or what the paper calls connection affinity from a client to a particular Maglev. The client may, for many reasons, start sending packets to a different Maglev, and the new Maglev might not have a cached assignment for that particular connection: it will need to recompute the assignment and come up with the same backend if we want to avoid the connection being broken. So we need the computation to be stable in the face of Maglev and server churn.

Maglev uses consistent hashing for this. Consistent hashing is a set of algorithms for solving exactly the problem we’ve described: hash a set of objects into a set of buckets in such a way that if the set of buckets changes, only a few objects have to move. As the paper describes, consistent hashing is certainly not new, but previous algorithms focus more on minimising churn - the number of objects that move when things get rehashed - than on ideal load balancing.

Damian Gryski has written a wonderful survey of consistent hashing in the context of the kind of problem Maglev is trying to solve, including plenty of detail on Maglev and its precursors.

The basic idea in Maglev is to subdivide the space that connections can be mapped onto into many many buckets, each of which covers an equally-sized range of the hash space. Each backend claims an equal number of buckets - and so receives an equal, load-balanced share of the connection load. When a backend fails, or new ones are added, only a few of the buckets change owners.

Each Maglev can compute ownership for each bucket without having to talk to any other Maglevs - and if it has the exact same view of the set of backends, it will compute the same ownership for each bucket (and small differences in the set of backends will only mean small differences in the bucket ownership, because of consistent hashing!).

For each backend, Maglev computes a ‘preference list’ - basically an ordered list of bucket indexes - for each backend. So the first bucket in your list is the one you want to get assigned most, the second one the second most and so on. A preference list is just a permutation of the set of buckets, and is deterministically computed from hashing the name of the backend.

$\mathit{offset} \leftarrow h_1(\mathit{name}[i]) \mod M$

$\mathit{skip} \leftarrow h_2(\mathit{name})[i]) \mod M$

$\mathit{permutation}[i][j] \leftarrow (\mathit{offset} + j*\mathit{skip}) \mod M$

Here $i$ is the backend whose preference list is being calculated, $M$ is the number of buckets, and $h_1$ and $h_2$ are independent hash functions. So if $M$ is prime, which it should be, starting at offset and going to the bucket that is $\mathit{skip}$ buckets away, $\mod{M}$ , you’ll hit every bucket in a unique order for every backend.

Once all the permutations are computed, Maglev can assign buckets to backends by looping over every backend in turn over and over, and picking the next unassigned bucket in its preference list. So if you had preferences [3, 1, 4, 2], and buckets 3 and 1 were already assigned, you’d get bucket 4 on this iteration. Once all buckets are assigned, the loop exits.

Since every backend gets a bucket in turn, until all buckets are assigned, every backend gets an equal share of the buckets. (They may differ by up to 1 bucket). If you make the number of buckets large, each bucket covers less of the hash space and so the actual difference between backends is very small.

As for keeping churn low: the paper doesn’t carefully analyze the expected amount of churn. You can intuitively see that if a backend leaves the set, not too many buckets should move because in order for them to move, because the remaining backends still have the same preference order, and so still have a tendency to get assigned roughly the same buckets. 1/M of the existing connections would certainly get reset, which is a lower bound given that the physical server they have been assigned is no longer able to respond. Adding a new server would appear to have the same effect: 1/(M +1) of the buckets are assigned to it. However, fewer connections would get reset because of the caching effects of Maglevs storing existing connection assignments in a table.

Once you have the bucket assignment table, connection assignment is easy: take the 5-tuple hash and permanently assign the connection to whichever backend owns the bucket you were just hashed into.

Other Bits

As mentioned above, Maglevs need a consistent view of the set of backends. Changes to the set - like new backends being added - are delivered via a config RPC. Each Maglev individually checks the health of all backends, however, presumably to avoid Gray failures where whoever provides the config has a different view of backend health. It seems that this means a large disparity in the health views between Maglevs could lead to quite different bucket assignment tables, but that is unlikely in a datacenter setting if Maglevs are co-located.

A noted issue is SYN flooding, when an attacker tries to create a huge amount of connections in short order. The problem then is that the connection tables on each Maglev get full, and they’re hard to garbage collect because how do you know the connection is dead? I think it’s probably safe to age out some connections, trusting in the relative stability of connection preference for those that return after a Lazarus-like pause.

If the number of backends is small, it’s not so clear that Maglev is a great choice. Hashing connections onto backends is more likely to be unbalanced, and if you’re a tiny service with only a few backends, maybe round-robin with a shared connections table to mitigate the effect of LB restarts might be more effective. Of course, if you’re Google, you’re always going to say that your service is the biggest ever…