Paul Khuong: some Lisp

Fixing the hashing in "Hashing modulo α-equivalence"

2022-12-29T15:12:06-05:00

Per Vognsen sent me a link to Maziarz et al’s Hashing Modulo Alpha-Equivalence because its Lemma 6.6 claims to solve a thorny problem we have both encountered several times.

Essentially, the lemma says that computing the natural recursive combination of hash values over $2^b$ bits for two distinct trees (ADT instances) $a$ and $b$ yields a collision probability at most $\frac{|a| + |b|}{2^b}$ if we use a random hash function (sure), and Section 6.2 claims without proof that the result can be safely extended to the unspecified “seeded” hash function they use.

That’s a minor result, and the paper’s most interesting contribution (to me) is an algorithmically efficient alternative to the locally nameless representation: rather than representing bindings with simple binders and complex references, as in de Bruijn indices (lambda is literally just a lambda literal, but references must count how many lambdas to go up in order to find the correct bindings), Maziarz and his coauthors use simple references (holes, all identical), and complex binders (each lambda tracks the set of paths from the lambda binding to the relevant holes).

The rest all flows naturally from this powerful idea.

Part of the naturally flowing rest are collision probability analyses for a few hashing-based data structures. Of course it’s not what PLDI is about, but that aspect of the paper makes it look like the authors are unaware of analysis and design tools for hashing based algorithms introduced in the 1970s (a quick Ctrl-F for “universal,” “Wegman,” or “Carter” yields nothing). That probably explains the reckless generalisation from truly random hash functions to practically realisable ones.

There are two core responsibilities for the hashing logic:

incrementally hash trees bottom up (leaf to root)
maintain the hash for a map of variable name to (hash of) trees (that may grow bottom-up as well)

As Per saliently put it, there are two options for formal analysis of collision probabilities here: we can either assume a cryptographic hash function like SHA-3 or BLAKE3, in which case any collision is world-breaking news, so all that matters is serialising data unambiguously when feeding bytes to the hash function, or we can work in the universal hashing framework.

Collision probability analysis for the former is trivial, so let’s assume we want the latter, pinpoint where the paper is overly optimistic, and figure out how to fix it.

Incremental bottom-up hashing, without novelty

Let’s tackle the first responsibility: incrementally hashing trees bottom up.

The paper essentially says the following in Appendix A. Assume we have one truly random variable-arity hash function (“hash combiner”) $f$, and a tag for each constructor (e.g., $s_{\texttt{Plus}}$ for (Plus a b)); we can simply feed the constructor’s arity, its tag, and the subtrees’ hash values to $f$, e.g., $f(2, s_{\texttt{Plus}}, hv_a, hv_b)$… and goes on to show a surprisingly weak collision bound (the collision rate for two distinct trees grows with the sum of the size of both trees).¹

A non-intuitive fact in hash-based algorithms is that results for truly random hash functions often fail to generalise for the weaker “salted” hash functions we can implement in practice. For example, linear probing hash tables need 5-universal hash functions² in order to match the performance we expect from a naïve analysis with truly random hash functions. A 5-universal family of hash functions isn’t the kind of thing we use or come up with by accident (such families are parameterised by at least 5 words for word-sized outputs, and that’s a lot of salt).

The paper’s assumption that the collision bound it gets for a truly random function $h$ holds for practical salted/seeded hash functions is thus unwarranted (see, for examples, these counter examples for linear probing, or the seed-independent collisions that motivated the development of SipHash); strong cryptographic hash functions could work (find a collision, break Bitcoin), but we otherwise need a more careful analysis.

It so happens that we can easily improve on the collision bound with a classic incremental hashing approach: polynomial string hashing.

Polynomial string hash functions are computed over a fixed finite field $\mathbb{F}$ (e.g., arithmetic modulo a prime number $p$), and parameterised by a single point $x \in \mathbb{F}$.

Assuming a string of “characters” $v_i \in \mathbb{F}$ (e.g., we could hash strings of atomic bytes in arithmetic modulo a prime $p \geq 256$ by mapping each byte to the corresponding binary-encoded integer), the hash value is simply

\[v_0 + v_1 x + v_2 x^2 \ldots + v_{n - 1} x^{n - 1},\]

evaluated in the field $\mathbb{F}$, e.g., $\mathbb{Z}/p\mathbb{Z}$.

For more structured atomic (leaf) values, we can serialise to bits and make sure the field is large enough, or split longer bit serialised values into multiple characters. And of course, we can linearise trees to strings by encoding them in binary S-expressions, with dedicated characters for open ( and close ) parentheses.³

The only remaining problem is to commute hashing and string concatenation: given two subtrees a, b, we want to compute the hash value of (Plus a b), i.e., hash "(Plus " + a + " " + b + ")" in constant time, given something of constant size, like hash values for a and b.

Polynomials offer a lot of algebraic structure, so it shouldn’t be a surprise that there exists a solution.

In addition to computing h(a), i.e., $\sum_{i=1}^{|a|} a_i x^i,$ we will remember $x^{|a|}$, i.e., the product of x repeated for each “character” we fed to the hash function while hashing the subtree a. We can obviously compute that power in time linear in the size of a, although in practice we might prefer to first compute that size, and later exponentiate in logarithmic time with repeated squaring.

Equipped with this additional power of $x\in\mathbb{F}$, we can now compute the hash for the concatenation of two strings $h(a \mathtt{++} b)$ in constant time, given the hash and power of x for the constituent strings $a$ and $b$.

Expanding $h(a \mathtt{++} b)$ and letting $m = |a|, $ $n = |b| $ yields:

\[a_0 + a_1 x + \ldots + a_{m - 1} x^{m - 1} + b_0 x^n + b_1 x^{n + 1} + \ldots + b_{n - 1} x^{m + n - 1},\]

which we can rearrange as

\[a_0 + a_1 x + \ldots + a_{m - 1} x^{m - 1} + x^m (b_0 + b_1 x + \ldots b_{n-1} x^{n-1},\]

i.e.,

\[h(a \mathtt{++} b) = h(a) + x^{|a|} h(b),\]

and we already have all right-hand side three terms $h(a),$ $x^{|a|},$ and $h(b).$

Similarly, $x^{|a \mathtt{++} b|} = x^{|a| + |b|} = x^a \cdot x^b,$ computable in constant time as well.

This gives us an explicit representation for the hash summary of each substring, so it’s easy to handle, e.g., commutative and associative operators by sorting the pairs of $(h(\cdot), x^{|\cdot|})$ that correspond to each argument before hashing their concatenation.

TL;DR: a small extension of classic polynomial string hashing commutes efficiently with string concatenation.

And the collision rate? We compute the same polynomial string hash, so two distinct strings of length at most $n$ collide with probability at most $n/|\mathbb{F}|$ (with the expectation over the generation of the random point $x \in \mathbb{F}$;⁴ never worse than Lemma 6.6 of Maziarz et al, and up to twice as good.

Practical implementations of polynomial string hashing tend to evaluate the polynomial with Horner’s method rather than maintaining $x^i$. The result computes a different hash function, since it reverses the order of the terms in the polynomial, but that’s irrelevant for collision analysis. The concatenation trick is similarly little affected: we now want $h(a \mathtt{++} b) = x^{|b|} h(a) + h(b)$.

Hashing unordered maps and sets

The term representation introduced in “Hashing Module Alpha-Equivalence” contains a map from variable name to a tree representation of the holes where the variable goes (like a DAWG representation for a set of words where each word is a path, except the paths only share as they get closer to the root of the tree… so maybe more like snoc lists with sharing).

We already know how to hash trees incrementally; the new challenge is in maintaining the hash value for a map.

Typically, one hashes unordered sets or maps by storing them in balanced trees sorted primarily on the key’s hash value, and secondarily on the key.⁵ We can also easily tweak arbitrary balanced trees to maintain the tree’s hash value as we add or remove entries: augment each node with the hash and power of x for the serialised representation of subtree rooted at the node.⁶

The paper instead takes the treacherously attractive approach of hashing individual key-value pairs, and combining them with an abelian group operator (commutative and associative, and where each element has an inverse)… in their case, bitwise xor over fixed-size words.

Of course, for truly random hash functions, this works well enough, and the proof is simple. Unfortunately, just because a practical hash function is well distributed for individial value does not mean pairs or triplets of values won’t show any “clumping” or pattern. That’s what $k-$universality is all about.

For key-value pairs, we can do something simple: associate one hash function from a (almost-xor)-universal family to each value, and use it to mix the associated value before xoring everything together.

It’s not always practical to associate one hash function with each key, but it does work for the data structure introduced in “Hashing modulo Alpha-Equivalence:” the keys are variable names, and these were regenerated arbitrarily to ensure uniqueness in a prior linear traversal of the expression tree. The “variable names” could thus include (or be) randomly generated parameters for a (almost-xor)-universal family.

Multiply-shift is universal, so that would work; other approaches modulo a Mersenne prime should also be safe to xor.

For compilers where hashing speed is more important than compact hash values, almost-universal families could make sense.

The simplest almost-xor-universal family of hash functions on contemporary hardware is probably PH, a 1-universal family that maps a pair of words $(x_1, x_2)$ to a pair of output words, and is parameterised on a pair of words $(a_1, a_2)$:

\[\texttt{PH}_a(x) = (x_1 \oplus a_1) \odot (x_2 \oplus a_2),\]

where $\oplus$ is the bitwise xor, and $\odot$ an unreduced carryless multiplication (e.g., x86 CLMUL).

Each instance of PH accepts a pair of $w-$bit words and returns a $2w-$bit result; that’s not really a useful hash function.

However, not only does PH guarantee a somewhat disappointing collision rate at most $w^{-1}$ for distinct inputs (expectation taken over the $2w-$bit parameter $(a_1, a_2)$), but, crucially, the results from any number of independently parameterised PH can be combined with xor and maintain that collision rate!

For compilers that may not want to rely on cryptographic extensions, the NH family also works, with $\oplus$ mapping to addition modulo $2^w$, and $\odot$ to full multiplication of two $w-$bit multiplicands into a single $2w-$bit product. The products have the similar property of colliding with probability $w^{-1}$ even once combined with addition modulo $w^2$.

Regardless of the hash function, it’s cute. Useful? Maybe not, when we could use purely functional balanced trees, and time complexity is already in linearithmic land.

Unknown unknowns and walking across the campus

None of this takes away from the paper, which I found both interesting and useful (I intend to soon apply its insights), and it’s all fixable with a minimal amount of elbow grease… but the paper does make claims it can’t back, and that’s unfortunate when reaching out to people working on hash-based data structures would have easily prevented the issues.

I find cross-disciplinary collaboration most effective for problems we’re not even aware of, unknown unknowns for some, unknown knowns for the others. Corollary: we should especially ask experts for pointers and quick gut checks when we think it’s all trivial because we don’t see anything to worry about.

Thank you Per for linking to Maziarz et al’s nice paper and for quick feedback as I iterated on this post.

Perhaps not that surprising given the straightforward union bound. ↩
Twisted tabular hashing also works despite not being quite 5-universal, and is already at the edge of practicality. ↩
It’s often easier to update a hash value when appending a string, so reverse Polish notation could be a bit more efficient. ↩
Two distincts inputs a and b define polynomials $p_a$ and `$p_b$ of respective degree $|a|$ and $|b|$. They only collide for a seed $x\in\mathbb{F}$ when $p_a(x) = p_b(x),$ i.e., $p_a(x) - p_b(x) = 0$. This difference is a non-zero polynomial of degree at most $\max(|a|, |b|),$ so at most that many of the $|\mathbb{F}|$ potential values for $x$ will lead to a collision. ↩
A more efficient option in practice, if maybe idiosyncratic, is to use Robin Hood hashing with linear probing to maintain the key-value pairs sorted by hash(key) (and breaking improbable ties by comparing the keys themselves), but that doesn’t lend itself well to incremental hash maintenance. ↩
Cryptographically-minded readers might find Incremental Multiset Hashes and their Application to Integrity Checking interesting. ↩

Plan B for UUIDs: double AES-128

2022-07-11T22:38:02-04:00

It looks like internauts are having another go at the “UUID as primary key” debate, where the fundamental problem is the tension between nicely structured primary keys that tend to improve spatial locality in the storage engine, and unique but otherwise opaque identifiers that avoid running into Hyrum’s law when communicating with external entities and generally prevent unintentional information leakage.¹

I guess I’m lucky that the systems I’ve worked on mostly fall in two classes:²

those with trivial write load (often trivial load in general), where the performance implications of UUIDs for primary keys are irrelevant.
those where performance concerns lead us to heavily partition the data, by tenant if not more finely… making information leaks from sequentially allocation a minor concern.

Of course, there’s always the possibility that a system in the first class eventually handles a much higher load. Until roughly 2016, I figured we could always sacrifice some opacity and switch to one of the many k-sorted alternatives created by web-scale companies.

By 2016-17, I felt comfortable assuming AES-NI was available on any x86 server,³ and that opens up a different option: work with structured “leaky” keys internally, and encrypt/decrypt them at the edge (e.g., by printing a user-defined type in the database server). Assuming we get the cryptography right, such an approach lets us have our cake (present structured keys to the database’s storage engine), and eat it too (present opaque unique identifiers to external parties), as long as the computation overhead of repeated encryption and decryption at the edge remains reasonable.

I can’t know why this approach has so little mindshare, but I think part of the reason must be that developers tend to have an outdated mental cost model for strong encryption like AES-128.⁴ This quantitative concern is the easiest to address, so that’s what I’ll do in this post. That leaves the usual hard design questions around complexity, debuggability, and failure modes… and new ones related to symmetric key management.

A short intermission for questions^Wcomments

Brandur compares sequential keys and UUIDs. I’m thinking more generally about “structured” keys, which may be sequential in single-node deployments, or include a short sharding prefix in smaller (range-sharded) distributed systems. Eventually, a short prefix will run out of bits, and fully random UUIDs are definitely more robust for range-sharded systems that might scale out to hundreds of nodes… especially ones focused more on horizontal scalability than single-node performance.

That being said, design decisions that unlock scalability to hundreds or thousands of nodes have a tendency to also force you to distribute work over a dozen machines when a laptop might have sufficed.

Mentioning cryptography makes people ask for a crisp threat model. There isn’t one here (and the question makes sense outside cryptography and auth!).

Depending on the domain, leaky or guessable external ids can enable scraping, let competitors estimate the creation rate and number of accounts (or, similarly, activity) in your application, or, more benignly, expose an accidentally powerful API endpoint that will be difficult to replace.

Rather than try to pinpoint the exact level of dedication we’re trying to foil, from curious power user to nation state actor, let’s aim for something that’s hopefully as hard to break as our transport (e.g., HTTPS). AES should be helpful.

Hardware-assisted AES: not not fast

Intel shipped their first chip with AES-NI in 2010, and AMD in 2013. A decade later, it’s anything but exotic, and is available even in low-power Goldmont Atoms. For consumer hardware, with a longer tail of old machines than servers, the May 2022 Steam hardware survey shows 96.28% of the responses came from machines that support AES-NI (under “Other Settings”), an availability rate somewhere between those of AVX (2011) and SSE4.2 (2008).

The core of the AES-NI extension to the x86-64 instruction set is a pair of instructions to perform one round of AES encryption (AESENC) or one round of decryption (AESDEC) on a 16-byte block. Andreas Abel’s uops.info shows that the first implementation, in Westmere, had a 6-cycle latency for each round, and that Intel and AMD have been optimising the instructions to bring their latencies down to 3 (Intel) or 4 (AMD) cycles per round.

That’s pretty good (on the order of a multiplication), but each instruction only handles one round. The schedule for AES-128, the fastest option, consists of 10 rounds: an initial whitening xor, 9 aesenc / aesdec and 1 aesenclast / aesdeclast. Multiply 3 cycles per round by 10 “real” rounds, and we find a latency of 30 cycles ($+ 1$ for the whitening xor) on recent Intels and $40 + 1$ cycles on recent AMDs, assuming the key material is already available in registers or L1 cache.

This might be disappointing given that AES128-CTR could already achieve more than 1 byte/cycle in 2013. There’s a gap between throughput and latency because pipelining lets contemporary x86 chips start two rounds per cycle, while prior rounds are still in flight (i.e., 6 concurrent rounds when each has a 3 cycle latency).

Still, 35-50 cycles latency to encrypt or decrypt a single 16-byte block with AES-128 is similar to a L3 cache hit… really not that bad compared to executing a durable DML statement, or even a single lookup in a big hash table stored in RAM.

A trivial encryption scheme for structured keys

AES works on 16 byte blocks, and 16-byte randomish external ids are generally accepted practice. The simplest approach to turn structured keys into something that’s provably difficult to distinguish from random bits probably goes as follows:

Fix a global AES-128 key.
Let primary keys consist of a sequential 64-bit id and a randomly generated 64-bit integer.⁵
Convert a primary key to an external id by encrypting the primary key’s 128 bits with AES-128, using the global key (each global key defines a unique permutation from 128 bits input to 128 bit output).
Convert an external id to a potential primary key by decrypting the external id with AES-128, using the same global key.

source: aes128.c

The computational core lies in the encode and decode functions, two identical functions from a performance point of view. We can estimate how long it takes to encode (or decode) an identifier by executing encode in a tight loop, with a data dependency linking each iteration to the next; the data dependency is necessary to prevent superscalar chips from overlapping multiple loop iterations.⁶

uiCA predicts 36 cycles per iteration on Ice Lake. On my unloaded 2 GHz EPYC 7713, I observe 50 cycles/encode (without frequency boost), and 13.5 ns/encode when boosting a single active core. That’s orders of magnitude less than a syscall, and in the same range as a slow L3 hit.

source: aes128-latency.c

This simple solution works if our external interface may expose arbitrary 16-byte ids. AES-128 defines permutation, so we could also run it in reverse to generate sequence/nonce pairs for preexisting rows that avoid changing their external id too much (e.g., pad integer ids with zero bytes).

However, it’s sometimes important to generate valid UUIDs, or to at least save one bit in the encoding as an escape hatch for a versioning scheme. We can do that, with format-preserving encryption.

Controlling one bit in the external encrypted id

We view our primary keys as pairs of 64-bit integers, where the first integer is a sequentially allocated identifier. Realistically, the top bit of that sequential id will always be zero (i.e., the first integer’s value will be less than $2^{63}$). Let’s ask the same of our external ids.

The code in this post assumes a little-endian encoding, for simplicity (and because the world runs on little endian), but the same logic works for big endian.

Black and Rogaway’s cycle-walking method can efficiently fix one input/output bit: we just keep encrypting the data until bit 63 is zero.

When decrypting, we know the initial (fully decrypted) value had a zero in bit 63, and we also know that we only re-encrypted when the output did not have a zero in bit 63. This means we can keep iterating the decryption function (at least once) until we find a value with a zero in bit 63.

source: aes128-cycle-walk.c

This approach terminates after two rounds of encryption (encode) or decryption (decode), in expectation.

That’s not bad, but some might prefer a deterministic algorithm. More importantly, the expected runtime scales exponentially with the number of bits we want to control, and no one wants to turn their database server into a glorified shitcoin miner. This exponential scaling is far from ideal for UUIDv4, where only 122 of the 128 bits act as payload: we can expect to loop 64 times in order to fix the remaining 6 bits.

Controlling more bits with a Feistel network

A Feistel network derives a permutation over tuples of values from hash functions over the individual values. There are NIST recommendations for general format-preserving encryption (FFX) with Feistel networks, but they call for 8+ AES invocations to encrypt one value.

FFX solves a much harder problem than ours: we only have 64 bits (not even) of actual information, the rest is just random bits. Full format-preserving encryption must assume everything in the input is meaningful information that must not be leaked, and supports arbitrary domains (e.g., decimal credit card numbers).

Our situation is closer to a 64-bit payload (the sequential id) and a 64-bit random nonce. It’s tempting to simply xor the payload with the low bits of (truncated) AES-128, or any PRF like SipHash⁷ or BLAKE3 applied to the nonce:

BrokenPermutation(id, nonce):
    id ^= PRF_k(nonce)[0:len(id)]  # e.g., truncated AES_k
    return (id, nonce)

The nonce is still available, so we can apply the same PRF_k to the nonce, and undo the xor (xor is a self-inverse) to recover the original id. Unfortunately, random 64-bit values could repeat on realistic database sizes (a couple billion rows). When an attacker observes two external ids with the same nonce, they can xor the encrypted payloads and find the xor of the two plaintext sequential ids. This might seem like a minor information leak, but clever people have been known to amplify similar leaks and fully break encryption systems.

Intuitively, we’d want to also mix the 64 random bits with before returning an external id. That sounds a lot like a Feistel network, for which Luby and Rackoff have shown that 3 rounds are pretty good:

PseudoRandomPermutation(A, B):
    B ^= PRF_k1(A)[0:len(b)]  # e.g., truncated AES_k1
    A ^= PRF_k2(B)[0:len(a)]
    B ^= PRF_k3(A)[0:len(b)]
    
    return (A, B)

This function is reversible (a constructive proof that it’s a permutation): apply the ^= PRF_k steps in reverse order (at each step, the value fed to the PRF passes unscathed), like peeling an onion.

If we let A be the sequentially allocated id, and B the 64 random bits, we can observe that xoring the uniformly generated B with a pseudorandom function’s output is the same as generating bits uniformly. In our case, we can skip the first round of the Feistel network; we deterministically need exactly two PRF evaluations, instead of the two expected AES (PRP) evaluations for the previous cycle-walking algorithm.

ReducedPseudoRandomPermutation(id, nonce):
    id ^= AES_k1(nonce)[0:len(id)]
    nonce ^= AES_k2(id)[0:len(nonce)]
    return (id, nonce)

This is a minimal tweak to fix BrokenPermutation: we hide the value of nonce before returning it, in order to make it harder to use collisions. That Feistel network construction works for arbitrary splits between id and nonce, but closer (balanced) bitwidths are safer. For example, we can work within the layout proposed for UUIDv8 and assign $48 + 12 = 60$ bits for the sequential id (row id or timestamp), and 62 bits for the uniformly generated value.⁸

source: aes128-feistel.c

Again, we can evaluate the time it takes to encode (or symmetrically, decode) an internal identifier into an opaque UUID by encoding in a loop, with a data dependency between each iteration and the next (source: aes128-feistel-latency.c).

The format-preserving Feistel network essentially does double the work of a plain AES-128 encryption, with a serial dependency between the two AES-128 evaluations. We expect roughly twice the latency, and uiCA agrees: 78 cycles/format-preserving encoding on Ice Lake (compared to 36 cycles for AES-128 of 16 bytes).

On my unloaded 2 GHz EPYC 7713, I observe 98 cycles/format-preserving encoding (compared to 50 cycles for AES-128 of 16 bytes), and 26.5 ns/format-presering encoding when boosting a single active core (13.5 ns for AES-128).

Still much faster than a syscall, and, although twice as slow as AES-128 of one 16 byte block, not that slow: somewhere between a L3 hit and a load from RAM.

Sortable internal ids, pseudo-random external ids: not not fast

With hardware-accelerated AES-128 (SipHash or BLAKE3 specialised for 8-byte inputs would probably be slower, but not unreasonably so), converting between structured 128-bit ids and opaque UUIDs takes less than 100 cycles on contemporary x86-64 servers… faster than a load from main memory!

This post only addressed the question of runtime performance. I think the real challenges with encrypting external ids aren’t strictly technical in nature, and have more to do with making it hard for programmers to accidentally leak internal ids. I don’t know how that would go because I’ve never had to use this trick in a production system, but it seems like it can’t be harder than doing the same in a schemas that have explicit internal primary keys and external ids on each table. I’m also hopeful that one could do something smart with views and user-defined types.

Either way, I believe the runtime overhead of encrypting and decrypting 128-bit identifiers is a non-issue for the vast majority of database workloads. Arguments against encrypting structured identifiers should probably focus on system complexity, key management⁹ (e.g., between production and testing environments), and graceful failure in the face of faulty hardware or code accidentally leaking internal identifiers.

Thank you Andrew, Barkley, Chris, Jacob, Justin, Marius, and Ruchir, for helping me clarify this post, and for reminding me about things like range-sharded distributed databases.

I’m told I must remind everyone that sharing internal identifiers with external systems is a classic design trap, because one day you’ll want to decouple your internal representation from the public interface, and that’s really hard to do when there’s no explicit translation step anywhere. ↩
There’s also a third class of really performance-sensitive systems, where the high-performance data plane benefited from managing a transient (reallocatable) id space separately from the control plane’s domain-driven keys… much like one would use mapping tables to decouple internal and external keys. ↩
ARMv8’s cryptographic extension offers similar AESD/AESE instructions. ↩
On the other hand, when I asked twitter to think about it, most response were wildly optimistic, maybe because people were thinking of throughput and not latency. ↩
The first 64-bit field can be arbitrarily structured, and, e.g., begin with a sharding key. The output also isn’t incorrect if the second integer is always 0 or a table-specific value. However, losing that entropy makes it easier for an attacker to correlate ids across tables. ↩
It’s important to measure latency and not throughput because we can expect to decode one id at a time, and immediately block with a data dependency on the decoded result. Encoding may sometimes be closer to a throughput problem, but low latency usually implies decent throughput, while the converse is often false. For example, a 747 carrying 400 passengers across the Atlantic in just under 5 hours is more efficient in person-km/h (throughput) than a Concorde, with a maximum capacity of 100 passengers, but the Concorde was definitely faster: three and a half hours from JFK to LHR is shorter than five hours, and that’s the metric individual passengers usually care about. ↩
Most likely an easier route than AES in a corporate setting that’s likely to mandate frequent key rotation. ↩
Or copy UUIDv7, with its 48-bit timestamp and 74 bit random value. ↩
Rotating symmetric keys isn’t hard a technical problem, when generating UUIDs with a Feistel network: we can use 1-2 bits to identify keys, and eventually reuse key ids. Rotation however must imply that we will eventually fail to decode (reject) old ids, which may be a bug or a feature, depending on who you ask. A saving grace may be that it should be possible for a service to update old external ids to the most recent symmetric key without accessing any information except the symmetric keys. ↩

Hacking tips for Linux perf porcelain

2022-06-01T21:09:08-04:00

Sometimes you just want to abuse Linux perf to make it do a thing it’s not designed for, and a proper C program would represent an excessive amount of work.

Here are two tricks I find helpful when jotting down hacky analysis scripts.

Programmatically interacting with `addr2line -i`

Perf can resolve symbols itself, but addr2line is a lot more flexible (especially when you inflict subtle things on your executable’s mappings).

It’s already nice that addr2line -Cfe /path/to/binary lets you write hex addresses to stdin and spits out the corresponding function name on one line, and its source location on the next (or ?? / ??:0 if debug info is missing). However, for heavily inlined (cough C++ cough) programs, you really want the whole callstack that’s encoded in the debug info, not just the most deeply inlined function (“oh great, it’s in std::vector::size()”).

The --inline flag addresses that… by printing source locations for inline callers on their own line(s). Now that the output for each address can span a variable number of lines, how is one to know when to stop reading?

A simple trick is to always write two addresses to addr2line’s standard input: the address we want to symbolicate, and that never has debug info (e.g., 0).

EDIT: Travis Downs reports that llvm-addr2line-14 finds debug info for 0x0 (presumably a bug. I don’t see that on llvm-addr2line-12) and suggests looking for 0x0.* in addition to ??/??:0. It’s easy enough to stop when either happens, and clang’s version of addr2line can be a lot faster than binutil’s on files with a lot of debug information.¹

We now know that the first set of resolution information lines (one line when printing only the file and line number, two lines when printing function names as well with -f) belongs to the address we want to symbolicate. We also know to expect output for missing information (??:0 or ?? / ??:0) from the dummy address. We can thus keep reading until we find a set of lines that corresponds to missing information, and disregard that final source info.

For example, passing $IP\n0\n on stdin could yield:

??
??:0
??
??:0

or, without -f function names,

??:0
??:0

In both cases we first consume the first set of lines (the output for$IP must include at least one record), then consume the next set of lines and observe it represent missing information, so we stop reading.

When debug information is present, we might instead find

foo()
src/foo.cc:10
??
??:0

The same algorithm clearly works.

Finally, with inlining, we might instead observe

inline_function()
src/bar.h:5
foo()
src/foo.cc:12
??
??:0

We’ll unconditionally assign the first pair of lines to $IP, read a second pair of lines, see that it’s not ?? / ??:0 and push that to the bottom of the inline source location stack, and finally stop after reading the third pair of lines.

Triggering PMU events from non-PMU perf events

Performance monitoring events in perf tend to be much more powerful than non-PMU events: each perf “driver” works independently, so only PMU events can snapshot the Processor Trace buffer, for example.

However, we sometimes really want to trigger on a non-PMU event. For example, we might want to watch for writes to a specific address with a hardware breakpoint, and snapshot the PT buffer to figure out what happened in the microseconds preceding that write. Unfortunately, that doesn’t work out of the box: only PMU events can snapshot the buffer. I remember running into a similar limitation when I wanted to capture performance counters after non-PMU events.

There is however a way to trigger PMU events from most non-PMU events: watch for far branches! I believe I also found these events much more reliable to detect preemption than the scheduler’s software event, many years ago.

Far branches are rare (they certainly don’t happen in regular x86-64 userspace program), but interrupt usually trigger a far CALL to execute the handler in ring 0 (attributed to ring 0), and a far RET to switch back to the user program (attributed to ring 3).

We can thus configure

perf record \
    -e intel_pt//u \
    -e BR_INST_RETIRED.FAR_BRANCH/aux-sample-size=...,period=1/u \
    -e mem:0x...:wu ...

to:

trigger a debug interrupt when userspace writes to the watched memory address
which will increment the far_branch performance monitoring counter
which triggers Linux’s performance monitoring interrupt handler
which will finally write both the far branch event and its associated PT buffer to the perf event ring buffer.

Not only does this work, but it also minimises the trigger latency. That’s a big win compared to, e.g., perf record’s built-in --switch-output-event: a trigger latency on the order of hundreds of microseconds forces a large PT buffer in order to capture the period we’re actually interested in, and copying that large buffer slows down everything.

Is this documented?

Who knows? (Who cares?) These tricks fulfill a common need in quick hacks, and I’ve been using (and rediscovering) them for years.

I find tightly scoped tools that don’t try to generalise have an ideal insight:effort ratio. Go write your own!

I ended up generating passing a string suffixed with a UUIDv4 as a sentinel: llvm-addr2line just spits back any line that doesn’t look addresses. Alexey Alexandrov on the profiler developers’ slack noted that llvm-symbolizer cleanly terminates each sequence of frames with an empty line. ↩

Bounded dynamicism with cross-modifying code

2021-12-19T19:43:01-05:00

Originally posted on the Backtrace I/O tech blog.

All long-lived programs are either implemented in dynamic languages,¹ or eventually Greenspun themselves into subverting static programming languages to create a dynamic system (e.g., Unix process trees). The latter approach isn’t a bad idea, but it’s easy to introduce more flexibility than intended (e.g., data-driven JNDI lookups) when we add late binding features piecemeal, without a holistic view of how all the interacting components engender a weird program modification language.

At Backtrace, we mostly implement late (re)binding by isolating subtle logic in dedicated executables with short process lifetimes: we can replace binaries on disk atomically, and their next invocation will automatically pick up the change. In a pinch, we sometimes edit template or Lua source files and hot reload them in nginx. We prefer this to first-class programmatic support for runtime modification because Unix has a well understood permission model around files, and it’s harder to bamboozzle code into overwriting files when that code doesn’t perform any disk I/O.

However, these patterns aren’t always sufficient. For example, we sometimes wish to toggle code that’s deep in performance-sensitive query processing loops, or tightly coupled with such logic. That’s when we rely on our dynamic_flag library.

This library lets us tweak flags at runtime, but flags can only take boolean values (enabled or disabled), so the dynamicism it introduces is hopefully bounded enough to avoid unexpected emergent complexity. The functionality looks like classic feature flags, but thanks to the flags’ minimal runtime overhead coupled with the ability to flip them at runtime, there are additional use cases, such as disabling mutual exclusion logic during single-threaded startup or toggling log statements. The library has also proved invaluable for crisis management, since we can leave flags (enabled by default) in well-trodden pieces of code without agonising over their impact on application performance. These flags can serve as ad hoc circuit breakers around complete features or specific pieces of code when new inputs tickle old latent bugs.

The secret behind this minimal overhead? Cross-modifying machine code!

Intel tells us we’re not supposed to do that, at least not without pausing threads… yet the core of the dynamic_flag C library has been toggling branches on thousands of machines for years, without any issue. It’s available under the Apache license for other adventurous folks.

Overhead matters

Runtime efficiency is an essential feature in dynamic_flag— enough to justify mutating machine code while it’s executing on other cores —not only because it unlocks additional use cases, but, more importantly, because it frees programmers from worrying about the performance impact of branching on a flag in the most obvious location, even if that’s in the middle of a hot inner loop.

With the aim of encouraging programmers to spontaneously protect code with flag checks, without prodding during design or code review, we designed dynamic_flag to minimise the amount of friction and mental overhead of adding a new feature flag. That’s why we care so much about all forms of overhead, not just execution time. For example, there’s no need to break one’s flow and register flags separately from their use points. Adding a feature flag should not feel like a chore.

However, we’re also aware that feature flags tend to stick around forever. We try to counteract this inertia with static registration: all the DF_* expansions in an executable appear in its dynamic_flag_list section, and the dynamic_flag_list_state function enumerates them at runtime. Periodic audits will reveal flags that have become obsolete, and flags are easy to find: each flag’s full name includes its location in the source code.

We find value in dynamic_flag because its runtime overhead is negligible for all but the most demanding code,² while the interface lets us easily make chunks of code toggleable at runtime without having to worry about things like “where am I supposed to register this new option?” The same system is efficient and ergonomic enough for all teams in all contexts, avoids contention in our source tree, and guarantees discoverability for whoever happens to be on call.

How to use `dynamic_flag`

All dynamic flags have a “kind” (namespace) string, and a name. We often group all flags related to an experimental module or feature in the same “kind,” and use the name to describe the specific functionality in the feature guarded by the flag. A dynamic flag can be disabled by default (like a feature flag), or enabled by default, and evaluating a dynamic flag’s value implicitly defines and registers it with the dynamic_flag library.

A dynamic flag introduced with the DF_FEATURE macro, as in the code snippet below, is disabled (evaluates to false) by default, and instructs the compiler to optimise for that default value.

We can instead enable code by default and optimise for cases where the flag is enabled with the DF_DEFAULT macro.

Each DF_* condition in the source is actually its own flag; a flag’s full name looks like kind:name@source_file:line_number (e.g., my_module:flag_name@:15), and each condition has its own state record. It’s thus safe, if potentially confusing, to define flags of different types (feature or default) with the same kind and name. These macros may appear in inline or static inline functions: each instantiation will get its own metadata block, and an arbitrary number of blocks can share the same full name.

Before manipulating these dynamic flags, applications must call dynamic_flag_init_lib to initialise the library’s state. Once the library is initialised, interactive or configuration-driven usage typically toggles flags by calling dynamic_flag_activate and dynamic_flag_deactivate with POSIX extended regexes that match the flags’ full names.

Using `dynamic_flag` programmatically

The DF_FEATURE and DF_DEFAULT macros directly map to classic feature flags, but the dynamic_flag library still has more to offer. Applications can programmatically enable and disable blocks of code to implement a restricted form of aspect oriented programming: “advice” cannot be inserted post hoc, and must instead be defined inline in the source, but may be toggled at runtime by unrelated code.

For example, an application could let individual HTTP requests opt into detailed tracing with a query string parameter ?tracing=1, and set request->tracing_mode = true in its internal request object when it accepts such a request. Environments where fewer than one request in a million enables tracing could easily spend more aggregate time evaluating if (request->tracing_mode == true) than they do in the tracing logic itself. One could try to reduce the overhead by coalescing the trace code in fewer conditional blocks, but that puts more distance between the tracing code and the traced logic it’s supposed to record, which tends to cause the two to desynchronise and adds to development friction.

It’s tempting to instead optimise frequent checks for the common case (no tracing) with a dynamic flag that is enabled whenever at least one in-flight request has opted into tracing. That’s why the DF_OPT (for opt-in logic) macro exists.

The DF_OPT macro instructs the compiler to assume the flag is disabled, but leaves the flag enabled (i.e., the conditional always evaluates request->tracing_mode) until the library is initialised with dynamic_flag_init_lib.³ After initialisation, the flag acts like a DF_FEATURE (i.e., the overhead is a test eax instruction that falls through without any conditional branching) until it is explicitly enabled again.

With this flag-before-check pattern, it’s always safe to enable request_tracing flags: in the worst case, we’ll just look at the request object, see that request->tracing_mode == false, and skip the tracing logic. Of course, that’s not ideal for performance. When we definitely know that no request has asked for tracing, we want to disable request_tracing flags and not even look at the request object’s tracing_mode field.

Whenever the application receives a request that opts into tracing, it can enable all flags with kind request_tracing by executing dynamic_flag_activate_kind(request_tracing, NULL). When that same request leaves the system (e.g., when the application has fully sent a response back), the application undoes the activation with dynamic_flag_deactivate_kind(request_tracing, NULL).

Activation and deactivation calls actually increment and decrement counters associated with each instance of a DF_... macro, so this scheme works correctly when multiple requests with overlapping lifetimes opt into tracing: tracing blocks will check whether request->tracing_mode == true whenever at least one in-flight request has tracing_mode == true, and skip these conditionals as soon as no such request exists.

Practical considerations for programmatic manipulation

Confirming that a flag is set to its expected value (disabled for DF_FEATURE and DF_OPT, enabled for DF_DEFAULT) is fast… because we shifted all the complexity to the flag flipping code. Changing the value for a set of flags is extremely slow (milliseconds of runtime and several IPIs for multiple mprotect(2) calls), so it only makes sense to use dynamic flags when they are rarely activated or deactivated (e.g., less often than once a minute or even less often than once an hour).

We have found programmatic flag manipulation to be useful not just for opt-in request tracing or to enable log statements, but also to minimise the impact of complex logic on program phases that do not require them. For example, mutual exclusion and safe memory reclamation deferral (PDF) may be redundant while a program is in a single-threaded startup mode; we can guard such code behind DF_OPT(steady_state, ...) to accelerate startup, and enable steady_state flags just before spawning worker threads.

It can also make sense to guard slow paths with DF_OPT when a program only enters phases that needs this slow path logic every few minutes. That was the case for a software transactional memory system with batched updates. Most of the time, no update is in flight, so readers never have to check for concurrent writes. These checks can be guarded with DF_OPT(stm, ...) conditions., as long as the program enables stm flags around batches of updates. Enabling and disabling all these flags can take a while (milliseconds), but, as long as updates are infrequent enough, the improved common case (getting rid of a memory load and a conditional jump for a read barrier) means the tradeoff is favourable.

Even when flags are controlled programmatically, it can be useful to work around bugs by manually forcing some flags to remain enabled or disabled. In the tracing example above, we could find a crash in one of the tracing blocks, and wish to prevent request->tracing_mode from exercising that block of code.

It’s easy to force a flag into an active state: flag activations are counted, so it suffices to activate it manually, once. However, we want it to be safe issue ad hoc dynamic_flag_deactivate calls without wedging the system in a weird state, so activation counts don’t go negative. Unfortunately, this means we can’t use deactivations to prevent, e.g., a crashy request tracing block from being activated.

Flags can instead be “unhooked” dynamically. While unhooked, increments to a flag’s activation count are silently disregarded. The dynamic_flag_unhook function unhooks DF_* conditions when their full name matches the extended POSIX regular expression it received as an argument. When a flag has been “unhook”ed more often than it has been “rehook”ed, attempts to activate it will silently no-op. Once a flag has been unhooked, we can issue dynamic_flag_deactivate calls until its activation count reaches 0. At that point, the flag is disabled, and will remain disabled until rehooked.

The core implementation trick

The introduction of asm goto in GCC 4.5 made it possible to implement control operators in inline assembly. When the condition actually varies at runtime, it usually makes more sense to set an output variable with a condition code, but dynamic_flag conditions are actually static in machine code: each DF_* macro expands to one 5-byte instruction, a test eax, imm32 instruction that falls through to the common case when that’s the flag’s value (i.e., enabled for DF_DEFAULT, disabled for DF_FEATURE and DF_OPT), and a 32-bit relative jmp rel32 to the unexpected path (disabled for DF_DEFAULT, enabled for DF_FEATURE and DF_OPT) otherwise. Activating and deactivating dynamic flags toggles the corresponding target instructions between test imm32 (0xA9) and jmp rel32 (0xE9).

The DF_... macros expand into a lot more inline assembly than just that one instruction; the rest of the expansion is a lot of noise to register everything with structs and pointers in dedicated sections. Automatic static registration is mostly orthogonal to the performance goals, but is key to the (lazy-)programmer-friendly interface.

We use test eax, imm32 instead of a nop because it’s exactly five bytes, just like jmp rel32, and because its 4-byte immediate is in the same place as the 4-byte offset of jmp rel32. We can thus encode the jump offset at assembly-time, and flip between falling through to the common path (test) and jumping to the unexpected path (jmp) by overwriting the opcode byte (0xA9 for test, 0xE9 for jmp).

Updating a single byte for each dynamic flag avoids questions around the correct order for writes. This single-byte cross-modification (we overwrite instruction bytes while other threads may be executing the mutated machine code) also doesn’t affect the size of the instruction (both test eax and jmp rel span 5 bytes), which should hopefully suffice to avoid sharp edges around instruction decoding in hardware, despite our disregard for Intel’s recommendations regarding cross-modifying code in Section 8.1.3 of the SDM.⁴

The library does try to protect against code execution exploits by relaxing and reinstating page protection with mprotect(2)) around all cross modification writes. Since mprotect-ing from Read-Write-eXecute permissions to Read-eXecute acts as a membarrier (issues IPIs) on Linux/x86-64, we can also know that the updated code is globally visible by the time a call to dynamic_flag_activate, etc., returns.

It’s not practical to bounce page protection for each DF_ expansion, especially with inlining (some users have hundreds of inlined calls to flagged functions, e.g., to temporarily paper over use-after-frees by nopping out a few calls to free(2)). Most of the complexity in dynamic_flag.c is simply in gathering metadata records for all DF_ sites that should be activated or deactivated, and in amortising mprotect calls for stretches of DF_ sites on contiguous pages.

Sometimes, code is just done

The dynamic_flag library is an updated interface for the core implementation of the 6-year old an_hook, and reflects years of experience with that functionality. We’re happy to share it, but aren’t looking for feature requests or contributions.

There might be some small clean-ups as we add support for ARM or RISC V, or let the library interoperate with a Rust implementation. However, we don’t expect changes to the interface, i.e., the DF_ macros and the activation/deactivation functions, nor to its core structure, especially given the contemporary tastes for hardening (for example, the cross-modification approach is completely incompatible with OpenBSD’s and OS X’s strict W^X policies). The library works for our target platforms, and we don’t wish to take on extra complexity that is of no benefit to us.

Of course, it’s Apache licensed, so anyone can fork the library and twist it beyond recognition. However, if you’re interested in powerful patching capabilities, dynamic languages (e.g., Erlang, Common Lisp, or even Python and Ruby), or tools like Live++ and Recode may be more appropriate.⁵ We want dynamic_flag to remain simple and just barely flexible enough for our usage patterns.

Thank you, Jacob, Josh, and Per, for feedback on earlier versions.

It’s no accident that canonical dynamic languages like Smalltalk, Forth, and Lisp are all image-based: how would an image-based system even work if it were impossible to redefine functions or types? ↩
Like guaranteed optimisations in Lisps, the predictable performance impact isn’t important because all code is performance sensitive, but because performance is a cross-cutting concern, and a predictably negligible overhead makes it easier to implement new abstractions, especially with the few tools available in C. In practice, the impact of considering a code path reachable in case a flag is flipped from its expected value usually dwarfs that of the single test instruction generated for the dynamic flag itself. ↩
Or if the dynamic_flag library isn’t aware of that DF_OPT, maybe because the function surrounding that DF_OPT conditional was loaded dynamically. ↩
After a few CPU-millenia of production experience, the cross-modification logic hasn’t been associated with any “impossible” bug, or with any noticeable increase in the rate of hardware hangs or failures. ↩
The industry could learn a lot from game development practices, especially for stateful non-interactive backend servers and slow batch computations. ↩

Slitter: a slab allocator that trusts, but verifies

2021-08-01T17:26:04-04:00

Originally posted on the Backtrace I/O tech blog.

Slitter is Backtrace’s deliberately middle-of-the-road thread-caching slab allocator, with explicit allocation class tags (rather than derived from the object’s size class). It’s mostly written in Rust, and we use it in our C backend server.

Slitter’s design is about as standard as it gets: we hope to dedicate the project’s complexity budget to always-on “observability” and safety features. We don’t wish to detect all or even most memory management errors, but we should statistically catch a small fraction (enough to help pinpoint production issues) of such bugs, and always constrain their scope to the mismanaged allocation class.¹

We decided to code up Slitter last April, when we noticed that we would immediately benefit from backing allocation with temporary file mappings:² the bulk of our data is mapped from persistent data files, but we also regenerate some cold metadata during startup, and accesses to that metadata have amazing locality, both temporal and spatial (assuming bump allocation). We don’t want the OS to swap out all the heap–that way lie grey failures–so we opt specific allocation classes into it.

By itself, this isn’t a reason to write a slab allocator: we could easily have configured specialised arenas in jemalloc, for example. However, we also had eyes on longer term improvements to observability and debugging or mitigation of memory management errors in production, and those could only be unlocked by migrating to an interface with explicit tags for each allocation class (type).

Classic mallocs like jemalloc and tcmalloc are fundamentally unable to match that level of integration: we can’t tell malloc(3) what we’re trying to allocate (e.g., a struct request in the HTTP module), only its size. It’s still possible to wrap malloc in a richer interface, and, e.g., track heap consumption by tag. Unfortunately, the result is slower than a native solution, and, without help from the underlying allocator, it’s easy to incorrectly match tags between malloc and free calls. In my experience, this frequently leads to useless allocation statistics, usually around the very faulty code paths one is attempting to debug.

Even once we have built detailed statistics on top of a regular malloc, it’s hard to convince the underlying allocator to only recycle allocations within an object class: not only do mallocs eagerly recycle allocations of similar sizes regardless of their type, but they will also release unused runs of address space, or repurpose them for totally different size classes. That’s what mallocs are supposed to do… it just happens to also make debugging a lot harder when things inevitably go wrong.³

Slab allocators work with semantically richer allocation tags: an allocation tag describes its objects’ size, but can also specify how to initialise, recycle, or deinitialise them. The problem is that slab allocators tend to focus exclusively on speed.

Forks of libumem may be the exception, thanks to the Solaris culture of pervasive hooking. However, umem’s design reflects the sensibilities of the 00s, when it was written: threads share a few caches, and the allocator tries to reuse address space. In contrast, Slitter assumes memory is plentiful enough for thread-local caches and type-stable allocations.⁴

Our experience so far

We have been running Slitter in production for over two months, and rely on it to:

detect when an allocation is freed with the wrong allocation class tag (i.e., detect type confusion on free).
avoid any in-band metadata: there are guard pages between allocations and allocator metadata, and no intrusive freelist for use-after-frees to stomp over.
guarantee type stable allocations: once an address has been used to fulfill a request for a certain allocation class, it will only be used for that class. Slitter doesn’t overlay intrusive lists on top of freed allocations, so the data always reflects what the application last stored there. This means that double-frees and use-after-frees only affect the faulty allocation class. An application could even rely on read-after-free being benign to simplify non-blocking algorithms.⁵
let each allocation class specify how its backing memory should be mapped in (e.g., plain 4 KB pages or file-backed swappable pages).

Thanks to extensive contracts and a mix of hardcoded and random tests, we encountered only two issues during the initial rollout, both in the small amount of lock-free C code that is hard to test.⁶

Type stability exerts a heavy influence all over Slitter’s design, and has obvious downsides. For example, a short-lived application that progresses through a pipeline of stages, where each stage allocates different types, would definitely waste memory if it were to replace a regular malloc with a type-stable allocator like Slitter. We believe the isolation benefits are more than worth the trouble, at least for long-lived servers that quickly enter a steady state.

In the future, we hope to also:

detect when an interior pointer is freed.
detect simple⁷ buffer overflows that cross allocation classes, by inserting guard pages.
always detect frees of addresses Slitter does not manage.
detect most back-to-back double-frees.
detect a random fraction of buffer overflows, with a sampling eFence.

In addition to these safety features, we plan to rely on the allocator to improve observability into the calling program, and wish to:

track the number of objects allocated and recycled in each allocation class.
sample the call stack when the heap grows.
track allocation and release call stacks for a small fraction of objects.

Here’s how it currently works, and why we wrote it in Rust, with dash of C.

The high level design of Slitter

At a high level, Slitter

reserves shared 1 GB Chunks of memory via the Mapper trait
carves out smaller type-specific Spans from each chunk with Mill objects
bump-allocates objects from Spans with Press objects, into allocation Magazines
pushes and pops objects into/from thread-local magazines
caches populated magazines in global type-specific lock-free stacks
manages empty magazines with a global mostly lock-free Rack

Many general purpose memory allocators implement strategies similarly inspired by Bonwick’s slab allocator, and time-tested mallocs may well provide better performance and lower fragmentation than Slitter.⁸ The primary motivation for designing Slitter is that having explicit allocation classes in the API makes it easier for the allocator to improve the debuggability and resilience of the calling program.⁹ For example, most allocators can tell you the size of your program’s heap, but that data is much more useful when broken down by struct type or program module.

Most allocators try to minimise accesses to the metadata associated with allocations. In fact, that’s often seen as a strength of the slab interface: the allocator can just rely on the caller to pass the correct allocation class tag, instead of hitting metadata to figure out there the freed address should go.

We went in the opposite direction with Slitter. We still rely on the allocation class tag for speed, but also actively look for mismatches before returning from deallocation calls. Nothing depends on values computed by the mismatch detection logic, and the resulting branch is trivially predictable (the tag always matches), so we can hope that wide out-of-order CPUs will hide most of the checking code, if it’s simple enough.

This concern (access to metadata in few instructions) combined with our goal of avoiding in-band metadata lead to a simple layout for each chunk’s data and metadata.

.-------.------.-------|---------------.-------.
| guard | meta | guard | data ... data | guard |
'-------'------'-------|---------------'-------'
  2 MB    2 MB   2 MB  |      1 GB        2 MB
                       v
               Aligned to 1 GB

A chunk’s data is always a 1 GB address range, aligned to 1 GB: the underlying mapper doesn’t have to immediately back that with memory, but it certainly can, e.g., in order to use gigantic pages. The chunk is preceded and followed by 2 MB guard pages. The metadata for the chunk’s data lives in a 2 MB range, just before the preceding guard page (i.e., 4 MB to 2 MB before the beginning of the aligned 1 GB range). Finally, the 2 MB metadata range is itself preceded by a 2MB guard page.

Each chunk is statically divided in 65536 spans of 16 KB each. We can thus map a span to its slot in the metadata block with a shifts, masks, and some address arithmetic. Mills don’t have to hand out individual 16 KB spans at a time, they simply have to work in multiples of 16 KB, and never split a span in two.

Why we wrote Slitter in Rust and C

We call Slitter from C, but wrote it in Rust, despite the more painful build¹⁰ process: that pain isn’t going anywhere, since we expect our backend to be in a mix of C, C++, and Rust for a long time. We also sprinkled in some C when the alternative would have been to pull in a crate just to make a couple syscalls, or to enable unstable Rust features: we’re not “rewrite-it-in-Rust” absolutists, and merely wish to use Rust for its strengths (control over data layout, support for domain-specific invariants, large ecosystem for less performance-sensitive logic, ability to lie to the compiler where necessary, …), while avoiding its weaknesses (interacting with Linux interfaces defined by C headers, or fine-tuning code generation).

The majority of allocations only interact with the thread-local magazines. That’s why we wrote that code in C: stable Rust doesn’t (yet) let us access likely/unlikely annotations, nor fast “initial-exec” thread-local storage. Of course, allocation and deallocation are the main entry points into a memory allocation library, so this creates a bit of friction with Rust’s linking process.¹¹

We also had to implement our lock-free multi-popper Treiber stack in C: x86-64 doesn’t have anything like LL/SC, so we instead pair the top-of-stack pointer with a generation counter… and Rust hasn’t stabilised 128-bit atomics yet.

We chose to use atomics in C instead of a simple lock in Rust because the lock-free stack (and the atomic bump pointer, which Rust handles fine) are important for our use case: when we rehydrate cold metadata at startup, we do so from multiple I/O-bound threads, and we have observed hiccups due to lock contention in malloc. At some point, lock acquisitions are rare enough that contention isn’t an issue; that’s why we’re comfortable with locks when refilling bump allocation regions.

Come waste performance on safety!

A recurring theme in the design of Slitter is that we find ways to make the core (de)allocation logic slightly faster, and immediately spend that efficiency on safety, debuggability or, eventually, observability. For a lot of code, performance is a constraint to satisfy, not a goal to maximise; once we’re close to good enough, it makes sense to trade performance away.¹² I also believe that there are lower hanging fruits in memory placement than shaving a few nanoseconds from the allocation path.

Slitter also focuses on instrumentation and debugging features that are always active, even in production, instead of leaving that to development tools, or to logic that must be explicitly enabled. In a SaaS world, development and debugging is never done. Opt-in tools are definitely useful, but always-on features are much more likely to help developers catch the rarely occurring bugs on which they tend to spend an inordinate amount of investigation effort (and if a debugging feature can be safely enabled in production at a large scale, why not leave it enabled forever?).

If that sounds like an interesting philosophy for a slab allocator, come hack on Slitter! Admittedly, the value of Slitter isn’t as clear for pure Rust hackers as it is for those of us who blend C and Rust, but per-class allocation statistics and placement decisions should be useful, even in safe Rust, especially for larger programs with long runtimes.

Our MIT-licensed code is on github, there are plenty of small improvements to work on, and, while we still have to re-review the documentation, it has decent test coverage, and we try to write straightforward code.

This post was much improved by feedback from my beta readers, Barkley, David, Eloise, Mark, Per, Phil, Ruchir, and Samy.

In my experience, their unlimited blast radius is what makes memory management bugs so frustrating to track down. The design goals of generic memory allocators (e.g., recycling memory quickly) and some implementation strategies (e.g., in-band metadata) make it easy for bugs in one module to show up as broken invariants in a completely unrelated one that happened to share allocation addresses with the former. Adversarial thinkers will even exploit the absence of isolation to amplify small programming errors into arbitrary code execution. Of course, one should simply not write bugs, but when they do happen, it’s nice to know that the broken code most likely hit itself and its neighbours in the callgraph, and not unrelated code that also uses the same memory allocator (something Windows got right with private heaps). ↩
Linux does not have anything like the BSD’s MAP_NOSYNC mmap flag. This has historically created problems for heavy mmap users like LMDB. Empirically, Linux’s flushing behaviour is much more reasonable these days, especially when dirty pages are a small fraction of physical RAM, as it is for us: in a well configured installation of our backend server, most of the RAM goes to clean file mappings, so only the dirty_expire_centisec timer triggers write-outs, and we haven’t been growing the file-backed heap fast enough for the time-based flusher to thrash too much. ↩
There are obvious parallels with undefined behaviour in C and C++… ↩
umem also takes a performance hit in order to let object classes define callbacks for object initialisation, recycling, and destruction. It makes sense to let the allocator do some pre-allocation work: if you’re going to incur a cache miss for the first write to an allocation, it’s preferable to do so before you immediately want that newly allocated object (yes, profiles will show more cycles in the allocators, but you’re just shifting work around, hopefully farther from the critical path). Slitter only supports the bare minimum: objects are either always zero-initialised, or initially zero-filled and later left untouched. That covers the most common cases, without incurring too many branch mispredictions. ↩
One could be tempted to really rely on it not just for isolation and resilience, but during normal operations. That sounds like a bad idea (we certainly haven’t taken that leap), at least until Slitter works with Valgrind/ASan/LSan: it’s easier to debug easily reproducible issues when one can just slot in calls to regular malloc/calloc/free with a dedicated heap debugger. ↩
It would be easy to blame the complexity of lock-free code, but the initial version, with C11 atomics, was correct. Unfortunately, gcc backs C11 atomic uint128_ts with locks, so we had to switch to the legacy interface, and that’s when the errors crept in. ↩
There isn’t much the allocator can do if an application writes to a wild address megabytes away from the base object. Thankfully, buffer overflows tend to proceed linearly from the actual end of the undersized object. ↩
In fact, Slitter actively worsens external fragmentation to guarantee type-stable allocations. We think it’s reasonable to sacrifice heap footprint in order to control the blast radius of use-after-frees and double-frees. ↩
That’s why we’re interested in allocation class tags, but they can also help application and malloc performance. Some malloc developers are looking into tags for placement (should the allocation be backed by memory local to the NUMA node, with huge pages, …?) or lifetime (is the allocation immortal, short-lived, or tied to a request?) hints. ↩
We re-export our dependencies from an uber-crate, and let our outer meson build invoke cargo to generate a static library for that facade uber-crate. ↩
Rust automatically hides foreign symbols when linking cdylibs. We worked around that with static linking, but statically linked rust libraries are mutually incompatible, hence the uber-crate. ↩
And not just for safety or productivity features! I find it often makes sense to give up on small performance wins (e.g., aggressive autovectorisation or link-time optimisation) when they would make future performance investigations harder. The latter are higher risk, and only potential benefits, but their upside (order of magnitude improvements) dwarfs guaranteed small wins that freeze the code in time. ↩