1 2 3 4 5 6 

With hardware popcount, this compiles to something like the following.
1 2 3 4 5 

This should raise a few questions:
Someone with a passing familiarity with x86 would also ask why we use
popcnt
instead of checking the parity flag after xor
.
Unfortunately, the parity flag only considers the least significant
byte of the result (:
When implementing something like the hashing trick or count sketches (PDF), you need two sets of provably strong hash functions: one to pick the destination bucket, and another to decide whether to increment or decrement by the sketched value.
Onebit hash functions are ideal for the latter use case.
The bitwise operations in bit_hash
implement a degenerate form of
tabulation hashing. It considers
the 64 bit input value x
as a vector of 64 bits, and associates a
two intermediate output values with each index. The naïve
implementation would be something like the following.
1 2 3 4 5 6 7 8 9 10 11 

Of course, the representation of random_table
is inefficient, and we
should handroll a bitmap. However, the loop itself is a problem.
The trick is to notice that we can normalise the table so that the
value for random_table[i][0]
is always 0: in order to do so, we have
to fix the initial value for acc
to a random bit. That initial
value is the hash value for 0
, and the values in
random_table[i][1]
now encode whether a nonzero bit i
in x
flips the hash value or leaves it as is.
The table
argument for bit_hash
is simply the 64 bits in
random_table[i][1]
, and bit
is the hash value for 0
. If bit i
in table
is 0, bit i
is irrelevant to the hash. If bit i
in
table
is 1, the hash flips when bit i
in x
is 1. Finally, the
parity counts how many times the hash was flipped.
I don’t think so. Whenever we need a hash bit, we also want a hash bucket; we might as well steal one bit from the latter wider hash. Worse, we usually want a few such bucket/bit pairs, so we could also compute a wider hash and carve out individual bits.
I only thought about this trick because I’ve been reading a few
empirical evaluation of sketching techniques, and a few authors find
it normal that computing a hash bit doubles the CPU time spent on
hashing. It seems to me the right way to do this is to map
columns/features to nottoosmall integers (e.g., universal hashing to
[0, n^2)
if we have n
features), and apply strong hashing to
these integers. Hashing machine integers is fast, and we can always
split strong hashes in multiple values.
In the end, this family of onebit hash functions seems like a good solution to a problem no one should ever have. But it’s still a cute trick!
]]>In July 2012, I started really looking into searching in static sorted sets, and found the literature disturbingly useless. I reviewed a lot of code, and it turned out that most binary searches out there are not only unsafe for overflow, but also happen to be badly microoptimised for small arrays with simple comparators. That lead to Binary Search eliminates Branch Mispredictions, a reaction to popular assertions that binary search has bad constant factors (compared to linear search or a breadth first layout) on modern microarchitecture, mostly due to branch mispredictions. That post has code for really riced up searches on fixed array sizes, so here’s the size generic inner loop I currently use.
1 2 3 4 5 6 7 8 9 10 

The snippet above implements a binary search, instead of dividing by three to avoid aliasing issues. That issue only shows up with array sizes that are (near) powers of two. I know of two situations where that happens a lot:
The fix for the first case is to do proper benchmarking on a wide range of input sizes. Ternary or offset binary search are only really useful in the second case. There’s actually a third case: when I’m about to repeatedly search in the same array, I dispatch to unrolled ternary searches, with one routine for each power of two. I can reduce any size to a power of two with one initial iteration on an offcenter “midpoint.” Ternary search has a high overhead for small arrays, unless we can precompute offsets by unrolling the whole thing.
My work on binary search taught me how to implement binary search not stupidly–unlike real implementations–and that most experiments on searching in array permutations seem broken in their very design (they focus on full binary trees).
I don’t think I ever make that explicit, but the reason I even started looking into binary search is that I wanted to have a fast implementation of searching in a van Emde Boas layout! However, none of the benchmarks (or analyses) I found were convincing, and I kind of lost steam as I improved sorted arrays: sortedness tends to be useful for operations other than predecessor/successor search.
Some time in May this year, I found Pat Morin’s fresh effort on the
exact question I had abandoned over the years: how do popular
permutations work in terms of raw CPU time? The code was open,
and even good by research standards! Pat had written the annoying
part (building the permutations), generated a bunch of tests I could
use to check correctness, and avoided obvious microbenchmarking
pitfalls. He also found a really nice way to find the return value for BFS searches from the location where the search ends with fast
bit operations (j = (i + 1) >> __builtin_ffs(~(i + 1));
, which he
explains in the paper).
I took that opportunity to improve constant factors for all the implementations, and to really try and explain in detail the performance of each layout with respect to each other, as well as how they respond to the size of the array. That sparked a very interesting back and forth with Pat from May until September (!). Pat eventually took the time to turn our informal exchange into a coherent paper. More than 3 years after I started spending time on the question of array layouts for implicit search trees, I found the research I was looking for… all it took was a bit of collaboration (:
Bonus: the results were unexpected! Neither usual suspects (Btree or van Emde Boas) came out on top, even for very large arrays. I was also surprised to see the breadthfirst layout perform much better than straight binary search: none of the usual explanations made sense to me. It turns out that the improved performance (when people weren’t testing on round, power of two, array sizes) was probably an unintended consequence of bad code! Breadthfirst is fast, faster than layouts with better cache efficiency, because it prefetches well enough to hide latency even when it extracts only one bit of information from each cache line; its performance has nothing to do with cachability. Our code prefetches explicitly, but slower branchy implementations in the wild get implicit prefetching, thanks to speculative execution.
Conclusion: if you need to do a lot of comparisonbased searches in > L2sized arrays, use a breadthfirst order and prefetch. If you need sorted arrays, consider sticking some prefetches in a decent binary search loop. If only I’d known that in 2012!
A couple months ago, I found LZ77like Compression with Fast Random Access by Kreft and Navarro. They describe a LempelZiv approach that is similar to LZ77, but better suited to decompressing arbitrary substrings. The hard part about applying LZ77 compression to (byte)code is that parses may reuse any substring that happens to appear earlier in the original text. That’s why I had to use Jez’s algorithm to convert the LZ77 parse into a (one word) grammar.
LZEnd fixes that.
Kreft and Navarro improve random access decompression by restricting the format of “backreferences” in the LZ parse. The parse decomposes the original string into a sequence of phrases; concatenating the phrases yields the string back, and phrases have a compact representation. In LZ77, phrases are compressed because they refer back to substrings in prior phrases. LZEnd adds another constraint: the backreferences cannot end in the middle of phrases.
For example, LZ77 might have a backreference like
[abc][def][ghi]

to represent “cdefg.” LZEnd would be forced to end the new phrase at “f”
[abc][def][ghi]

and only represent “cdef.” The paper shows that this additional restriction has a marginal impact on compression rate, and uses the structure to speed up operations on compressed data. (The formal definition also forbids the cute/annoying selfreferenceasloop idiom of LZ77, without losing too much compression power!)
We can apply the same idea to compress code. Each phrase is now a
subroutine with a return
at the end. A backreference is a series
of calls to subroutines; the first call might begin in the middle, but
matches always end on return
, exactly like normal code does! A
phrase might begin in the middle of a phrase that itself consists of
calls. That’s still implementable: the referrer can see through the
indirection and call in the middle of the callee’s callee (etc.), and
then go back to the callee for a suitably aligned submatch.
That last step looks like it causes a space blowup, and I can’t bound it (yet).
But that’s OK, because I was only looking into compressing traces as a
last resort. I’m much more interested in expression trees, but
couldn’t find a way to canonicalize sets (e.g., arguments to integer
+
) and sequences (e.g., floating point *
) so that similar
collections have similar subtrees… until I read Hammer et al’s work
on Nominal Adapton, which solves a
similar problem in a different context.
They want a tree representation for lists and tries (sets/maps) such that a small change in the list/trie causes a small change in the tree that mostly preserves identical subtrees. They also want the representation to be a deterministic function of the list/trie. That way, they can efficiently reuse computations after incremental changes to inputs.
That’s exactly my sequence/set problem! I want a treebased representation for sequences (lists) and sets (tries) such that similar sequences and sets have mostly identical subtrees for which I can reuse pregenerated code.
Nominal Adapton uses a hashbased construction described by Pugh and Teitelbaum in 1989 (Incremental computation via function caching) to represent lists, and extends the idea for tries. I can “just” use the same trick to canonicalise lists and sets into binary trees, and (probabilistically) get common subexpressions for free, even across expressions trees! It’s not perfect, but it should scale pretty well.
That’s what I’m currently exploring when it comes to using compression to reduce cache footprint while doing aggressive specialisation. Instead of finding redundancy in linearised bytecode after the fact, induce identical subtrees for similar expressions, and directly reuse code fragments for subexpressions.
I thought I’d post a snippet on the effect of alignment and virtual memory tricks on TLBs, but couldn’t find time for that. Perhaps later this week. In the meantime, I have to prepare a short talk on the software transactional memory system we built at AppNexus. Swing by 23rd Street on December 15 if you’re in New York!
]]>What do memory allocation, histograms, and event scheduling have in common? They all benefit from rounding values to predetermined buckets, and the same bucketing strategy combines acceptable precision with reasonable space usage for a wide range of values. I don’t know if it has a real name; I had to come up with the (confusing) term “linearlog bucketing” for this post! I also used it twice last week, in otherwise unrelated contexts, so I figure it deserves more publicity.
I’m sure the idea is old, but I first came across this strategy in jemalloc’s binning scheme for allocation sizes. The general idea is to simplify allocation and reduce external fragmentation by rounding allocations up to one of a few bin sizes. The simplest scheme would round up to the next power of two, but experience shows that’s extremely wasteful: in the worst case, an allocation for \(k\) bytes can be rounded up to \(2k  2\) bytes, for almost 100% space overhead! Jemalloc further divides each poweroftwo range into 4 bins, reducing the worstcase space overhead to 25%.
This subpoweroftwo binning covers medium and large allocations. We still have to deal with small ones: the ABI forces alignment on every allocation, regardless of their size, and we don’t want to have too many small bins (e.g., 1 byte, 2 bytes, 3 bytes, …, 8 bytes). Jemalloc adds another constraint: bins are always multiples of the allocation quantum (usually 16 bytes).
The sequence for bin sizes thus looks like: 16, 32, 48, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, 384, … (0 is special because malloc must either return NULL [bad for error checking] or treat it as a full blown allocation).
I like to think of this sequence as a special initial range with 4 linearly spaced subbins (0 to 63), followed by poweroftwo ranges that are again split in 4 subbins (i.e., almost logarithmic binning). There are thus two parameters: the size of the initial linear range, and the number of subbins per range. We’re working with integers, so we also know that the linear range is at least as large as the number of subbins (it’s hard to subdivide 8 integers in 16 bins).
Assuming both parameters are powers of two, we can find the bucket for any value with only a couple x86 instructions, and no conditional jump or lookup in memory. That’s a lot simpler than jemalloc’s implementation; if you’re into Java, HdrHistogram’s binning code is nearly identical to mine.
As always when working with bits, I first doodled in SLIME/SBCL: CL’s bit manipulation functions are more expressive than C’s, and a REPL helps exploration.
Let linear
be the \(\log\sb{2}\) of the linear range, and subbin
the \(\log\sb{2}\) of the number of subbin per range, with
linear >= subbin
.
The key idea is that we can easily find the power of two range (with a
BSR
), and that we can determine the subbin in that range by shifting
the value right to only keep its subbin
most significant (nonzero)
bits.
I clearly need something like \(\lfloor\log\sb{2} x\rfloor\):
1 2 

I’ll also want to treat values smaller than 2**linear
as
though they were about 2**linear
in size. We’ll do that with
nbits := (lb (logior x (ash 1 linear))) === (max linear (lb x))
We now want to shift away all but the top subbin
bits of x
shift := ( nbits subbin)
subindex := (ash x ( shift))
For a memory allocator, the problem is that the last rightward shift rounds down! Let’s add a small mask to round things up:
mask := (ldb (byte shift 0) 1) ; that's `shift` 1 bits
rounded := (+ x mask)
subindex := (ash rounded ( shift))
We have the top subbin
bits (after rounding) in subindex
. We
only need to find the range index
range := ( nbits linear) ; nbits >= linear
Finally, we combine these two together by shifting index
by
subbin
bits
index := (+ (ash range subbin) subindex)
Extra! Extra! We can also find the maximum value for the bin with
size := (logandc2 rounded mask)
Assembling all this yields
1 2 3 4 5 6 7 8 9 10 

Let’s look at what happens when we want \(2\sp{2} = 4\) subbin per range, and a linear progression over \([0, 2\sp{4} = 16)\).
CLUSER> (bucket 0 4 2)
0 ; 0 gets bucket 0 and rounds up to 0
0
CLUSER> (bucket 1 4 2)
1 ; 1 gets bucket 1 and rounds up to 4
4
CLUSER> (bucket 4 4 2)
1 ; so does 4
4
CLUSER> (bucket 5 4 2)
2 ; 5 gets the next bucket
8
CLUSER> (bucket 9 4 2)
3
12
CLUSER> (bucket 15 4 2)
4
16
CLUSER> (bucket 17 4 2)
5
20
CLUSER> (bucket 34 4 2)
9
40
The sequence is exactly what we want: 0, 4, 8, 12, 16, 20, 24, 28, 32, 40, 48, …!
The function is marginally simpler if we can round down instead of up.
1 2 3 4 5 6 7 8 

CLUSER> (bucketdown 0 4 2)
0 ; 0 still gets the 0th bucket
0 ; and rounds down to 0
CLUSER> (bucketdown 1 4 2)
0 ; but now so does 1
0
CLUSER> (bucketdown 3 4 2)
0 ; and 3
0
CLUSER> (bucketdown 4 4 2)
1 ; 4 gets its bucket
4
CLUSER> (bucketdown 7 4 2)
1 ; and 7 shares it
4
CLUSER> (bucketdown 15 4 2)
3 ; 15 gets the 3rd bucket for [12, 15]
12
CLUSER> (bucketdown 16 4 2)
4
16
CLUSER> (bucketdown 17 4 2)
4
16
CLUSER> (bucketdown 34 4 2)
8
32
That’s the same sequence of bucket sizes, but rounded down in size instead of up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 

I first implementated this code to mimic’s jemalloc binning scheme: in a memory allocator, a linearlogarithmic sequence give us alignment and bounded space overhead (bounded internal fragmentation), while keeping the number of size classes down (controlling external fragmentation).
High dynamic range histograms use the same class of sequences to bound the relative error introduced by binning, even when recording latencies that vary between microseconds and hours.
I’m currently considering this binning strategy to handle a large number of timeout events, when an exact priority queue is overkill. A timer wheel would work, but tuning memory usage is annoying. Instead of going for a hashed or hierarchical timer wheel, I’m thinking of binning events by timeout, with one FIFO per bin: events may be late, but never by more than, e.g., 10% their timeout. I also don’t really care about sub millisecond precision, but wish to treat zero specially; that’s all taken care of by the “round up” linearlog binning code.
In general, if you ever think to yourself that dispatching on the bitwidth of a number would mostly work, except that you need more granularity for large values, and perhaps less for small ones, linearlogarithmic binning sequences may be useful. They let you tune the granularity at both ends, and we know how to round values and map them to bins with simple functions that compile to fast and compact code!
P.S. If a chip out there has fast int>FP conversion and slow bit scans(!?), there’s another approach: convert the integer to FP, scale by, e.g., \(1.0 / 16\), add 1, and shift/mask to extract the bottom of the exponent and the top of the significand. That’s not slow, but unlikely to be faster than a bit scan and a couple shifts/masks.
]]>Here’s what I quickly (so quickly that my phone failed to focus correctly) put together on embedding search trees in sorted arrays. You’ll note that the “slides” are very low tech; hopefully, more people will contribute their own style to the potluck next time (:
I didn’t really think about implementing search trees until 34 years ago. I met an online collaborator in Paris who, after a couple G&T, brought up the topic of “desert island” data structures: if you were stuck with a computer and a system programming guide on a desert island, how would you rebuild a standard library from scratch? Most data structures and algorithms that we use every day are fairly easy to remember, especially if we don’t care about proofs of performance: basic dynamic memory allocation, hash tables, sorting, notsobignum arithmetic, etc. are all straightforward. He even had a mergeable priority queue, with skew heaps. However, we both got stuck on balanced search trees: why would anyone want to remember rotation rules? (Tries were rejected on what I argue are purely theoretical grounds ;)
I love searching in sorted arrays, so I kept looking for a way to build simpler search trees on top of that. That lead me to Bentley and Saxe’s (PDF) dynamisation trick. The gist of it is that there’s a family of methods to build dynamic sets on top of static versions. For sorted arrays, one extreme is an unsorted list with fast inserts and slow reads, and the other exactly one sorted array, with slow inserts and fast lookups. The most interesting design point lies in the middle, with \( \log n \) sorted arrays, yielding \( \mathcal{O}(\lg n) \) time inserts and \( \mathcal{O}(\lg\sp{2}n) \) lookups; we can see that design in writeoptimised databases. The problem is that my workloads tend to be read heavy.
Some time later, I revisited a paper by Brodal, Fagerberg, and Jacob (PDF). They do a lot of clever things to get interesting performance bounds, but I’m really not convinced it’s all worth the complexity^{1}… especially in the context of our desert island challenge. I did find one trick very interesting: they preserve logarithmic time lookups when binary searching arrays with missing values by recasting these arrays as implicit binary trees and guaranteeing that “NULLs” never have valid entries as descendants. That’s a lot simpler than other arguments based on guaranteeing a minimum density. It’s so much simpler that we can easily make it work with a branchfree binary search: we only need to treat NULLs as \( \pm \infty \) (depending on whether we want a predecessor or a successor).
While lookups are logarithmic time, inserts are \(\mathcal{O}(\lg\sp{2} n) \) time. Still no satisfying answer to the desert island challenge.
I went back to my real research in optimisation, and somehow stumbled on Igal Galperin’s PhD thesis on both online optimisation/learning and… simpler balanced binary search trees!
Scapegoat trees (PDF) rebalance by guaranteeing a bound \( \alpha > 0 \) on the relative difference between the optimal depth (\( \lceil\lg n\rceil \)) for a set of \(n\) values and the height (maximal depth) of the balanced tree (at most \( (1+\alpha)\lceil\lg n\rceil \)). The only property that a scapegoat tree has (in addition to those of binary search trees) is this bound on the height of the tree, as a function of its size. Whenever a new node would be inserted at a level too deep for the size of the tree, we go up its ancestors to find a subtree that is small enough to accomodate the newcomer and rebuild it from scratch. I will try to provide an intuition of how they work, but the paper is a much better source.
For a tree of \(n = 14\) elements, we could have \(\alpha = 0.25\), for a maximum depth of \(1.25 \lceil\lg 14\rceil = 5\). Let’s say we attempt to insert a new value, but the tree is structured such that the value would be the child of a leaf that’s already at depth \(5\); we’d violate the (im)balance bound. Instead, we go up until we find an ancestor \(A\) at depth, e.g., \(3\) with \(4\) descendants. The ancestor is shallow enough that it has space for \(5  3 = 2\) levels of descendants, for a total height of \(2 + 1 = 3\) for the subtree. A full binary tree of height \(3\) has \(2\sp{3}  1 = 7\) nodes, and we thus have enough space for \(A\), its \(4\) descendants, and the new node! These 6 values are rebuilt in a nearperfect binary tree: every level must be fully populated, except for the last one.
The criteria to find the scapegoat subtree are a bit annoying to remember–especially given that we don’t want to constantly rebuild the whole tree–but definitely simpler than rotation rules. I feel like that finally solves the desert island balanced search tree challenge… but we still have gapped sorted arrays to address.
What’s interesting about scapegoat trees is that rebalancing is always localised to a subtree. Rotating without explicit pointers is hard (not impossible, amazingly enough), but scapegoat trees just reconstruct the whole subtree, i.e., a contiguous section of the sorted array. That’s easy: slide nonempty values to the right, and redistribute recursively. But, again, finding the scapegoat subtree is annoying.
That made me think: what if I randomised scapegoat selection? Rather than counting elements in subtrees, I could approximate that probabilistically by sampling from an exponential distribution… which we can easily approximate with the geometric for \(p = 0.5\) by counting leading zeros in bitstrings.
I’m still not totally convinced that it works, but I vaguely remember successfully testing an implementation and sketching a proof that we can find the scapegoat subtree by going up according to a scaled geometric to preserve amortised logarithmic time inserts. The probability function decreases quickly enough that we preserve logarithmic time inserts on average, yet slowly enough that we can expect to redistribute a region before it runs out of space.
The argument is convoluted, but the general idea is based on the observation that, in a tree of maximum height \(m\), a subtree at depth \(k\) can contain at most \(n\sp\prime = 2\sp{m  k + 1}  1\) elements (including the subtree’s root).
We only violate the imbalance bound in a subtree if we attempt to insert more than \(n\sp\prime\) elements in it. Rebalancing works by designating the shallowest subtree that’s not yet full as the scapegoat. We could simplify the selection of the scapegoat tree by counting the number of inserts in each subtree, but that’d waste a lot of space. Instead, we count probabilistically and ensure that there’s a high probability (that’s why we always go up by at least \(\lg \lg n\) levels) that each subtree will be rebalanced at least once before it hits its insertion count limit. The memoryless property of the geometric distribution means that this works even after a rebalance. If we eventually fail to find space, it’s time to completely rebuild the subtree; this case happens rarely enough (\(p \approx \frac{\lg n}{n}\)) that the amortised time for insertions is still logarithmic.
We can do the same thing when embedding scapegoat trees in implicit trees. The problem is that a multiplicative overhead in depth results in an exponential space blowup. The upside is that the overhead is tunable: we can use less space at the expense of slowing down inserts.
In fact, if we let \( \alpha \rightarrow 0 \), we find Brodal et al’s scheme (I don’t know why they didn’t just cite Galperin and Rivest on scapegoat trees)! The difference is that we are now pretty sure that we can easily let a random number generator guide our redistribution.
I only covered insertions and lookups so far. It turns out that deletions in scapegoat trees are easy: replace the deleted node with one of its leaves. Deletions should also eventually trigger a full rebalance to guarantee logarithmic time lookups.
Classical implicit representations for sorted sets make us choose between appallingly slow (linear time) inserts and slow lookups. With stochastic scapegoat trees embedded in implicit binary trees, we get logarithmic time lookups, and we have a continuum of choices between wasting an exponential amount of space and slow \( \mathcal{O}(\lg\sp{2} n) \) inserts. In order to get there, we had to break one rule: we allowed ourselves \(\mathcal{O}(n)\) additional space, rather than \(\mathcal{o}(n)\), but it’s all empty space.
What other rule or assumption can we challenge (while staying true to the spirit of searching in arrays)?
I’ve been thinking about interpolation lately: what if we had a monotone (not necessarily injective) function to map from the set’s domain to machine integers? That’d let us bucket values or interpolate to skip the first iterations of the search. If we can also assume that the keys are uniformly distributed once mapped to integers, we can use a linear Robin Hood hash table: with a linear (but small) space overhead, we get constant time expected inserts and lookups, and what seems to be \( O(\lg \lg n) \) worst case^{2} lookups with high probability.
Something else is bothering me. We embed in full binary trees, and thus binary search over arrays of size \(2\sp{n}  1\)… and we know that’s a bad idea. We could switch to ternary trees, but that means inserts and deletes must round to the next power of three. Regular divbymul and scaling back up by the divisor always works; is there a simpler way to round to a power of three or to find the remainder by such a number?
I don’t know! Can anyone offer insights or suggest new paths to explore?
I think jumping the van Emde Boa[s] is a thing, but they at least went for the minor version, the van Emde Boas layout ;)↩
The maximal distance between the interpolation point and the actual location appears to scale logarithmically with the number of elements. We perform a binary search over a logarithmicsize range, treating empty entries as \(\infty\).↩
As programmers, we have a tendency to try and maximise generality: if we can support multiple writers, why would one bother with measly SPMC systems? The thing is SPMC is harder than SPSC, and MPMC is even more complex. Usually, more concurrency means programs are harder to get right, harder to scale and harder to maintain. Worse: it also makes it more difficult to provide theoretical progress guarantees.
Apart from architecting around simple cases, there’s a few ways to deal with this reality. We can define new, weaker, classes of program, like obstructionfreedom: a system is obstructionfree when one thread is guaranteed to make progress if every other thread is suspended. We can also weaken the guarantees of our data structure. For example, rather than exposing a single FIFO, we could distribute load and contention across multiple queues; we lose strict FIFO order, but we also eliminate a system bottleneck. Another option is to try and identify how real computers are more powerful than our abstract models: some argue that, realistically, many lockfree schemes are waitfree, and others exploit the fact that x86TSO machines have finite store buffers.
Last week, I got lost doodling with x86specific crossmodifying code, but still stumbled on a cute example of a simple lockfree protocol: lockfree sequence locks. This sounds like an oxymoron, but I promise it makes sense.
It helps to define the terms better. Lockfreedom means that the overall system will always make progress, even if some (but not all) threads are suspended. Classical sequence locks are an optimistic form of writebiased reader/writer locks: concurrent writes are forbidden (e.g., with a spinlock), read transactions abort whenever they observe that writes are in progress, and a generation counter avoids ABA problems (when a read transaction would observe that no write is in progress before and after a quick write).
In Transactional Mutex Locks (PDF), sequence locks proved to have enviable performance on small systems and scaled decently well for readheavy workloads. They even allowed lazy upgrades from reader to writer by atomically checking that the generation has the expected value when acquiring the sequence lock for writes. However, we lose nearly all progress guarantees: one suspended writer can freeze the whole system.
The central trick of lockfreedom is cooperation: it doesn’t matter if a thread is suspended in the middle of a critical section, as long as any other thread that would block can instead complete the work that remains. In general, this is pretty hard, but we can come up with restricted use cases that are idempotent. For lockfree sequence locks, the critical section is a precomputed set of writes: a series of assignments that must appear to execute atomically. It’s fine if writes happen multiple times, as long as they stop before we move on to another set of writes.
There’s a primitive based on compareandswap that can easily achieve such conditional writes: restricted double compare and single swap (RDCSS, introduced in A Practical MultiWord CompareandSwap (PDF)). RDCSS atomically checks if both a control word (e.g., a generation counter) and a data word (a mutable cell) have the expected values and, if so, writes a new value in the data word. The pseudocode for regular writes looks like
if (CAS(self.data, self.old, self) == fail) {
return fail;
}
if (*self.control != self.expected) {
CAS(self.data, self, self.old);
return fail;
}
CAS(self.data, self, self.new);
return success;
The trick is that, if the first CAS succeeds, we always know how to
undo it (data
’s old value must be self.old
), and that
information is stored in self
so any thread that observes the first
CAS has enough information to complete or rollback the RDCSS. The
only annoying part is that we need a twophase commit: reserve data
,
confirm that control
is as expected
, and only then write to data
.
For the cost of two compareandswap per write – plus one to acquire the sequence lock – writers don’t lock out other writers (writers help each other make progress instead). Threads (especially readers) can still suffer from starvation, but at least the set of writes can be published ahead of time, so readers can even lookup in that set rather than waiting for/helping writes to complete. The generation counter remains a bottleneck, but, as long as writes are short and happen rarely, that seems like an acceptable trade to avoid the 3n CAS in multiword compare and swap.
Here’s what the scheme looks like in SBCL.
First, a mutable box because we don’t have raw pointers (I could also have tried to revive my sblocative hack) in CL.
1 2 3 

Next, the type for write records: we have the the value for the next generation (once the write is complete) and a hash table of box to pairs of old and new values. There’s a key difference with the way RDCSS is used to implement multiple compare and swap: we don’t check for mismatches in the old value and simply assume that it is correct.
1 2 3 4 5 6 7 

The central bottleneck is the sequence lock, which each (read) transaction must snapshot before attempting to read consistent values.
1 2 3 4 5 6 7 8 9 10 

The generation associated with a snapshot is the snapshot if it is a positive fixnum, otherwise it is the write record’s generation.
Before using any read, we make sure that the generation counter hasn’t changed.
1 2 3 4 5 6 7 8 

I see two ways to deal with starting a read transaction while a write
is in progress: we can help the write complete, or we can overlay the
write on top of the current heap in software. I chose the latter:
reads can already be started by writers. If a write is in progress
when we start a transaction, we stash the write set in *currentmap*
and lookup there first:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

We’re now ready to start read transactions. We take a snapshot of the
generation counter, update *currentmap*
, and try to execute a
function that uses boxvalue
. Again, we don’t need a readread
barrier on x86oids (nor on SPARC, but SBCL doesn’t have threads on
that platform).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

The next function is the keystone: helping a write record go through exactly once.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

Now we can commit with a small wrapper around help
. Transactional
mutex lock has the idea of transaction that are directly created as
write transactions. We assume that we always know how to undo writes,
so transactions can only be upgraded from reader to writer.
Committing a write will thus check that the generation counter is
still consistent with the (read) transaction before publishing the new
write set and helping it forward.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

And now some syntactic sugar to schedule writes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

That’s enough for a smoke test on my dual core laptop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

The function testreads
counts the number of successful read
transactions and checks that (boxvalue a)
and (boxvalue b)
are
always equal. That consistency is preserved by testwrites
, which
counts the number of times it succeeds in incrementing both
(boxvalue a)
and (boxvalue b)
.
The baseline case should probably be serial execution, while the ideal case for transactional mutex lock is when there is at most one writer. Hopefully, lockfree sequence locks also does well when there are multiple writers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

First, the serial case. As expected, all the transactions succeed, in 6.929 seconds total (6.628 without GC time). With one writer and two readers, all the writes succeed (as expected), and 98.5% of reads do as well; all that in 4.186 nonGC seconds, a 65% speed up. Finally, with two writers and two readers, 76% of writes and 98.5% of reads complete in 4.481 nonGC seconds. That 7% slowdown compared to the singlewriter case is pretty good: my laptop only has two cores, so I would expect more aborts on reads and a lot more contention with, e.g., a spinlock.
CLUSER> (gc :full t) (time (testserial 1000000))
Evaluation took:
6.929 seconds of real time
6.944531 seconds of total run time (6.750770 user, 0.193761 system)
[ Run times consist of 0.301 seconds GC time, and 6.644 seconds nonGC time. ]
100.23% CPU
11,063,956,432 processor cycles
3,104,014,784 bytes consed
(10000000 10000000 1000000 1000000)
CLUSER> (gc :full t) (time (testsinglewriter 1000000))
Evaluation took:
4.429 seconds of real time
6.465016 seconds of total run time (5.873936 user, 0.591080 system)
[ Run times consist of 0.243 seconds GC time, and 6.223 seconds nonGC time. ]
145.97% CPU
6,938,703,856 processor cycles
2,426,404,384 bytes consed
(9863611 9867095 1450000)
CLUSER> (gc :full t) (time (testmultiplewriters 1000000))
Evaluation took:
4.782 seconds of real time
8.573603 seconds of total run time (7.644405 user, 0.929198 system)
[ Run times consist of 0.301 seconds GC time, and 8.273 seconds nonGC time. ]
179.30% CPU
7,349,757,592 processor cycles
3,094,950,400 bytes consed
(9850173 9853102 737722 730614)
How does a straight mutex do with four threads?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

CLUSER> (gc :full t) (time (testmutex 1000000))
Evaluation took:
5.814 seconds of real time
11.226734 seconds of total run time (11.169670 user, 0.057064 system)
193.10% CPU
9,248,370,000 processor cycles
1,216 bytes consed
(#<SBTHREAD:THREAD FINISHED values: NIL {1003A6E1F3}>
#<SBTHREAD:THREAD FINISHED values: NIL {1003A6E383}>
#<SBTHREAD:THREAD FINISHED values: NIL {1003A6E513}>
#<SBTHREAD:THREAD FINISHED values: NIL {1003A6E6A3}>)
There’s almost no allocation (there’s no write record), but the lack of read parallelism makes locks about 20% slower than the lockfree sequence lock. A readerwriter lock would probably close that gap. The difference is that the lockfree sequence lock has stronger guarantees in the worst case: no unlucky preemption (or crash, with shared memory IPC) can cause the whole system to stutter or even halt.
The results above correspond to my general experience. Lockfree algorithms aren’t always (or even regularly) more efficient than well thought out locking schemes; however, they are more robust and easier to reason about. When throughput is more than adequate, it makes sense to eliminate locks, not to improve the best or even the average case, but rather to eliminate a class of worst cases – including deadlocks.
P.S., here’s a sketch of the horrible crossmodifying code hack. It
turns out that the instruction cache is fully coherent on (post586)
x86oids; the prefetch queue will even reset itself based on the linear
(virtual) address of writes. With a single atomic byte write, we can
turn a xchg (%rax), %rcx
into xchg (%rbx), %rcx
, where %rbx
points to a location that’s safe to mutate arbitrarily. That’s an
atomic store predicated on the value of a control word elsewhere
(hidden in the instruction stream itself, in this case). We can then
dedicate one sequence of machine to each transaction and reuse them
via some
Safe Memory Reclamation mechanism (PDF).
There’s one issue: even without preemption (if a writer is preempted,
it should see the modified instruction upon rescheduling), stores
can take pretty long to execute: in the worst case, the CPU has to
translate to a physical address and wait for the bus lock. I’m pretty
sure there’s a bound on how long a xchg m, r64
can take, but I
couldn’t find any documentation on hard figure. If we knew that xchg
m, r64
never lasts more than, e.g., 10k cycles, a program could wait
that many cycles before enqueueing a new write. That wait is bounded
and, as long as writes are disabled very rarely, should improve
the worstcase behaviour without affecting the average throughput.