I find most developers are vastly more comfortable dealing with
pointerbased data structures than with implicit ones. I blame our
courses, which focus on the former and completely fail to show us how
to reason about the latter. For example, the typical presentation of
the binary heap introduces an indexing scheme to map from parent to
children – the children of heap[i]
are heap[2 * i]
and
heap[2 * i + 1]
, with onebased indices – that is hard to
generalise to ternary or kary heaps (Knuth’s presentation, with
parents at \(\lfloor i/2 \rfloor\), is no better). The reason it’s
so hard to generalise is that the indexing scheme hides a simple
bijection between paths in kway trees and natural numbers.
I find it ironic that I first encountered the idea of describing data structures or algorithms in terms of number systems, through Okasaki’s and Hinze’s work on purely functional data structures: that vantage point seems perfectly suited to the archetypal mutable data structure, the array! I’ll show how number systems help us understand implicit data structures with two examples: a simple indexing scheme for kary heaps and compositions of specially structured permutations to implement inplace bitreversal and (some) matrix transpositions.
The
classical
way
to present implicit binary heaps is to work with onebased array
indices and to root an implicit binary tree at heap[1]
. For any
node heap[i]
, the children live at heap[2 * i]
and
heap[2 * i + 1]
. My issue with this presentation is that it’s
unclear how to extend the scheme to ternary or kary trees: if the
children of heap[1]
are heap[3 * 1 ... 3 * 1 + 2]
, i.e.,
heap[3, 4, 5]
,
what do we do with heap[2]
? We end up with k  1
parallel kary
trees stashed in the same array. Knuth’s choice of mapping children
to parents with a floored integer division suffers the same fate.
Onebased arrays hides the beauty of the binary indexing scheme. With
zerobased arrays, the children of heap[i]
are heap[2 * i + 1]
and
heap[2 * i + 2]
. This is clearly isomorphic to the onebased scheme
for binary heaps. The difference is that the extension to kway
trees is obvious: the children of heap[i]
are heap[k * i + 1 ... k * i + k]
.
A couple examples like the one below fail to show any problem. However, even a large number of tests is no proof. Thinking in terms of number systems leads to a nice demonstration that the scheme creates a bijection between infinite kary trees and the naturals.
There is a unique finite path from the root to any vertex in an infinite kary tree. This path can be described as a finite sequence \((c\sb{1}, c\sb{2}, \ldots, c\sb{n})\) of integers between 1 and k (inclusively). If \(c\sb{1} = 1\), we first went down the first child of the root node; if \(c\sb{1} = 2\), we instead went down the second child; etc. If \(c\sb{2} = 3\), we then went down the third child, and so on. We can recursively encode such finite paths as naturals (in pidgin ML):
path_to_nat [] = 0
path_to_nat [c : cs] = k * path_to_nat cs + c
Clearly, this is an injection from finite paths in kary trees to the
naturals. There’s only a tiny difference with the normal positional
encoding of naturals in base k
: there is no 0, and digits instead
include k
. This prevents us from padding a path with zeros, which
would map multiple paths to the same natural.
We only have to show that path_to_nat
is a bijection between finite
paths and naturals. I’ll do that by constructing an inverse that is
total on the naturals.
nat_to_path 0 = []
nat_to_path n = let c = pop n
in c : nat_to_path (n  c) / k
where pop
is a version of the mod k
that returns k
instead of 0:
pop n = let c = n `mod` k
in
if (c != 0)
then c
else k
The base case is nat_to_path 0 = []
.
In the induction step, we can assume that path_to_nat cs = n
and
that nat_to_path n = cs
. We only have to show that, for any
\(1 \leq c \leq k\), nat_to_path path_to_nat c:cs = c:cs
. Let
n' = path_to_nat c:cs = k * n + c
.
\[n\sp{\prime} = kn + c \equiv c\quad \mod k,\]
so pop n'
will correctly return c
(and k
rather than 0). It’s
then a tiny bit of algebra to show that (n'  c) / k = n
, and we
fall back to the induction hypothesis.
This scheme is so simple that I wound up coding a version of
heapsort(3)
that lets callers choose the heap’s arity at runtime.
Higher arity heaps perform more comparisons but fewer swaps than
binary heaps; the tradeoff is profitable when sorting large items.
It seems to me that, for decades, we’ve been presenting implicit heaps
and heapsort in a way that marginally simplifies the binary case at
the expense of obscuring the elegant general case.
Bit reversing an array of length \(2\sp{l}\) sends the element x[i]
to
x[j]
, where the binary representation of i
(including padding up
to l
bits) is the reverse of j
. For example, in an array of
length 16, 3 = 0011
becomes 1100 = 12
.
Reversing a fixedwidth integer’s binary representation is its selfinverse, so bit reversing an array is a sequence of swaps. This means that the permutation can be performed inplace, as a series of independent swaps. Bit reversal used to be slow on cached machines: contiguous elements (with indices that only vary in their low order bits) swap with faroff elements (indices that only vary in their high order bits). Worse, the stride between the latter elements is a large power of two, which causes all sorts of aliasing issues. Workarounds (see Zhang 99 (PDF)) mostly end up implementing a software cache with explicit buffers. Nowadays, even L1 caches have such a high associativity that aliasing is a much less important issue.
NapaFFT3 implements bit reversals by calling a few specialised functions that only swap the lower and higher bits; the main routine iterates over an array of precomputed middle bit reversals (similar to various publications of Elster’s, but recursing on the middle bits first). In this implementation, the number of L1 cache misses incurred by bit reversing an array is only slightly greater than the compulsory misses. Bit reversal isn’t free, but it’s also not clear that autosorting FFTs are quicker than outoforder FFTs followed by a bit reversal pass.
Bit reversal is the only array permutation I’ve seen described in terms of its effect on indices. I think it’s a fruitful avenue for other inplace permutations.
For example, the viewpoint makes it clear how to transpose a matrix of dimensions \(2\sp{m} \times 2\sp{n}\) with a sequence of inplace bit reversals (each \(i\sb{k}\) and \(j\sb{l}\) is a bit in the index’s binary representation).
For a rowmajor layout, the sketch above corresponds to:
Bit reversal, like all other permutations, is a reversible linear operation. We can thus change the order of operation if we want to. For example, it’s not necessarily preferable to bitreverse contiguous rows first. We could also flip the highorder bits of the indices: rather than swapping scalars, we would swap rows. Separately bit reversing contiguous rows works best when each row fits in cache. Bit reversing columns instead amortises the bad access pattern inherent to bit reversal by spending more time on each swap: swapping rows is slower than swapping scalars, but also very efficients with regards to (streaming!) I/O.
This is interesting because inplace transposition of rectangular matrices is hard, and transposition is already a bad fit for caches. Transposing matrices with a sequence of bit reversals might just be practical. In fact, that’s what I intend to do in NapaFFT3 for multidimensional DFTs: we can fuse all but the middle wholevector bit reversal with mixedradix FFTs (and the latter might similarly benefit from operating on [sub]rows rather than scalars).
One obvious question now appears: can we generalise the trick to general dimensions? It’s pretty clear that we can do it for any other base \(b\) and matrices of dimension \(b\sp{m} \times b\sp{n}\) (it’s interesting how highly composite numbers dimensions are easy to transpose, and, IIRC, so are coprime ones). What if there’s no such factorisation? The best I can do is “more or less.”
For arbitrary matrix dimensions \(m \times n\), I think it’s best to decompose indices in a mixed radix (but still positional) number system. For example, a \(63 \times 21\) matrix might have indices in radix \(3,7\ \ 3,7,3\). Given this number system, matrix transposition is
It’s a small generalisation to let the radices be \(a,b\ \ a,b,a\), for a matrix of dimension \(ab \times a\sp{2}b\). We can then perform most of a matrix transposition by swapping positions of identical weight: first a full mixedradix digit reversal (the weights are palindromic), followed by another mixedradix reversal on the first three positions.
This leaves the last chunk \(b\sb{2},a\sb{3}\), which should instead be \(a\sb{3},b\sb{2}\). That’s another rectangular matrix tranpose, but smaller than the original one. It might be practical to execute that last step with a straightforward outofplace transpose: a smaller transpose needs less scratch space and may fit in cache. We can also apply the same trick as for bit reversals and apply the transpose before everything else, by permuting rows rather than scalars. The simplest way to do that is to transpose a matrix of pointers before replicating the permutation on the actual data (glibc’s mergesort references Knuth vol. 3, exercise 5.210).
Finally, this also works for \(a\sp{2}b \times ab\): matrix transposition is its own inverse, so we only have execute the inverse of each step, in reverse order.
Definitely mixed results, but at least we have some intuition on why general rectangular transpositions are hard to perform in place: they’re hard to express as sequences of swaps.
This post is the more theoretical prologue to
a lowlevel look
at qsort(3)
: I really wanted to make sure the nice implicit tree
layout in the first section had the space it deserves.
I tried to make inorder implicit trees fit in this number system approach. I can’t see how. The problem is that inorder trees associate ranges (rather than indices) with nodes; for example, at what depth is index 1000? It depends on the size of the search tree. It might be the root (in a tree of 2000 vertices) or a leaf (in a tree of 1001 vertices).
]]>I don’t really follow the compression scene, and only pay minimal attention to machine learning. Nevertheless, the “Compression is Learning” slogan feels more and more right to me. In this post, I’ll explore another relationship that I find surprising: one between compression and compilation.
Five years ago, I took Marc Feeley’s compiler class, and he let us choose our term project. I settled on generating traces from recursive algorithms (typical of cacheoblivious algorithms) and reordering them to get iterative code that was better suited to the first level of cache. I came up with a gnarly CLaccented researchquality prototype, but the result was surprisingly effective. Sadly, I never really managed to recover loops or function calls from the fully inlined trace, so even mediumsize kernels would exhaust the instruction cache.
I believe I now see a practical and scalable solution, thanks to Artur Jez’s work on “A really simple approximation of smallest grammar.” His work might lead to a functionoriented analogue of trace compilation. Trace compilers are notoriously weak on recursion (recursive function calls don’t map well to loops), so it would be nifty to have an alternative that identifies functions rather than loops.
This makes me quip that “(Trace) Compilation is Compression.” We can see trace compilers as lazily unpacking traces from the source program, rewriting them a bit, and recompressing traces in an executable format. Admittedly, the analogy is a bit of a stretch for classical compilers: they are less aggressive in decompressing source code and directly translate from one compressed representation (a small source file may describe billions of execution paths) to another.
Anyway… this is extremely hypothetical and Jez’s work is fun independently of my weekend speculations.
Once we cast a program trace as a sequence of opcodes (perhaps without operands, to expose more redundancy), it’s obvious that reducing code size is “just” compression, and LZtype algorithms quickly come to mind: they compress strings by referring to earlier substrings.
The heart of LZ77 is a loop that streams through the input sequence and finds the longest common subsequence earlier in the input. In practice, the search usually considers a fixedsize window; when I write LZ77 here, I instead refer to the theoretical approach with an unbounded search window. Repeated LCS searches on an unbounded window sounds slow, but, in sophisticated implementations, the bottleneck is sorting the input string to generate a suffix array.
I don’t feel like being sophisticated and will implement a quadratictime LCS search.
1 2 3 4 5 6 7 8 9 10 11 12 

Some backreferences clearly look like function calls.
1 2 3 4 5 6 

It’s a bit surprising at first, but some also correspond to repeat
loops.
1 2 3 

This last patch is strange because it’s selfreferential. However, if
we decompress character by character, its meaning becomes obvious: we
replicate the substring in [0, 1]
to generate exactly 5 characters.
This is basically a loop; we can handle partial copies with, e.g.,
rotation and entering the middle of the loop (like Duff’s device).
Lempel and Ziv’s result tells us that we can look for references greedily in an unbounded window and obtain asymptotically optimal (proportional to the string’s entropy) compression. Again, there are algorithms to do that in linear time – via radix sort – but I’ll just bruteforce my way in cubic time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

Here’s the problem with converting an LZ77 factorisation into calls and loops: there is no guarantee that patches nest sanely.
1 2 3 

This factorisation encodes “ababacbac” by repeating “ab” two and a half times, inserting “c”, and then repeating “bac.” The issue is that we wish to convert “ababa” into a loop, but the last patch would then need to enter that loop in its last iteration. We could also have the opposite problem, with a patch that only covers a prefix of an earlier patch (a few iterations of a loop). The sketch below shows how we might even have to deal with both issues in the same factor.
Jez’s paper analyses a method to recover a CFL with a single member, the string we wish to compress. The CFL is described by a fully deterministic (it produces exactly one string) CFG in Chomsky normal form, i.e., a straightline program.
This “really simple” approach takes a step back from LZ compression and instead starts with the easiest, greedy, way to generate a straightline program from a string: just pair characters as you encounter them, lefttoright.
For example, on “abab,” this would pair “(ab)(ab)” and then ”((ab)(ab)),” and generate the following grammar:
G > XX
X > ab
We got lucky: the repetition synced up with the greedy pairing. That’s not always the case; for example, “abcab” becomes “(ab)(ca)b” – which exposes no redundancy – rather than “(ab)c(ab).”
Jez addresses this issue by using LZ77 factorisation as an oracle to synchronise pairings.
First, we simplify the algorithm by assuming that there are no factors that begin exactly one character before their output. Such factors correspond to repetitions of that one character (e.g., “aaaa” = “a” x 4) and we can easily turn them into repetitions of a pair (e.g., “aa” x 2).
It’s then simply a matter of not pairing when it would prevent the next character from synchronising with its backreference. For example, in “abcab,” we’d skip pairing “ca” so that “ab” could pair like the first occurrence did. We also don’t force a pair when the backreference only includes the first letter, but not the second. Finally, we opportunistically match consecutive unpaired letters.
Each such pass creates a new, shorter, string of literals and production rules. We iteratively apply the guided pairing loop until we’re left with a single production rule.
That’s it. Jez’s paper is mostly concerned with showing that we only have to run LZ compression once and with proving suboptimality bounds. The key is that we can convert factors for the input into factors for the output, and that each pass shrinks the string by a multiplicative factor. I’ll simplify things further by recomputing a factorisation (in cubic time!) from scratch in each pass.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 

My implementation is exceedingly naïve and runs in cubic time in the worst case. With a lot more work, the bottleneck (LZ77 factorisation) runs at the speed of sort… but constructing a suffix array would only get in the way of this post. Let’s just see what pairing the code generates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

In short, “abcab” becomes “(ab)c(ab)”, “abababa” “(ab)(ab)(ab)a”, and “ababacbac” “(ab)(ab)(ac)()b(ac)”.
So far, we’ve only made pairing decisions. The next step is to translate them into production rules (i.e., functions), and to merge equivalent rules to actually save space.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 

On the examples above, we find:
1 2 3 4 5 6 7 8 9 10 11 

We only have to pair production rules further until the sequence consists of a single rule (or literal).
1 2 3 4 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 

The “call graph” below shows there’s a lot of sharing for such a short string.
I started by writing about compiling program traces, but so far I’ve only been working with strings of characters. Let’s pretend the following PostScript file was written by someone who’s never heard of functions or loops.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Phew, programming is hard! That’s a lot of copypaste to produce a simple pattern.
Let’s see what our (simplified) implementation of Jez’s algorithm can do.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

35 functions that each contain two literals or function calls. Not bad (: It’s an actual SMOP to convert this grammar into a PostScript program.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

This second PostScript program comprises 211 tokens, rather than the original’s 156, but 105 of those are “{ } def” noise to define functions. Arguably, the number of meaningful tokens was reduced from 156 to slightly more than 100. Crucially, the output remains the same.
Stack programs seem particularly well suited to this rewriting approach: explicit operands (e.g., registers) introduce trivial discrepancies. In a VM, I would consider “compressing” the opcode stream separately from the operands. Compiling only the former to native code would already reduce interpretative overhead, and extracting function calls from the instruction stream should avoid catastrophic size explosions.
There are several obvious deficiencies in the direct LZ77guided
pairing. Just breaking Chomsky normalisation to inline singleuse
function would already help. It might also make sense to specialcase
repetitions as repeat n
loops, instead of hiding that in
\(\log n\) levels of recursion. In a real program, it would also be
useful to inline the bottommost rules so as not to drown in control
flow. The thing is, destroying structure in the name of performance
is well understood compiler tech; recovering it efficiently was the
hard part.
It’s already Google Summer of Code season; like spring, it’s slightly earlier this year. For multiple reasons – including the accelerated schedule and my transition to having a Real Job – I did not manage to guide any of the students who sent in applications. After an initial ranking, we’re left with a fair number of decent proposals for a Summer of SBCL. They’re of comparable quality, and all leave me a bit uneasy. In this post, I’ll try to explain what I look for in GSoC proposals… or, really, in action plans in general.
I care about the quality of GSoC plans for two reasons. First, because it reassures me that the applicant understands the difficulties they’ll have to overcome to complete their project. Second, because, like TDD, a good plan makes it easier to feel when we’re making concrete progress. To me, these two factors are necessary conditions for a good success rate.
We’ll probably allow additional time to update the proposals we have already received, so hopefully this post can be immediately useful. I have a single vote, and other mentors no doubt have different criteria. I only hope that this post can give ideas to applicants looking to review their proposal.
This is a common objection to any form of planning for software projects. I don’t understand where it comes from.
Software development should be easier to plan than most endeavours: we have to deal with relatively few people, and, when we do, it’s rarely in an adversarial setting… and I’m willing to bet most of us can better forecast the behaviour of programs than that of people. There’s currently an electoral campaign in Québec, and I’m certain that, before the date was even announced, every single party had a precise plan for each day, at the riding, regional, and provincial levels. Politicians can do it; why can’t software developers?
Of course, there will be constant adjustments to the initial plan. The reason I think it’s important to spend time and thought on that plan isn’t to stick to it, but primarily because it forces us to systematically consider all sorts of potential issues, challenges, failure modes, etc. This exercise becomes useful when problems crop up: rather than making decisions in the heat of the moment, we can rely on our past (well rested and composed) self’s research, notes, and ideas.
I also find plans tend to become much more handwavy as we get farther in the future. That’s a normal tendency. I think we’re usually pretty good at shortterm planning; most people can cook a meal without sitting down to figure out how they’ll cause awesome chemical reactions to occur without burning their house down. It’s a different matter when it comes to monthlong camping trips or feeding a thousand people.
The difference is one of scale, and it’s exactly when we’re out of our comfort zone that mindful planning becomes useful. I expect an action plan to be as thorough in its last week as in its first. Most of it will be fiction, but, as I wrote above, the thinking involved in writing that fiction is what I’m after.
There are tons of ways to describe a plan. I like this one, but the main thing is what I have in mind when I read each section. For instance, last year’s proposal for register allocation via graph colouring (PDF) had:
It’s a different structure than what I suggest below, but both mostly present the same information. My wish is to feel confident that the student has done the work to have that information.
First, a description of the goal. SBCL offers prefab proposals, and it’s tempting to just copypaste. However, I feel much more confident in the intrinsic motivation of a student who’s made the project their own. Many of our suggestions are skeletons for projects; it should be easy to come up with ways to extend and better specify them according to the student’s interests. Others are already fleshed out; those are a bit harder, but even small variations on the initial theme are encouraging. In all cases, if I don’t feel like the student owns the project, I probably won’t rank the proposal highly, and certainly won’t be inclined to mentor them. The problem statement is a simple way to demonstrate one’s understanding of – and interest for – the project.
Second, a review of the current environment. What currently exists, in SBCL or elsewhere? Can we exploit interesting research or code? Has there been previous efforts, and how/why did they not pan out? Perhaps there’s a relevant postmortem from a similar project that will help understand the problem domain. What potential difficulties do we know of? What don’t we know (known unknowns)? How bad do we estimate unknown unknowns to be, and why? What persons or resources could be helpful? To a certain extent, this is simply demonstrating that one’s done their due diligence. However, it also tells other people (e.g., potential mentors) how they can best work with the applicant. This section should be a good resource to reread when bad things happen.
Third, the suggested course of action at a high level. Not only what would happen in a perfect execution, but, ideally, ways to address (ahead of time or when/if they occur) some of the difficulties and unknowns listed in the previous section.
Fourth, success indicators, a translation of the initial goal description into a few checkboxes. Ideally, the criteria are falsifiable… much like an experiment, with the null hypothesis being project failure. I find that checking even a few of these boxes gives me a good sense of closure on a project.
Finally the calendar, to show how one might execute the third section and satisfy the success indicators in the fourth section. The first few weeks are usually easy to envision, but proposals tend to become fuzzier with time. I’m more confident in proposals that only use one or two week periods, with actions and a few testable for each period. Shorter periods make fuzziness more obvious, but, more importantly, they combine with the milestones to let us tell that we’re (or aren’t) getting traction.
In the unfortunate case that we’re somehow not making the expected progress, frequent milestones means we can tell more quickly that something is off. We only have to make up for a one week setback rather than for a month, i.e., one quarter of the GSoC program. The odds of outright project failure are lower, and we don’t risk feeling (as) down for wasting a month of work.
One thing that I find particularly scary in project calendars are “Magic Happens” steps. This seems more common with longer planning periods. Many proposals are quite detailed with respect to preliminary work and research, and to final integration and documentation. The issue is that most of the work is stashed in a 35 week interval during which the actual project becomes completed. It’s not clear what the student will do to get there (or even who will do the work ;). Planning each week individually makes such steps obvious.
I know that it’s hard to determine how we’ll achieve a goal when we’ve never done something similar before. But we can pretend. I want to feel that an applicant has a clear idea how to execute the task they propose, if only in an artificial perfect world. Otherwise, why should I believe they can do so in reality?
That’s it. I’m told that writers should show rather than tell. I feel the same about plans. I like to be shown how goals could be achieved in ideal and lessthanideal worlds rather than told when steps will be completed. I think the most important point is that I don’t appreciate detailed plans because we should (or can) stick to them. Rather, I believe that the exercise of coming up with such a plan is an efficient way to ensure we can fruitfully respond to what’ll actually happen when we execute it.
The regalloc proposal was one of the longest ones last year, but I felt confident that the student had a good idea of what challenges made the task nontrivial. In the end, we followed only a small part of the plan: it’s hard to come up with a suite of bad regalloc examples, visualisation turned out not to scale to real code, and there was too little time to work on live range splitting. However, every time there was a change, Alexandra had no problem determining how to adjust the schedule and figuring out the next step. On my end, I could easily follow the modified schedule and see that we were making progress.
This planning exercise is not easy, and it’s certainly not quick. (FWIW, I find it useful to get someone else, even without relevant experience, to look at the plan and ask questions.) Action plans have helped me countless times: when I needed to react to problems; when I felt like I was running in circles but could tell for sure that I was making progress; when a project forcibly came to an end but I still had a sense of closure… Hopefully, they can do the same for SBCL GSoC students.
]]>NEXT
sequence (used to) encode
an effective address with an index register but no base. The mistake
doesn’t affect the meaning of the instruction, but forces a wasteful
encoding. The difference in machine code are as follows.
Before (14 bytes):
1 2 3 4 

Now (9 bytes):
1 2 3 4 

I fixed the definition of NEXT
, but not the disassembly snippets
below; they still show the old machine code.
Earlier this week, I took another look at the
F18. As usual with Chuck Moore’s
work, it’s hard to tell the difference between insanity and mere
brilliance ;) One thing that struck me is how small the stack is: 10
slots, with no fancy overflow/underflow trap. The rationale is that,
if you need more slots, you’re doing it wrong, and that silent
overflow is useful when you know what you’re doing. That certainly
jibes with my experience on the HP41C and with x87. It also reminds
me of a
post of djb’s decrying our misuse of x87’s rotating stack:
his thesis was that, with careful scheduling, a “free” FXCH
makes
the stack equivalent – if not superior – to registers. The post
ends with a (nonpipelined) loop that wastes no cycle on shuffling
data, thanks to the x87’s implicit stack rotation.
That lead me to wonder what implementation techniques become available for stackbased VMs that restrict their stack to, e.g., 8 slots. Obviously, it would be ideal to keep everything in registers. However, if we do that naïvely, push and pop become a lot more complicated; there’s a reason why Forth engines usually cache only the top 12 elements of the stack.
I decided to mimic the x87 and the F18 (EDIT: modulo the latter’s two TOS cache registers): pushing/popping doesn’t cause any data movement. Instead, like the drawing below shows, they decrement/increment a modular counter that points to the top of the stack (TOS). That would still be slow in software (most ISAs can’t index registers). The key is that the counter can’t take too many values: only 8 values if there are 8 slots in the stack. Stack VMs already duplicate primops for performance reasons (e.g., to help the BTB by spreading out execution of the same primitive between multiple addresses), so it seems reasonable to specialise primitives for all 8 values the stack counter can take.
In a regular direct threaded VM, most primops would end with a code
sequence that jumps to the next one (NEXT
), something like
add rsi, 8 ; increment virtual IP before jumping
jmp [rsi8] ; jump to the address RSI previously pointed to
where rsi
is the virtual instruction pointer, and VM instructions
are simply pointers to the machine code for the relevant primitive.
I’ll make two changes to this sequence. I don’t like hardcoding addresses in bytecode, and 64 bits per virtual instruction is overly wasteful. Instead, I’ll encode offsets from the primop code block:
mov eax, [rsi]
add rsi, 4
add rax, rdi
jmp rax
where rdi
is the base address for primops.
I also need to dispatch based on the new value of the implicit stack
counter. I decided to make the dispatch as easy as possible by
storing the variants of each primop at regular intervals (e.g. one
page). I rounded that up to 64 * 67 = 4288
bytes to minimise
aliasing accidents. NEXT
becomes something like
mov eax, [rsi]
add rsi, 4
lea rax, [rax + rdi + variant_offset]
jmp rax
The trick is that variant_offset = 4288 * stack_counter
, and the
stack counter is (usually) known when the primitive is compiled. If
the stack is left as is, so is the counter; pushing a value decrements
the counter and popping one increments it.
That seems reasonable enough. Let’s see if we can make it work.
I want to explore a problem for which I’ll emit a lot of repetitive machine code. SLIME’s REPL and SBCL’s assembler are perfect for the task! (I hope it’s clear that I’m using unsupported internals; if it breaks, you keep the pieces.)
The basic design of the VM is:
r8
r15
: stack slots (32 bits);rsi
: base address for machine code primitives;rdi
: virtual instruction pointer (points to the next instruction);rax
,rbx
,rcx
,rdx
: scratch registers;rsp
: (virtual) return stack pointer.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

The idea is that we’ll define functions to emit assembly code for each
primitive; these functions will be implicitly parameterised on
*stackpointer*
thanks to @
. We can then call them as many times
as needed to cover all values of *stackpointer*
. The only
complication is that code sequences will differ in length, so we must
insert padding to keep everything in sync. That’s what emitcode
does:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

This function is used by emitallcode
to emit the machine code for
a bunch of primitives, while tracking the start offset for each
primitive.
1 2 3 4 5 6 7 8 9 10 

Now, the pièce de résistance:
1 2 3 4 5 6 7 8 9 10 11 12 13 

Let’s add a few simple primitives.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

The code for swap
lives between bytes 0 and 32. Let’s take a look
at the version for *stackpointer* = 0
and *stackpointer* = 1
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

dup
is at 3264, and sub
at 128152:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

These are relatively tight. I certainly like how little data
shuffling there is; the NEXT
sequence is a bit hairy, but the
indirect branch is likely its weakest (and least avoidable) point.
A VM without control flow isn’t even a toy. First, unconditional
relative jumps. These can be encoded as [jmp] [offset]
, with the 32
bit offset relative to the end of offset
. We just overwrite
*virtualip*
with the new address.
1 2 3 4 5 

Call and return are at the heart of Forthlike engines. ret
is
easy: just pop from the control stack into *virtualip*
.
1 2 3 

Call is a bit more complicated. It’s like jmp
, but pushes the
address of the next instruction to the control stack:
1 2 3 4 5 6 

Let’s look at the resulting machine code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

We finally almost have enough for interesting demos. The only
important thing that’s still missing is calls from CL into the VM.
I’ll assume that the caller takes care of saving any important
register, and that the primop (rsi
) and virtual IP (rdi
) registers
are setup correctly. The stack will be filled on entry, by copying
values from the buffer rax
points to, and written back on exit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

The CLside interlocutor of enter
follows, as a VOP:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 

The only thing missing is to store the machine code for our primop in a range of memory that’s executable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

1 2 3 

Let’s execute add sub
(and finish with leave
).
1 2 3 4 5 6 7 8 9 

And, indeed, 10  (3 + 2) = 5
.
We should also test function calls
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Instead of executing add
directly, this bytecode sequence calls to
whatever is 8 bytes (2 dwords) after the call instruction; in our
case, add ret
.
Writing bytecode by hand is annoying. This tiny functions takes care of the stupid stuff.
1 2 3 4 5 6 7 

We can now write
1 2 3 4 5 6 7 

We can now either write (basically) straightline code or infinite
loops. We need conditionals. Their implementation is much like
jmp
, with a tiny twist. Let’s start with jump if (top of stack is)
nonzero and jump if zero.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

It’s hard to write programs without immediate values. Earlier control
flow primitives already encode immediate data in the virtual
instruction stream. We’ll do the same for lit
, inc
, and dec
:
1 2 3 4 5 6 7 8 9 10 11 12 

Finally, we have enough for a decentlooking (if useless) loop. First, update the primop code page:
1 2 3 4 5 6 7 8 

1 2 3 4 5 6 7 8 9 10 

One million iterations of this stupid loop that only decrements a counter took 15M cycles. 15 cycles/iteration really isn’t that bad… especially considering that it executes an indirect jump after loading 1, after subtracting, and after comparing with 0.
We can do better by fusing lit sub
into dec
:
1 2 3 4 5 6 7 8 9 10 

Decrementing a counter and jumping if non zero is a common operation
(old x86 even implemented that in hardware, with loop
). Let’s add
decrement and jump if nonzero (djn
) to the VM:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

1 2 3 4 5 6 7 8 9 10 

That’s better… But I’m really not convinced by the conditional move.
The branch will usually be predictable, so it makes sense to expose
that to the hardware and duplicate the NEXT
sequence.
1 2 3 4 5 6 7 8 9 10 

The resulting code isn’t too large, and the two indirect jumps are 16 bytes apart.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

This alternative implementation does work better on our stupid loop.
1 2 3 4 5 6 7 8 9 10 

Let’s see how that compares to straight assembly code.
1 2 3 4 5 6 

1 2 3 4 5 6 7 8 9 10 

My slow macbook air gets 1 iteration/cycle on a loop that’s 100%
control overhead. With djn2
, a good implementation of a reasonable
specialised operator, the loop is about 6x as slow as native code. A
worse implementation of djn
is still only 8x as slow as pure native
code, and horribly unspecialised bytecode is 1115x as slow as native
code.
Specialising primops to a virtual stack pointer is feasible in
practice, when the stack is restricted to a small size. It also seems
to have a reasonable runtime overhead for threaded interpreters. I’m
not actually interested in straight stack languages; however, I
believe that a fixed stack VM makes a nice runtime IR, when coupled
with a mechanism for local variables. We’ll see if I find time to
translate a high level language into superoperators for such a VM.
Fused operators would reduce the importance of NEXT
; in constrast,
simpler function calls (because there’s less shuffling of items to
stack them up in the right position) would remain as useful.
SBCL has definitely proven itself to be a good companion to explore the generation of domainspecific machine code. I don’t know of any other language implementation with that kind of support for interactive programming and machine code generation (and inspection). FWIW, I believe LuaJIT + dynasm will soon be comparable.
Steel Bank Common Lisp: because sometimes C abstracts away too much ;)
]]>EDIT: I was asked for feedback on some of these books, so I added some inline.
Brassard and Bratley, Fundamentals of Algorithmics is my goto reference book for basic algorithmic stuff. I love the presentation, the authors are careful with Landau notation, and they specify in what circumstances handwaving tricks apply. The quality of Brassard’s lectures is probably a factor as well: the book reminds me of one of the best courses in Université de Montréal’s undergraduate curriculum.
CLRS for baseline implementations of classic data structures and algorithms.
Knuth 14A for its attention to minutiae. Also very useful as a reference to point to: if something is so old or well known that I can’t think of a canonical reference, it’s probably in Knuth.
Sedgewick, Algorithms (First edition) has an interesting section on geometric algorithms. I also like the way many older data structure and algorithm books clearly take lowlevel coding consideration into account while discussing asymptotics. I find nearly everything by Tarjan very good in that regard; his papers are among the few that describe (and painstakingly prove) algorithmic improvements (e.g., the inverseAckermann Union Find) only to reveal that a simpler implementation with a worse bound consistently performs better in practice.
Bertot and Castéran, Interactive Theorem Proving and Program Development is heavy and I’m still going through the book.
Pierce, Types and Programming Languages should be mandatory skimming before entering a debate on “static” and “dynamic” types.
Gries, The Science of Programming changed the way I write programs. It’s not that I go full formal on my code, but that I try and structure it so it’s easier to reason about. I was just reminded of Dijkstra’s Discipline of Programming. The first time I read both books, I did so concurrently. It was a long time ago, and I was less mature mathematically, but I definitely remember finding Gries clearer and more directly applicable, while Dijkstra did a better job of making me think hard about how to best express the meaning of code. Even now, I’d say they’re still complementary.
Jackson, Software Abstraction describes an interesting approach to interpolate between testing and formal methods. This book changed the way I write tests. It’s also based on a nice application of combinatorial optimisation ;)
Oram and Wilson, Beautiful Code feels uneven to me. Some chapters cover inspiring gems, but many left no mark. A friend suggests Bentley’s Programming Pearls or Neuman’s Computer Related Risks.
Golub and Van Loan, Matrix Computations is obviously a good reference for the implementation of dense linear algebra. However, I like this work even more for its clear and precise discussion of performance and numerical precision matters. One can’t hide behind bigOh when discussing linear algebra. Yet, the book is as enlightening and general as classic tomes on asymptotic analysis of algorithms and data structures. Better: that is true for both the implementation and the use of the algorithms they describe. If all software engineering were as well understood as numerical linear algebra, our discipline would be in a good place.
McGeoch, A Guide to Experimental Algorithmics was my bedside companion for the holiday season. On several occasions, I told friends that this is the book I wish I had read ten years ago and that I dreamt of writing the past couple year. It has good rules of thumb and suggestions for everything from designing proper experiments to deriving robust performance models.
Warren, Hacker’s Delight is really a fundamental book for me. It’s not that bit twiddling tricks are that important, but that I know I can count on Warren’s nice exposition if I need to direct someone to a reference. Now nicely complemented by Knuth 4A.
Ahuja, Magnanti and Orlin, Network Flows because it describes all the classics. In particular, if an optimisation problem is in P, it’s probably in there.
Bazaraa, Sherali and Shetty, Nonlinear Programming mostly for a couple convergence proofs and as a reference.
Chvátal, Linear Programming for its unique (? rare, definitely) and insightful presentation of linear optimisation and of classic duality theorems. I like the sections on Network flows and on Basis factorisation for similar reasons.
HiriartUrruty and Lemaréchal, Convex Analysis and Minimization Algorithms because decomposition methods depend on convex nondifferentiable optimisation. I rarely care about general convex optimisation, but it’s an unavoidable building block for Lagrangian decomposition methods.
Bollobás, Modern Graph Theory mostly as a reference for classic results. I remember having a lot of fun with the section on matchings.
Lawler, Combinatorial Optimization is short but dense. I don’t grok matroids, and someone I respect gave me this book and told me it would all make sense once I’d gone through it.
Nemhauser and Wolsey, Integer and Combinatorial Optimization for the dark times when I need to directly think in terms of polytopes and facets, or when I try to generate fancy cuts.
Schrijver, Combinatorial Optimization mostly as a reference for theoretical results. It’s exhaustive, so I can use it like the TAoCP of optimisation and as a springboard to other work.
Wolsey, Integer programming when Nemhauser is hard to follow.
These books help me approach problems that don’t fit neatly in any box.
Bertsekas, Dynamic Programming and Optimal Control and Law, Simulation, Modeling and Analysis fill holes left by my focusing on deterministic singleperiod problems. I use Bertsekas and Law as refreshers when I remember that I’m forgetting something. I can then decide whether to go more deeply in that direction.
Polya, How to Solve It. For the really dark times when it feels like I’m asked to do the impossible. Perhaps it’s mostly useful because it distracts me from the immediate task for a couple minutes. Either way, I like to reread it when I get stuck.
Tufte, The Visual Display of Quantitative Information, along with Hadley Wickham’s ggplot2, taught me to think deliberately about presenting results. Our visual cortex is pretty awesome but bad visualisation wastes our brain on distractions.
]]>That’s not too surprising. Given any change, it’s usually easy to design a situation that makes it shine. I find it just as important to ask the opposite question: what’s the worst slowdown I can get from switching from regular 4KB pages to huge 1GB pages? This way, I can make more robust decisions, by taking into account both a good (if not best) and a worst case. By the end of this post, I’ll multiply runtime by 250% for the same operation, simply by switching from 4KB to 1GB pages… with a setup is so contrived that it seems unlikely to occur by accident.
But first, a quick overview of the benefits and downsides of huge pages.
In short, manufacturers are adding huge pages because translating virtual addresses to physical ones is slow.
On x8664, the mapping from virtual to physical pages is represented with a trie; each level dispatches on 9 bits (i.e., each node has 512 children), and leaves correspond to 4KB pages. There are 4 levels from the root node to the leaves, which covers the (currently) standard 48bit virtual address space.
The address translation table is (mostly) stored in normal memory and is too large to fit in cache. Thus, translating a virtual address requires four reads, any of which can hit uncached memory.
This is where the translation lookaside buffer (TLB) comes in. On my E54617, each core has 64 entries for regular 4KB pages in its L1 dTLB, and 512 (shared) entries in its L2 TLB. I don’t know if the TLBs are exclusive or inclusive, but even if they’re exclusive, that’s only enough for 2.25MB worth of data. Assuming that the working set is completely contiguous (i.e., the best case), the TLB space for 4KB pages is less than the total cache/core (2.5MB L3/core + 256 KB L2 + 32 KB L1D).
2MB and 1GB “huge pages” address this imbalance: nine 2MB pages suffice to cover more address space than all the caches in a 6core E54617. However, there are only 32 L1dTLB entries for 2MB pages – and 4 entries for 1GB pages – on my E5.
In addition to covering more address space in the TLB, huge pages offer secondary benefits: there are fewer page table entries, and the trie is shallower. Fewer page table entries means that a larger fraction of memory can be used by data, rather than metadata, and that the page table walk is more likely to stay in cache. Moreover, larger pages are closer to the trie’s root: while the processor traverses 4 levels to get to a 4KB page, it only traverses 3 levels to reach a 2MB page and 2 levels to for a 1GB page. These two effects compound to make TLB misses quicker to handle.
This idea that one must cover as much address space as possible with the TLB is most relevant in two settings: trivially, if the working set is completely covered by the (L1d)TLB, or, more interestingly, when the access patterns show a lot of locality. Examples of the latter case are BLAS routines: with appropriate blocking, they can usually access each page once or twice, but read almost every byte in a page before switching to the next.
The opposite, worst, case would be something like lookups in a large (too big for the TLB) hash table: we choose a virtual address pseudorandomly and painstakingly translate it to a physical address, only to read a handful of words from that page. In that situation, we want as many TLB entries as possible, regardless of the address space each one covers… and that’s where 4KB pages ought to shine. Taking into account both the L1DTLB and the L2TLB, each core has 576 TLB entries for 4KB (data) pages, versus 64x2MB and 4x1GB. Now, I don’t know if the TLBs are exclusive or not, so I’ll assume the worst case and work with 512*4KB entries.
The thing is, 512 TLB entries aren’t that many. If, by chance, our hash table lookups keep hitting the same 512 pages, a contrived microbenchmark will show that 4KB pages are a big win (but really, a software cache might be a better way to exploit the situation). It’s more likely that it’ll be hard to avoid TLB misses regardless of page size, and huge pages then become useful because each TLB miss is handled more quickly. Regardless, I’ll try to approximate this worstcase behaviour to see how bad things can get.
Ideally, I would want to read from 512 (or a bit fewer) locations 1GB apart, but I don’t have that much RAM. In the interest of realism, I decided to “only” allocate 24GB.
My first microbenchmark follows: I allocate 24GB, divide that space in 512 chunks, and read the first word of each chunk in a loop. At first, I didn’t even randomise the traversal order (so as to abuse LRU), but there seems to be some prefetching for 1GB pages.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 

The results: 1.13e10 cycles for 4KB pages, 1.60e10 for 2MB and 1.56e10 for 1GB. That’s only 40% more cycles… it’s bad, but not horrible. The reason is that the data vector spans only 24x1GB, so 1/6th of the random lookups will hit the 1GB TLB. Instead, let’s try and load from each of these 24 pages, in random order. 24 pages will easily fit in the L1DTLB for 4KB pages, but not in the 4 slots for 1GB pages.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

The results are even worse (better)! 4.82e9 cycles for 4KB pages, versus 3.96e9 and 2.84e9 for 2MB and 1GB pages!
The problem is aliasing. The TLB on my E5 has limited wayness (4way, I believe), so, by aligning everything to a 1GB boundary, the effective size of the 4KB page TLB is 4 entries (same for 2MB). In a way, this highlights the effect of page size when TLBs are useless (random accesses to dozens or hundreds of GBs): 2MB pages shave 18% off the runtime, and 1GB pages another 30%, for a total of 60% as much time to handle a 1GB TLB miss versus 4KB.
Let’s try again, with indices[i] = (x*stride + (x*4096)%(1ul<<30)) % (24ul << 30);
on line 19. I now find 1.14e9, 6.18e9 and 2.65e9 cycles. Much better!
For fun, I also tried to offset by 2MB increments, with indices[i] =
(x*stride + (x<<21)%(1ul<<30)) % (24ul << 30);
, and found 2.76e9,
1.30e9, and 2.85e9 cycles.
Finally, I tried
size_t offset = 4096 + (1ul<<21);
indices[i] = (x*stride + (x*offset)%(1ul<<30)) % (24ul << 30);
so that neither 4KB nor 2MB pages would alias, and got 1.13e9, 1.08e9 and 2.65e9 cycles. That’s 234% as much time for 1GB pages as for 4KB.
We’re close: this setup is such that 1GB pages cause a lot of TLB
misses, but neither 4KB nor 2MB pages do. However, perf stat
shows
there’s a lot of cache misses, and that probably reduces the
difference between 4KB and 1GB pages.
Let’s try one last thing, with size_t offset = 4096 + (1ul<<21) +
64;
(to avoid aliasing at the data cache level), and a smaller index
vector that fits in cache.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

We get 1.06e9, 9.94e8, and 2.62e9 cycles, i.e., 250% as much time with 1GB pages than 4KB ones.
We can easily turn this around: we just have to loop over more than 4 4KBaligned locations in a 4GB space. For example, with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

With the above, I find 7.55e9 cycles for 4KB pages, 3.35e9 for 2MB and
1.09e9 for 1GB pages. Here, 4KB pages are almost 7x as slow as 1GB
pages. If I instead let size_t offset = 4096 + 64;
(to avoid
aliasing in the 4KB TLB), I get 4.72e9 cycles for 4KB pages, so still
433% as much time.
We can also play the same trick over 32*2MB = 64MB. On my E5, I find 3.23e9 cycles for 4KB pages, versus 1.09e9 for 2MB and 1GB pages. Eliminating pagelevel aliasing only brings the 4KB case down to 3.02e9 cycles, and doesn’t affect the other two cases.
The following table summarises the runtimes of all the variations above with 2MB and 1GB pages (as a fraction of the number of cycles for 4KB pages).
2MB/4KB  1GB/4KB 
1.42  1.38 
0.82  0.59 
5.42  2.32 
0.47  1.03 
0.96  2.34 
0.94  2.47 
0.44  0.14 
0.72  0.23 
0.34  0.34 
0.36  0.36 
Overall, I think that I wouldn’t automatically switch to 2MB pages, but that 1GB pages are a solid choice for machines that basically run a single process at a time. When the data fits in 4GB, 1GB pages completely eliminate TLB misses. When the data is even larger, 2MB and 1GB pages make page table walks quicker (by 18% and 40%, respectively). It takes a very contrived situation – in which a program keeps hitting fewer than 512 4KBpages that are spread out across multiple GBs – for smaller pages to be preferable. The worst I managed was 250% as much time for 1GB pages vs 4KB; in the other direction, I achieved 693% as much time for 4KB pages versus 1GB, and 433% with a realistic situation (e.g., repeated lookups in a 4GB hash table). Plus, there’s another interesting benefits from larger pages that did not show up in this post: we get more control over aliasing in data caches.
With multiple processes in play, there are fragmentation issues, and things aren’t as clearcut… especially given that 1GB pages must currently be allocated at boottime, on Linux.
I’m also still unsure how 1GB pages interact with NUMA. I’m particularly worried about interleaving: interleaving at a 1GB granularity seems unlikely to smooth out the ratio of local:remote accesses as much as doing it at a 4KB granularity.
]]>One rainy day at SophiaAntipolis, a researcher told me a story about an optimisation software company. They’d looked into adding explicit support for DantzigWolfe decomposition, but market research showed there were a few hundred potential users at most: too few people are familiar enough with decomposition methods to implement new ones. The problem is, the strength of these methods is that they are guided by domain expertise; useful decompositions are custommade for the target problem, and thus obviously novel. What this story tells me is that we, researchers in decomposition, are really bad communicators.
It’s not like these methods are new or ill understood. The theory on how to break up optimisation problems into smaller ones (for Benders and DantzigWolfe decompositions, at least) was fully developed in the fifties and sixties. It’s the same theory that backs stateoftheart solvers for planetscale planning problems today. Surprisingly, it seems that a single undergraduate or graduate course in optimisation methods does not suffice for expertise to trickle down to the wider computer science and management communities.
The divide seems particularly wide with the constraint programming (CP) community (despite the work of Hooker, Chvátal and Achterberg [that I know of]). Perhaps this explains why the VPRI’s efforts on cooperating constraint solvers are not informed by decomposition methods at all. Then again, it’s not just decomposition people who exchange too little with the CP community, but all of mathematical programming. I only single out decomposition because our tools are general enough to work with CP. We write and sometimes think in mixed integer linear programming (MILP) terms, but mostly for theoretical reasons; implementations may (useful ones usually do) include specialised solvers for combinatorial problems that only happen to be presented as MILP.
That’s why I spent three months in Nice: I explored how to crosspollinate between CP and decomposition by developing a TSP solver (it’s a simple problem and Concorde is an impressive baseline). I have a lot to say about that, certainly more than can fit in this buffer. In this first post of many, I hope to communicate an intuition on why DantzigWolfe and Benders decomposition work, to relate these methods to work from the SAT/CSP/CP community, and to show how to decompose finite domain problems.
I noted earlier that we work in MILP terms because it gives a sound theoretical basis for decomposition methods. That basis is duality, especially linear programming duality.
In optimisation, duality refers to the relationship between two problems a primal problem (P) and its dual (D). I’ll assume that we’re interested in solving (P), a minimisation problem (minimise badness, e.g., costs); (D) is then a maximisation problem (maximise goodness, e.g., profits). These problems are such that the minimal (least) value for (P) is always higher or equal to the maximal (greatest) value for (D).
So, even if (P) and (D) are so hard that we can only solve them heuristically, a pair of feasible solutions always brackets the optimal value for the primal problem (P). \(\tilde{x}\), the feasible solution for (P) might not be optimal, but it’s still a solution. In turn, \(\tilde{y}\), the dual feasible solution, gives us a lower bound on the optimal value: we don’t know the optimal value, but we know it can’t be lower than that of any solution for (D). We can then compare the value of \(\tilde{x}\) to this lower bound. If they’re close enough, we stop looking; otherwise we use both primal and dual information to refine our solutions and tighten the bracket (close the gap).
The primal solution is useful: it’s a feasible plan. However, the dual solution is often where we gain insights into the problem. For example, if I wish to minimise my carbon footprint, it’s clear that driving less will help me. However, what is the impact of allowing myself more commuting time? and how does that compare with simply shortening the distance between my home and work? That’s what dual solutions tell us: they estimate the impact of (the righthand side of) each constraint on the objective value.
Linear programs are a special case: they have a “strong” duality, in that we can easily write down an exact dual program for any primal linear program. The dual is exact because its maximal value matches the minimal value of the primal; there is no duality gap. This dual is another linear program that’s exactly as large as the primal, and the dual of the dual is… the initial primal program.
Decomposition exploits this relationship to extract information about simpler subproblems, and translate that into a refined approximation of the complete problem. It would be disappointing if that only worked on linear programs, and this is where Lagrangian duality comes in.
You may remember Lagrange multipliers from calculus courses. That’s exactly what I’m talking about. For reasons both practical and theoretical, I’ll assume that we have a (sub)problem that’s simpler once a few linear (affine) constraints are removed, i.e. something like $$\min\sb{x} z(x)$$ subject to $$Ax \leq b,$$ $$x\in X.$$
This form is general; the objective function \(z\) need only be convex and the feasible set \(X\) is arbitrary.
Lagrangian duality says that we can relax this problem into $$\min\sb{x\in X} z(x) + \lambda(Axb),$$ where \(\lambda\) is any vector of nonnegative multipliers for each row (constraint) in \(A\). I find such subproblems interesting because minimising over \(X\) can be easy with a specialised procedure, but hard as a linear program. For example, \(X\) could represent spanning trees for a graph. That’s how the onetree relaxation for the TSP works. Once you remove an arbitrary “onenode” from a Hamiltonian cycle, the remaining Hamiltonian path is a special kind of spanning tree: the degree of all but two nodes is two, and the remaining two nodes are endpoints (would be connected to the onenode). The onetree relaxation dualises (relaxes into the objective function) the constraint that the degree of each node be at most two, and solves the remaining minimum spanning tree problem for many values of \(\lambda\).
The onetree subproblem, and Lagrangian subproblems in general, is a relaxation because any optimal solution to this problem is a lower bound for the optimal value of the initial problem.
To show this, we first notice that the feasible set of the subproblem is a superset of that of the initial problem: we only removed a set of linear constraints. Moreover, for any feasible solution \(\bar{x}\) for the initial problem, $$z(\bar{x}) + \lambda(A\bar{x}b) \leq z(\bar{x}):$$ \(A\bar{x} \leq b\), i.e., \(A\bar{x}b \leq 0\), and \(\lambda\geq 0\). In particular, that is true if \(\bar{x}\) is an optimal solution to the initial problem. Obviously, an optimal solution to the relaxed subproblem must be at least as good as \(\bar{x}\), and the optimal value of the subproblem is a lower bound for that of the initial problem.
This relation between optimal solutions for the initial problem and for the relaxed subproblem is true given any \(\lambda\geq 0\). However, it’s often useless: bad multipliers (e.g., \(\lambda = 0\)) lead to trivially weak bounds. The Lagrangian dual problem with respect to \(Ax \leq b\) is to maximise this lower bound: $$\max\sb{\lambda\geq 0} \min\sb{x\in X} F(\lambda, x),$$ where $$F(\lambda, x) = z(x) + \lambda(Axb) = z(x) + \lambda Ax  \lambda b.$$
Again, we don’t have to maximise this dual exactly. As long as the minisation subproblem is solved to optimality, we have a set of lower bounds; we only have to consider the highest lower bound. Maximising this Lagrangian dual, given an oracle that solves the linearly penalised subproblem, is an interesting problem in itself that I won’t address in this post. The important bit is that it’s doable, if only approximately, even with very little storage and computing power.
There’s some nifty theoretical results that tell us that, when \(z(x)\) is a linear function, maximising the Lagrangian dual function is equivalent to solving the linear programming dual of $$\min z(x)$$ subject to $$Ax \leq b,$$ $$x\in \textrm{conv}(X),$$ where \(\textrm{conv}(X)\) is the convex hull of \(X\).
This is a relaxation (the convex hull is superset of \(X\)), but a particularly tight one. For example, if \(X\) is a finite set, optimising a linear function over \(\textrm{conv}(X)\) is equivalent to doing so over \(X\) itself. Sadly, this doesn’t hold anymore once we add the linear constraint \(Ax \leq b\) after taking the convex hull, but still gives an idea of how strong the relaxation can be.
One key detail here is that the optimality of \(x\) depends on \(\lambda\), \(z\), \(X\) and \(A\), but not on \(b\): once we have a pair \((\lambda\sp{\star}, x\sp{\star})\) such that \(x\sp{\star}\) is minimal for \(F(\lambda\sp{\star},\cdot)\), their objective value is always a valid lower bound for the initial problem, regardless of the righthand side \(b\). Thus, it make sense to cache a bunch of multipliersolution pairs: whenever \(b\) changes, we only need to reevaluate their new objective value to obtain lower bounds for the modified problem (and interesting pairs can serve to warm start the search for better maximising multipliers).
Because I’m a heartless capitalist. (I find it interesting how economists sometimes seem overly eager to map results based on strong duality to the [not obviously convex] world.)
Seriously, because optimisation composes better than satisfaction. For example, imagine that we only had a routine to enumerate elements of \(X\). There’s not a lot we can do to optimise over \(X\) (when you seriously consider a British Museum search…), or even just to determine if the intersection of \(X\) and \(Ax \leq b\) is empty.
I’ve hinted as to how we can instead exploit an optimisation routine over \(X\) to optimise over an outer approximation of that intersection. It’s not exact, but it’s often close enough; when it isn’t, that approximation is still useful to guide branchandbound searches. A satisfaction problem is simply an optimisation problem with \(z(x) = 0\). Whenever we have a relaxed solution with a strictly positive objective value, we can prune a section of the search space; at some point we either find a feasible solution or prune away all the search space. Thus, even for pure satisfaction problems, the objective function is useful: it replaces a binary indicator (perhaps feasible, definitely infeasible) with a continuous one (higher values for less clearly feasible parts of the search space).
I said optimisation composes better because decomposition also works for the intersection of sets defined by oracles. For example, we can satisfy over the intersection of finite sets \(X\) and \(Y\) by relaxing the linear constraint in $$x = y,$$ $$x\in X, y\in Y.$$ The relaxed subproblem is then $$\min\sb{(x,y)\in X\times Y} \lambda(xy),$$ which we can separate (linear objectives are separable) into two optimisation subproblems: $$\min\sb{x\in X} \lambda x,$$ and $$\min\sb{y\in Y} \lambda y.$$ Even if the initial problem is one of satisfaction, Lagrangian decomposition considers each finite set in the intersection through independent minimisation: components communicate via their objective functions.
All this can also work with categorical data; we just have to do what no CP practicioner would ever do and map each choice to a binary (0/1) variable. When, in the constraint program, a variable \(x\) can take values \({a, b, c}\), we express that in integer programs with a triplet of variables:
and a constraint \(\sum\sb{i\in {a,b,c}} x\sb{i} = 1\).
We can perform this mapping around each subproblem (e.g., optimisation over \(X\)). On entry, the coefficient for \(x\sb{a}\) in the linear objective corresponds to the impact on the objective function of letting \(x=a\). On exit, \(x=c\) becomes \(x\sb{c}=1\), \(x\sb{a}=0\) and \(x\sb{b}=0\), for example.
Benders decomposition is the mathematical programming equivalent of clause learning in SAT solvers, or explanationbased search in CP.
Clause learning improves straight backtracking by extracting information from infeasible branches: what’s a small (minimal is NPhard) set of assignments (e.g. \(x=\mathtt{false}\) and \(y=\mathtt{false}\)) that causes this infeasibility? The problem is then updated to avoid this partial assignment.
Explanations in CP are similar: when a global constraint declares infeasibility, it can also provide an explanation, a small set of assignments that must be avoided. This, combined with explanations from other constraints, can trigger further propagation.
Benders decomposition generalises this feasibilityoriented view for the optimisation of problems with a hierarchical structure. I think the parallel has been known for a while, just not popularised. John Hooker proposed logicbased Benders to try and strengthen the link with logic programming, but I don’t know who (if anyone) else works in that direction.
The idea behind Benders decomposition is to partition the decision variables in two sets, strategic variables \(y\), and tactical ones \(x\). The decomposition attempts to determine good values for \(y\) without spending too much time on \(x\), which are left for the subproblem. This method is usually presented for mixed integer programs with continuous tactical (\(x\)) variables, but I’ll instead present the generalised form, which seems well suited to constraint programming.
We start with the initial integrated formulation $$\min\sb{(x,y)\in S} cx + fy,$$ which we reformulate as $$\min\sb{(x,y)} cx + fy,$$ $$y\in Y,$$ $$x\in X(y).$$
For example, this could be a network design problem: we wish to determine where to build links so that all (known) communication demands are satisfied at least total cost.
Benders decomposition then eliminates variables \(x\) from the reformulation. The latter becomes the master problem $$\min\sb{y,z} fy + z$$ subject to $$Ay = d,$$ $$By \leq z,$$ $$y\in Y.$$
The two linear constraints are generated dynamically, by solving subproblems over \(X\). The first corresponds directly to learned clauses or failure explanations: whenever an assignment for \(y\) happens to be infeasible, we add a linear constraint to avoid it (and similar assignments) in the future. The second enables optimisation over \(S\).
The idea is to solve a Lagrangian relaxation of the (slave) subproblem for a given assignment \(y\sp{i}\): $$\min\sb{x,y} cx$$ subject to $$y = y\sp{i},$$ $$x\in X(y).$$
We will relax the first equality constraint to obtain a new feasible set that is a relaxation (outer approximation) of \(S\): it only considers interactions within \(X\) and between \(X\) and \(Y\), but not within \(Y\) (those are handled in the master problem).
In my network design example, the unrelaxed subproblem would be a lot of independent shortest path problems (over the edges that are open in \(y\sp{i}\)).
We dualise the first constraint and obtain the Lagrangian dual $$\max\sb{\lambda} \min\sb{y\in\mathbb{R}\sp{m}, x\in X(y)} cx + \lambda(yy\sp{i})$$ or, equivalently, $$\max\sb{\lambda} \lambda y\sp{i} + \min\sb{y\in\mathbb{R}\sp{m}, x\in X(y)} cx + \lambda y.$$
The network design subproblem still reduces to a lot of independent shortest path problems, but (with appropriate duplication of decision variables) forbidden edges may be taken, at additional cost per flow.
In this example, the previous step is gratuitously convoluted: we can solve the unrelaxed subproblem and shortest paths can be seen as linear programs, so there are other ways to recover optimal multipliers directly. However, this is useful for more complicated subproblems, particularly when the \(x = y\sp{i}\) constraint is difficult to handle. Note that the slave can be reformulated (e.g., with multiple clones of \(y\sp{i}\)) to simplify the Lagrangian subproblem.
I pointed out earlier that, given \(\bar{\lambda}\sp{i}\) and an optimal solution \((\bar{x}\sp{i},\bar{y}\sp{i})\) for that vector of multipliers, we always have a valid lower bound, regardless of the righthand side, i.e., regardless of the assignment \(y\sp{i}\).
We found a first (hopefully feasible) solution by solving a restricted subproblem in the previous step. The multipliers for our dualised constraint \(y = y\sp{i}\) explain the value associated to \(y\sp{i}\), what part of \(y\sp{i}\) makes the solution to the restricted subproblem so bad (or good).
That’s how we dynamically constrain \(z\). Each time we solve the (updated) master problem, we get a tentative assignment for the strategic variables, and a lower bound for the initial problem. We can then solve for \(x\in X(y\sp{i})\) to find a heuristic solution (or generate a feasibility cut to tell the master why the assignment is infeasible); if the heuristic solution is close enough to our current lower bound, we can stop the search. Otherwise, we solve the fixed Lagrangian dual and generate a fresh optimality cut $$\bar{\lambda}\sp{i}y + c\sp{i} \leq z,$$ where \(c\sp{i} = c\bar{x}\sp{i} + \bar{\lambda}\sp{i}\bar{y}\sp{i}\): we update the master problem with information about why its last tentative assignment isn’t as good as it looked.
It can be harder to explain infeasibility than to solve the Lagrangian subproblem. In that case, it suffices to have an upper bound on the optimal value (e.g., 0 for pure feasibility problems): when the lower bound is greater than the upper bound, the problem is infeasible. Given that the master problem also tries to minimise its total cost, it will avoid solutions that are infeasible if only because their value is too high. The advantage of feasibility cuts is that, with some thinking, a single feasibility cut can forbid a lot more solutions than multiple optimality cuts.
The problem with Benders decomposition is that the master mixed integer program gains new constraints at each iteration. Eventually, its size becomes a liability.
There are a few folk workarounds. One is to not stop (and then restart) the branchandbound search as soon as it finds an optimal (for the current relaxed master problem) integer feasible assignment for \(y\): instead, add a new constraint to the formulation and resume the search (previously computed lower bounds remain valid when we add constraints). Another is to drop constraints that have been inactive for a long while (much like cache replacement strategies); they can always be added back if they ever would be active again.
Chvátal’s Resolution Search takes a different tack: it imposes a structure on the clauses it learns to make sure they can be represented compactly. The master problem is completely different and doesn’t depend on solving mixed integer programs anymore. It works well on problems where the difficulty lies in satisfying the constraints more than finding an optimal solution. A close friend of mine worked on extensions of the method, and it’s not yet clear to me that it’s widely applicable… but there’s hope.
Another issue is that the Benders lower bound is only exact if the slave subproblem is a linear program. In the general case, the master problem is still useful, as a lower bounding method and as a fancy heuristic to guide our search, but it may be necessary to also branch on \(x\) to close the gap.
I believe Lagrangian decomposition, a special case of Lagrangian relaxation, is a more promising approach to improve constraint programming with mathematical programming ideas: I think it works better with the split between constraint propagation and search. The decomposition scheme helps constraints communicate better than via only domain reductions: adjustments to the objective functions pass around partial information, e.g., “\(x\) is likely not equal to \(a\) in feasible/optimal solutions.” Caveat lector, after 7 years of hard work, I’m probably biased (Stockholm syndrome and all that ;).
I already presented the basic idea. We take an initial integrated problem $$\min\sb{x\in X\cap Y} cx$$ and reformulate it as $$\min\sb{(x,y)\in X\times Y} cx$$ subject to $$x = y.$$ Of course, this works for any finite number of sets in the intersection as well as for partial intersections, e.g., \(x\sb{0} = y\sb{0}\) and \(x\) and \(y\) otherwise independent.
Dualising the linear constraint leaves the independent subproblems $$\min\sb{x\in X} cx+\lambda x,$$ and $$\min\sb{y\in Y} \lambda y.$$
The two subproblems communicate indirectly, through adjustments to \(\lambda\)… and updating Lagrange multipliers is a fascinating topic in itself that I’ll leave for a later post (: The basic idea is that, if \(x\sb{i} = 1\) and \(y\sb{i} = 0\), we increase the associated multiplier \(\lambda\sb{i}\), and decrease it in the opposite situation: the new multipliers make the current incoherence less attractive. For a first implementation, subgradient methods (e.g., the volume algorithm) may be interesting because they’re so simple. In my experience, however, bundle methods work better for nontrivial subproblems.
When we have an upper bound on the optimal value (e.g., 0 for pure satisfaction problems), each new multipliers \(\lambda\) can also trigger more pruning: if forcing \(x = a\) means that the lower bound becomes higher than the known upper bound, we must have \(x \neq a\). The challenges becomes to find specialised ways to perform such pruning (based on reduced costs) efficiently, without evaluating each possibility.
Lagrangian decomposition also gives us a fractional solution to go with our lower bound. Thus, we may guide branching decisions with fractional variables, rather than only domains. Again, an optimisationoriented view augments the classic discrete indicators of CP (this variable may or may not take that value) with continuous ones (this assignment is unattractive, or that variable is split .5/.5 between these two values).
I believe decomposition methods are useful for constraint programming at two levels: to enable more communication between constraints, and then to guide branching choices and heuristics.
In classic constraint programming, constraint communicate by reducing the domain of decision variables. For example, given \(x \leq y\), any change to the upper bound for \(y\) (e.g., \(y\in [0, 5]\)) means that we can tighten the upper bound of \(x\): \(x \leq y \leq 5\). This can in turn trigger more domain reduction.
Some constraints also come in “weighted” version, e.g., \(x\sb{i}\) form a Hamiltonian cycle of weight (cost) at most 5, but these constraints still only communicate through the domain of shared decision variables.
With Lagrangian decomposition, we generalise domain reduction through optimisation and reduced cost oracles. For each constraint, we ask, in isolation,
Adding the objective values computed by the optimisation oracle gives us a lower bound. If the lower bound is greater than some upper bound, the problem (search node) is infeasible. Otherwise, we can use reduced costs to prune possibilities.
Let’s say the difference between our current lower and upper bounds is \(\Delta\). We prune possibilities by determining whether setting \(x=a\) increases the lower bound by more than \(\Delta\). We can do this independently for each constraint, or we can go for additive bounding. Additive bounding computes a (lower) estimate for the impact of each assignment on the objective function, and sums these reduced costs for all constraints. The result is a set of valid lower estimates for all constraints simultaneously. We can then scan these reduced costs and prune away all possibilities that lead to an increase that’s greater than \(\Delta\).
The optimisation oracle must be exact, but the reduced cost oracle can be trivially weak (in the extreme, reduced costs are always 0) without affecting correctness. Domain reduction propagators are simply a special case of reduced cost oracles: infeasible solutions have an arbitrarily bad objective value.
Some constraints also come with explanations: whenever they declare infeasibility, they can also generate a small assignment to avoid in the future. The mathematical programming equivalent is cut generation to eliminate fractional solutions. Some very well understood problems (constraints) come with specialised cuts that eliminate fractional solutions, which seems useful in hybrid solvers. For example, in my work on the TSP, I added the (redundant) constraint that Hamiltonian cycles must be 2matchings. Whenever the Lagrangian dual problem was solved fairly accurately, I stopped to compute a fractional solution (for a convexification of the discrete feasible set) that approximately met the current lower bound. I then used that fractional solution to generate new constraints that forbid fractional 2matchings, including the current fractional solution. These constraints increased the lower bound the next time I maximised the Lagrangian dual, and thus caused additional reduced cost pruning.
The same fractional solution I used to generate integrality cuts can also guide branching. There’s a lot of interesting literature on branching choices, in CP as well as in mathematical programming. Lagrangian decomposition, by computing fractional solutions and lower bounds, let us choose from both worlds. Again, the lower bound is useful even for pure feasibility problems: they are simply problems for which we have a trivial upper bound of 0, but relaxations may be lower. Only branching on fractional decision variables also let us cope better with weak domain propagators: the relaxation will naturally avoid assignments that are infeasible for individual constraints… we just can’t tell the difference between a choice that’s infeasible or simply uninteresting because of the objective function.
This is a lot more tentative than everything above, but I believe that we can generate decomposition automatically by studying (hyper)tree decompositions for our primal constraint graphs. Problems with a small (hyper)treewidth are amenable to quick dynamic programming, if we explore the search space in the order defined by the tree decomposition: a small treewidth means that each subtree depends on a few memoisation keys, and there are thus few possibilities to consider.
For example, I’m pretty sure that the fun polytime dynamic programming algorithms in the Squiggol book all correspond to problems with a bounded treewidth for the constraint graph.
In general though, problems don’t have a small treewidth. This is where mathematical decomposition methods come in. Benders decomposition could move a few key linking variables to the master problem; once they are taken out of the slave subproblem, the fixed Lagrangian subproblem has a small tree width.
However, I already wrote that Lagrangian decomposition seems like a better fit to me. In that case, we find a few variables such that removing them from the primal graph leaves a small tree decomposition. Each subproblem gets its own clones of these variables, and Lagrangian decomposition manipulates the objective function to make the clones agree with one another.
With this approach, we could also automatically generate dynamic programming solvers for Lagrangian subproblems… and it might even be practical to do the same for reduced cost oracle.
Going to hypertreewidth improves our support for global constraints
(e.g., alldiff
) or constraints that otherwise affect many variables
directly. Global constraint should probably get their own
decomposition component, to easily exploit their pruning routines, but
not necessarily large ad hoc ones. Hypertree decomposition takes the
latter point into account, that a large clique created by a single
constraint is simpler to handle than one that corresponds to many
smaller (e.g., pairwise) constraints.
I don’t know. I won’t have a lot of time to work on this stuff now that I have a real job. I’ve decided to dedicate what little time I have to a solver for Lagrangian master problems, because even the research community has a bad handle on that.
For the rest, I’m mostly unexcited about work that attempts to explore the search space as quickly as possible. I believe we should instead reify the constraint graph as much as possible to expose the structure of each instance to analyses and propagators, and thus reduce our reliance on search.
As a first step, I’d look into a simple system that only does table constraints (perhaps represented as streams of feasible tuples). A bucket heuristic for tree decomposition may suffice to get decent dynamic programming orders and efficiently optimise over the intersection of these table constraints. (Yes, this is highly related to query optimisation; constraint programming and relational joins have much in common.)
After that, I’d probably play with a few externally specified decomposition schemes, then add support for global constraints, and finally automatic decomposition.
All in all, this looks like more than a thesis’s worth of work…
]]>Linear programming (LP) is awesome. I based my PhD work on solving LPs with millions of variables and constraints… instead of integer programs a couple orders of magnitude smaller. However, writing LP solvers is an art that should probably be left to experts, or to those willing to dedicate a couple years to becoming one.
That being said, if one must write their own, an interior point method (IPM) is probably the best approach. This post walks through the development of a primal affine scaling method; it doesn’t guarantee stateoftheart practical or theoretical efficiency (it’s not even polynomial time), but is simple and easy to understand.
I think interior point methods suffer from an undeserved bad reputation: they only seem complicated compared to the nice combinatorial Simplex method. That would explain why I regularly (once or twice a year) see hackers implement a (classic!) simplex method, but rarely IPMs.
The problem is that simplex only works because worldclass
implementations are marvels of engineering: it’s almost like their
performance is in spite of the algorithm, only thanks to
codercenturies of effort. Worse, textbooks usually present the
classic simplex method (the dark force behind simplex tableaux) so
that’s what unsuspecting hackers implement. Pretty much no one uses
that method: it’s slow and doesn’t really exploit sparsity (the fact
that each constraint tends to only involve a few variables). Avoiding
the trap of the classic simplex is only the first step. Factorising
the basis — and updating the factors — is essential for efficient
revised simplex methods, and some clever tricks really help in
practice (packages like
LUSOL mostly
take care of that); pivot selection (pricing) is fiddly but has a huge
impact on the number of iterations — never mind degeneracy, which
complicates everything; and all that must be balanced with minimising
work per iteration (which explains the edge of dual simplex: there’s
no primal analogue to devex normalised dual steepest edge
pricing). There’s also presolving and crash basis construction, which
can simplify a problem so much that it’s solved without any simplex
pivot. Despite all this sophistication, only recent theoretical
advances
protect against instances going exponential.
Sidenote: Vasek Chvatal’s Linear Programming takes a view that’s closer to the revised simplex algorithm and seems more enlightening to me. Also, it’s been a while since I read about pricing in the simplex, but the dual simplex definitely has a computational edge for obscure reasons; this dissertation might have interesting details.
There’s also the learning angle. My perspective is biased (I’ve been swimming in this for 5+ years), but I’d say that implementing IPMs teaches a lot more optimisation theory than implementing simplex methods. The latter are mostly engineering hacks, and relying on tableaux or bases to understand duality is a liability for large scale optimisation.
I hope to show that IPMs are the better choice, with respect to both the performance of naïve implementations and the insights gained by coding them, by deriving a decent implementation from an intuitive starting point.
I’ll solve linear problems of the form $$\min\sb{x} cx$$ subject to $$Ax = b$$ $$l \leq x \leq u.$$
I will assume that we already have a feasible point that is strictly inside the box \([l, u]\) and that A has full row rank. Full rank doesn’t always pan out in practice, but that’s a topic for another day.
If it weren’t for the box constraint on x, we would be minimising a smooth function subject to affine equalities. We could then easily find a feasible descent direction by projecting \(c\) (the opposite of the gradient) in the nullspace of A, i.e., by solving the linear equalityconstrained least squares $$\min\sb{d} \(c)d\\sb{2}$$ subject to $$Ad = 0,$$ e.g. with LAPACK’s xGGLSE.
The problem with this direction is that it doesn’t take into account the box constraints. Once some component of x is at its bound, we can’t move past that wall, so we should try to avoid getting to close to any bound.
Interior point methods add a logarithmic penalty on the distance between x and its bounds; (the gradient of) the objective function then reflects how close each component of x is to its lower or upper bound. As long as x isn’t directly on the border of the box, we’ll be able to make a nontrivial step forward without getting too close to that border.
There’s a similar interpretation for primal affine scaling methods, but I prefer the original explanation. These methods rescale the space in which we solve for d so that x is always far from its bounds; rescaling the direction back in the original space means that we’ll tend to move more quickly along coordinates that are far from their bounds, and less for those close to either bound. As long as we’re strictly inside the box, the transformation is meaningful and invertible.
It’s a primal method because we only manipulate primal variables; dual methods instead work with the dual variables associated with the linear constraints.
This sketch might be useful. x is close to the lefthand side bound, and, after projection in the nullspace, d quickly hits the border. With scaling, d is more vertical, and this carries to the unscaled direction.
The descent direction may naturally guide us away from a closeby bound; scaling is then useless. However, each jump away from that bound makes the next longer, and such situations resolve themselves after a couple iterations.
Formally, let $$s = \min\{xl, ux\} > 0$$ be the slack vector (computed element wise) for the distance from x to its bounds. We’ll just scale d and c elementwise by S (S is the diagonal matrix with the elements of s on its diagonal). The scaled least square system is $$\min\sb{d\sb{s}} \(Sc)d\sb{s}\ $$ subject to $$ASd\sb{s} = 0.$$
Now that we have a descent direction \(d=Sd\sb{s}\), we only have to decide how far along to go in that direction. We determine the limit, r, with a ratio test. For each coordinate, look at \(d\sb{i}\): if it’s zero, this coordinate doesn’t affect the limit; if it’s negative, the limit must be at most \((lx)\sb{i}/d\sb{i}\); if it’s positive, the limit is at most \((ux)\sb{i}/d\sb{i}\). We then take the minimum of these ratios.
If the minimum doesn’t exist (i.e., \(r=\infty\)), the problem is unbounded.
Otherwise, we could go up to \(x+rd\) to improve the solution while maintaining feasibility. However, we want to stay strictly inside the feasible space: once we hit a bound, we’re stuck there. That’s why we instead take a slightly smaller step, e.g., \(0.9r\).
This drawing shows what’s going on: given d, r takes us exactly to the edge, so we take a slightly shorter step. The new solution is still strictly feasible, but has a better objective value than the previous one. In this case, we’re lucky and the new iterate \(x\sp{\prime}\) is more isotropic than the previous one; usually, we slide closer to the edge, only less so than without scaling.
We then solve the constrained least squares system again, with a new value for S.
I assumed that x is a strictly feasible (i.e., interior) solution: it satisfies the equality constraint \(Ax=b\) and is strictly inside its box \([l, u]\). That’s not always easy to achieve; in theory, finding a feasible solution is as hard as optimising a linear program.
I’ll now assume that x is strictly inside its box and repair it toward feasibility, again with a scaled least squares solution. This time, we’re looking for the least norm (the system is underdetermined) solution of $$Ad = (b  Ax).$$
We rescale the norm to penalise movements in coordinates that are already close to their bound by instead solving $$ASd\sb{s} = (b  Ax),$$ for example with xGELSD.
If we move in direction \(Sd\sb{s}\) with step \(r=1\), we’ll satisfy the equality exactly. Agin, we must also take the box into account, and perform a ratio test to determine the step size; if \(0.9r > 1\), we instead set \(r=1/0.9\) and the result is a strictly feasible solution.
I uploaded some hacktastic CL to github. The initial affine scaling method corresponds to this commit. The outer loop looks at the residual \(Axb\) to determine whether to run a repair (feasibility) or optimisation iteration.
The code depends on a few libraries for the MPS parser and on matlisp, and isn’t even ASDFloadable; I just saved and committed whatever I defined in the REPL.
This commit depends on a patch to matlisp’s src/lapack.lisp
(and to
packages.lisp
to export lapack:dgglse
):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

1 2 3 4 5 6 7 8 

MPS is a very old (punchcardoriented) format to describe linear and
mixed integer programs. The parser mostly works (the format isn’t
very well specified), and standardform.lisp
converts everything to
our standard form with equality constraints and a box around decision
variables.
I will test the code on a couple LPs from netlib, a classic instance set (I think the tarball in CUTEr is OK).
AFIRO is a tiny LP (32 variables by 28 constraints, 88 nonzeros). Good news: the program works on this trivial instance and finds the optimum (the relative difference with the reference value is 7e6). The first four “repair” iterations find a feasible solutions, and 11 more iterations eventually get to the optimum.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

I also tested on ADLITTLE, another tiny instance (97x57, 465 nz): 48 iterations, 0.37 seconds total.
AGG is more interesting, at 163x489 and 2541 nz; that one took 156 iterations and 80 seconds (130% CPU, thanks to Apple’s parallel BLAS). Finally, FIT1D is challenging, at 1026x25 and 14430 nz; that one took 100 seconds and the final objective value was off .5%.
All these instances are solved in a fraction of a second by stateoftheart solvers. The next steps will get us closer to a decent implementation.
The initial implementation resorted to DGGLSE for all least squares solves. That function is much too general.
I changed the repair least squares to a GELSY (GELSD isn’t nicely wrapped in matlisp yet).
We can do the same for the optimisation constrained least squares. Some doodling around with Lagrange multipliers shows that the solution of $$\min\sb{x} \xc\\sb{2}\sp{2}$$ subject to $$Ax = 0$$ is $$x = A\sp{t}(AA\sp{t})\sp{1}Ac  c.$$
This is a series of matrixvector multiplications and a single linear solve, which I first perform with GELSY: when we’re nearly optimal, the system is really badly conditioned and regular linear solves just crash.
Both linear algebra microoptimisations are in this commit.
ADLITTLE becomes about five times faster, and FIT1D now runs in 2.9 seconds (instead of 100), but AGG eventually fails to make progress: we still have problems with numerical stability.
Our numerical issues aren’t surprising: we work with a scaled matrix \(AS\), where the diagonal of \(S\) reflects how close \(x\) is to its bounds. As we get closer to an optimal solution, some of \(x\) will converge to the bounds, and some to a value inbetween: the limit is a basic solution. We then take this badly conditioned matrix and multiply it by its own transpose, which squares the condition number! It’s a wonder that we can get any useful bit from our linear solves.
There’s a way out: the normal matrix \(AA\sp{t}\) is not only symmetric, but also positive (semi)definite (the full rank assumption isn’t always satisfied in practice). This means we can go for a Cholesky \(LL\sp{t}\) (or \(LDL\sp{t}\)) factorisation.
We’re lucky: it turns out that Cholesky factorisation reacts to our badly conditioned matrix in a way that is safe for Newton steps in IPMs (for reasons I totally don’t understand). It also works for us. I guess the intuition is that the bad conditioning stems from our scaling term. When we go back in the unscaled space, rounding errors have accumulated exactly in the components that are reduced into nothingness.
I used Cholesky factorisation with LAPACK’s DPOTRF and DPOTRS for the optimisation projection, and for repair iterations as well.
LAPACK’s factorisation may fail on semidefinite matrices; I call out to GELSY when this happens.
To my eyes, this yields the first decent commit.
The result terminates even closer to optimum, more quickly. AGG stops after 168 iterations, in 26 seconds, and everything else is slightly quicker than before.
I think the code is finally usable: the key was to reduce everything to PD (positive definite) solves and Cholesky factorisation. So far, we’ve been doing dense linear algebra with BLAS and LAPACK and benefitted from our platform’s tuned and parallel code. For small (up to a hundred or so constraints and variables) or dense instances, this is a good simple implementation.
We’ll now get a 10100x speedup on practical instances. I already noted, en passant, that linear programs in the wild tend to be very sparse. This is certainly true of our four test instances: their nonzero density is around 1% or less. Larger instances tend to be even sparser.
There’s been a lot of work on sparse PD solvers. Direct sparse linear solvers is another complex area that I don’t think should be approached lightly: expensive supercomputers have been solving sparse PD linear systems for a couple decades, and there’s some really crazy code around. I’ve read reports that, with appropriate blocking and vectorisation, a sparse matrix can be factored to disk at 80% peak FLOP/s. If you’re into that, I’m told there are nice introductions, like Tim Davis’s book, which covers a didactical yet useful implementation of sparse Cholesky in 2000 LOC.
I decided to go with CHOLMOD from
SuiteSparse, a
mixed GPL/LGPL library. CHOLMOD implements state of the art methods:
when all its dependencies are available, it exposes parallelisation
and vectorisation opportunities to the BLAS (thanks to permutations
computed, e.g., by METIS) and exploits Intel’s TBB and Nvidia’s CUDA.
It’s also well designed for embedding in other programs; for example,
it won’t abort(3)
on error (I’m looking at you, LLVM), and includes
data checking/printing routines that help detect FFI issues. Its
interface even helps you find memory leaks! Overall, it’s a really
refreshing experience compared to other academic code, and I wish all
libraries were this well thought out.
I built only the barest version on my Mac and linked it with a
wrapper
to help
my bindings.
I had to build the dynamic library with Wl,all_load
on OS X to
preserve symbols that only appear in archives; Wl,wholearchive
should do it for gnu ld. (I also bound a lot of functions that I
don’t use: I was paying more attention to the TV at the time.)
CHOLMOD includes functions to simultaneously multiply a sparse matrix with
its transpose and factorise the result.
Function solvedense
shows how I use it to solve a dense PD system. It’s a threestep
process:
This is really stupid: I’m not even dropping zero entries in the dense matrix. Yet, it suffices to speed up AGG to 14 seconds!
It’s clear that we need to preserve the constraint matrix A in a sparse format. CHOLMOD has a function to translate from the trivial triplet representation (one vector of row indices, another of column indices, and another of values) to the more widely used compressed representation. Instances in our standard form already represent the constraint matrix as a vector of triplets. We only have to copy from CL to a CHOLMOD triplet struct to exploit sparsity in Cholesky solves and in matrixvector multiplications.
We now solve AGG in 0.83 seconds and FIT1D in 0.95. I think we can expect runtimes of one second or less for instances up to ~200x200. Better, we can finally hope to solve respectable LPs, like FIT2D (10500x26, 138018 nz, 2.3 s) or FIT1P (1677x628, 10894 nz, 14 s).
Our normal matrix \(ASS\sp{t}A\sp{t}\) always has the same pattern (nonzero locations): we only change the scaling diagonals in the middle. Sparse solvers separate the analysis and factorisation steps for exactly such situations. When we solve a lot of systems with the same pattern, it makes sense to spend a lot of time on a onetime analysis that we then reuse at each iteration: fancy analysis routines generate factors that take up less space and need fewer FLOPs to build.
I do that here. Our less tiny instances are almost twice as fast. We solve FIT1P in 8 seconds now, and even FIT2P (13525x3001, 60784 nz) in 200 seconds.
I then made some microoptimisation to reuse storage.
Finally, I added steps that push the current solution away from closeby bounds. These centering steps help subsequent iterations make longer jumps toward optimality
The last two changes shave a couple more seconds on large instances and gets closer to optimality on nastier ones.
I hope I managed to communicate the intuition behind primal affine scaling methods, and that the walkthrough helped map that intuition to a sparse implementation. I also realise that the code isn’t pretty: I wrote it during a short marathon and tried to only make incremental changes. Still, the algorithm should be more or less usable for small instances; more than a naïve simplex method, anyway.
That’s particularly true on multicore machines. Parallelising simplex methods has been an area of slow research for a couple decades; the best work I’ve seen so far takes a huge hit by reverting to the classic algorithm and hopes that parallelisation can compensate for that initial 101000x slowdown. In contrast, even our sparse method is already parallel: CHOLMOD automatically vectorises, parallelises and offloads work to the GPU.
I’ll try to code a sparse primaldual affine scaling method from scratch soon. Primaldual methods usually work better than pure primal (or dual) methods and I find their theory interesting (if a bit complicated).
If you liked this post, you might be interested in Stephen Boyd’s Convex optimisation course. He’s offering it online this winter, starting January 21st.
]]>I think the underlying confusion is frequent; I’ve seen it in a couple discussions where people report conflicting experiences with Robin Hood hashing.
The scheme described in Pedro Celis’s dissertation (PDF) is a variant of double hashing. Hashing is based on simulating random allocation with a hash function; double hashing asks for a second hash function to simulate independent random probing sequences for each key. This is really useful to reduce the worstcase probe sequence length, and Robin Hood hashing further improves the distribution’s tail by reducing the variance: better locations are given to faroff entries after bumping out welloff ones, hence the name.
Twentyfive years later, I don’t think any variant of double hashing makes sense. If we allow multiple hash functions and reads to uncorrelated addresses (and writes that bump entries out), we might as well go for simpler strategies with awesome worstcase bounds, like 2left or cuckoo hashing. Then again, I’m not convinced that these schemes are useful in software either: we can do a lot of probing in the time it takes to read a second random address [*]. In hardware, the tradeoffs seem completely different; I wouldn’t be surprised if two 128 KB SRAM chips were cheaper than a single 256 KB chip.
[*] We could hope for memorylevel parallelism… but, in my experience, TLB misses are where (uncached) hash lookups hurt, and those don’t seem to be resolved in parallel. In cache, linear probing eliminates bank conflicts and is really quick.
When I looked into Robin Hood hashing, I was interested in a degenerate variant in which probing sequences are linear. This variant also seems the one others usually mean when they say “Robin Hood hashing,” nowadays. The algorithms for inserts and lookups in Celis’s dissertation still work: probe locations and distances just happen to be computed trivially. However, the analysis doesn’t hold anymore. This is where later work like Viola’s Distributional analysis of Robin Hood linear probing hashing with buckets and the associated journal paper comes in. The bounds are worse than with (pseudo)random probing sequences, but each probe is a lot quicker.
Such weaker bounds also mean that Celis’s suggestion of using tombstone (markers for deleted entries) doesn’t perform as well. Linear probing clumps up, even with Robin Hood rebalancing. Tombstones can be cleared out with a full rehash after \(\Theta(n)\) operations, or deletion instead implemented by copying later elements backwards. I’m a fan of the latter option: when ties are broken deterministically (e.g., by comparing keys), the layout of the hash table is independent of the sequence of insertions and deletions executed to get there.
When there’s a disagreement over the performance of Robin Hood hashing, it may help to specify the probing sequence. Older (pre2000) references and experiments probably refer to the original twist on double hashing; newer ones may instead describe the linear probing variant. Robin Hood double hashing seems obsolete, and the linear probing variant isn’t what Celis described. It may be more appropriate to refer to the latter as “Robin Hood linear probing.”
]]>SBCL is known for its flowsensitive type propagation pass… or, rather, for its effects. That is probably why it can be so shocking when SBCL fails to infer obvious invariants. I feel like I can now describe the root causes of these surprising weaknesses and propose a series of fixes. Coding that up would likely take months, but the result should be more precise type propagation with improved asymptotic efficiency, and even a tunable knob between precision and compile times.
The compiler in SBCL, Python, doesn’t work on an IR in SSA form, nor does it convert internally to CPS à la Rabbit/Orbit. However, a clever – I don’t recall seeing it described anywhere else – design decision achieves similar effects. Operations receive their arguments as linear vars (LVARs) and write their results to an LVAR. Each LVAR has exactly one destination (one reader node), and, on any execution path, can only be used (written to) by one node. LVARs may nevertheless have multiple uses: these correspond to the value of conditionals and of conditional nonlocal exits. Assignment and access to lexical variables are represented as SET and REF nodes that receive and produce values through LVARs. This means that nearly everything can work in terms of LVARs when manipulating code and reasoning about dataflow. The conversion of readonce lexical variables to LVARs further simplifies a lot of mostly functional code.
Python takes this split to the extreme: only the constraint (type) propagation pass really handles fullblown lexical variables. Everything else depends on the flowsensitive type information summarised in CFG nodes and joined in LVARs. Want to check whether a given argument is always positive? Simply test for subtyping on the LVAR’s derived type to implicitly leverage flowsensitive type information. Want to insert some computation (e.g. a type assertion or an explicit modulo for overflowing arithmetic) between a value and its destination? Insert a new node in the CFG and substitute around nothing but opaque LVARs. These operations are so natural in Python that I only recently realised the design decision that makes them so easy.
However, I believe that Python went too far in relegating lexical variables to only the constraint pass. That pass only handles lexical variables (LAMBDAVARs) and flowsensitive information about them, and only propagates three kinds of constraints (derived information):
The importance of EQL constraints is subtle. These constraints represent the way that, e.g., information about a lexical variable is also true of the result of a lookup for that variable’s value (and vice versa), even if that information is learned after the variable is accessed. Similarly, when two lexical variables are EQL, information about one is true of the other. Finally, equality with a constant is always useful for constant propagation.
All three constraint types are centered on lexical variables. The reason is that no flowsensitive information needs to be computed for LVARs: they are only used once. For example, if a type test is performed on an LVAR, that is the LVAR’s only appearance and there is no information to propagate… unless that information is also true of some lexical variable.
There’s some more clever engineering and the analysis is simplified by disregarding variables that may be assigned from closures, but that’s the gist of it. A vanilla dataflow analysis worklist loop propagates information around, and constraints sets for each basic block shrink until the least (in terms of types, greatest when measuring information) fixed point is reached.
The first weakness is caused by the aforementioned cleverness: knowledge about the state of the world as each basic block is entered and exited is represented with sets of constraints (bitsets indexed with consecutive indices assigned to constraints on demand). In the interest of speed, join and meet are implemented as (bit)set union/intersection. The problem with this approach is that it completely discards information about a lexical variable when it is not identically present in both sets. For example, when one predecessor says that V is a positive integer and another that V is an integer, their successor discards all information about the type of V.
In practice, this is a lesser problem than it first appears to be: LVARs merge information about their multiple uses (writes) without any loss. The kind of code that suffers is code that does things like:
(lambda (x)
(ecase x
(1 ...)
(2 ...))
[work on X, knowing that it is either 1 or 2])
The two branches of the ecase that eventually continue execution respectively derive that X is EQL to 1 and to 2. However, their shared successor combines that information by forgetting about both equalities rather than weakening/downgrading the two constraints into “X is of type (INTEGER 1 2).”
Another interesting case is
(lambda (x)
(let ((y nil)) ; IF expression translated to an effectful IR
(if x
(setf y 1)
(setf y 2))
y))
for which Python derives that the result is either 1, 2, or NIL (the
union of all the values ever assigned to Y). The equivalent code
(lambda (x) (if x 1 2))
compiles everything to LVARs and the result is
known to be either 1 or 2 even before constraint propagation.
This can be addressed with a redundant representation that adds, to each constraint set, a dictionary from lexical variables to the (downgradable) information known about them: their types and the types they’re known to be < or > than. When intersecting constraint sets, these dictionaries make it easy to weaken information and insert the corresponding constraints.
The second weakness is deeply embedded in the structure of the compiler: the constraint propagation pass only pushes around precomputed flowinsensitive information. That information is valid because the flowinsensitive pass (that computes things like the type of the values produced by each CFG node) works from an older (over) approximation of the flowsensitive information. Once a fixed point is reached in the constraint propagation pass, the flowinsensitive pass is reexecuted and information refined. In the end, there is a backandforth between the simple flowsensitive constraint propagation pass and the flowinsensitive “optimization” pass that leverages a huge knowledge base about Common Lisp operators and functions.
That exchange of information not only takes many (hefty) iterations to converge, but is also lossy: the constraint propagation pass is flowsensitive, but manipulates types based on an earlier pass or on type declarations. In effect, constraint propagation computes the least fixed point of a function that is defined by an earlier coarse upper approximation… a process that ends up preserving a lot of the initial weakness. This strange choice presents an interesting characteristic: every iteration the outer communication loop between flow sensitive and insensitive analysis produces valid (but overly conservative) information. So convergence is slow, but at least intermediate results can guide rewrites safely.
The obvious fix is to derive node types within the constraint propagation pass, even as the final result of the flowsensitive pass is approximated from below. For example, upon assignment to a lexical variable, the current pass adds a new constraint: the type of that variable is that of the LVAR for the assignment’s value, and the LVAR’s type is computed based on the information from the previous constraint propagation pass. Instead, it should be based on the current (overly agressive) flowsensitive information. That would eventually converge to a smaller fixed point. This alternative design would also obviate the need for a questionable hack that improves precision for some lucky iteration variables.
A weaker form of this change would be to use preliminary (perhaps overly aggressive) information only to tentatively detect impossible conditional branches. This would enable Python to derive that the return value of
(lambda ()
(loop with x = 0
while (foo)
do (case x
(1 (setf x 2))
(2 (setf x 1)))
finally (return x)))
is always 0 (and never 1 or 2), or that the expression (if (eql x y) (eql x y) t)
is always true.
With suitable handling of calls to local functions, useful types might even be derived for recursive functions. The current type propagation pass derives insanely stupid types for recursive functions and for assignment to variables in loops (unless the iteration variable hack suffices). For example
(lambda ()
(labels ((foo (x)
(if (bar)
(foo (1+ x))
x)))
(foo 42)))
is derived to return an arbitrary NUMBER. Instead, the return type of local functions can be initialised to the empty type (in which case propagation must be stopped for the caller block) and extended as argument types grow and as recursion triggers more propagation.
The problem is that this more precise flowsensitive pass loses the finite chain condition: there is no guarantee that the least fixed point will always be reached in finite time. The current constraint propagation pass operates on a finite lattice: only variables and types that exist in the input IR1 graph can appear in constraints, so the fixpointed function is defined over the powerset of this finite set of constraints for each basic block. In contrast, the full CL type universe is most definitely not finite (type tests aren’t even decidable in general).
There is a simple solution to this problem: before extending the constraint universe, widen types to satisfy the finite chain condition, i.e., make sure that types eventually stop growing. For example, when taking a nontrivial union or intersection of types, the result could be quantised to a coarse set of types (singleton types, named class types, a few key numeric ranges, etc.). Appropriate widening can also accelerate finite but slow convergence; the resulting fixed point isn’t guaranteed to be minimal, but is always valid. In fact, widening is likely to be necessary for the first improvement (with redundant dictionaries) as well.
Another issue in the constraint propagation pass is an instance of the phase ordering problem. Some sourcetosource transforms help type propagation (e.g., turning complicated functions into simpler ones for which type propagation is feasible), and others hinder it (e.g., converting integer division by a constant to multiply and shift or MAPCAR to an inline loop). Worse, there’s a dependency cycle as transforms rely on type propagation to determine whether they are applicable. There’s been some academic work that exposes multiple equivalent code sequences to analyses, and it seems possible to do something similar in Python as well: sourcetosource transforms generate functions, so we’d “only” have to analyse the function without splicing it in. However, that’s likely to be impractically slow (even termination is nonobvious)… perhaps it will be doable for select transforms that expand into pure expressions.
The only suggestion to accelerate constraint propagation so far is to widen types, and that mostly compensates for slowdowns (or infinite convergence) introduced by other changes. There’s a classic trick that ought to accelerate even the current pass: guide the order in which basic blocks are analysed by partitioning them in strongly connected components. Python exploits a simple ordering technique, a depthfirst ordering that minimises backward edges in reducible controlflow graphs. Basic blocks could instead be partitioned into strongly connected components: contracting each strongly connected component into a single vertex leaves a directed acyclic graph. It then suffices to fixpoint on each strong component and traverse the condensed DAG so that each component is only visited after all its predecessors. Python already computes loop nests, so that’s an easy change; it’s likely even better to look at nested loops and recursively order analysis within each component. A few other passes perform some form of dataflow analysis and would benefit from that improved ordering… perhaps compile times can be really improved (Amdahl says that’s unlikely until splicing in sourcetosource transformations is a lot quicker).
Recap :
In terms of implementation difficulty, these changes probably go 163425. This also happens to be the order in which I’d try to implement them.
]]>So, why is the continuation monad the mother of all monads? The short answer is that, by enabling transparent inversion of control, it eliminates the need to sprinkle hooks for monadspecific code everywhere; normal (as much as anything involving delimited continuations can be “normal”) evaluation rules will be subverted as needed. Here’s how.
First, some boilerplate (boilerfoil may be the more appropriate term).
#lang racket
(require racket/control) ; for shift/reset
(define (run thunk return . arguments)
(reset (return (apply thunk arguments))))
This definition is all that is needed to execute arbitrary Racket code
in a monad: the only thing to specify is how a ground value should be
moved in the monad. bind
will be defined implicitly, through the
code for monadspecific behaviour. The body of run
defines a
context to determine where to stop capturing the continuation, and
executes the form (return (apply thunk arguments))
in that context:
the thunk is called with any argument provided, and the result is
return
ed into the monad.
For the sake of generalisation, I’ll start with a trivial example: the Maybe monad. First, I’ll quickly define structure types. In practice, a distinguished “nothing” value would suffice, but this way parallels Haskell more closely.
(struct Maybe ())
(struct Nothing Maybe ())
(struct Some Maybe (value))
The constructor Some
also doubles as return
. In order to abort
out, nothing
must unwind the continuation and return a Nothing
object.
(define (nothing)
(shift k (Nothing)))
> (run (lambda () 42) Some)
#<Some>
> (Somevalue (run (lambda () 42) Some))
42
The function run
obviously works as it should in these trivial
examples. It’s also not surprising that nothing
works, because it’s
the obvious implementation of unwinding with delimited continuations.
> (run (lambda () 42 (nothing) #t) Some)
#<Nothing>
In the List monad, return
is just list
. run
can be called with a
thunk and list
as the second argument, and indeed, the result is a
normal computation that returns its value as a singleton.
> (run (lambda () (+ 4 5)) list)
'(9)
The useful thing to do with the List monad is to specify multiple return values, and have the computation fork (lazily in Haskell, eagerly here, because I’m working with strict lists) on each choice. That’s a oneliner:
(define (injectvalues . values)
(shift k (appendmap k values)))
This function captures the continuation up to reset
, unwinds the
current continuation up to that point, and binds the captured
delimited continuation to k
. Then, it passes each possible value to
the continuation, and appends the results together.
Here’s a first example:
> (run (lambda () (+ (injectvalues 1 2) (injectvalues 3 4))) list)
'(4 5 5 6)
That is, reassuringly, the four possible sums: 1 + 3 = 4
, 1 + 4 = 5
,
2 + 3 = 5
, and 2 + 4 = 6
. The magic is that all Scheme code,
by virtue of supporting the capture (and unwind) of delimited
continuation, is now “in” the List monad. It is certainly the case
for that uninstrumented thunk. The predefined function map
provides a more convincing example.
> (run (lambda ()
(map cons
(injectvalues '(1 2) '(4 5))
(injectvalues '(3 4) '(5 6))))
list)
'(((1 . 3) (2 . 4))
((1 . 5) (2 . 6))
((4 . 3) (5 . 4))
((4 . 5) (5 . 6)))
I’ve taken the liberty of linewrapping the return value, and clearly,
(map cons ...)
has been called with all four pairs of lists… but
all the special monadic operations happens outside map
. Moving
injectvalues
inside the mapped function is a much stronger evidence
that arbitrary uninstrumented Scheme code is implicitly lifted in the
monad: map
is a predefined (precompiled even) library function.
> (run (lambda ()
(map (lambda (x y) ((injectvalues + ) x y))
'(1 2 3)
'(4 5 6)))
list)
'((5 7 9) ; + + +
(5 7 3) ; + + 
(5 3 9) ; +  +
(5 3 3) ; +  
(3 7 9) ;  + +
(3 7 3) ;  + 
(3 3 9) ;   +
(3 3 3)) ;   
At each call to the mapped function, the computation explores both the branch in which the arguments are added and the one in which they are subtracted. The result, for 3 pairs of triplets, is a list of 8 triplets.
Neither of these implementations is surprising or new; I believe
they’re standard undergraduate exercises in a few universities. The
insight in
Filinski’s work
is that both nothing
and injectvalues
share the same structure
and can be defined in terms of the monad they help implement. Because
I dislike scrolling, their definitions are copied here.
(define (nothing) ; Maybe
(shift k (Nothing)))
(define (injectvalues . values) ; List
(shift k (appendmap k values)))
Instead of returning the (monadic) value (Nothing)
or values
(a
list), both capture and unwind the delimited continuation, bind it to
k
, and then do something more to the value. Some squinting reveals
that that something more is calling bind
with the continuation k
as the next step. In the Maybe monad, Nothing >>= k
evaluates to
Nothing
. In the List monad, values >>= k
becomes foldr ((++)
. k) [] values
, which is basically (appendmap k values)
. The
general form for any monad is then to implement any operator that
doesn’t just return
a value as
(define (operator ...)
(shift k (bind [special stuff] k)))
As code, this gives
(define (makeoperator bind operator)
(lambda arguments
(shift k (bind (apply operator arguments) k))))
after adding support for variadic operators. For example, choosing between multiple items in a list is
(define (listbind x k) (appendmap k x))
(define injectvalues2 (makeoperator listbind list))
Some sketching on paper will show that the transformation is generic
and correct, with a proof that mostly goes through repeated
applications of the monad laws. Capturing the continuation moves the
whole stack of functions that will eventually receive results into a
single function (the continuation), and that function can then be
passed to bind
. The monad laws guarantee that this associative
reordering of nested bind
s preserves the meaning of the program.
We can also mimic do
notation more closely and implement join
, an
“antireturn”. Given that operator, not return
ing a value can instead
be implemented as join
ing it after the fact. One definition is
(makeoperator bind identity)
, but I feel it’s simpler to just do it
longhand.
(define (makejoin bind)
(lambda (value)
(shift k (bind value k))))
(define joinlist (makejoin listbind))
> (run (lambda () (+ 1 (joinlist '(1 2 3)))) list)
'(2 3 4)
Of course all this also works when operator
is equivalent to
return
; it’s just pointless. The shift
/return
/bind
dance is
then a convoluted way to demand regular Scheme evaluation rules.
And that is how the continuation monad is universal. When code is in the continuation monad (or in a language with delimited continuations), there is a mechanical way to have that code execute in almost any other monad. There are technical restrictions on the monad for the transformation to work, but I think any monad that can be implemented in (pure) Haskell qualifies.
I feel like sigfpe’s presentation was hobbled by the fact that he used the Continuation monad in Haskell, making it harder to see that the Continuation monad of the implementation is completely independent of the emulated one. Really, the idea is one of these nice insights that formalise and generalise old hacks, and seem obvious in retrospect.
The post’s title refers to the fact that delimited continuations can
themselves be implemented in terms of callwithcurrentcontinuation
(and a single mutable cell). There are many ways to interpret the
corollary that call/cc
suffices to transpose code into arbitrary
monads. It certainly seems like a victory for the minimalist crowd.
On the other hand, I believe that the power of programming languages
and paradigms lies as much in what they enable as it does in what they
forbid (or make inconvenient, at least). From that point of view,
it’s not obvious whether the universality of call/cc
makes a strong
case for or against the feature.
This result also provides a tentative explanation for the low traction of monads among Lispers: perhaps many would rather directly hack the features in, with (ersatz) continuations.
]]>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

The canonical bitap algorithm matches literal strings, e.g. “abad”. Like other bitap algorithms, it exploits the fact that, given a bitset representation of states, it’s easy to implement transfer functions of the form “if state \( i \) was previously active and we just consumed character \( c \), state \( i+1 \) is now active”. It suffices to shift the bitset representation of the set of active states by 1, and to mask out transitions that are forbidden by the character that was just consumed.
For example, when matching the literal string “abad”, a state is associated with each position in the pattern. 0 is the initial state, 1 is the state when we’ve matched the first ‘a’, 2 after we’ve also matched ‘b’, 3 after the second ‘a’, and 4 after ‘d’, the final character, has been matched. Transposing this information gives us the information we really need: ‘a’ allows a transition from 0 to 1 and from 2 to 3, ‘b’ a transition from 1 to 2, and ‘d’ from 3 to 4.
A trivial state machine is probably clearer.
The initial state is in grey, the accept state is a double circle, and transitions are labelled with the character they accept.
Finding the substring “abad” in a long string thus reduces to a bunch of bitwise operations. At each step, we OR bit 0 in the set of states: we’re always looking for the beginning of a new match. Then, we consume a character. If it’s ‘a’, the mask is 0b01010; if it’s ‘b’, mask is 0b00100; if it’s ‘d’, mask is 0b10000; otherwise, its mask is 0. Then, we shift the set of state left by 1 bit, and AND with mask. The effect is that bit 1 is set iff mask has bit 1 set (and that only happens if the character is ‘a’), and bit 0 was previously set (… which is always the case), bit 2 is set iff mask has bit 2 set (i.e. we consumed a ‘b’) and bit 1 was previously set, etc.
We declare victory whenever state 4 is reached, after a simple bit test.
It’s pretty clear that this idea can be extended to wildcards or even character classes: multiple characters can have the same bit set to 1 in their mask. Large alphabets are also straightforward to handle: the mapping from character to mask can be any associative dictionary, e.g. a perfect hash table (wildcards and inverted character classes then work by modifying the default mask value). This works not only with strings, but with any sequence of objects, as long as we can easily map objects to attributes, and attributes to masks. Some thinking shows it’s even possible to handle the repetition operator ”+”: it’s simply a conditional transition from state \( i \) to state \( i \).
What I find amazing is that the technique extends naturally to arbitrary regular expressions (or nondeterministic finite automata).
Simulating an NFA by advancing a set of states in lockstep is an old technique. Russ Cox has written a nice review of classic techniques to parse or recognize regular languages. The NFA simulation was first described by Thomson in 1977, but it’s regularly being rediscovered as taking the Brzozowski derivative of regular expressions ;)
Fast Text Searching With Errors by Sun Wu and Udi Manber is a treasure trove of clever ideas to compile pattern matching, a symbolic task, into simple bitwise operations. For regular expressions, the key insight is that the set of states can be represented with a bitset such that the transition table for the NFA’s states can be factored into a simple datadependent part followed by \( \epsilon \)transitions that are the same, regardless of the character consumed. Even better: the NFA’s state can be numbered so that each character transition is from state \( i \) to \( i+1 \), and such a numbering falls naturally from the obvious way to translate regular expression ASTs into NFAs.
For the initial example “ab(cde)*fg”, the AST looks like a node to match ‘a’, succeeded by a node to match ‘b’, then a repetition node into either “cd”, “e” or \( \epsilon \), and the repetition node is succeeded by “fg” (and, finally, accept). Again, a drawing is much clearer!
Circles correspond to charactermatching states, the square to a repetition node, the diamond to a choice node, and the pointy boxes to dummy states. The final accept state is, again, a double circle. \( \epsilon \)transitions are marked in blue, and regular transitions are labelled with the character they consume.
The nodes are numbered according to a straight depthfirst ordering in which children are traversed before the successor. As promised, character transitions are all from \( i \) to \( i+1 \), e.g. 0 to 1, 1 to 2, 6 to 7, … This is a simple consequence of the fact that charactermatching nodes have no children and a single successor.
This numbering is criminally wasteful. States 3, 6 and 10 serves no purpose, except forwarding \( \epsilon \) transitions to other states. The size of the state machine matters because bitwise operations become less efficient when values larger than a machine word must be manipulated. Using fewer states means that larger regular expressions will be executed more quickly.
Eliminating them and sliding the numbers back yields the something equivalent to the more reasonable 11state machine shown in Wu and Manber (Figure 3 on page 10). Simply not assigning numbers to states that don’t match characters and don’t follow character states suffices to obtain such decent numberings.
Some more thinking shows we can do better, and shave three more states. State 2 could directly match against character ‘e’, instead of only forwarding into state 3; what is currently state 4 could match against character ‘c’, instead of only forwarding into state 8, then 2 and then 5 (which itself matches against ‘c’); and similarly for state 7 into state 8. The result wastes not a single state: each state is used to match against a character, except for the accept state. Interestingly, the \( \epsilon \) transitions are also more regular: they form a complete graph between states 2, 3 and 5.
It’s possible to use fewer states than the naïve translation, and it’s useful to do so. How can a program find compact numberings?
The natural impulse for functional programmers (particularly functional compiler writers ;) is probably to start matching patterns to iteratively reduce the graph. If they’ve had bad experience with slow fixpoint computations, there might also be some attempts at recognising patterns before even emitting them.
This certainly describes my first couple stabs; they were either mediocre or wrong (sometimes both), and certainly not easy to reason about. It took me a while to heed ageold advice about crossing the street and compacting state machines.
Really, what we’re trying to do when compacting the state machine is to determine equivalence classes of states: sets of states that can be tracked as an atomic unit. With rewrite rules, we start by assuming that all the states are distinct, and gradually merge them. In other words, we’re computing a fixed point starting from the initial hypothesis that nothing is equivalent.
Problem is, we should be doing the opposite! If we assume that all the states can be tracked as a single unit, and break equivalence classes up in as we’re proven wrong, we’ll get maximal equivalence classes (and thus as few classes as possible).
To achieve this, I start with the naïvely numbered state machine. I’ll refer to the start state and character states as “interesting sources”, and to the accept state and character states as “interesting destinations”. Ideally, we’d be able to eliminate everything but interesting destinations: the start state can be preprocessed away by instead working with all the interesting destinations transitively reachable from the start state via \( \epsilon \) transitions (including itself if applicable).
The idea is that two states are equivalent iff they are always active after the same set of interesting sources. For example, after the start state 0, only state 1 is active (assuming that the character matches). After state 1, however, all of 2, 3, 4, 6, 7, 10 and 11 are active. We have the same set after states 4 and 8. Finally, only one state is alive after each of 11 and 12.
Intersecting the equivalence relations thus defined give a few trivial equivalence classes (0, 1, 5, 8, 9, 12, 13), and one huge equivalence class comprised of {2,3,4,6,7,10,11} made of all the states that are active exactly after 1, 5 and 9. For simplicity’s sake, I’ll refer to that equivalence class as K. After contraction, we find this smaller state graph.
We can renumber this reasonably to obey the restriction on character edges: K is split into three nodes (one for each outgoing characterconsuming edge) numbered 2, 4 and 7.
We could do even better if 5 and 9 (3 and 6 above) in the earlier contracted graph were also in equivalence class K: they only exist to forward back into K! I suggest to achieve that with a simple postprocessing pass.
Equivalence classes are found by find determining after which interesting node each state is live. States that are live after exactly the same sets of interesting nodes define an equivalence class. I’ll denote this map from state to transitive interesting predecessors \( pred(state) \).
We can coarsen the relationship a bit, to obtain \( pred\sp{\prime}(state) \). For interesting destinations, \( pred\sp{\prime} = pred \). For other nodes, \(pred\sp{\prime}(state) = \cap\sb{s\in reachable(state)}pred(s)\), where \(reachable(state)\) is the set of interesting destinations reachable via \( \epsilon \) transitions from \(state\). This widening makes sense because \(state\) isn’t interesting (we never want to know whether it is active, only whether its reachable set is), so it doesn’t matter if \(state\) is active when it shouldn’t, as long as its destinations would all be active anyway.
This is how we get the final set of equivalence classes.
We’re left with a directed multigraph, and we’d like to label nodes such that each outgoing edge goes from its own label \( i \) to \( i+1 \). We wish to do so while using the fewest number of labels. I’m pretty sure we can reduce something NPHard like the minimum path cover problem to this problem, but we can still attempt a simple heuristic.
If there were an Eulerian directed path in the multigraph, that path would give a minimal set of labels: simply label the origin of each arc with its rank in the path. An easy way to generate an Eulerian circuit, if there is one, is to simply keep following any unvisited outgoing arc. If we’re stuck in a dead end, restart from any vertex that has been visited and still has unvisited outgoing arcs.
There’s a fair amount of underspecification there. Whenever many equivalence classes could be chosen, I choose the one that corresponds to the lexicographically minimal (sorted) set of regexp states (with respect to their depthfirst numbering). This has the effect of mostly following the depthfirst traversal, which isn’t that bad. There’s also no guarantee that there exists an Eulerian path. If we’re completely stuck, I start another Eulerian path, again starting from the lexicographically minimal equivalence class with an unvisited outgoing edge.
Finally, once the equivalence states are labelled, both character and \( \epsilon \) transitions are reexpressed in terms of these labels. The result is a nice 8state machine.
This only covers the abstract stuff. There’s a
CL code dump on github.
You’re probably looking for compileregexp3.lisp
and
scalarbitap.lisp
; the rest are failed experiments.
Once a small labeling is found, generating a matcher is really straightforward. The datadependent masks are just a dictionary lookup (probably in a vector or in a perfect hash table), a shift and a mask.
Traditionally, epsilon transitions have been implemented with a few
table lookups. For example, the input state can be cut up in bytes;
each byte maps to a word in a different lookup table, and all the
bytes are ORed together. The tables can be pretty huge (nstates/8
lookup tables of 256 state values each), and the process can be slow
for large states (bitsets). This makes it even more important to
reduce the size of the state machine.
When runtime compilation is easy, it seems to make sense to instead generate a small number of test and conditional moves… even more so if SIMD is used to handle larger state sets. A couple branchfree instructions to avoid some uncorrelated accesses to LUTs looks like a reasonable tradeoff, and, if SIMD is involved, the lookups would probably cause some slow crosspipeline pingponging.
There’s another interesting lowlevel trick. It’s possible to handle large state sets without multiword shifts: simply insert padding states (linked via \( \epsilon \) transitions) to avoid character transitions that straddle word boundaries.
There’s a lot more depth to this bitap for regexp matching thing. For example, bitap regular expressions can be adapted to fuzzy matchings (up to a maximum edit distance), by counting the edit distance in unary and working with one bitset for each edit distance value. More important in practice, the approach described so far only handles recognising a regular language; parsing into capture groups and selecting the correct match is a complex issue about which Russ Cox has a lot to say.
What I find interesting is that running the NFA backward from accept states gives us a forward oracle: we can then tell whether a certain state at a given location in the string will eventually reach an accept state. Guiding an otherwise deterministic parsing process with such a regular language oracle clearly suffices to implement capture groups (all nondeterministic choices become deterministic), but it also looks like it would be possible to parse useful nonregular languages without backtracking or overly onerous memoisation.
]]>Huge month for #SBCL: devious bugs fixed, new optimisations, new features. Help us find regressions in HEAD!sbcl.git.sourceforge.net/git/gitweb.cgi…
— Paul Khuong (@pkhuong) May 24, 2013One feature I’m particularly excited about is the basic support for SSE intrinsics on x8664. It’s only the fundamental infrastructure needed for the compiler and runtime to manipulate and compile operations on SIMD packs (the platformindependent name suggested by Kevin Reid). However, now that it’s merged in, we can easily add new functionality that maps (more or less) directly to SSE assembly in a running image, like the sbrotatebyte contrib does for bitwise rotations.
There is currently no such contrib in SBCL, although Alexander Gavrilov has been maintaining clsimd, an extension for SBCL and ECL (he also kept an old SSE/SBCL branch of mine on life support since 2008). I’m really not sure what the right interface will look like for Common Lisp, so I decided to only provide the mechanism to implement such an interface; when something emerges that looks sane and has been used in anger, we can merge it in.
In the meantime, we get to define our own interface and see where that experience leads us. In this post, I’ll use a classic example to guide my first stab at an interface: the Mandelbrot set. I’ll only define as much of an interface as I need, and adapt it along the way if I see the need for more sophistication; I prefer to build programs and libraries bottomup and iteratively, rather than trying to divine my requirements ahead of time. A consequence of this approach is that I have to/am able to maintain a dynamic development style, even when working with (and working on) SSE intrinsics. I know many people use REPLs as desktop calculators; maybe others can be persuaded to also use SBCL as an SIMD calculator, to doodle on SSE code sequences (:
1 2 3 4 5 6 7 8 9 

The last form adds three new functions (f4+
, f4*
and f4
) to the
compiler’s database (fndb). Each function accepts two packs of
single floats as arguments, and returns another single float pack (the
result of elementwise addition, multiplication or subtraction).
Moreover, the functions can be reordered and flushed as dead code by
the compiler, and, whenever they appear in code, they should
ultimately be translated to assembly code. The last bit is so the
runtime system silently overwrites the database if the functions
are already there, rather than warning at loadtime.
Next, we add a template (VOP) to convert calls to f4+
into assembly
code. This task is so deeply tied with the internals that it makes
sense to just do it from SBVM.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

The first line defines a VOP with name mandelbrot::f4+
(it’s an
arbitrary symbol and the important bit is only to avoid collisions,
but good naming is useful when reading compiler traces or optimisation
notes).
Then, we specify that it’s to be automatically considered when there
is a call to mandelbrot::f4+
, and that it can be used both in fast
and in safe code.
The template converts a call with two arguments, x
and y
, that
must both be in SSE registers of single floats. Both value must be
represented as single floats in SIMD packs: when defining VOPs,
types refer to primitive types, and primitive types are concerned with
representation (like C types) rather than only sets of values (a
given CL value or type can be mapped to many primitive types
depending on the context). We also declare a preference for x
to be
allocated in the same register as the result r
, if possible.
The template has a single result, which is also an SIMD pack of single floats allocated in an SSE register of single floats.
Finally, we have code to emit assembly. If r
and y
were packed
(registerallocated) in the same register, we exploit commutativity to
add x
into y
. Otherwise, we move x
into r
if necessary (the
move
function takes care of checking for equivalent locations), and
emit a packed single float addition (addps
) of y
into r
(which
has just been overwritten with the contents of x
).
The reason we have to specify that the SSE registers hold single
float values (rather than doubles or integers) is for functions like
move
to emit a move of FP SSE registers rather than integer, for
example. However, this additional information doesn’t really affect
register allocation: the three storage classes (SC) –
singlessereg
, doublessereg
and intssereg
– all map to the
same storage base (SB), and the same SSE register can be used for an
simdpacksingle
value at one program point, an simdpackint
at
another, and a doublefloat
at yet another.
This VOP is the bare minimum for a useful packed addition of single floats: in the future, it will probably make sense to add support for a constant argument, which could be directly loaded from a RIPrelative address to save a register (SBCL isn’t clever enough to determine when it makes sense to keep a constant in a register across a basic block or a loop).
We do the same for multiplication and subtraction. There’s one last
twist for f4
. We can’t exploit commutativity there, so we have to
make sure y
and r
aren’t packed in the same register; this is
achieved by extending the live range of r
from the end of the
(default) live range for x
. There are obvious opportunities for
macroisation, but, again, I can refactor when I have a clear need
(e.g. when I decide to add a dozen more intrinsics).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

Now that we’ve declared the functions and defined one way to convert
each to machine code, we can define stubs to interact with them from
interpreted code or as normal firstclass functions. The compiler
already has all the information needed to convert any call to f4+
et
al to machine code; however, no such function is actually defined. We
can define them with what looks like infinite recursion: the
defknown
will ensure the compiler inserts type checks for
(simdpack singlefloat)
as needed and infers that the result is
also an (simdpack singlefloat)
, so the templates will be
applicable.
1 2 3 4 5 6 7 8 

Now, we can call our new functions at the REPL, pass them to higherorder functions, etc., like any normal function.
1 2 3 4 

The first form makes an simdpack
from four single float values. We
must already track at compiletime whether each SIMD pack value is a
pack of singles, doubles or integers, so it makes sense to do the same
at runtime: that way we preserve type information when constant
folding, and that we can print them nicely. Thus, we can call f4+
on that value, and see that 1.0 + 1.0 = 2.0
. We can also
disassemble the function f4*
to confirm that it is a normal function
that can be passed around (in this case to disassemble
), and that it
doesn’t endlessly recurse.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

The inner loop body to approximate the Mandelbrot set is simply z' =
z*z + c
, where z
and c
are complex values. If we expand the
complexes into their real and imaginary parts (zr + i zi
, cr + i
ci
), we find z*z+c = (zr*zr  zi*zi + cr) + i(2*zr*zi + ci)
. That’s easily
vectorised if we have a pack of four zr
values, another for the
zi
, and similarly for cr
and ci
.
1 2 3 4 5 6 

1 2 3 4 5 6 7 8 9 

This is a straight translation of the scalar formula above, but it computes four values in parallel (that’s why struct of array layouts lead to easy to vectorisation). Crucially, a disassembly reveals we get the code we expect:
1 2 3 4 5 6 7 8 9 10 11 12 13 

The beauty of Python’s representation selection capabilities is that we can specify multiple representations (unboxed in a register or on stack, or boxed as a generic pointer to a heapallocated value) for SIMD packs, and the right (usually…) one will be chosen depending on context. In this case, the function receives boxed SIMD packs, unboxes them into SSE registers in the prologue, performs its FP arithmetic only on SSE registers, and converts the results back into boxed SIMD packs.
Even better: it seems to compute the right thing (:
1 2 3 4 5 6 7 8 9 10 11 12 13 

The first return value is a pack of real components, and the second a pack of imaginary components, and the values do correspond to the scalar computation in normal complex arithmetic.
The next step is to compute the (squared) norm of complex values, in order to detect escape from the orbit.
1 2 3 4 

Again, a couple smoke tests, and a disassembly
1 2 3 4 5 6 7 8 9 10 11 12 

Finally, we have to compare floats (squared norms) against a limit
(4.0), and perform some integer arithmetic to count the number of
iterations until escape. f4signmask
is the first instance of a
function that accepts arbitrary SIMD pack types: singles, integers, or
even doubles, it’ll extract sign bits from the four 32bit chunks.
1 2 3 4 5 6 7 8 9 10 11 12 13 

The VOPs are really straightforward and somewhat repetitive, again: we only want the bare minimum to get something working, and fancy special cases can wait until they actually show up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

Again, we add stubs to easily work at the REPL
1 2 3 4 5 6 7 8 9 10 11 

And some smoke tests
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

f4<=
compares two packs of single floats element by element, and
returns a pack of integers with all 32 bits set to 1 for pairs that
are <=
, and 0 otherwise. i4
subtracts 32bit signed integers
elementwise; subtracting 1 (#xffff...
) is equivalent to adding 1.
Finally, f4signmask
extracts the topmost (sign) bit from each
chunk of 32 bits, and returns the 4 bits as a normal integer (i.e. in
a general purpose register). The function sbext:%simdpackub32s
is useful to extract the four 32 bit unsigned values from an SIMD pack
– in this case the result of subtracting the comparison masks from
zero – and work on them in a scalar context.
The idea is that we’ll want to compare the result of %norm^2
with
the limit (4.0), and stop iterating when the maximal iteration count
is reached, or when all four norms are greater than 4.0 (all sign bits
are zero). Until then, we can subtract from the counter to increment
the number of unescaped iterations. When we’re done, we can easily
extract individual iteration counts from the packed counters.
This yields
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

We can run a couple random tests and compare with a straightforward scalar version:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 

So, there’s some register allocation oddities (mostly because the VOPs
don’t handle the case when both arguments are the same, I think), but
it’s pretty good. The one issue that bothers me is (zerop
(f4signmask stillinorbit))
: it’s a common operation, and there’s
a slight miscompile (the result of f4signmask
is converted to a
fixnum before comparison… an instance of bad representation
selection). These two reasons – factoring code out and improving
code generation for a mediumlevel operation – are enough for me to
add a specialised function to test if all the sign bits are zero.
We define predicates simply as functions that return boolean
values (this one is prefixed with f4
because the instruction
officially works on single floats).
1 2 3 

However, VOPs for such functions shouldn’t return values; instead,
they can define how to determine that their result is true. In this
case, if the :z
(zero) flag is set. Code generation will take care of
using that information in conditional branches or conditional moves.
1 2 3 4 5 6 7 8 9 10 

For example, in this stub, the compiler will emit the template, followed by a conditional move to return T or NIL.
1 2 

1 2 3 4 5 6 7 8 9 10 11 12 

Now, we can slightly adjust mandelbrotescape
to exploit this new
function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

This all looks like a lot of work, but the vast majority of it is
reusable code to define SSE operations: 22 LOC of defknown, 83 LOC for
assembly templates, 20 LOC in stubs and utility code. There’s only 31
LOC for the SIMD mandelbrot loop (%mandelbrotiter
, %norm^2
and
mandelbrotescape
), 6 for the naïve scalar loop, and 12 for the
random test harness. I uploaded all the files on
github to simplify
further experimentation; just fork and play around! The function
names could certainly be improved, for one.
My request for extrahard bug shaking in SBCL 1.1.8 didn’t function as well as I’d wish: only a few hours after release (and after a week of code freeze), we received a new bug report (with fixes committed very rapidly, at least). We observed the same phenomenon in 1.1.6, and wound up being extremely conservative for 1.1.7. I’m not sure what’s the best way to encourage testing (binary release candidates during freeze?), but I am starting to see the point of the old odd/even policy for Linux. For now, the result is that 1.1.8 offers a lot of exciting improvements, but also more than its share of bugs; for production code, it’s probably best to either build an early 1.1.8.gitcommit, or wait for 1.1.9.
Longtime readers might remember that I first blogged about SSE
intrinsics in
2008.
The problem is that I hit some issues with representing type
information (whether an SIMD pack contains integer, single float or
double float data), and didn’t manage to find a nice resolution. I
also had trouble with type operations (intersection, union, negation)
on (simdpack [type])
. For the first problem, I bit the bullet and
created many storage classes and primitive types; as shown above, a
function can accept arbitrary simdpack
, and so can VOPs: when
converting to primitive types, unspecialised simdpack
are
treated like integer simdpack
. The second, I solved by moving to
a type upgrading system inspired by array types. There are really only
three specialised simdpack
types: on integers, single floats and
double floats. Any specialised simdpack
type specifier must be
specialised on a recognisable subtype of one of these three types.
It’s a hack, but all type operations become trivial and easy to think
about.
The reason I finally sat down, reread my code and thought hard for a few days is Google’s Summer of Code. A student proposed to build on that work to vectorise standard sequence operations in CL, and I really didn’t want to be blocking their work. In the end we were only allotted two slots, and had to select a pair of extremely strong proposals from ten interesting submissions… more on that soon!
In the meantime, I hope the recent burst of activity and outside interest can motivate others to contribute to SBCL, even without Google support. We already have one person working on a contrib to exploit GMP for bignum and rational arithmetic, instead of our handrolled routines; I expect to see it merged in 1.1.9 or in 1.1.10.
P.S. I recently stumbled on Glassman’s FFT and it’s really nifty: an inorder generalsize FFT (not so fast when prime factors are huge) in approximately 50 LOC. The performance isn’t awesome on cached architectures, but shouldn’t be atrocious either… and it handles arbitrary DFT sizes. There are issues with numerical stability when computing the twiddle factors on the fly, and the access patterns could be improved, but it’s a very simple and interesting foundation… especially given that the inner loop is naturally vectorisable.
]]>EDIT: There’s a section with a bunch of general references.
EDIT 2: Added a note on genesis when playing with core formats.
The first step is probably to install git, and clone our repo
(the github mirror works well, and
lets you fork to your own github account for easy publication). Then,
building from source and installing SBCL (a local installation to
$HOME works fine) is obviously useful experience, and will be useful
to explore the source. Reading INSTALL should be enough to get
started on Linux or OS X and x86[64]. Other platforms may need more
work, and might not be the best choice if you’re not interested in
improving the port itself (although I’m told FreeBSD and Solaris work
very well on x86[64]). To build SBCL from source, you’ll need an
SBCL binary (bootstrapping from other CLs should work, but support
regularly bitrots away), and the usual C development tools
(e.g. buildessential on debian). A fancy build (./make.sh fancy
)
is probably the best choice for development.
You’ll also want to run the test suite often; better try it out now
(cd tests; sh runtests.sh
) to make sure you can get it working.
The test suite will barf if there’s nonASCII characters in the
environment. Oh my Zsh’s
git customisation systematically trips me up, for example (I currently
kludge it and start a bash from ~ and then run the tests).
Once SBCL HEAD is working and installed, it’s probably best to install emacs and SLIME. Quicklisp’s quicklispslimehelper can take care of installing SLIME. It is possible to work on SBCL without SLIME. However, SLIME has a lot of useful extensions, if only to explore the code base. If you’re not (and don’t wish to become) comfortable with emacs, it’s probably best to nevertheless use emacs and SLIME for the REPL, debugger, inspector, etc. and write code in your favourite editor. Later on, it’ll be useful to figure out how to make SLIME connect to a freshlybuilt SBCL.
I often see newcomers try to read the source like a book, and, once they realise there’s a lot of code, try to figure out a good order to read the source. I don’t think that’s the best approach. SBCL is pretty huge, and I doubt anyone ever simultaneously holds the complete system in their head. RAM’s “The Python Compiler for CMU Common Lisp” is still useful as an overview, and SBCL’s internals manual is a good supplement. Once you get close to bootstrapping logic, Christophe Rhodes’s “SBCL: a SanelyBootstrappable Common Lisp” helps understand the exclamation marks. Past that, I believe it’s preferable to start out small, learn just enough to get the current task done, and accept that some things just work, without asking how (for now).
In that spirit, I’d say M. (Alt period, Command period on some OS X emacsen) is the best way to explore most of SBCL’s source. SBCL’s build process preserves a lot of source location information, and M. queries that information to jump to the definitions for any given symbol (M, will pop back up to the previous location). For example, if you type “(truncate” at the REPL and hit M. (with the point on or just after “truncate”), you’ll find the out of line definition for truncate in (mostly) regular Common Lisp, optimisation rules regarding truncate, and VOPs, assembly language templates, for truncate called with a few sets of argument and return types. The out of line definition isn’t that interesting. The transforms, however, are. (VOPs aren’t useful if one isn’t comfortable with the platform’s assembly language, and mostly selfexplanatory otherwise.)
The one to “convert integer division to multiplication” is a very good
example. One could M. on deftransform
, and go down a very long
chain of definitions. Instead, I think it’s only essential to see
that the form defines a new rule, like a compiler macro, such that
compiletime values (lvars) that represent its two arguments are bound
to x and y, and the rule only triggers if its first argument is known
to be an unsigned word, and its second a constant unsigned word. If
that’s satisfied, the transformation still only triggers if the speed
optimisation quality is higher than both compilation speed and space
(code size).
Then, the constant value for y is extracted and bound to y, and a
conservative bound on the maximum value that x can take at runtime is
computed. If truncate by y should be handled elsewhere, the transform
gives up. Otherwise, it returns a form that will be wrapped in
(lambda (x y) ...)
and spliced in the call, instead of truncate.
To extend SBCL’s support for division by constants, it’s not necessary to understand more of SBCL’s compiler than the above. There’s no need to try and understand how deftransform works, only that it defines a rule to simplify calls to truncate. Similarly for lvarvalue and lvartype: the former extracts the value for constant lvars, and the latter the static type derived for that lvar (value at a program point). With time, knowledge will slowly accrete. However it’s possible, if not best, to start hacking without understanding the whole system. This approach will lead to a bit of cargo culting, but mentors and people on IRC will help make sure it doesn’t do any harm, and can explain more stuff if it’s interesting or à propos.
Working on the compiler itself is a bit more work. I think the best
approach is to go in src/compiler/main.lisp
and look for
compilecomponent
. ir1phases
loops on a component and performs
highlevel optimisations until fixpoint (or we get tired of waiting),
while %compilecomponent
handles the conversion to IR2 and then to
machine code. The compilation pipeline hasn’t really changed since
the Python paper was written, and the subphases each have their own
function (and file). M. on stuff that sounds interesting is probably
the best approach at the IR2 level.
The C and assembly runtime lives in src/runtime/
. There’s a lot of
stuff that’s symlinked or generated during the build, so it’s probably
best to look at it after a successful build. Sadly, we don’t track
source locations there, but {c,e,whatever}tags works; so does grep.
GC stuff is in the obvious suspects (gccommon, gencgc, gc, etc.), but
may end up affecting core loading/saving (core, coreparse, save).
Depending on what in the core loading code is affected, code in
genesis (the initial bootstrap that reads fasl files from the cross
compiler and builds the initial core file) might also have to be
modified (mostly in src/compiler/generic/genesis.lisp
). That’s…
more work. Like the project suggestions list says, when we change
things in the runtime, it sometimes ends up affecting a lot of other
components.
GDB tends to be less than useful, because of the many tricks SBCL
plays on itself. It’s usually hard to beat pen, paper, and printf.
At least, rebuilding the C runtime is quick: if the feature
:sbafterxccore
is enabled (which already happens for fancy
builds), slam.sh
should be able to rebuild only the C runtime, and
then continue the bootstrap with the rest of SBCL from the previous
build. That mostly leaves PCL to build, so the whole thing should
takes less than a minute on a decent machine.
I was replying to an email when I realised that some general compiler references would be useful, in addition to project and SBCL specific tips.
Christian Queinnec’s Lisp in Small Pieces gives a good overview of issues regarding compiling Lisplike languages. Andrew Appel’s Modern Compiler Implementation in ML is more, well, modern (I hear the versions in C and Java have the same text, but the code isn’t as nice… and ML is a very nice language for writing compilers). I also remember liking Appel’s Compiling with Continuations, but I don’t know if it’s particularly useful for CL or the projects we suggest.
For more complicated stuff, I believe Stephen Muchnick’s Advanced Compiler Design and Implementation would have been really nice to have, instead of slogging through code and dozens of papers. Allen & Kennedy’s Optimizing Compilers for Modern Architectures: A Dependencebased Approach is another really good read, but I’m not sure how useful it would be when working on SBCL: we still have a lot of work to do before reaching for the really sophisticated stuff (and what sophistication there is is fairly nonstandard).
I believe the Rabbit and Orbit compilers have influenced the design of CMUCL and SBCL. The Lambda papers provide some historical perspective, and the RABBIT and ORBIT theses are linked here.
What little magic remains in SBCL and CMUCL is the type derivation (constraint propagation) pass, and how it’s used to exploit a repository of sourcetosource transformations (deftransforms). The rest is bogstandard tech from the 70s or 80s. When trying to understand SBCL’s type derivation pass at a very high level, I remember finding Henry Baker’s The Nimble Type Inferencer for Common Lisp84 very useful, even though it describes a scheme that doesn’t quite work for Common Lisp (it’s very hard to propagate information backward while respecting the final standard). Kaplan and Ullman’s A Scheme for the Automatic Inference of Variable Types was also helpful.
Over the years, I’ve seen a couple of people come in with great ambition, and give up after some time, seemingly without having made any progress. I believe a large part of the problem is that they tried to understand all of SBCL instead of just learning the bare minimum to get hacking, and that their goal was too big. I already wrote that SBCL is probably best approached bit by bit, with some guidance from people who’ve been there before, and I hope the projects we suggest can all lead to visible progress quickly, after a couple days or two weeks of work at most.
Still, before investing my time, I like to see the other person also give some of theirs to SBCL. This is why, as I wrote on the mailing list last week, I’m much more inclined to help someone who’s already built SBCL on their own and has submitted a patch that’s been committed or is being improved on the mailing list. I absolutely do not care what the patch is; it can be new code, a bugfix for a highly unlikely corner case, better documentation, or spelling and grammar corrections in comments. The bugs tagged as easy in our bugtracker may provide some inspiration. However trivial a patch might seem, it’s still a sign that someone is willing to put the work in to concretely make SBCL better, and I like that… it’s also a minimal test to make sure the person is able to work with our toolchain. (This isn’t SBCL policy for GSoC. It’s simply how I feel about these things.)
Again, I’m amazed by the number of people who wish to hack on SBCL this summer (as part of Google’s Summer of Code or otherwise). Because of that, I think it’s important to note that this is our first year, and so we’ll likely not have more than two or three spots. However, I always like seeing more contributors, and I hope anyone who’d like to contribute will always be guided, GSoC or not.
Finally, I’ll note that Google’s Summer of Code program was only a good excuse to write up our list of projects: they’re simply suggestions to incite programmers to see what they can do that is useful for SBCL and, most importantly, is interesting for them. Anyone should feel welcome to work on any of these projects, even if they’re not eligible or chosen for GSoC. They’re also only suggestions; if someone has their own idea, we can likely help them out just the same.
]]>What’s wrong with Taylor approximations? Why the need to specify a range?
Why are the results different for single and double floating point arithmetic? Doesn’t rounding each coefficient to the closest float suffice?
Why do I deem only certain approximations to be interesting and others not, and how can there be so few?
In this post, I attempt to provide answers to these interrogations, and sketch how I exploited classic operations research (OR) tools to enter the numerical analysts’ playground.
The final section describe how I’d interpret the catalogue when coding quick and slightly inaccurate polynomial approximations. Such lossy approximations seem to be used a lot in machine learning, game programming and signal processing: for these domains, it makes sense to allow more error than usual, in exchange for faster computations.
The CL code is all up in the repository for rationalsimplex, but it’s definitely researchgrade. Readers beware.
The natural way to approximate functions with polynomials (particularly if one spends time with physicists or engineers) is to use truncated Taylor series. Taylor approximations are usually easy to compute, but only provide good approximations over an infinitesimal range. For example, the degree1 Taylor approximation for \(\exp\) centered around 0 is \(1 + x\). It’s also obviously suboptimal, if one wishes to minimise the worstcase error: the exponential function is convex, and gradients consistently underapproximate convex functions.
Another affine function, \(1.26 + 1.18x\), intersects \(\exp\) in two points and is overall much closer over \([1, 1]\). In fact, this latter approximation (roughly) minimises the maximal absolute error over that range: it’s clearly not as good as the Taylor approximation in the vicinity of 0, but it’s also much better in the worst case (all bets are off outside \([1, 1]\)). That’s why I’m (or anyone’s libm, most likely) not satisfied by Taylor polynomials, and instead wish to compute approximations that minimise the error over a known range; functionspecific identities can be exploited to reduce any input to such a range.
As far as I know, the typical methods to find polynomial approximation that minimise the maximal error (minimax polynomials) are iterative algorithms in the style of the Remez exchange algorithm. These methods exploit real analysis results to reduce the problem to computing minimax polynomials over very few points (one per coefficient, i.e. one more than the degree): once an approximation is found by solving a linear equation system, error extrema are computed and used as a basis to find the next approximation. Given arbitraryprecision arithmetic and some properties on the approximated function, the method converges. It’s elegant, but depends on highprecision arithmetic.
Instead, I reduce the approximation problem to a sequence of linear optimisation programs. Exchange algorithms solve minimax subproblems over exactly as many points as there are coefficients: the fit can then be solved as a linear equation. I find it simpler to use many more constraints than there are coefficients, and solve the resulting optimisation problem subject to linear inequalities directly, as a linear program.
There are obviously no cycling problems in this cutting planes approach (the set of points grows monotonically), and all the machinery from exchange algorithms can be reused: there is the same need for a good set of initial points and for determining error extrema. The only difference is that points can always be added to the subproblem without having to remove any, and that we can restrict points to correspond to floating values (i.e. values we might actually get as input) without hindering convergence. The last point seems pretty important when looking at highprecision approximations.
For example, let’s approximate the exponential function over \([1, 1]\) with an affine function. The initial points could simply be the bounds, 1 and 1. The result is the line that passes by \(\exp(1)\) and \(\exp(1)\), approximately \(1.54 + 1.18x\). The error is pretty bad around 0; solving for the minimax line over three points (1, 0 and 1) yields a globally optimal solution, approximately \(1.16 + 1.18x\).
There’s a lot of meat to wrap around this bone. I use a computable reals package in Common Lisp to pull arbitraryprecision rational approximations for arithmetic expressions; using libm directly would approximate an approximation, and likely cause strange results. A variant of Newton’s algorithm (with bisection steps) converges to error extrema (points which can be added to the linear subproblem); arbitrary precision real arithmetic is very useful to ensure convergence down to machine precision here. Each linear subproblem is solved with an exact simplex algorithm in rational arithmetic, and convergence is declared when the value of the error extrema found in the Newton steps correspond to that estimated by the subproblem. Finally, the fact that the input are floatingpoint values is exploited by ensuring that all the points considered in the linear subproblems correspond exactly to FP values: rather than only rounding a point to the nearest FP value, its two immediately neighbouring (predecessor and successor) FP values were also added to the subproblem. Adding immediate neighbours helps skip iterations in which extrema move only by one ULP.
Initialising the method with a good set of points is essential to obtain reasonable performance. The Chebyshev nodes are known to yield nearly optimal [PDF] initial approximations. The LPbased approach can exploit a large number of starting points, so I went with very many initial Chebyshev nodes (256, for polynomials of degree at most 16), and, again, adjoined three neighbouring FP values for each point. It doesn’t seem useful to me to determine the maximal absolute error very precisely, and I declared convergence when the value predicted by the LP relaxation was off by less than 0.01%. Also key to the performance were tweaks in the polynomial evaluation function to avoid rational arithmetic until the very last step.
The previous section gives one reason why there are different tables for single and double float approximations: the ranges of input considered during optimisation differ. There’s another reason that’s more important, particularly for single float coefficients: rounding coefficients can result in catastrophic error blowups.
For example, rounding \(2.5 x\sp{2}\) to the nearest integer coefficient finds either \(2 x\sp{2}\) or \(3 x\sp{2}\). However \(2 x\sp{2} + x\sp{3}\) is also restricted to integer coefficients, but more accurate over \([0, .5]\). Straight coefficientwise rounding yields an error of \(\frac{1}{2} x\sp{2}\), versus \(\frac{1}{2} x\sp{2}  x\sp{3}\) for the degree3 approximation. As the following graph shows, the degree3 approximation is much closer for the range we’re concerned with.
Floating point values are more densely distributed than the integers for the range of values we usually encounter in polynomial approximations, but still discrete, and the same phenomenon crops up, at a smaller scale.
The state of the art for this version of the approximation problem seems to be based on an initial Remez step, followed by a reduction to a CVP.
The cutting planes approach points to a natural solution from the OR world: a branch and cut method. The issue with the cutting planes solution is that, while I only generate points that correspond to FP values, the decision variables (coefficients) are free to take arbitrary rational values.
Branching can be used to split up a decision variable’s range and eliminate from consideration a range of nonFP values. The decision variables’ upper and lower bounds are tightened in a large number of subproblems; solutions gradually converge to allFP values and their optimality is then proven.
Embedding the cutting planes method in a branch and bound lets us find polynomial approximations with float coefficients that minimise the maximal error over float arguments, with correctly rounded powers. The only remaining simplification is that we assume that the dot product in the polynomial evaluation is errorfree. Sadly, removing this simplification results in a horrible discrete optimisation problem with a very bumpy objective function. I’m not sure that any exact approach can solve this in reasonable time. Still, there are ways to evaluate FP polynomials very accurately, and I’m mostly interested in approximations to trade accuracy for speed, so rounding errors may well be negligible compared to approximation errors.
As in all branch and bound methods, a problem is split in subproblems by restricting the range of some coefficient to exclude some infeasible (nonFP) values. For example, in the previous example (in which we’re looking for integer coefficients), the 2.5 coefficient would lead to two subproblems, one in which the degree2 coefficient is at most 2 \((\lfloor 2.5\rfloor)\), and another in which it’s at least 3 \((\lceil2.5 \rceil)\), excluding all the fractional values between 2 and 3. For floats, we restrict to the closest floats that under or over approximate the current value.
However, instead of solving the full continuous relaxation (which is impractical, given the large number of potential argument values), our subproblems are solved with cutting planes. The trick is that cuts (constraints, which correspond to argument values) from any branch can be used everywhere else. Thus, the global point pool is shared between all subproblems, rather than regenerating it from scratch for each subproblem. In practice, this lifting of cutting planes to the root problem seems essential for efficiency; that’s certainly what it took for branchandcut MIP solvers to take off.
I generate cuts at the root, but afterwards only when a subproblem yields allFP values. This choice was made for efficiency reasons. Finding error extrema is relatively slow, but, more importantly, adding cuts really slows down reoptimisation: the state of the simplex algorithm can be preserved between invocations, and warmstarting tends to be extremely efficient when only some variables’ bounds have been modified. However, it’s also necessary to add constraints when an incumbent is allFP, lest we prematurely declare victory.
There are two really important choices when designing branch and bound algorithms: how the branching variable is chosen, and the order in which subproblems are explored. With polynomial approximations, it seems to make sense to branch on the coefficient corresponding to the lowest degree first: I simply scan the solution and choose the leastdegree coefficient that isn’t represented exactly as a float. Nodes are explored in a hybrid depthfirst/bestfirst order: when a subproblem yields children (its objective value is lower than the current best feasible solution, and it isn’t feasible itself), the next node is its child corresponding to the bound closest to the branching variable’s value, otherwise the node with the least predicted value is chosen. The depthfirst/closestfirst dives quickly converge to decent feasible solutions, while the bestfirst choice will increase the lower bound, bringing us closer to proving optimality.
A randomised rounding heuristic is also used to provide initial feasible (allFP) solutions, and the incumbent is considered close enough to optimal when it’s less than 5% off from the bestknown lower bound. When the method returns, we have a polynomial approximation with coefficients that are exactly representable as single or double floats (depending on the setting), and which (almost) minimises the approximation error on float arguments.
The branch and cut can be used to determine an (nearly) optimally accurate FP polynomial approximation given a maximum degree. However, the degree isn’t the only tunable to accelerate polynomial evaluation: some coefficients are nicer than others. Multiplying by 0 is obviously very easy (nothing to do), while multiplication by +/1 is pretty good (no multiplication), and by +/ 2 not too bad either (strengthreduce the multiplication into an addition). It would be possible to consider other integers, but it doesn’t seem to make sense: on current X86, floating point multiplication is only 3366% slower (more latency) than FP addition, and fiddling with the exponent field means pingponging between the FP and integer domains.
There are millions of such approximations with a few “nice” coefficients, even when restricting the search to low degrees (e.g. 10 or lower). However, the vast majority of them will be wildly inaccurate. I decided to only consider approximations that are at least as accurate as the best approximation of degree three lower: e.g. a degree3 polynomial with nice coefficients is only (potentially) interesting if it’s at least as accurate as a constant. Otherwise, it’s most likely even quicker to just use a lowerdegree approximation.
That’s not enough: this filter still leaves thousands of polynomials. Most of the approximations will be dominated by another; what’s the point in looking at a degree4 polynomial with one coefficient equal to 0 if there’s a degree3 with one zero that’s just as accurate? The relative importance of the degree, and the number of zeroes, ones and twos will vary depending on the environment and evaluation technique. However, it seems reasonable to assume that zeroes are always at least as quick to work with as ones, and ones as quick as twos. The constant offset ought to be treated distinctly from other coefficients. It’s only added rather than multiplied, so any speedup is lower, but an offset of 0 is still pretty nice, and, on some architectures, there is special support to load constants like 1 or 2.
This lets me construct a simple but robust performance model: a polynomial is more quickly evaluated than another if it’s of lower or same degree, doesn’t have more nonzero, non{zero, one} or non{zero, one, two} multipliers, and if its constant offset is a nicer integer (0 is nicer than +/ 1 is nicer than +/ 2 is nicer than arbitrary floats).
With these five metrics, in addition to accuracy, we have a multiobjective optimisation problem. In some situations, humans may be able to determine a weighting procedure to bring the dimension down to a scalar objective value, but the procedure would be highly domainspecific. Instead, we can use this partial order to report solutions on the Pareto front of accuracy and efficiency: it’s only worth reporting an approximation if there is no approximation that’s better or equal in accuracy and in all the performance metrics. The performance and accuracy characteristics are strongly correlated (decreasing the degree or forcing nice coefficients tends to decrease the accuracy, and a zero coefficient is also zeroorone, etc.), so it’s not too surprising that there are so few nondominated solutions.
The branch and cut can be used to find the most accurate approximation, given an assignment for a few values (e.g. the constant term is 1, or the first degree coefficient 0). I’ll use it as a subproblem solver, in a more exotic branch and bound approach.
We wish to enumerate all the partial assignments that correspond to nottoohorrible solutions, and save those. A normal branch and bound can’t be applied directly, as one of the choices is to leave a given variable free to take any value. However, bounding still works: if a given partial assignment leads to an approximation that’s too inaccurate, the accuracy won’t improve by fixing even more coefficients.
I started with a search in which children were generated by adjoining one fixed value to partial assignments. So, after the root node, there could be one child with the constant term fixed to 0, 1 or 2 (and everything else free), another with the first coefficient fixed to 0, 1 or 2 (everything else left free), etc.
Obviously, this approach leads to a search graph: fixing the constant term to 0 and then the first coefficient to 0, or doing in the reverse order leads to the same partial assignment. A hash table ensures that no partial assignment is explored twice. There’s still a lot of potential for wasted computation: if bounding lets us determine that fixing the constant coefficient to 0 is worthless, we will still generate children with the constant fixed to 0 in other branches!
I borrowed a trick from the SAT solving and constraint programming communities, nogood recording. Modern SAT solvers go far beyond strict branching search: in practice, it seems that good branching orders will lead to quick solutions, while a few bad ones take forever to solve. Solvers thus frequently reset the search tree to a certain extent. However, information is still communicated across search trees via learned clauses. When the search backtracks (an infeasible partial assignment is found), a nogood set of conflicting partial assignments can be computed: no feasible solution will include this nogood assignment.
Some time ago, Vasek Chvatal introduced Resolution Search to port this idea to 0/1 optimisation, and Marius Posta, a friend at CIRRELT and Université de Montréal, extended it to general discrete optimisation [PDF]. The complexity mostly comes from the desire to generate partial assignments that, if they’re bad, will merge well with the current set of learned clauses. This way, the whole set of nogoods (or, rather, an approximation that suffices to guarantee convergence) can be represented and queried efficiently.
There’s no need to be this clever here: each subproblem involves an exact (rational arithmetic) branch and cut. Simply scanning the set of arbitrary nogoods to look for a match significantly accelerates the search. The process is sketched below.
The search is further accelerated by executing multiple branch and cuts and nogood scans in parallel.
The size of the search graph is also reduced by only fixing coefficients to a few values. There’s no point in forcing a coefficient to take the value of 0, 1 or 2 (modulo sign) if it already does. Thus, a coefficient with a value of 0 is left free; otherwise a child extending the partial assignment with 0 is created. Similarly, a child extended with a value of 1 is only generated if the coefficient isn’t already at 0 or 1 (we suppose that multiplication by 0 is at least as efficient as by 1), and similarly for 2. Finally, coefficients between 0 and 1 are only forced to 0 or 1, and those between 1 and 3 to 0, 1 or 2. If a coefficient takes a greater absolute value than 3, fixing it most probably degrades the approximation too strongly, and it’s left free – then again such coefficients only seem to happen on really hardtoapproximate functions like \(\log\) over \([1, 2]\). Also, the last coefficient is never fixed to zero (that would be equivalent to looking at lowerdegree approximations, which was done in previous searches). Of course, with negative coefficients, the fixed values are negated as well.
This process generates a large number of potentially interesting polynomial approximations. The nondominated ones are found with a straight doubly nested loop, with a slight twist: accuracy is computed with machine floating point arithmetic, thus taking rounding into account. The downside is that it’s actually approximated, by sampling fairly many (the FP neighbourhood of 8K Chebyshev nodes) points; preliminary testing indicates that’s good enough for a relative error (on the absolute error estimate) lower than 1e5. There tends to be a few exactly equivalent polynomials (all the attributes are the same, including accuracy – they only differ by a couple ULP in a few coefficients); in that case, one is chosen arbitrarily. There’s definitely some lowhanging fruit to better capture the performance partial order; the error estimate is an obvious candidate. The hard part was generating all the potentially interesting approximations, though, so one can easily rerun the selection algorithm with tweaked criteria later.
I’m not sure what functions are frequently approximated over what ranges, so I went with the obvious ones: \(\cos\) and \(\sin\) over \([\pi/2, \pi/2]\), \(\exp\) and arctan over \([1, 1]\) or \([0, 1]\), \(\log\) over \([1, 2]\), and \(\log 1+x\) and \(\log\sb{2} 1+x\) over \([0, 1]\). This used up a fair amount of CPU time, so I stopped at degree 16.
Each file reports the accuracy and efficient metrics, then the coefficients in floating point and rational form, and a hash of the coefficients to identify the approximation. The summary columns are all aligned, but each line is very long, so the files are best read without line wrapping.
For example, if I were looking for a fairly good approximation of degree 3 for \(\exp\) in single floats, I’d look at expdegreelb_errornon_zeronon_onenon_twoconstanterror.
The columns report the accuracy and efficiency metrics, in the order used to sort approximations lexicographically:
After that, separated by pipes, come the coefficients in float form, then in rational form, and the MD5 hash of the coefficients in a float vector (in a contiguous vector, in increasing order of degree, with X86’s littleendian signmagnitude representation). The hash might also be useful if you’re worried that your favourite implementation isn’t parsing floats right.
There are three polynomials with degree 3, and they all offer approximately the same accuracy (lb_error = 10). I’d choose between the most accurate polynomial
1


or one with a nicer multiplier that doesn’t even double the maximum error
1


On the other hand, if I were looking for an accurateenough approximation of \(\log 1+x\), I’d open log1pxlb_errordegreeerrornon_zeronon_onenon_twoconstant.
The columns are in the order used for the lexicographic sort:
An error around 1e4 would be reasonable for my needs, and
log1px6AE509
seems interesting: maximum error is around 1.5e4,
it’s degree 4, the constant offset is 0 and the first multiplier 1.
If I needed a bit more accuracy (7.1e5), I’d consider log1pxE8200B
:
it’s degree 4 as well, and the constant is still 0.
It seems to me optimisation tools like approximation generators are geared toward fire and forget usage. I don’t believe that’s a realistic story: very often, the operator will have a fuzzy range of acceptable parameters, and presenting a small number of solutions with fairly close matches lets them exploit domainspecific insights. In this case, rather than specifying fixed coefficients and degree or accuracy goals, users can scan the indices and decide whether each tradeoff is worth it or not. That’s particularly true of the single float approximations, for which the number of possibilities tends to be tiny (e.g. eight nondominated approximations for \(\sin\)).
]]>In an earlier post, I noted how tedious coding unrolled sorts can be. Frankly, that’s the main reason I stopped at leaf sorts of size three. Recently, Neil Toronto wrote a nice post on the generation of sizespecialised merge sorts. The post made me think about that issue a bit more, and I now have a neat way to generate unrolled/inlined merge sorts that are significantly smaller than the comparison and size “optimal” inlined merge sorts.
The code is up as a singlefile library, and sorts short vectors faster than SBCL’s inlined heapsort by a factor of two to three… and compiles to less machine code. The generator is a bit less than 100 LOC, so I’m not sure I want to include it in the mainline yet. If someone wants to add support for more implementations, I’d be happy to extend Tabasco sort, and might even consider letting it span multiple files ;)
The inlined merge sort for three values (a
, b
, and c
) is copied
below. It has to detect between \(3! = 6\) permutation, and does so
with an optimal binary search tree. That scheme leads to code with
\(n!  1\) comparisons to sort n values, and for which each
execution only goes through two or three comparisons (\(\approx \lg n!\)).
1 2 3 4 5 6 7 8 9 10 11 

An optimal sorting network for three values needs only three comparisons, and always executes those three comparisons.
1 2 3 4 5 6 7 

Finally, the leaf sort I used in SBCL is smaller than the inlined merge sort (three comparisons), but sometimes executes fewer than three comparisons. It’s superoptimal ;)
1 2 3 4 5 6 7 8 

The optimal merge sort is larger than the optimal sorting network, and the optimal sorting network performs potentially more comparisons than the optimal merge sort…
Each implementation is optimal for different search spaces: the optimal merge sort never merges continuations, and the sorting network’s only control dependencies are in the conditional swaps.
The “superoptimal” merge sort does better by allowing itself both assignments (or tailcalls) and nontrivial control flow: it’s smaller than the inlined merge sort (but performs more data movement), and potentially executes fewer comparisons than the sorting network (with a larger total code size). And, essential attribute in practice, it’s easy to generate. This contrasts with optimal sorting networks, for which we do not have any generation method short of brute force search; in fact, in practice, sorting networks tend to exploit suboptimal (by a factor of \(\log n\)) schemes like bitonic sort or oddeven merge sort. Then again, we’re only concerned with tiny sorts, and asymptotics can be misleading: Batcher’s oddeven merge sort happens to be optimal for \(n\leq 8\). The issue with sorting networks remains: dataoblivious control flow pessimises their comparison count.
What the last merge sort does is to first sort both halves of the
values (a
is trivially sorted, and b c
needs one conditional
swap), and then, assuming that each half (a
and b c
) is sorted,
find the right permutation with which to merge them. Rather than
\(n!\) permutations, a merge only needs to distinguish between
\(C(n, \lfloor n/2\rfloor) = \frac{n!}{\lfloor n/2\rfloor!\lceil n/2\rceil!}\)
permutations, and the recursive sorts are negligible compared to the
merge step. That’s a huge reduction in code size!
A simple merge generator fits in half a screenful.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Given two lists of sorted variables, emitmerge
calls emitmerge1
to generate code that finds the right permutation, and executes it at
the leaf. A binary search tree is generated by keeping track of the
merged list in a reverseorder (to enable tailsharing) accumulator of
variable names. As expected, when merging a list of length one with
another of length two, we get pretty much the code I wrote by hand
earlier.
CLUSER> (emitmerge '(a) '(b c))
(if (< b a)
(if (< c a)
(setf (values a b c) (values b c a))
(setf (values a b c) (values b a c)))
(setf (values a b c) (values a b c)))
There’s one striking weakness: we generate useless code for the
identity permutation. We could detect that case, or, more generally,
we could find the cycle decomposition of each permutation and use it
to minimise temporary values; that’d implicitly take care of cases
like (setf (values a b c) (values b a c))
, in which some values are
left unaffected.
I’ll represent permutations as associative lists, from source to destination. Finding a cycle is easy: just walk the permutation from an arbitrary value until we loop back.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

To generate the code corresponding to a permutation, I can extract all
the cycles, execute each cycle with a rotatef
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

The merge step for a sort of size three is now a bit more explicit, but likely compiles to code that uses fewer registers as well. It probably doesn’t matter on good SSAbased backends, but those are the exception rather than the norm in the Lisp world.
CLUSER> (emitmerge '(a) '(b c))
(if (< b a)
(if (< c a)
(progn (rotatef a b c))
(progn (rotatef a b)))
(progn))
The only thing missing for a merge sort is to add base cases and recursion. The base case is easy: lists of length one are sorted. Inlining recursion is trivial, as is usually the case when generating Lisp code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

The resulting threevalue sorter looks good; there are some
redundancies with nested or empty progn
s, but any halfdecent
compiler will take care of that. Python certainly does a goob job on
that code.
CLUSER> (emitsort '(a b c))
(progn
nil
(progn
nil
nil
(if (< c b)
(progn (rotatef b c))
(progn)))
(if (< b a)
(if (< c a)
(progn (rotatef a b c))
(progn (rotatef a b)))
(progn)))
CLUSER> (disassemble (lambda (a b c)
(declare (type fixnum a b c))
(inlinesort a b c)))
; disassembly for (lambda (a b c))
; 0E88F150: 498BD0 mov rdx, r8 ; noargparsing entry point
; 53: 498BC9 mov rcx, r9
; 56: 498BDA mov rbx, r10
; 59: 4D39CA cmp r10, r9
; 5C: 7C30 jl L3
; 5E: L0: 4C39C1 cmp rcx, r8
; 61: 7D0B jnl L1
; 63: 4C39C3 cmp rbx, r8
; 66: 7C1B jl L2
; 68: 488BD1 mov rdx, rcx
; 6B: 498BC8 mov rcx, r8
; 6E: L1: 488BF9 mov rdi, rcx
; 71: 488BF3 mov rsi, rbx
; 74: 488D5D10 lea rbx, [rbp+16]
; 78: B906000000 mov ecx, 6
; 7D: F9 stc
; 7E: 488BE5 mov rsp, rbp
; 81: 5D pop rbp
; 82: C3 ret
; 83: L2: 488BD1 mov rdx, rcx
; 86: 488BCB mov rcx, rbx
; 89: 498BD8 mov rbx, r8
; 8C: EBE0 jmp L1
; 8E: L3: 498BCA mov rcx, r10
; 91: 498BD9 mov rbx, r9
; 94: EBC8 jmp L0
I thought about generating calls to values
in the final merge,
rather than permuting, but decided against: I know SBCL doesn’t
generate clever code when permuting registers, and that’d result in
avoidable spills. I also considered generating code in CPS rather
than emitting assignments; again, I decided against because I can’t
depend on SBCL to emit clever permutation code. The transformation
would make sense in a dialect with weaker support (both compiler and
social) for assignment.
Both this inline merge sort and the original, permutationfree (except at the leaves), one actually define the exact same algorithms. For any input, both (should) execute the same comparisons in the same order: the original inline merge sort simply inlines the whole set of execution traces, without even merging control flow.
The permutationfree sort guarantees that it never performs redundant comparisons. Whether it performs the strict minimum number of comparisons, either on average or in the worst case, is another question. At first, I thought that \(\lg n!\) should be a good estimate, since the search tree seems optimally balanced. The problem is \(n!\) tends to have many other prime factors than 2, and we can thus expect multiple comparisons to extract less than 1 bit of information, for each execution. The lower bound can thus be fairly far from the actual value… Still, this question is only practically relevant for tiny sorts, so the discrepancy shouldn’t be too large.
A simple way to get the minimum, average or maximum comparison count would be to annotate the permutationfree generator to compute the shortest, average and longest path as it generates the search tree.
I’ll instead mimic the current generator.
The first step is to find the number of comparisons to perform a
merge of m
and n
values.
If either m
or n
is zero, merging the sequences is trivial.
Otherwise, the minimum number of comparisons is (min m n)
: the
sequences are presorted, and the shortest sequence comes first. The
maximum is (1 (+ m n))
. I couldn’t find a simple expression for
the average over all permutations. Instead, I iterate over all
possible combinations and count the number of comparisons until either
subsequence is exhausted.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

Counting the number of comparisons in sorts is then trivial, with a recursive function. I didn’t even try to memoise repeated computations: the generated code is ludicrously long when sorting as few as 13 or 14 values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

CLUSER> (loop for i from 2 upto 16
do (multiplevaluebind (min avg max)
(sortcount i)
(format t "~4D ~6,2F ~6D ~6,2F ~6D~%"
i (log (! i) 2d0)
min (float avg) max)))
;; n lg(n!) min avg max ; best network
2 1.00 1 1.00 1 ; 1 1
3 2.58 2 2.67 3 ; 3 3
4 4.58 4 4.67 5 ; 5 5
5 6.91 5 7.17 8 ; 7 9
6 9.49 7 9.83 11 ; 10 12
7 12.30 9 12.73 14 ; 13 16
8 15.30 12 15.73 17 ; 16 19
9 18.47 13 19.17 21 ; 19 25?
10 21.79 15 22.67 25 ; 22 29?
11 25.25 17 26.29 29 ; 26 35?
12 28.84 20 29.95 33 ; 30 39?
13 32.54 22 33.82 37 ; 34 45?
14 36.34 25 37.72 41 ; 38 51?
15 40.25 28 41.69 45 ; 42 56?
16 44.25 32 45.69 49 ; 46? 60?
I annotated the output with comments (marked with semicolons). The columns are, from left to right, the sort size, the theoretical lower bound (on the average or maximum number of comparisons), the minimum number of comparisons (depending on the input permutation), the average (over all input permutations), and the maximum count. I added two columns by hand: the optimal worstcase (maximum) comparison counts (over all sorting methods, copied from the OEIS), and the optimal size for sorting networks, when known (lifted from a table here [pdf]). Inexact (potentially higher than the optimum) bounds are marked with a question mark.
For the input size the inline merge sort can reasonably tackle (up to ten or so), its worstcase is reasonably close to the best possible, and its average case tends to fall between the lower bound and the best possible. Over all these sizes, the merge sort’s worst case performs fewer comparisons than the optimal or bestknown sorting networks. The current best upper bounds on the minimal worstcase comparison count seem to be based on insertion sort passes that minimise the number of comparisons with a binary search. I don’t believe that’s directly useful for the current use case, but a similar trick might be useful to reduce the number of comparisons, at the expense of reasonably more data movement.
That’s already a decent proof of concept. It’s also far too plain to fit in the industrialstrength Common Lisp way. An inline sorting macro worthy of CL should be parameterised on both comparator and key functions, and work with arbitrary places rather than only variables.
Parameterising the comparator is trivial. The key could be handled by calling it at each comparison, but that’s wasteful. We’re generating code; might as well go for glory. Just like in the list merge sort, I’ll cache calls to the key function in the merge tree. I’ll also use special variables instead of passing a halfdozen parameters around in the generator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 

Finally, handling arbitrary places, thus letting the macro take care of writing results back to the places, is just regular macrology.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 

Now, the macro can be used to sort, e.g., vectors of double floats “inplace” (inasmuch as copying everything to registers can be considered inplace).
CLUSER> (macroexpand1 `(inlinesort (#'< :key #')
(aref array 0) (aref array 1) (aref array 2)))
(let* ((#:comparator1184 #'<)
(#:comparator1184
(if (functionp #:comparator1184)
#:comparator11184
(symbolfunction #:comparator1184)))
(#:key1185 #')
(#:key1185
(if (functionp #:key1185)
#:key1185
(symbolfunction #:key1185)))
(#:overwrite1186 t)
(#:array1195 array)
(#:array1192 array)
(#:array1189 array)
(#:temp1193 (aref #:array1195 0))
(#:temp1190 (aref #:array1192 1))
(#:temp1187 (aref #:array1189 2)))
(declare (ignorable #:comparator1184 #:key1185 #:overwrite1186))
(progn
nil
(progn
nil
nil
(let ((#:leftheadkey1196 (funcall #:key1185 #:temp1190))
(#:rightheadkey1197 (funcall #:key1185 #:temp1187)))
(if (funcall #:comparator1184 #:rightheadkey1197 #:leftheadkey1196)
(progn (rotatef #:temp1190 #:temp1187))
(progn))))
(let ((#:leftheadkey1198 (funcall #:key1185 #:temp1193))
(#:rightheadkey1199 (funcall #:key1185 #:temp1190)))
(if (funcall #:comparator1184 #:rightheadkey1199 #:leftheadkey1198)
(let ((#:rightheadkey1199 (funcall #:key1185 #:temp1187)))
(if (funcall #:comparator1184 #:rightheadkey1199
#:leftheadkey1198)
(progn (rotatef #:temp1193 #:temp1190 #:temp1187))
(progn (rotatef #:temp1193 #:temp1190))))
(progn))))
(when #:overwrite1186
(let ((#:new1194 #:temp1193))
(sbkernel:%aset #:array1195 0 #:new1194))
(let ((#:new1191 #:temp1190))
(sbkernel:%aset #:array1192 1 #:new1191))
(let ((#:new1188 #:temp1187))
(sbkernel:%aset #:array1189 2 #:new1188)))
(values #:temp1193 #:temp1190 #:temp1187))
t
CLUSER> (disassemble (lambda (array)
(declare (type (simplearray doublefloat (3)) array))
(inlinesort (#'< :key #')
(aref array 0) (aref array 1) (aref array 2))
array))
; disassembly for (lambda (array))
; 0C5A5661: F20F105201 movsd XMM2, [rdx+1] ; noargparsing entry point
; 666: F20F104209 movsd XMM0, [rdx+9]
; 66B: F20F104A11 movsd XMM1, [rdx+17]
; 670: 660F28E0 movapd XMM4, XMM0
; 674: 660F5725A4000000 xorpd XMM4, [rip+164]
; 67C: 660F28D9 movapd XMM3, XMM1
; 680: 660F571D98000000 xorpd XMM3, [rip+152]
; 688: 660F2FDC comisd XMM3, XMM4
; 68C: 7A02 jp L0
; 68E: 7267 jb L3
; 690: L0: 660F28DA movapd XMM3, XMM2
; 694: 660F571D84000000 xorpd XMM3, [rip+132] ; negate doublefloat
; 69C: 660F28E0 movapd XMM4, XMM0
; 6A0: 660F572578000000 xorpd XMM4, [rip+120]
; 6A8: 660F2FE3 comisd XMM4, XMM3
; 6AC: 7A26 jp L1
; 6AE: 7324 jnb L1
; 6B0: 660F28E1 movapd XMM4, XMM1
; 6B4: 660F572564000000 xorpd XMM4, [rip+100]
; 6BC: 660F2FE3 comisd XMM4, XMM3
; 6C0: 7A27 jp L2
; 6C2: 7325 jnb L2
; 6C4: 660F28DA movapd XMM3, XMM2
; 6C8: 660F28D0 movapd XMM2, XMM0
; 6CC: 660F28C1 movapd XMM0, XMM1
; 6D0: 660F28CB movapd XMM1, XMM3
; 6D4: L1: F20F115201 movsd [rdx+1], XMM2
; 6D9: F20F114209 movsd [rdx+9], XMM0
; 6DE: F20F114A11 movsd [rdx+17], XMM1
; 6E3: 488BE5 mov rsp, rbp
; 6E6: F8 clc
; 6E7: 5D pop rbp
; 6E8: C3 ret
; 6E9: L2: 660F28DA movapd XMM3, XMM2
; 6ED: 660F28D0 movapd XMM2, XMM0
; 6F1: 660F28C3 movapd XMM0, XMM3
; 6F5: EBDD jmp L1
; 6F7: L3: 660F28D8 movapd XMM3, XMM0
; 6FB: 660F28C1 movapd XMM0, XMM1
; 6FF: 660F28CB movapd XMM1, XMM3
; 703: EB8B jmp L0
The inline sort supports the same options as CL:SORT
, so it’d be
really interesting to opportunistically compile calls to the latter
into sizespecialised inline sort. The usual, portable, way to code
that sort of macro qua sourcetosource optimiser in CL is with
compiler macros; compiler macros have access to all the usual
macroexpansiontime utility, but the function definition is left in
place. That way the user can still use the function as a firstclass
function, and the compilermacro can decline the transformation if a
regular call would work better (and the compiler can ignore any
compiler macro). That’s not enough for our needs, though… and there
can only be one compiler macro per function, so adding one to code we
don’t own is a bad idea.
Python’s first internal representation (ir1) is optimised by iteratively deriving tighter type information, and (mostly) pattern matching on the type of function calls. Its DEFTRANSFORM form lets us add new rules, and there may be an arbitrary number of such rules for each function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

The two deftransforms at the end define new rules that match on calls
to CL:SORT
and CL:STABLESORT
, with arbitrary argument types and
return types: basic type checks are performed elsewhere, and
maybeemitunrolledmergesort
does the rest. Transforms are
identified by the docstring (which also improve compiler notes), and
the argument and return types, so the forms are reevaluationsafe.
All the logic lies in maybeemitunrolledmergesort
. The policy
form checks that the optimisation policy at the call node has speed
greater than space
, and gives up on the transformation otherwise.
The next step is to make sure the sequence argument is an array, and
that its dimensions are known and define a vector (its dimension list
is a list of one number). The final guard makes sure we only
specialise on small sorts (at most
*unrolledvectorsortmaxlength*
).
Finally, we get to code generation itself. A vector of length 1 or 0
is trivially presorted. I could also directly emit the inner
inlinesort
form, but SBCL has some stuff to hoist out computations
related to hairy arrays. witharraydata
takes a potentially
complex array (e.g. displaced, or not a vector), and binds the
relevant variables to the underlying simple array of rank 1, and the
start and end indices corresponding to the range we defined
(defaulting to the whole range of the input array). Bound checks are
eliminated because static information ensures the accesses are safe
(or the user lied and asked not to insert type checks earlier), and
the start
index is declared to be small enough that we can add to it
without overflow – Python doesn’t implement the sort of sophisticated
shape analyses that could figure that out. Finally, a straight
inlinesort
form can be emitted.
That machinery means we’ll get quicker and shorter inline sort code
when the size is known ahead of time. For example, a quick
disassembly shows the following is about one fourth the size of the
sizegeneric inline code (with (simplearray doublefloat (*))
, and
(optimize speed (space 0))
.
CLUSER> (lambda (x)
(declare (type (simplearray doublefloat (4)) x)
(optimize speed))
(sort x #'<))
#<FUNCTION (lambda (x)) {100BFF457B}>
CLUSER> (disassemble *)
; disassembly for (lambda (x))
; 0BFF45CF: F20F105A01 movsd XMM3, [rdx+1] ; noargparsing entry point
; 5D4: F20F104209 movsd XMM0, [rdx+9]
; 5D9: F20F104A11 movsd XMM1, [rdx+17]
; 5DE: F20F105219 movsd XMM2, [rdx+25]
; 5E3: 660F2FC3 comisd XMM0, XMM3
; 5E7: 7A0E jp L0
; 5E9: 730C jnb L0
; 5EB: 660F28E3 movapd XMM4, XMM3
; 5EF: 660F28D8 movapd XMM3, XMM0
; 5F3: 660F28C4 movapd XMM0, XMM4
[...]
Even for vectors of length 8 (the default limit), the specialised merge sort is shorter than SBCL’s inlined heapsort, and about three times as fast on shuffled vectors.
It took me much longer to write this up than to code the generator, but I hope this can be useful to other people. One thing I’d like to note is that sorting networks are much harder to get right than this generator, and pessimise performance: without branches, there must be partially redundant computations on nonpoweroftwo sizes. In the absence of solid compiler support for conditional swaps, I doubt the additional overhead of optimal sorting networks can be compensated by the simpler control flow, never mind the usual oddeven or bitonic networks.
]]>The result? Pure noise.
(expt 10 10)
overflows 32 bit signed integers, so the C version
wound up going through 1410065408 iterations instead. In fact, signed
overflow is undefined in C, so a sufficiently devious compiler could
cap the iteration count to 65536 and still be standard compliant.
On SBCL/x8664, we can do the following and explicitly ask for machine unsigned arithmetic:
CLUSER> (lambda (max)
(declare (type (unsignedbyte 64) max)
(optimize speed))
(let ((sum 0))
(declare (type (unsignedbyte 64) sum))
(dotimes (i max sum)
(setf sum (ldb (byte 64 0) (+ sum i))))))
#<FUNCTION (LAMBDA (MAX)) {1004DA3D6B}>
CLUSER> (disassemble *)
; disassembly for (LAMBDA (MAX))
; 04DA3E02: 31C9 XOR ECX, ECX ; noargparsing entry point
; 04: 31C0 XOR EAX, EAX
; 06: EB0E JMP L1
; 08: 0F1F840000000000 NOP
; 10: L0: 4801C1 ADD RCX, RAX
; 13: 48FFC0 INC RAX
; 16: L1: 4839D0 CMP RAX, RDX
; 19: 72F5 JB L0
[ function epilogue ]
Now that ldb
portably ensures modular arithmetic, we
virtually get the exact same thing as what GCC outputs, down to
alignment. It’s still slower than the C version because it goes
through 1e10 iterations of the lossy sum, rather than
1.4e9.
Microbenchmarks are useful to improve our understanding of complex systems. Microbenchmarks whose results we completely discard not so much: if there’s nothing keeping us or the compiler honest, we might as well get them to compile to noops.
]]> addr = destination
offset = addr>>k;
(cards(heap_base>>k))[offset] = 1 ; mark one byte
write to addr
Some SBCL users allocate Lisp object lookalikes in the C heap, and we have stackallocated values; I have to test whether the address is in range or mask the offset to avoid overflows.
Or, we could exploit X86’s bitaddressing instructions:
addr = destination
bts cards, addr
write to addr
where cards
is a vector of 256 or 512MB (there’s some trickery to handle
negative offsets). bts
will index into that vector of 4G bits, and
set the corresponding bit to 1. On X8664, we can force cards
to be
in the lower 4GB, and stick to 32 bit addressing: the instruction will
also implicitly mask out the upper 32 bit of addr
before indexing
into cards
. Too bad it’s around twice or thrice as slow as a shift
and a byte write (or even shift, mask and byte write) and really sucks
with SMP.
There are also hacks [PDF] to abuse alignment checking as hardware lowtag (tag data in the lower bit of addresses) checks. Who says that contemporary machines don’t support safe languages well? (:
]]>I believe the approach I used to choose the implementation can be applied in other contexts, and the tiny tweak to adapt the sort to nearlysorted inputs is simple (much simpler than explicitly detecting runs like Timsort), if a bit weak, and works with pretty much any merge sort.
The original code is reproduced below. The sort is parameterised on two functions: a comparator (test) and a key function that extracts the property on which data are compared. The key function is often the identity, but having it available is more convenient than having to pull the calls into the comparator. The sort is also stable, so we use it for both stable and regular sorting; I’d like to keep things that way to minimise maintenance and testing efforts. This implementation seems like a good foundation to me: it’s simple but pretty good (both in runtime and in number of comparisons). Trying to modify alreadycomplicated code is no fun, and there’s little point trying to improve an implementation that doesn’t get the basics right.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

There are a few obvious improvements to try out: larger base cases, recognition of sorted subsequences, converting branches to conditional moves, finding some way to avoid the initial call to length (which must traverse the whole linked list), … But first, what would interesting performance metrics be, and on what inputs?
I think it works better to first determine our objective, then the inputs to consider, and, last, the algorithmic variants to try and compare (decision variables). That’s more or less the reverse order of what’s usually suggested when defining mathematical models. The difference is that, in the current context, the space of inputs and algorithms are usually so large that we have to winnow them down by taking earlier choices into account.
A priori, three basic performance metrics seem interesting: runtimes, number of calls to the comparator, and number of calls to the key functions. On further thought, the last one doesn’t seem useful: if it really matters, a schwartzian transform suffices to reduce these calls to a minimum, regardless of the sort implementation.
There are some complications when looking at runtimes. The universe
of test and key functions is huge, and the sorts can be inlined, which
sometimes enables further specialisation on the test and key. I’ve
already decided that calls to key don’t matter directly. Let’s
suppose it’s very simple, the identity function. The number of
comparisons will correlate nicely with performance when comparisons
are slow. Again, let’s suppose that the comparator is simple, a
straight <
of fixnums. The performance of sorts, especially with a
trivial key and a simple comparator, can vary a lot depending on
whether the sort is specialised or not, and both cases are relevant in
practice. I’ll have to test for both cases: inlined comparator and
key functions, and generic sorts with unknown functions.
This process lead to a set of three objective functions: the number of calls to the comparator, the runtime (number of cycles) of normal, generic sort, and the number of cycles for a specialised sort.
The obvious thing to vary in the input (the list to sort) is the length of the list. The lengths should probably span a wide range of values, from short lists (e.g. 32 elements) to long ones (a few million elements). Programs that are sortbound on very long lists should probably use vectors, if only around the sort, and then invest in a sophisticated sort.
In real programs, sort is sometimes called on nearly sorted or reversesorted sequences, and it’s useful to sort such inputs faster, or with fewer comparisons. However, it’s probably not that interesting if the adaptivity comes at the cost of worse performance on fully shuffled lists. I decided to test on sorted and fully shuffled inputs. I also interpolated between the two by flipping randomlyselected subranges of the list a few times.
Finally, linked lists are different than vectors in one key manner: contiguous elements can be arbitrarily scattered around memory. SBCL’s allocation scheme ensures that consecutivelyallocated objects will tend to be located next to each other in memory, and the copying garbage collector is hacked to copy the spine of cons lists in order. However, a list can still temporarily exhibit bad locality, for example after an inplace sort. Again, I decided to go for ordered conses, fully scattered conses (only the conses were shuffled, not the list’s values), and to interpolate, this time by swapping randomlyselected pairs of consecutive subranges a couple times.
The textbook way to improve a recursive algorithm is to increase the base cases’ sizes. In the initial code the base case is a sublist of size one; such a list is trivially sorted. We can easily increase that to two (a single conditional swap suffices), and an optimal sorting network for three values is only slightly more complicated. I decided to stop there, with base cases of size one to three. These simple sorts are implemented as a series of conditional swaps (i.e. pairs of max/min computations), and these can be executed branchfree, with only conditional moves. There’s a bit of overhead, and conditional moves introduce more latency than well predicted branches, but it might be useful for the specialised sort on shuffled inputs, and otherwise not hurt too much.
The merge loop could cache the result of calls to the key functions. This won’t be useful in specialised sorts, and won’t affect the number of comparisons, but it’ll probably help the generic sort without really affecting performance otherwise.
With one more level of indirection, the merge loop can be branchfree:
mergeone
can be executed on references to the list heads, and these
references can be swapped with conditional moves. Again, the
additional complexity makes it hard to guess if the change would be a
net improvement.
Like I hinted back in May, we can accelerate the sort on presorted inputs by keeping track of the last cons in each list, and tweaking the merge function: if the first value in one list is greater than (or equal to) the last in the other, we can directly splice them in order. Stability means we have to add a bit of complexity to handle equal values correctly, but nothing major. With this tiny tweak, merge sort is lineartime on sorted or reversesorted lists (the recursive step is constanttime, and merge sort recurses on both halves); it also works on recursivelyprocessed sublists, and the performance is thus improved on nearlysorted inputs in general. There’s little point going through additional comparisons to accelerate the merger of two tiny lists; a minimal length check is in order. In addition to the current version, without any quick merge, I decided to try quick merges when the length of the two sublists summed to at least 8, 16 or 32. I didn’t try limits lower than 8 because any improvement would probably be marginal: trying to detect opportunities for quicker merge introduces two additional comparisons when it fails.
Finally, I tried to see if the initial call to length
(which has to
traverse the whole list) could be avoided, e.g. by switching to a
bottomup sort. The benchmarks I ran in May made me realise that’s
probably a bad idea. Such a merge sort almost assuredly has to split
its inputs in chunks of power of two (or some other base) sizes.
These splits are suboptimal on nonpoweroftwo inputs; for example,
when sorting a list of length (1+ (ash 1 n))
, the final merge is
between a list of length (ash 1 n)
and a list of length … one.
Knowing the exact length of the list means we can split optimally on
recursive calls, and that eliminates bumps in runtimes and in the
number of comparisons around “round” lengths.
I usually don’t try to do anything clever, and simply run a large number of repetitions for all the possible implementations and families of inputs, and then choose a few interesting statistical tests or sic an ANOVA on it. The problem is, I’d maybe want to test with ten lengths (to span the wide range between 32 and a couple million), a couple shuffledness values (say, four, between sorted and shuffled), a couple scatteredness values (say, four, again), and around 48 implementations (size1 base case, size3 base case, size3 base case with conditional moves, times cached or uncached key, times branchful or branchfree merge loop, times four possibilities for the quick merge). That’s a total of 7680 sets of parameter values. If I repeated each possibility 100 times, a reasonable sample size, I’d have to wait around 200 hours, given an average time of 1 second/execution (a generous estimate, given how slow shuffling and scattering lists can be)… and I’d have to do that separately when testing for comparison counts, generic sort runtimes and specialised sort runtimes!
I like working on SBCL, but not enough to give its merge sort multiple CPUweeks.
Executing multiple repetitions of the full cross product is overkill: that actually gives us enough information to extract information about the interaction between arbitrary pairs (or arbitrary subsets, in fact) of parameters (e.g. shuffledness and the minimum length at which we try to merge in constanttime). The thing is, I’d never even try to interpret all these crossed effects: there are way too many pairs, triplets, etc. I could instead try to determine interesting crosses ahead of time, and find a design that fits my simple needs.
Increasing the length of the list will lead to longer runtimes and more comparisons. Scattering the cons cells around will also slow the sorts down, particularly on long lists. Hopefully, the sorts are similar enough to be affected comparably by the length of the list and by how its conses are scattered in memory.
Presorted lists should be quicker to sort than shuffled ones, even without any clever merge step: all the branches that depend on comparisons are trivially predicted. Hopefully, the effect is more marked when sorted pairs of sublists are merged in constant time.
Finally, the interaction between the remaining algorithmic tweaks is pretty hard to guess, and there are only 12 combinations. I feel it’s reasonable to cross the three parameter sets.
That’s three sets of crossed effects (length and scatteredness, shuffledness and quick merge switchover, remaining algorithmic tweaks), but I’m not interested in any further interaction, and am actually hoping these interactions are negligible. A Latin square design can help bring the sample to a much more reasonable size.
An NxN Latin square is a square of NxN cells, with one of N symbols in each cell, with the constraint that each symbol appears once in each row and column; it’s a relaxed Sudoku.
When a first set of parameters values is associated with the rows, a second with the columns, and a third with the symbols, a Latin square defines N^{2} triplets that cover each pair of parameters between the three sets exactly once. As long as interactions are absent or negligible, that’s enough information to separate the effect of each set of parameters. The approach is interesting because there are only N^{2} cells (i.e. trials), instead of N^{3.} Better, the design can cope with very low repetition counts, as low as a single trial per cell.
Latin squares are also fairly easy to generate. It suffices to fill the first column with the symbols in arbitrary order, the second in the same order, rotated by one position, the third with a rotation by two, etc. The square can be further randomised by shuffling the rows and columns (with FisherYates, for example). That procedure doesn’t sample from the full universe of Latin squares, but it’s supposed to be good enough to uncover pairwise interactions.
Latin squares only make sense when all three sets of parameters are the same size. Latin rectangles can be used when one of the sets is smaller than the two others, by simply removing rows or columns from a random Latin square. Some pairs are then left unexplored, but the data still suffices for uncrossed linear fits, and generating independent rectangles helps cover more possibilities.
I’ll treat all the variables as categorical, even though some take numerical values: it’ll work better on nonlinear effects (and I have no clue what functional form to use).
Comparison counts are easier to analyse. They’re oblivious to microoptimisation issues like conditional moves or scattered conses, and results are deterministic for fixed inputs. There are much fewer possibilities to consider, and less noise.
Four values for the minimum length before checking for constanttime merger (8, 16, 32 or never), and ten shuffledness values (sorted, one, two, five, ten, 50, 100, 500 or 1000 flips, and full shuffle) seem reasonable; when the number of flips is equal to or exceeds the list length, a full shuffle is performed instead. That’s 40 values for one parameter set.
There are only two interesting values for the remaining algorithmic tweaks: size3 base cases or not (only size1).
This means there should be 40 list lengths to balance the design. I chose to interpolate from 32 to 16M (inclusively) with a geometric sequence, rounded to the nearest integer.
The resulting Latin rectangles comprise 80 cells. Each scenario was repeated five times (starting from the same five PRNG states), and 30 independent rectangles were generated. In total, that’s 12 000 executions. The are probably smarter ways to do this that better exploit the fact that there are only two algorithmic tweaks variants; I stuck to a very thin Latin rectangle to stay closer to the next two settings. Still, a full cross product with 100 repetitions would have called for 320 000 executions, nearly 30 times as many.
I wish to understand the effect of these various parameters on the number of times the comparison function is called to sort a list. Simple models tend to suppose additive effects. That doesn’t look like it’d work well here. I expect multiplicative effects: enabling quick merge shouldn’t add or subtract to the number of comparisons, but scale it (hopefully by less than one). A logarithmic transformation will convert these multiplications into additions. The ANOVA method and the linear regression I’ll use are parametric methods that suppose that the mean of experimental noise roughly follows a normal distribution. It seems like a reasonable hypothesis: variations will be caused by a sum of many small differences caused by the shuffling, and we’re working with many repetitions, hopefully enough for the central limit theorem to kick in.
The Latin square method also depends on the absence of crossed interactions between rows and columns, rows and symbols, or columns and symbols. If that constraint is violated, the design is highly vulnerable to Type I errors: variations caused by interactions between rows and columns could be assigned to rows or columns, for example.
My first step is to look for such interaction effects.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

The main effects are statistically significant (in order, list length,
shuffling and quick merge limit, and the algorithmic tweaks), with p <
2e16. That’s reassuring: the odds of observing such results if they
had no effects are negligible. Two of the pairs are, as well. Their
effects, on the other hand, don’t seem meaningful. The Sum Sq
column reports how much of the variance in the data set is explained
when the parameters corresponding to each row (one for each degree of
freedom Df
) are introduced in the fit. Only the
Size.Scatter:Shuffle.Quick row really improves the fit, and that’s
with 1159 degrees of freedom; the mean improvement in fit, Mean Sq
(per degree of freedom) is tiny.
The additional assumption that interaction effects are negligible seems reasonably satisfied. The linear model should be valid, but, more importantly, we can analyse each set of parameters independently. Let’s look at a regression with only the main effects.
1 2 3 4 5 6 7 8 9 10 11 12 13 

The fit is only slightly worse than with pairwise interactions. The coefficient table follows. What we see is that half of the observations fall within 12% of the linear model’s prediction (the worst case is off by more than 100%), and that nearly all the coefficients are statistically significantly different than zero.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

The Size.Scatter coefficients are plotted below. The number of comparison grows with the length of the lists. The logarithmic factor shows in the curve’s slight convexity (compare to the linear interpolation in blue).
The Shuffle.Quick values are the coefficients for the crossed effect of the level of shuffling and the minimum length (cutoff) at which constanttime merge may be executed; their values are reported in the next histogram, with error bars corresponding to one standard deviation. Hopefully, a shorter cutoff lowers the number of comparisons when lists are nearly presorted, and doesn’t increase it too much when lists are fully shuffled. On very nearly sorted lists, looking for presorted inputs as soon as eight or more values are merged divides the number of comparisons by a factor of 4 (these are base2 logarithms), and the advantage smoothly tails off as lists are shuffled better. Overall, cutting off at eight seems to never do substantially worse than the other choices, and is even roughly equivalent to vanilla merges on fully shuffled inputs.
The coefficient table tells us that nearly all of the Shuffle.Quick coefficients are statistically significant. The statistical significance values are for a null hypothesis that each of these coefficients is actually zero: the observation would be extremely unlikely if that were the case. That test tells us nothing about the relationship between two coefficients.
Comparing differences with standard deviations helps us detect hugely significant difference, but we can use statistical tests to try and make finer distinctions. Tukey’s Honest Significant Difference (HSD) method gives intervals on the difference between two coefficients for a given confidence level. For example, the 99.99% confidence interval between cutoff at 8 and 32 on lists that were flipped 50 times is [0.245, 0.00553]. This result means that, if the hypotheses for Tukey’s HSD method are satisfied, the probability of observing the results I found is less than .01% when the actual difference in effect between cutoff at 8 and 32 is outside that interval. Since even the upper bound is negative, it also means that the odds of observing the current results are less than .01% if the real value for cutoff at 8 isn’t lower than that of cutoff at 32: it’s pretty sure that looking for quick merges as early as length eight pays off compared to only doing so for merges of length 32 or more. One could also just prove that’s the case. Overall, cutting off at length eight results in fewer comparisons than the other options at nearly all shuffling levels (with very high confidence), and the few cases cases it doesn’t aren’t statistically significant at a 99.99% confidence level – of course, absence of evidence isn’t evidence of absence, but the differences between these estimates tend to be tiny anyway.
The last row, Leaf.Cache.BranchMergeTxFxT, reports the effect of adding base cases that sort lists of length 2 and 3. Doing so causes 4% more comparisons. That’s a bit surprising: adding specialised base cases usually improves performance. The issue is that the sorting networks are only optimal for dataoblivious executions. Sorting three values requires, in theory, 2.58 (\(\lg 3!\)) bits of information (comparisons). A sorting network can’t do better than the ceiling of that, three comparisons, but if control flow can depend on the comparisons, some lists can be sorted in two comparisons.
It seems that, if we wish to minimise the number of comparisons, I should avoid sorting networks for the size3 base case, and try to detect opportunities for constanttime list merges. Doing so as soon as the merged list will be of length eight or more seems best.
I decided to keep the same general shape of 40x40xM parameter values when looking at the cycle count for generic sorts. This time, scattering conses around in memory will affect the results. I went with conses laid out linearly, 10 swaps, 50 swaps, and full randomisation of addresses. These 4 scattering values leave 10 list lengths, in a geometric progression from 32 to 16M. Now, it makes sense to try all the other microoptimisations: trivial base case or base cases of size up to 3, with branches or conditional moves (3 choices), cached calls to key during merge (2 choices), and branches or conditional moves in the merge loop (2 choices). This calls for Latin rectangles of size 40x12; I generated 10 rectangles, and repeated each cell 5 times (starting from the same 5 PRNG seeds). In total, that’s 24 000 executions. A full cross product, without any repetition, would require 19 200 executions; the Latin square design easily saved a factor of 10 in terms of sample size (and computation time) for equivalent power.
I’m interested in execution times, so I generated the inputs ahead of time, before sorting them; during both generation and sorting, the garbage collector was disabled to avoid major slowdowns caused by the mprotectbased write barrier.
Again, I have to apply a logarithmic transformation for the additive model to make sense, and first look at the interaction effects. There’s a similar situation as the previous section on comparison counts: one of the crossed effects is statistically significant, but it’s not overly meaningful. A quick look at the coefficients reveals that the speedups caused by processing nearlysorted lists in close to linear time are overestimated on short lists and slightly underestimated on long ones.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

We can basically read the ANOVA with only main effects by skipping the rows corresponding to crossed effects and instead adding their values to the residuals. There are statistically significant coefficients in there, and they’re reported below. Again, I’m quite happy to be able to examine each set of parameters independently, rather than having to understand how, e.g., scattering cons cells around affects quick merges differently than the vanilla merge. Maybe I just didn’t choose the right parameters, or was really unlucky; I’m just trying to do the best I can with induction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

The coefficients for list length crossed with scattering level are plotted below. Sorting seems to be slower on longer list (surprise!), especially when the cons cells are scattered; sorting long scattered lists is about twice as slow as sorting nicely laidout lists of the same length. The difference between linear and slightly scattered lists isn’t statistically significant.
Just as with comparison counts, sorting presorted lists is faster, with or without special logic. Looking for sorted inputs before merging pays off even on short lists, when the input is nearly sorted: the effect of looking for presorted inputs even on sublists of length eight is consistently more negative (i.e. reduces runtimes) than for the other cutoffs. The difference is statistically significant at nearly all shuffling levels, and never significantly positive.
Finally, the three algorithmic tweaks. Interestingly, the coefficients tell us that, overall, the additional overhead of the branchfree merge loop slows it down by 5%. The fastest combination seems to be larger base cases, with or without conditional moves (C or T), cached calls to key (T), and branchful merge loop (T); the differences are statistically significant against nearly all other combinations, except FxTxT (no leaf sort, cached key, and branchful merge loop). Compared with the current code (FxFxT), the speed up is on the order of 5%, and at least 2% with 99.99% confidence.
If I want to improve the performance of generic sorts, it looks like I want to test for presorted inputs when merging into a list of length 8 or more, probably implement larger base cases, cache calls to the key function, and keep the merge loop branchful.
I kept the exact same plan as for generic sorts. The only difference is that independent Latin rectangles were regenerated from scratch. With the overhead from generic indirect calls removed, I’m hoping to see more important effects from the microoptimisations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Here as well, all the main and crossed effects are statistically significant. The effect of the microoptimisations (Leaf.Cache.BranchMerge) are now about as influential as the fast merge minimum length. It’s also even more clear that the crossed effects are much less important than the main ones, and that it’s probably not too bad to ignore the former.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 

The general aspect of the coefficients is pretty much the same as for generic sorts, except that differences are amplified now that the constant overhead of indirect calls is eliminated.
The coefficients for crossed list length and scattering level coefficients are plotted below. The graph shows that fully shuffling long lists around slows sort down by a factor of 8. The initial check for crossed effect gave good reasons to believe that this effect is fairly homogeneous throughout all implementations.
Checking for sorted inputs before merge still helps, even on short lists (of length 8 or more). In fact even on completely shuffled lists, looking for quick merge on short lists very probably accelerates the sort compared to not looking for presorted inputs, although the speed up compared to other cutoff values isn’t significant to a 99.99% confidence level.
The key function is the identity, and is inlined into nothing in these measurements. It’s not surprising that the difference between cached and uncached key values is tiny. The versions with larger base cases (C or T) sorts, and branchful merge are quicker than the others at 99.99% confidence level; compared to the initial code, they’re at least 13% faster with 99.99% confidence.
When the sort is specialised, I probably want to use a merge function that checks for presorted inputs very early, to implement larger base cases (with conditional moves or branches), and to keep the merge loop branchful.
Comparison counts are minimised by avoiding sorting networks, and by enabling opportunistic constanttime merges as early as possible. Generic sorts are fastest with larger base cases (with or without branches), cached calls to the key function, a branchful merge loop and early checks for constanttime merges. Specialised sorts are, similarly, fastest with larger base cases, a branchful merge loop and early checks when merging (without positive or negative effect from caching calls to the key function, even if it’s the identity).
Overall, these result point me toward one implementation: branchful size2 and size3 base cases that let me avoid redundant comparisons, cached calls to the key function, branchful merge loop, and checks for constanttime merges when the result is of length eight or more.
The compound effect of these choices is linear time complexity on sorted inputs, speedups (and reduction in comparison counts) by factors of 2 to 4 on nearlysorted inputs, and by 5% to 30% on shuffled lists.
The resulting code follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

It’s somewhat longer than the original, but not much more complicated: the extra code mostly comes from the tedious but simple leaf sorts. Particularly satisfying is the absence of conditional move hack: SBCL only recognizes trivial forms, and only on X86 or X8664, so the code tends to be ugly and sometimes sacrifices performance on other platforms. SBCL’s bad support for conditional moves may explain the lack of any speed up from converting branches to select expressions: the conditional swaps had to be implemented as pairs of independent test with T/NIL and conditional moves. Worse, when the comparison is inlined, an additional conditional move converted the result of the integer comparison to a boolean value; in total, three pairs of comparison/conditional move were then executed instead of one comparison and two conditional moves. Previous work [PDF] on outofplace array merge sort in C found it useful to switch to conditional movebased merge loop and sorting networks. Some of the difference is probably caused by SBCL’s weaker code generation, but the additional overhead inherent to linked list manipulations (compared to linear accesses in arrays) may also play a part.
Another code generation issue is caused by the way the initial version called the comparison function in exactly one place. This meant that arbitrary comparators would almost always be inlined in the specialised sort’s single call site. We lose that property with accelerated merge and larger base cases. That issue doesn’t worry me too much: functions can be declared inline explicitly, and the key function was already called from multiple sites.
I’m a bit surprised that neither the sorting networks nor the merge loop were appreciably spedup by rewriting them with conditional moves. I’m a lot more surprised by the fact that it pays off to try and detect presorted lists even on tiny merges, and even when the comparator is inlined. The statistical tests were useful here, with results that defy my initial expectations and let me keep the code simpler. I would be pleasantly surprised if complex performance improvement patches, in SBCL and otherwise, went through similar testing. Code is a longterm liability, and we ought to be convinced the additional complexity is worth the trouble.
Independently of that, the Latin square design was very helpful: it easily saved me a couple CPUweeks, and I can see myself using it regularly in the future. The approach only works if we already have a rough (and simple) performance model, but I have a hard time interpreting complex models with hundreds of interacting parameters anyway. Between a simplistic, but still useful, model and a complex one with a much stronger fit, I’ll usually choose the former… as long as I can be fairly certain the simple model isn’t showing me a mirage.
More generally, research domains that deal with the real world have probably already hit the kind of scaling issues we’re now facing when we try to characterise how computers and digital systems function. Brute forcing is easier with computers than with interns, but it can still pay off to look elsewhere.
]]>Binary search suffers from a related ailment when executed on medium or large vectors of almost poweroftwo size (in bytes), but it can be cured. Once that is done, searching a sorted vector can be as fast as searches with a welltuned hash table, for a few realistic access patterns.
The task is interesting to me because I regularly work with static, or almost static, sets: sets for which there’s a majority of lookups, while updates are either rare or batchable. For such sets, the improved performance of explicit balanced search trees on insertions is rarely worth the slowdown on lookups, nor the additional space usage. Replacing binary search with slightly offcenter binary or quaternary (fourway) searches only adds a bit more code to provide even quicker, more consistent lookup times.
The following jittered scatterplot reports the runtimes, on my dual
2.8 GHz X5660, for binary searches to the first value in a single
sorted vector of size 16 to 1G elements (32bit unsigned
s), and is
overlaid with the median in green. For each poweroftwo size, the
benchmark program was executed 10 times, with 100 000 binary searches
per execution (and each search was timed individually, with rdtscp
).
Finally,
NUMA effects
were eliminated by forcing the processes to run and acquire memory
from the same unloaded node. 1 000 000 points are hard to handle, so
the plot is only based on a .1% random sample (i.e. 1 000 points for
each size).
Two things bother me here. The performance degrades much more quickly when \(\lg(n)\geq 17\) (i.e. the size exceeds 512 KB), but, worse, it shows a lot of variation on larger vectors. The variation seems more critical to me. Examining the data set reveals that the variation is purely interexecution: each set of 100 000 repetitions shows nearly constant timings.
The degradation after \(\lg(n)=17\) is worrying because drops in performance when the input is 512 KB or larger looks a lot like a cache issue. However, the workload is perfectly cachable: even on vectors of 1G values, binary search is always passed the same key and vector, and thus always reads from the same 30 locations. This should have no trouble fitting in L1 cache, for the whole range of sizes considered here.
The variation is even stranger. It happens between executions of identical binaries for a deterministic program, on the same machine and OS, but never in a single execution. The variation makes no sense. If the definition of insanity is doing the same thing over and over again and expecting different results, contemporary computers are quite ill. Worse, the variation would never have been uncovered by repetitions in a single process. I was lucky to look at the results from multiple executions, otherwise I could easily have concluded that binary search for the same element in a vector of 1G 32bit elements almost always takes around 700 cycles, or almost always 1400ish cycles.
What’s behind the speed bump and the extreme variations?
The next graph reports the number of cache misses at the first, second, third and translation lookaside buffer levels. The average count per search over 100 000 repetitions were recorded in 10 independent executions for each size. The points correspond to medians of ten average counts, and the vertical lines span the minimum and maximum counts.
The miss counts between executions are pretty much constant for L3 and TLB, and reasonably consistent for L1. L1 and TLB misses seem to cause the slowdown, even though the working sets are tiny. L2 misses on \(\lg(n)\geq 24\) are all over the place, and seem to be behind the variability.
Something’s wrong with the caches. We’ve found an issue with classic binary search – that splits the range at each iteration in the middle – on vectors of (nearly)poweroftwo size, in bytes.
Nearly all (cacheful) processors to this day map memory addresses to
cache locations by taking the modulo with a power of two. For
example, with 1K cache lines of 64 bytes, 0xabcd = 43981
is mapped
to \((43981\div 64)\mod 1024\). This is as simple to understand as
it is to implement: just select a short range of bits from the
address. It also has the side effect that nicelyaligned data will often map to the same cache lines.
Two vectors of 64KB each, allocated at 0x10000 and 0x20000, have the
unfortunate property that, for each index, the data at that index in
both vectors will map to the same cache lines in a directmapped cache
with 1024 lines of 64 bytes each. When only one cache line is allowed
per location, iterating through both vectors in the same loop will
pingpong the data from both vectors between cache and memory, without
ever benefiting from caching. The addresses alias each other, and
that’s bad.
There are workarounds, on both the hardware and software sides.
Current X86s tend to avoid fast, directmapped, caches even at the first level: setassociative caches are designed to handle a small number of collisions (like fixedsize buckets in hash tables), from 24 in the fastest levels to 8, 16 or more for last level caches. This ensures that a program can access two similarlyaligned vectors in lockstep, without having each access to one vector evicting cached data from the other. It’s far from perfect: a 2way associative cache enables a loop over two vectors to work fine, but, as soon as the loop iterates over three or more vectors, it suddenly becomes much slower. Arguably more robust solutions involving prime moduli, hash functions, or other clever addresstocacheline mappings have been proposed since the 80’s (this [pdf] is a recent one, with a nice set of references to older work), but I don’t think any has ever gained traction.
On the software end, smart memory allocators try to alleviate the problem by offsetting large allocations away from the beginning of pages. Inserting a variable number of cache lines as padding (before allocated data) minimises the odds that vector traversals suffer from aliasing. Sadly, that’s not yet the case for SBCL. In fact, as well as pretty much guaranteeing that large vectors will alias each other, SBCL’s allocator also makes cacheaware traversals hard to write: the allocator aligns the vector’s header (two words for a length and type descriptor) rather than its data. That’s the kind of consideration that is too easily disregarded when designing a runtime system; getting it right after the fact probably isn’t hard, but may call for some unsavory hacks.
Binary search doesn’t involve lockstep traversals over multiple vectors. However, when the vector to search into is of poweroftwo length (or almost), the first few accesses are to similarlyaligned addresses (e.g. in a 64KB vector of 32bit values, it could access indices 0x8000, 0x4000, 0x2000, 0x1000, etc.). Similar issues crop up when working with poweroftwosized matrices, or when bitreversing a vector. We can easily convince ourselves that the access patterns will be similarly structured when binary searching vectors of size slightly (on the order of one or two cache lines) shorter or longer than powers of two.
On my X5660, the L1D cache has 512 lines of 64 bytes each, L2 4096 lines, and L3 196608 lines (12 MB). Crucially, the L1 and L2 caches are 8way setassociative, and the L3 16way. The TLBs are 4way associative, and the first level has 64 entries and the second 512.
The number of L1 cache misses explodes on vectors of 512K elements and more. The L1D is designed so that only contiguous ranges of \(64\times 512\div 8 = 4 \textrm{KB}\) (one normal page) can be cached without collisions. For \(\lg(n)=19\), binary search needs 10 reads that all map to the same set of cache lines before reducing the candidate range below 4 KB. Once the first level cache’s 8way associativity has been exhausted, binary search on a vector of 512K 4byte elements still has to read two more locations before hitting a different set (bucket) of cache lines. This accounts for the L1 misses: so many reads are to locations that map to the same cache lines that a tiny working set exceeds the L1D’s wayness. To make things worse, LRUstyle replacement strategy are useless. We have a similar issue on TLB misses (plus, filling the TLB on misses means additional data cache misses to walk the page table).
In contrast, the L3 is large enough and has a highenough associativity to accomodate the binary search.
That leaves the wildlyvarying miss counts for the second level cache. The L1D cache is able to fullfill some reads, which explains why L2 misses only appear on larger searches. The variation is surprising. What happens is that, on the X5660, caching depends on physical memory addresses, and those are fully under the OS’s control and hidden from userspace. The exception is the L1D cache, in which the sets only cover 4 KB (one normal page). It’s just small enough that the bits used to determine the set are always the same for both virtual and physical addresses, and the CPU can thus perform part of the lookup in parallel with address translation. This hidden variable explains the large discrepancies in L2 miss counts that is then reflected in runtimes. Sometimes the OS hands us a range of physical pages that works well for our workloads, other times it doesn’t. Not only is there nothing we can do about it, but we also can’t even know about it, except indirectly, or with unportable hacks. Finally, the OS can’t change the mapping without copying data, so that happens very rarely, and memory allocators tend to minimise syscalls by keeping previouslyallocated address space around. In effect, an application that’s unlucky in its allotment of physical addresses can do little about it and can’t really tell except by logging bad performance.
Classic binary search on poweroftwo sized vectors is a pathological case for small (i.e. fast) caches and TLBs, and its performance varies a lot based on physical address allocation, something over which we have no control. The previous section only covered a trivial case, and real workloads will be even worse. We could try to improve performance and consistency by introducing padding, either after each element or between chunks of elements, but I’d prefer a different solution that preserves the optimal density and simplicity of sorted vectors.
There are many occurrences of two as a magic number in our test cases. On the hardware side, cache lines are poweroftwo sizes, as are L1 and L2 set (buckets) counts. On the software side, each unsigned is 4bytes long, the vectors are of poweroftwo lengths, and binary search divides the size of the candidate range by two. There’s little we can do about hardware, and changing our interface to avoid round element sizes or counts isn’t always possible. The only remaining knob to improve the performance and consistency of binary search is the division factor.
The successor of two is three. How can we divide a poweroftwo size vector in three subranges? One way is to vary the size of the subdivided range, and keep track of it at runtime. There’s a simpler workaround: it’s all right, if suboptimal, to search over a wider range than strictly necessary, as long as no outofbounds access is performed. For example, a vector of length 8 can be split at indices \(\lceil 8/3\rceil = 3\) and \(83 = 5\); the next range will be either [0, 3), [3, 5) or [5, 8). The middle range is of length 2 rather than 3, but the search is still correct if it’s executed over [3, 6) instead of [3, 5).
Ternary search functions were generated for each vector size; when the last range was of length 2, a binary search step was executed instead. The example below is for a vector of length 32K. It implements a lower bound search, assuming that the key is greater than or equal to the first element in the vector (otherwise, the first element is spuriously considered as the lower bound).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

Similar code was generated for binary searches. We still have to define the expansion
for each TERNARY
or BINARY
step.
BINARY
is easy: we simply reuse the logic from the previous post on binary search and
branch mispredictions.
1 2 3 4 5 6 

The compiler barrier was the most straightforward way I found to prevent gcc from noticing potentiallyredundant accesses and trying to optimise them away by inserting branches.
TERNARY
present a choice: do we try to expose
memorylevel parallelism (MLP)
and read from both locations in parallel before comparing them to the
key, or do we instead avoid useless memory traffic and make the second
load depend on the result of the first comparison?
The first implementation is simple.
1 2 3 4 5 6 7 

The second is only a bit more convoluted:
1 2 3 4 5 6 7 

If the first comparison succeeds, the second will repeat the exact
same operation, and store the pointer in base
.
Finally, a quaternary search was also implemented: it enables memorylevel parallelism by reading from three locations at each iteration to divide the range’s size by four. As the code snippet below shows, the highlevel access patterns are otherwise exactly the same as binary search, and a twoway division is invoked when the range is down to two elements.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Comparing all four implementations (binary search, ternary search with parallel loads, ternary search with sequential loads, and quaternary search with parallel loads) helps us understand the respective impact of minimising aliasing issues and exploiting memorylevel parallelism.
Theoretically, the serial ternary search is the second best implementation after binary search, which is optimal in terms of reads per reduction in candidate range size: it executes (expected count, for uniformlydistributed searches) 1.667 reads to divide the range by three, so \(1.6\overline{6}/\lg(3) \approx 1.05\) times as many comparisons as binary search, versus approximately 1.26 and 1.5 times as many for parallel ternary and quadratic searches. On the other hand, if we can expect independent reads to be executed in parallel, parallel ternary search will only perform \(1/\lg(3) \approx 0.63\) times as many batches of reads as binary search, and quaternary search half as many.
The next graph reports the performance of the methods when searching for the first element in sorted vectors, just to make sure we actually handle that case right. One would hope that the ternary searches (‘st’ for the serial loads, and ‘ter’ for the concurrent loads) improve on the binary (‘bin’) or quaternary (‘quat’) searches (that both suffer from aliasing issues); this is the reason we’re interested in them. We do observe a marked improvement in runtimes, and the second graph lets us believe it is indeed due to more successful caching (beware, the colours correspond to cache levels there, not search algorithms; there’s one facet for each search algorithm instead). Interestingly, \(\lg(n)=30\) is a bit faster than \(\lg(n)=29\) for the ternary searches; I can’t even hazard a guess as to what’s going on there.
Additional MLP when the accesses cause many more cache misses doesn’t help. If anything, the performance of quaternary search is worse and less consistent than that of binary search. However, it does accelerate ternary search, for which caches work fine. The serial ternary search (‘st’) only improves on the binary or quaternary search for \(\lg(n)\geq 20\), where the classic search begin to really suffer from aliasing issues, but the ternary search with parallel loads (‘ter’) is always on par with the classic searches, or quicker.
The next set of graph compares the performance of the methods when searching for random (present) keys; the same seed was used in each execution and for each method to minimise differences.
On this less easily cached workload, we find the same general effect on performance. Ternary search works better on larger vectors, and ternary search with parallel loads is faster even on small vectors, although the quaternary search seems to fare better here. Reassuringly, the ternary searches are also much more consistent, both in runtimes (narrower spreads in the runtime jittered plot), and in average cache misses, even with different physical memory allocation. The cache miss graph shows that, not surprisingly, the versions with parallel loads (quaternary and ternary searches) incur more misses than the comparable methods (binary and serial ternary), but that doesn’t prevent them from executing faster (see the runtime graph).
I draw two conclusions from these preliminary experiments: avoiding aliasing issues with slightly suboptimal subdivision (ternary searches) is a win, and so is exposing memory level parallelism, even when it results in more cache misses.
The initial results let us exclude straight binary and quaternary search from consideration, as well as serial ternary search. We’re also know we’re looking for variants that avoid aliasing and expose memorylevel parallelism.
Binary search is theoretically optimal (one bit of information to divide the range in two), so we could also try a very close approximation that avoids aliasing issues caused by exact division in halves. I implemented a binary search, with an offseted midpoint. Rather than guiding the recursion by the exact middle, the offset binary search reads from the \(31/63\)rd index (rounded up), and iterates on a subrange \(32/63\)rd as long. This is only slightly suboptimal: each lookup results in a division by \(63/32\), so that the offcenter search goes through approximately 2.3% more lookups than a classic binary search. Comparing other methods with the offset binary search should be equivalent to comparing them with an hypothetical classic binary search free of aliasing issues.
Aliasing mostly happens at the beginning of the search, and ternary search is (theoretically) suboptimal. It may be faster to execute ternary search only at the beginning of the recursion (when aliasing stands in the way of caching frequentlyused locations), and otherwise go for a classic binary search. I generated ternary/binary searches by using ternary search steps until the candidate range size was less than the square root of the original range size, and then switching to regular binary search steps.
Finally, quaternary search is interesting because it goes through roughly the same access patterns as binary search, but exposes more memorylevel parallelism. Again, a very close approximation suffices to avoid aliasing issues; I went with reads at \(16/63\), \(32/63\) and \(116/63 = 47/63\), and iteration on a subrange \(16/63\)rd as long. Depending on the actual capacity for memory level parallelism during execution, it will perform between 51% and 152% as much work as binary search.
I also compared the performance of g++’s (4.7) STL set, and google’s dense_hash_set, which don’t search sorted vectors, but do work with sets, ordered or not.
The set
in g++’s STL is a
redblack tree
with a fair amount of space overhead: each node has three pointers
(parent, and left and right children) and a colour enum. For our data (32bit unsigned), that’s about 700% additional space for each value.
Even before considering malloc overhead, the set uses eight times as
much memory as a sorted vector. Obviously, this affects the size of
datasets that can fit in faster storage (data caches, TLB, or main
memory)… on multisocket machines, it also affects the size of
datasets that can be processed before having to use memory assigned to
a faraway socket. This is why the comparison for set was only
performed until 128M 32bit elements; after that, there wasn’t enough
memory (a bit less than 12 GB) on the socket. In theory, the
redblack tree is a bit less efficient than binary search, and
probably in the same range as the ternary or quaternary searches:
redblack trees trades potentially imperfect balancing (with some
leaves at depth up to twice the optimum) for much quicker insertions.
An interesting advantage of the redblack tree is that it naturally
avoids aliasing issues: nodes will not tend to be laid out in memory
following a depthfirst, inorder traversal.
Google’s dense_hash_set
is an openaddressed hash table with
quadratic probing. It seems to target about 25% to 50% utilisation,
so ends up using between four times or twice as much memory as the
sorted vector. I’ve seen horrible implementations of tr1::hash
for
unsigned values in (it’s the identity function on my mac), so I wrote
a quick tabular hash function. It’s
very similar to the obvious implementation, but the matrix is
transposed: when some octets are identical (e.g. allzero highorder
bits), the table lookups will hit closeby addresses. On the
workloads considered here (small consecutive integers), the identity
hash function would actually perform admirably well, but it would fail
very hard on more realistic inputs. The tabular hash function is
closer to the ideal, and makes the performance of the hash table
nearly independent of the values constituting the set. It’s also
pretty fast: we can expect the lookup table to fit in L2, and most
probably in L1D, for the microbenchmarks used here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

Comparing ordered sets with an unordered (hash) set may seem unfair: unordered sets expose fewer operations, and hash tables exploit that additional freedom to offer (usually) faster insertions and lookups. I don’t care. When I use a container, I don’t necessarily want an ordered or an unordered set. I wish to implement an algorithm, and that algorithm could well be adapted to exploit the strengths of either ordered or unordered sets. The dense hash table is a valid data point: it helps informs choices that happen in practice.
First, let’s compare these implementations – the hash table (‘goog’), the offset binary search (‘ob’), the offset quaternary search (‘oq’), the redblack tree (‘stl’), the ternary/binary search (‘tb’), and the ternary search (‘ter’) – on the two extreme cases we’ve been considering so far: constant searches for the least element, and searches for random elements. None of the algorithms suffer from aliasing issues, so the spread, if any, is smoothly covered. This allows me to avoid the rather busy jittered scatter plot and instead only interpolate between median cycle counts, with shaded regions corresponding to the values between the 10th and 90th percentile. The lines for ‘goog’ and ‘stl’ end early: these data structures could not represent the largest sets in the 12GB of local memory available in each processor. Also, now that we’ve dealt with the aliasing anomalies, I don’t feel it’s useful to graph the cache and TLB misses per size and implementation anymore.
When we always search for the least element in the set, the hash table obviously works very well: it only reads the same one or two cache lines, and they’re always in cache. Surprisingly, the redblack tree is also a bit faster than the searches in sorted vectors… until it exhausts all available memory. Looking at the behaviour on tiny sizes, it seems likely that the allocation sequence yields a nice layout for the first few steps of the search. The offset binary search is both faster and more consistent than the classic binary search (shown in earlier graphs), but never faster than the offset quaternary search. The ternary/binary search is similarly never faster than the straight ternary search.
On fully random searches, the hash table is obviously slower than when it keeps reading the same one or two words, but still much faster than any of the sorted sets. The redblack tree is now significantly slower than all the sorted vector searches (1000 cycles versus 200 to 400 cycles on vectors of 128K elements, for example). Again, we find that the offset quaternary search dominates the offset binary search, and the ternary search the hybrid ternary/binary search.
For the rest of the post, I’ll only consider google’s dense hash table, g++’s redblack tree, the offset quaternary search and the ternary search. Note that both the offset quaternary and ternary searches exhibit consistently better performance than the offset binary search, for which aliasing isn’t an issue. These alternative ways to search sorted vectors can thus be expected to improve on binary search even on vectors of lengths that are far from powers of two.
So far, I’ve only reported runtimes for two workloads: a trivial one, in which we always search for the same element, and a worstcase scenario, in which all the keys are chosen independently and randomly from all the values in the set (missing keys might be even worse for some hash tables). I see two dimensions along which it’s useful to interpolate between these workloads.
The first dimension is spatial locality. We sometimes work with very large sorted sets, but almost exclusively search for values in a small contiguous subrange. Within that subrange, however, there is no obvious relationship between search keys; we might as well be choosing them randomly, with uniform probability. The constantelement search is a degenerate case with a subrange of length one, and the fully random choice another degenerate case, with a subrange spanning the whole set. For all vector sizes from 1K to 1G elements, I ran the program with subranges beginning at the least value in the set and of size 1K, 16K, 64K, 256K, 1M and 2M (or size equal to the whole set if it’s too small). These values are interesting because they roughly correspond to working set sizes that bracket the cache and TLB sizes for all four implementations.
We can expect all methods to benefit from spatial locality, but not to the same extent. The hash table works well because it scatters everything pseudorandomly in a large vector; when the subrange is sufficiently small compared to the set, each key in that subrange can be expected to use its own cache line, and the hash table’s performance will be the same as on randomlychosen keys on relatively small subranges. Similarly, the redblack tree uses about eight times as much space as sorted vectors for the same data, and will also exceed the same cache level much more quickly.
The redblack tree was slightly faster than the sorted vector searches when the key was always the same. This isn’t the case when the keys are selected from a small subrange. On all six scenarios and across all sizes, the 10th percentile for ‘stl’ is slower than the 90th for either ‘oq’ or ‘ter’. As expected, the runtimes for the hash table (‘goog’) evolve in plateaus, as the working set outstrips each level of caching. When it exceeds even the L3 (256K values in distinct cache lines is just barely larger than L3) its performance is more or less comparable to that of the sorted vector searches, as long as the subrange is below 2M items. All the sorted set implementations exhibit the same roughly sigmoidal shape for runtimes as functions of the set size. However, except for a few spikes on very large vectors, runtimes for the sorted vector searches grow much more slowly, and are more stable than for the redblack tree. In general, the offset quaternary search seems to execute a bit faster than the ternary search, but the difference is minimal. More importantly, when working with medium subranges (around 64K to 1M elements) in medium or large sets (1M elements or more), the sorted vectors searches are competitive with a hash table, in addition to being more compact.
The second dimension is temporal locality. I simulated workloads exhibiting temporal locality with strided accesses. For example with a stride of 17, the search keys are, in order, at rank 0, 17, 34, 51, etc. in the sorted set. The constant search is a special case with stride zero. For all vector sizes from 1K to 1G elements, I ran the programs with various strides: 1, 17, 65, 129, 257, and 1K+1. The ranks were taken modulo the set’s size, so these odd strides ensure that every key is generated for small vectors (i.e. they’re mod1 step sizes, for corewar fans). The ordered sets should exploit this setting well (the redblack tree probably less so than the vector searches), and the hash table not at all: a good hash function guarantees that similar keys don’t hash to closeby values any more than very different keys, and the hash table will behave exactly as if keys were chosen randomly.
The hash table behaves almost exactly like it did with random keys; in particular, there is a large increase in runtimes when the set comprises at least 1M elements: the hash table then exceeds L3. Even on these very easy inputs, the redblack tree is only faster than the hash table on accesses with stride one, the very best case. Worse, the redblack tree already exhibits very wide variations (consider the height of the shaded areas for ‘stl’) on such traversals. It’s even wider on longer strides in large vectors, but the redblack tree is then also so slow that it’s completely offgraph. The sorted vector searches are always faster, and in general slightly more stable. In fact, they seem comparable with the hash table, if not faster, for nearly all vector sizes, as long as the access stride is around 129 or less.
I only worked on binary search here, and, even with this very simple algorithm, there are fairly obscure pitfalls. The issue with physical address allocation affecting cache line aliasing, particularly at the L2 level, is easy to miss, and can have a huge impact on results. Each execution of the test program showed very consistent timings, but different executions yielded timings that varied by a factor of two. We’ve been warned [pdf] about such hidden variables in the past, but that was a pretty stealthy instance with a serious effect.
Even without that hurdle, the slowdown caused by aliasing between cache lines when executing binary searches on vectors of (nearly)poweroftwo sizes is alarming. The ratio of runtimes between the classic binary search and the offset quaternary search is on the order of two to ten, depending on the test case. Some of that is due to MLP, but the offset binary search offers only slightly lesser improvements: aliasing really is an issue. And yet, I believe I haven’t seen anyone try to compensate for that when comparing fancier search methods that often work best on nearpoweroftwo sizes (e.g. breadthfirst or van Emde Boas layouts) with simple binary search.
Also worrying is the fact that, even with all the repetitions and performance counters, some oddities (especially spikes in runtimes on very large vectors) are still left unexplained.
Nevertheless, these experiments can probably teach us something useful.
I don’t think that basing decisions on illunderstood phenomena is a good practice. This is why I try to minimise the number of variables and start with a logical hypothesis (in this case, “aliasing hurts and MLP helps”), rather than just being a slave to the pvalue. Hopefully, this approach results in more robust conclusions, particularly in conjunction with quantilebased comparisons and a lot of visualisation. In the current case, the most useful conclusions seem to be:
These raise two obvious questions: could we go further than quaternary search, and how do the results map to loopy implementations? I don’t think it’s useful to try and go farther than quaternary search: caches can only handle a few concurrent requests, and quaternary search’s three requests is probably close to, if not past, the maximum. In an implementation that’s not fully unrolled the offset searches are probably most easily adapted by multiplying the range by \(33/64 = 1/2+1/64\) or \(17/64 = 1/4+1/64\), with two shifts and a few adds. In either case, L2 cache misses tend to outweight branch mispredictions, which themselves are usually more important than microoptimisations at the instruction level. The speedups brought about by eliminating aliasing and branches, while exposing memorylevel parallelism, ought to be notable, even in a loopy implementation.
]]>