As programmers, we have a tendency to try and maximise generality: if we can support multiple writers, why would one bother with measly SPMC systems? The thing is SPMC is harder than SPSC, and MPMC is even more complex. Usually, more concurrency means programs are harder to get right, harder to scale and harder to maintain. Worse: it also makes it more difficult to provide theoretical progress guarantees.
Apart from architecting around simple cases, there’s a few ways to deal with this reality. We can define new, weaker, classes of program, like obstructionfreedom: a system is obstructionfree when one thread is guaranteed to make progress if every other thread is suspended. We can also weaken the guarantees of our data structure. For example, rather than exposing a single FIFO, we could distribute load and contention across multiple queues; we lose strict FIFO order, but we also eliminate a system bottleneck. Another option is to try and identify how real computers are more powerful than our abstract models: some argue that, realistically, many lockfree schemes are waitfree, and others exploit the fact that x86TSO machines have finite store buffers.
Last week, I got lost doodling with x86specific crossmodifying code, but still stumbled on a cute example of a simple lockfree protocol: lockfree sequence locks. This sounds like an oxymoron, but I promise it makes sense.
It helps to define the terms better. Lockfreedom means that the overall system will always make progress, even if some (but not all) threads are suspended. Classical sequence locks are an optimistic form of writebiased reader/writer locks: concurrent writes are forbidden (e.g., with a spinlock), read transactions abort whenever they observe that writes are in progress, and a generation counter avoids ABA problems (when a read transaction would observe that no write is in progress before and after a quick write).
In Transactional Mutex Locks (PDF), sequence locks proved to have enviable performance on small systems and scaled decently well for readheavy workloads. They even allowed lazy upgrades from reader to writer by atomically checking that the generation has the expected value when acquiring the sequence lock for writes. However, we lose nearly all progress guarantees: one suspended writer can freeze the whole system.
The central trick of lockfreedom is cooperation: it doesn’t matter if a thread is suspended in the middle of a critical section, as long as any other thread that would block can instead complete the work that remains. In general, this is pretty hard, but we can come up with restricted use cases that are idempotent. For lockfree sequence locks, the critical section is a precomputed set of writes: a series of assignments that must appear to execute atomically. It’s fine if writes happen multiple times, as long as they stop before we move on to another set of writes.
There’s a primitive based on compareandswap that can easily achieve such conditional writes: restricted double compare and single swap (RDCSS, introduced in A Practical MultiWord CompareandSwap (PDF)). RDCSS atomically checks if both a control word (e.g., a generation counter) and a data word (a mutable cell) have the expected values and, if so, writes a new value in the data word. The pseudocode for regular writes looks like
if (CAS(self.data, self.old, self) == fail) {
return fail;
}
if (*self.control != self.expected) {
CAS(self.data, self, self.old);
return fail;
}
CAS(self.data, self, self.new);
return success;
The trick is that, if the first CAS succeeds, we always know how to
undo it (data
’s old value must be self.old
), and that
information is stored in self
so any thread that observes the first
CAS has enough information to complete or rollback the RDCSS. The
only annoying part is that we need a twophase commit: reserve data
,
confirm that control
is as expected
, and only then write to data
.
For the cost of two compareandswap per write – plus one to acquire the sequence lock – writers don’t lock out other writers (writers help each other make progress instead). Threads (especially readers) can still suffer from starvation, but at least the set of writes can be published ahead of time, so readers can even lookup in that set rather than waiting for/helping writes to complete. The generation counter remains a bottleneck, but, as long as writes are short and happen rarely, that seems like an acceptable trade to avoid the 3n CAS in multiword compare and swap.
Here’s what the scheme looks like in SBCL.
First, a mutable box because we don’t have raw pointers (I could also have tried to revive my sblocative hack) in CL.
1 2 3 

Next, the type for write records: we have the the value for the next generation (once the write is complete) and a hash table of box to pairs of old and new values. There’s a key difference with the way RDCSS is used to implement multiple compare and swap: we don’t check for mismatches in the old value and simply assume that it is correct.
1 2 3 4 5 6 7 

The central bottleneck is the sequence lock, which each (read) transaction must snapshot before attempting to read consistent values.
1 2 3 4 5 6 7 8 9 10 

The generation associated with a snapshot is the snapshot if it is a positive fixnum, otherwise it is the write record’s generation.
Before using any read, we make sure that the generation counter hasn’t changed.
1 2 3 4 5 6 7 8 

I see two ways to deal with starting a read transaction while a write
is in progress: we can help the write complete, or we can overlay the
write on top of the current heap in software. I chose the latter:
reads can already be started by writers. If a write is in progress
when we start a transaction, we stash the write set in *currentmap*
and lookup there first:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

We’re now ready to start read transactions. We take a snapshot of the
generation counter, update *currentmap*
, and try to execute a
function that uses boxvalue
. Again, we don’t need a readread
barrier on x86oids (nor on SPARC, but SBCL doesn’t have threads on
that platform).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

The next function is the keystone: helping a write record go through exactly once.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

Now we can commit with a small wrapper around help
. Transactional
mutex lock has the idea of transaction that are directly created as
write transactions. We assume that we always know how to undo writes,
so transactions can only be upgraded from reader to writer.
Committing a write will thus check that the generation counter is
still consistent with the (read) transaction before publishing the new
write set and helping it forward.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

And now some syntactic sugar to schedule writes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

That’s enough for a smoke test on my dual core laptop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

The function testreads
counts the number of successful read
transactions and checks that (boxvalue a)
and (boxvalue b)
are
always equal. That consistency is preserved by testwrites
, which
counts the number of times it succeeds in incrementing both
(boxvalue a)
and (boxvalue b)
.
The baseline case should probably be serial execution, while the ideal case for transactional mutex lock is when there is at most one writer. Hopefully, lockfree sequence locks also does well when there are multiple writers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

First, the serial case. As expected, all the transactions succeed, in 6.929 seconds total (6.628 without GC time). With one writer and two readers, all the writes succeed (as expected), and 98.5% of reads do as well; all that in 4.186 nonGC seconds, a 65% speed up. Finally, with two writers and two readers, 76% of writes and 98.5% of reads complete in 4.481 nonGC seconds. That 7% slowdown compared to the singlewriter case is pretty good: my laptop only has two cores, so I would expect more aborts on reads and a lot more contention with, e.g., a spinlock.
CLUSER> (gc :full t) (time (testserial 1000000))
Evaluation took:
6.929 seconds of real time
6.944531 seconds of total run time (6.750770 user, 0.193761 system)
[ Run times consist of 0.301 seconds GC time, and 6.644 seconds nonGC time. ]
100.23% CPU
11,063,956,432 processor cycles
3,104,014,784 bytes consed
(10000000 10000000 1000000 1000000)
CLUSER> (gc :full t) (time (testsinglewriter 1000000))
Evaluation took:
4.429 seconds of real time
6.465016 seconds of total run time (5.873936 user, 0.591080 system)
[ Run times consist of 0.243 seconds GC time, and 6.223 seconds nonGC time. ]
145.97% CPU
6,938,703,856 processor cycles
2,426,404,384 bytes consed
(9863611 9867095 1450000)
CLUSER> (gc :full t) (time (testmultiplewriters 1000000))
Evaluation took:
4.782 seconds of real time
8.573603 seconds of total run time (7.644405 user, 0.929198 system)
[ Run times consist of 0.301 seconds GC time, and 8.273 seconds nonGC time. ]
179.30% CPU
7,349,757,592 processor cycles
3,094,950,400 bytes consed
(9850173 9853102 737722 730614)
How does a straight mutex do with four threads?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

CLUSER> (gc :full t) (time (testmutex 1000000))
Evaluation took:
5.814 seconds of real time
11.226734 seconds of total run time (11.169670 user, 0.057064 system)
193.10% CPU
9,248,370,000 processor cycles
1,216 bytes consed
(#<SBTHREAD:THREAD FINISHED values: NIL {1003A6E1F3}>
#<SBTHREAD:THREAD FINISHED values: NIL {1003A6E383}>
#<SBTHREAD:THREAD FINISHED values: NIL {1003A6E513}>
#<SBTHREAD:THREAD FINISHED values: NIL {1003A6E6A3}>)
There’s almost no allocation (there’s no write record), but the lack of read parallelism makes locks about 20% slower than the lockfree sequence lock. A readerwriter lock would probably close that gap. The difference is that the lockfree sequence lock has stronger guarantees in the worst case: no unlucky preemption (or crash, with shared memory IPC) can cause the whole system to stutter or even halt.
The results above correspond to my general experience. Lockfree algorithms aren’t always (or even regularly) more efficient than well thought out locking schemes; however, they are more robust and easier to reason about. When throughput is more than adequate, it makes sense to eliminate locks, not to improve the best or even the average case, but rather to eliminate a class of worst cases – including deadlocks.
P.S., here’s a sketch of the horrible crossmodifying code hack. It
turns out that the instruction cache is fully coherent on (post586)
x86oids; the prefetch queue will even reset itself based on the linear
(virtual) address of writes. With a single atomic byte write, we can
turn a xchg (%rax), %rcx
into xchg (%rbx), %rcx
, where %rbx
points to a location that’s safe to mutate arbitrarily. That’s an
atomic store predicated on the value of a control word elsewhere
(hidden in the instruction stream itself, in this case). We can then
dedicate one sequence of machine to each transaction and reuse them
via some
Safe Memory Reclamation mechanism (PDF).
There’s one issue: even without preemption (if a writer is preempted,
it should see the modified instruction upon rescheduling), stores
can take pretty long to execute: in the worst case, the CPU has to
translate to a physical address and wait for the bus lock. I’m pretty
sure there’s a bound on how long a xchg m, r64
can take, but I
couldn’t find any documentation on hard figure. If we knew that xchg
m, r64
never lasts more than, e.g., 10k cycles, a program could wait
that many cycles before enqueueing a new write. That wait is bounded
and, as long as writes are disabled very rarely, should improve
the worstcase behaviour without affecting the average throughput.
My work at AppNexus mostly involves performance optimisation, at any level from microarchitecturedriven improvements to data layout and assembly code to improving the responsiveness of our distributed system under load. Technically, this is similar to what I was doing as a lone developer on researchgrade programs. However, the scale of our (constantly changing) code base and collaboration with a dozen other coders mean that I approach the task differently: e.g., rather than singlemindedly improving throughput now, I aim to pick an evolution path that improves throughput today without imposing too much of a burden on future development or fossilising ourselves in a design deadend. So, although numbers still don’t lie (hah), my current approach also calls for something like judgment and taste, as well as a fair bit of empathy for others. Rare are the obviously correct choices, and, in that regard, determining what changes to make and which to discard as overthetop ricing feels like I’m drafting a literary essay.
This view is probably tainted by the fact that, between English and French classes, I spent something like half of my time in High School critiquing essays, writing essays, or preparing to write one. Initially, there was a striking difference between the two languages: English teachers had us begin with the five paragraph format where one presents multiple arguments for the same thesis, while French teachers imposed a thesis/antithesis/synthesis triad (and never really let it go until CÉGEP, but that’s another topic). When I write that performance optimisation feels like drafting essays, I’m referring to the latter “Hegelian” process, where one exposes arguments and counterarguments alike in order to finally make a stronger case.
I’ll stretch the analogy further. Reading between the lines gives us access to more arguments, but it’s also easy to get the context wrong and come up with hilariously farfetched interpretations. When I try to understand a system’s performance, the most robust metrics treat the system as a black box: it’s hard to get throughput under production data wrong. However, I need finer grained information (e.g., performance counters, instructionlevel profiling, or applicationspecific metrics) to guide my work, and, the more useful that information can be – like domain specific metrics that highlight what we could do differently rather than how to do the same thing more efficiently – the easier it is to measure incorrectly. That’s not a cause for despair, but rather a fruitful line of skepticism that helps me find more opportunities.
Just two weeks ago, questioning our applicationspecific metrics lead to an easy 10% improvement in throughput for our biggest consumer of CPU cycles. The consumer is an application that determines whether internet advertising campaigns are eligible to bid on an ad slot, and if so, which creative (ad) to show and at what bid price. For the longest time, the most timeconsuming part of that process was the first step, testing for campaign eligibility. Consequently, we tracked the execution of that step precisely and worked hard to minimise the time spent on ineligible campaigns, without paying much attention to the rest of the pipeline. However, we were clearly hitting diminishing returns in that area, so I asked myself how an adversary could use our statistics to mislead us. The easiest way I could think of was to have campaigns that are eligible to bid, but without any creative compatible with the ad slot (e.g., because it’s the wrong size or because the website forbids Flash ads): although the campaigns are technically eligible, they are unable to bid on the ad slot. We added code to track these cases and found that almost half of our “eligible” campaigns simply had no creative in the right size. Filtering these campaigns early proved to be a lowhanging fruit with an ideal code complexity:performance improvement ratio.
I recently learned that we also had to secondguess instruction level profiles. Contemporary x86oids are out of order, superscalar, and speculative machines, so profiles are always messy: “blame” is scattered around the real culprit, and some instructions (pipeline hazards like conditional jumps and uncached memory accesses, mostly) seem to account for more than their actual share. What I never realised is that, in effect, some instructions systematically mislead and push their cycles to others.
Some of our internal spinlocks use mfence
. I expected that to be
suboptimal, since it’s
common
knowledge
that lock
ed instruction are more efficient barriers: serialising
instructions like mfence
have to affect streaming stores and other
weakly ordered memory accesses, and that’s a lot more work than just
preventing store/load reordering. However, our profiles showed that
we spent very little time on spinlocking so I never gave it much thought…
until eliminating a set of spinlocks had a much better impact on
performance than I would have expected from the profile. Faced with
this puzzle, I had to take a closer look at the way mfence
and
locked instructions affect hardwareassisted instruction profiles on
our production Xeon E5s (Sandy Bridge).
I came up with a simple synthetic microbenchmark to simulate locking
on my E54617: the loop body is an adjustable set of memory accesses
(reads and writes of outofTLB or uncached locations) or computations
(divisions) bracketed by pairs of normal stores, mfence
, or lock
inc/dec
to cached memory (I would replace the fences with an
increment/decrement pair and it looks like all readmodifywrite
instructions are implemented similarly on Intel). Comparing runtimes
for normal stores with the other instructions helps us gauge their
overhead. I can then execute each version under perf
and estimate
the overhead from the instructionlevel profile. If mfence
is
indeed extra misleading, there should be a greater discrepancy between
the empirical impact of the mfence
pair and my estimate from the
profile.
You can find the
super crufty code here,
along with an rdtscp
version of cycle.h.
With lock
ed instructions and random reads that miss the L3 cache,
the (cycle) profile for the microbenchmark loop is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

Looking at that profile, I’d estimate that the two random reads
account for ~50% of runtime, and the pair of lock inc/dec
for ~40%.
The picture is completely different for mfence
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

It looks like the loads from uncached memory represent ~85% of the
runtime, while the mfence
pair might account for at most ~15%, if
I include all the noise from surrounding instructions.
If I trusted the profile, I would worry about eliminating lock
ed
instructions, but not so much for mfence
. However, runtimes (in
cycles), which is what I’m ultimately interested in, tell a different
story. The same loop of LLC load misses takes 2.81e9 cycles for 32M
iterations without any atomic or fence, versus 3.66e9 for lock
inc/dec
and 19.60e9 cycles for mfence
. So, while the profile for
the mfence
loop would let me believe that only ~15% of the time is
spent on synchronisation, the mfence
pair really represents 86%
\(((19.6  2.81) / 19.6)\) of the runtime for that loop! Inversely,
the profile for the lock
ed pair would make me guess that we spend
about 40% of the time there, but, according to the timings, the real
figure is around 23%.
The other tests all point to the same conclusion: the overhead of
mfence
is strongly underestimated by instruction level profiling,
and that of lock
ed instructions exaggerated, especially when
adjacent instructions write to memory.
setup cycles (est. overhead) ~actual overhead
div [ALU] (100 Mi iterations)
atomic: 20153782848 (20%) ~ 3.8%
mfence: 28202315112 (25%) ~31.3%
vanilla: 19385020088
Reads:
TLB misses (64Mi iterations)
atomic: 3776164048 (80%) ~39.3%
mfence: 12108883816 (50%) ~81.1%
vanilla: 2293219400
LLC misses (32Mi iterations)
atomic: 3661686632 (40%) ~23.3%
mfence: 19596840824 (15%) ~85.7%
vanilla: 2807258536
Writes:
TLB (64Mi iterations)
atomic: 3864497496 (80%) ~10.4%
mfence: 13860666388 (50%) ~75.0%
vanilla: 3461354848
LLC (32Mi iterations)
atomic: 4023626584 (60%) ~16.9%
mfence: 21425039912 (20%) ~84.4%
vanilla: 3345564432
I can guess why we observe this effect; it’s not like Intel is
intentionally messing with us. mfence
is a full pipeline flush: it
slows code down because it waits for all inflight instructions to
complete their execution. Thus, while it’s flushing that slows us
down, the profiling machinery will attribute these cycles to the
instructions that are being flushed. Locked instructions instead
affect stores that are still queued. By forcing such stores to
retire, locked instructions become responsible for the extra cycles
and end up “paying” for writes that would have taken up time anyway.
Losing faith in hardware profiling being remotely representative of
reality makes me a sad panda; I now have to double check perf
profiles when hunting for misleading metrics. At least I can tell
myself that knowing about this phenomenon helps us make better
informed – if less definite – decisions and ferret out more easy
wins.
P.S., if you find this stuff interesting, feel free to send an email (pkhuong at $WORK.com). My team is hiring both experienced developers and recent graduates (:
]]>Garbage collection is in the air. My friend Maxime is having issues with D’s garbage collector, and Martin Cracauer has a large patch to improve SBCL’s handling of conservative references. I started reviewing that patch today, and, after some discussion with Alastair Bridgewater, I feel like adding a markandsweep component to SBCL’s GC might be easier than what the patch does, while achieving its goal of reducing the impact of conservative references. That lead to the whiteboarding episode below and a plan to replace the garbage collecting half of SBCL’s generational GC. But first, a summary of the current implementation.
CMUCL started out with a Cheneystyle twospace collector. Twospace collectors free up space for more allocations by copying objects that might still be useful (that are reachable from “roots,” e.g., variables on the stack) from the old space to the new space. Cheney’s algorithm elegantly simplifies this task by storing bookkeeping information in the data itself. When we copy an object to the new space (because it is reachable), we want to make sure that all other references to that object are also replaced with references to the copy. Cheney’s solution to that desideratum is obvious: overwrite the old object with a broken heart (forwarding pointer), a marker that
x
.This adds a constraint that heapallocated objects can never be smaller than a broken heart, but they’re usually one or two words (two in SBCL’s case) so the constraint is rarely binding.
When the garbage collector traverses the roots (the stack, for example) and finds a pointer, the code only has to dereference that pointer to determine if the objects it points to has been moved. If so, the GC replaces the root pointer with a pointer to the copy in the new space. Otherwise, the GC copies the object to the new space, repoints to that copy, and overwrites the old object with a broken heart.
We also need to traverse objects recursively: when we find that an object is live and copy it to the new space, we must also make sure that anything that objects points to is also preserved, and that any pointer in that object is updated with pointers to copies in the new space.
That’s a graph traversal, and the obvious implementation maintains a workset of objects to visit which, in the worst case, could include all the objects in old space. The good news is we don’t have to worry about objects reentering that workset: we always overwrite objects (in old space) with a broken heart when we visit them for the first time.
Cheney proposed a clever trick to implement this workset. Whenever an object enters the workset, it has just been copied to the new space; as long as we allocate in the new space by incrementing an allocation pointer, the new space itself can serve as the workset! In addition to the allocation pointer, we now need a “toscan” pointer. Any object in the new space that’s below the toscan pointer has already been scanned for pointers and fixed to point in the new space; any object between the toscan pointer and the allocation pointer must be scanned for pointers to the old space. We pop an element from the workset by looking at the next object (in terms of address) after the toscan pointer and incrementing that pointer by the object’s size. When the toscan and the allocation pointers meet, the workset is empty and GC terminates.
Some SBCL platforms still use this twospace collector, but it doesn’t scale very well to large heaps (throughput is usually fine, but we waste a lot of space and GC pauses can be long). The generational conservative garbage collector (GENCGC, GENGC on precise/nonconservative platforms) is a hack on top of that Cheney GC.
The GC is “generational” because most passes only collect garbage from a small fraction of the heap, and “conservative” because we have to deal with values that may or may not be pointers (e.g., we don’t always know if the value in a register is a Lisp reference or just a machine integer) by considering some objects as live (not up for collection) while pinning them at their current address.
The runtime uses mprotect to record writes to the heap, except for the nursery (newly allocated objects) where we expect most writes to land. The heap is partitioned in pages, and the first write to a page after a GC triggers a protection fault; the signal handler marks that page as mutated and changes the protection to allow writes.
When a GC is triggered, we usually want to collect only the nursery, i.e., only objects that were allocated after the previous GC pass. GEN(C)GC adapts Cheney to this use case by building the set of all pages that might have been mutated to point somewhere in the nursery (thanks to the mprotect write barrier) and scanning them for roots, like the stack in Cheney GC. The default GENGC configuration has 7 generations and we extend this scheme by flagging pages with pointers to younger generations (newer objects), without noting what these generations might be.
Pinned objects are also handled by abusing the root set: pages that contain at least one pinned object don’t undergo garbage collection and are directly scanned for pointers, like the stack in Cheney GC.
Instead of having two heaps, an old space and a new space, we now have a lot of pages, and each page belongs to a generation. When we want to collect a given generation, pages in that generation form the old space, and pages allocated during GC the new space. This means that we lose the simplicity of Cheney’s newspaceistheworkset trick: the new space isn’t contiguous, so a single toscan pointer doesn’t cut it anymore! GENGC works around that by scanning the page table, but it’s not pretty and I really don’t know if Cheney is a good fit anymore.
GENCGC’s approach to pinned objects is stupid. If a page has no reference except for one conservative pointer, the whole page is considered live and scanned for references.
Martin’s solution is to allocate additional temporary metadata only for pinned pages and track the pinned status of individual objects. When the GC encounters a pointer to a page with pinned objects, it checks if it’s a pointer to a pinned object. If so, the pointee is left in place. Otherwise, it’s copied normally.
The patch has code to mark objects as live (pinned) and to overwrite objects once they have been copied. Basically, it is half of a markandsweep garbage collector. The main difference is that the set of pinned objects doesn’t grow (being pinned isn’t a contagious property), so we don’t need a worklist for pinned objects. However, I already noted that I’m not convinced the worklist hack in GENGC is a good idea.
Instead of marking pages as containing pinned objects, I feel it may be simpler to collect some pages by copying, and others by marking. Any pinned page would have the “mark” GC policy, while pages that likely contain few live objects (e.g., the nursery and pages with a lot of unused memory) would be collected by copying. This too would avoid the issue with considering whole pages as live when pinned, and I think that having the choice of copying or marking at a page granularity will be simpler than toggling at the level of individual object.
Each “mark” page now has two (bit)sets, one for live objects and another for live objects that have already been scanned. We can maintain a worklist at the page granularity with an embedded linked list: whenever a “mark” page gains a new live object and it’s not already in the worklist, that page is enqueued for scanning.
Instead of emulating Cheney’s trick by looking for newly allocated pages in our page table, we can add pages in new space to the worklist whenever they become full.
Finally, at the end of the pass, we traverse all “mark” pages and clear dead objects.
That’s pretty simple (arguably simpler than the current implementation!), and shouldn’t involve too many changes to the rest of the code. Mostly, I’d have to adapt the weak pointer machinery to stop assuming that it can use forwarding pointers to determine when objects have been marked as live.
However, we might lose the ability to run medium GCs, to collect more than the nursery but less than the whole heap. If we only want to GC the nursery, the mprotect write barrier gives us all the information we need to find references from the rest of the heap to the nursery. If we wish to collect the whole heap, we only have to consider stacks and some statically allocated space as roots.
For medium GCs, e.g., collect only generations 14 out of 7, GENGC exploits the way that garbage collection (almost) always copies to easily track pages with pointers to younger generations. It’s coarse, but usually acceptable thanks to the copying. I don’t know that it would work as well if the default is to only copy the nursery. Moreover, if we have a hybrid GC, it probably makes sense to focus copying on pages that are mostly empty, regardless of their age. If we do want medium GCs, we might have to track, for each page, the set of pages that point there. This set can include false positives, so it’s probably easiest to clear it before major GCs, and otherwise only add to that set (removing pages that were emptied by a GC pass sounds reasonable). I also expect that some pages will have many refererrers; I’m thinking we might use a distinguished value to mean “referred by every pages” and not consider them for medium GC.
Martin’s patch clearly addresses an important weakness in SBCL’s garbage collector. If I can’t make good progress on the hybrid GC soon, I’ll make sure the patch is cleaned up for master, hopefully by Thanksgiving.
]]>Until a new patched version of my favourite C compiler is released, there aren’t many tasteful workarounds for this performance bug. I’d have to switch to inline asm, and either force the compiler to allocate the same register to the input and the result, or force different registers and clear the spurious dependency with a xor. Ideally, I wouldn’t impose any additional constraint on the register allocator and only insert a xor if the destination and source registers don’t match.
SBCL easily supports this use case, without having to rerelease or even recompile the implementation: VOPs (virtual operations) execute arbitrary CL code during code generation and they can be defined at runtime.
The first step is to make sure that SBCL’s assembler knows how to emit
popcnt: the assembler can also be extended at runtime, but that’s more
hairy and a topic for another post. Instruction encodings are defined
in src/compiler/$ARCH/insts.lisp
, and a quick grep reveals
(defineinstruction popcnt (segment dst src) ...)
: the x8664
backend learned about popcnt in May 2013 (thanks to Douglas Katzman).
We define VOPs via definevop
, a macro that exposes many options.
Most of the time, it’s easiest to look at a preexisting definition
for an operation that’s similar to the one we want to add. Popcount
looks like integer negation: it has a single (machine integer)
argument and returns another integer. Integer negation is defined in
src/compiler/$ARCH/arith.lisp
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 

The code snippet above includes a bit of boilerplate to factor out
commonalities via inheritance. The first definition introduces
fastsafearithop
, VOPs that apply in both high speed and high
safety settings (the rest is copy/pasted noise from earlier ports that
sport a scheduler); the second one extends fastsafearithop
to
define fixnumunop
, a base definition for singleargument operations
on fixnums, while the third one is the same, but for machine integers.
The last three definitions fill in the blanks so the compiler can
compile %negate
of fixnum, signed and unsigned integers. The
(:translate %negate)
bit means that these VOPs can be emitted
instead of calls to %negate
. The integer after :generator
defines
the “cost” of each variant; the compiler will choose the (applicable)
variant with the least cost and execute the code sequence that follows
to convert a call to %negate
into machine code.
This kind of implementation inheritance is fine for an SBCL backend,
where we define many VOPs and expect developers to understand the
system. I doubt it’s a didactic win. Let’s do something simpler for
popcnt
. In the interest of simplicity, I’ll also completely
disregard powerful details in definevop
that are rarely relevant
when defining intrinsics that map directly to machine instructions.
First, we need to tell the compiler that we’re about to do special
things to a function named popcnt
(and to blow away any preexisting
information if the defknown
form is reevaluated).
1 2 3 4 5 6 7 8 9 

This says that popcnt
accepts a 64bit unsigned integer and returns
an integer between 0 and 64 (inclusively), and that the function can
be constantfolded, flushed (eliminated as dead code) and moved around
(it’s pure).
Now, to define a VOP that implements popcnt
:
1 2 3 4 5 6 7 8 9 10 11 12 13 

We define a new VOP named popcnt:popcnt
(the name is arbitrary, as
long as it doesn’t collide with another VOP) that is applicable at all
optimization policies (both high speed and high debug level), and that
implements popcnt:popcnt
. Its first and only argument, x
, is an
unsignednum
, an unsigned machine integer, that can only be stored
in a register. Moreover, if possible, we’d like x
to be allocated
the same register as the result, r
. There’s only one result (r
)
and it’s an unsigned machine integer in a register, just like x
.
The generator, of cost 3 (a common default for arithmetic operations),
breaks any dependency chain in r
if necessary, and stores the
population count of x
in r
.
At first sight, the defknown
form seems to conflict with the VOP.
We declare that the return value of popcnt
is a small integer,
clearly a fixnum, and then define a VOP that returns a machine
integer. The subtlety is that defknown
is concerned with IR1, the
higher level intermediate representation, which works on CL types
(i.e, types as sets) and abstract values. VOPs, on the other hand,
are defined for the lower level IR2, where types describe concrete
representations (like C). It is perfectly meaningful to say that a
small integer will be represented as an untagged machine integer.
The next step isn’t strictly necessary, but helps people who like
their REPL. The compiler knows how to compile calls to popcnt
, so
we can define popcnt
… as a call to popcnt
. Our new function is
now a firstclass value that can be called from interpreted code and
passed to higherorder functions, like the compiler’s constantfolding
pass.
1 2 3 4 

CLUSER> (disassemble 'popcnt:popcnt)
; disassembly for POPCNT:POPCNT
; Size: 25 bytes
; 07FCDB6E: 4831D2 XOR RDX, RDX ; noargparsing entry point
; 71: F3480FB8D1 POPCNT RDX,RCX
; 76: 48D1E2 SHL RDX, 1
; 79: 488BE5 MOV RSP, RBP
; 7C: F8 CLC
; 7D: 5D POP RBP
; 7E: C3 RET
[ error trap noise ]
CLUSER> (popcnt:popcnt 42)
3
The disassembly shows that we get the code that we expect, including
the dependencybreaking workaround, and the smoke test passes.
There’s one interesting detail: we only defined a VOP that returns a
machine integer. However, popcnt
returns a tagged value (a fixnum),
and does so with an efficient shift. IR2 takes care of inserting any
coercion needed between VOPs (e.g., between popcnt
and the VOP used to
return boxed values from functions), and the IR1 defknown
guarantees
that the result of popcnt
, despite being represented in an
unsigned machine integer, is small enough for a fixnum.
Let’s see what happens when we feed arithmetic into popcnt
, e.g.:
CLUSER> (disassemble (lambda (x y)
(declare (type (unsignedbyte 32) x y))
(popcnt:popcnt (+ x y))))
; disassembly for (LAMBDA (X Y))
; Size: 55 bytes
; 0752BD59: 4801FA ADD RDX, RDI ; noargparsing entry point
; 5C: 48D1FA SAR RDX, 1
; 5F: F3480FB8D2 POPCNT RDX,RDX
; 64: 48D1E2 SHL RDX, 1
; 67: 488BE5 MOV RSP, RBP
; 6A: F8 CLC
; 6B: 5D POP RBP
; 6C: C3 RET
After adding two fixnums, an automatic coercion unboxes the resulting
fixnum into a machine integer which is then passed to popcnt
(note the lack of dependencybreaking xor
now that the source and
destination are the same register).
That’s pretty good code, but we can do better: fixnums are tagged with
0, so we can simply pass fixnums to popcnt
without untagging.
This is where the cost parameter to :generator
comes in: we can
define another VOP for popcnt
of fixnums and bias the compiler to
prefer the fixnum VOP.
1 2 3 4 5 6 7 8 9 10 11 12 13 

CLUSER> (disassemble (lambda (x y)
(declare (type (unsignedbyte 32) x y))
(popcnt:popcnt (+ x y))))
; disassembly for (LAMBDA (X Y))
; Size: 47 bytes
; 07BEABE9: 4801FA ADD RDX, RDI ; noargparsing entry point
; BEC: F3480FB8D2 POPCNT RDX,RDX
; BF1: 48D1E2 SHL RDX, 1
; BF4: 488BE5 MOV RSP, RBP
; BF7: F8 CLC
; BF8: 5D POP RBP
; BF9: C3 RET
Unlike many lowlevel languages, CL includes a standard function for
population count, logcount
. SBCL includes a VOP for logcount
(with a cost of 14), which we can supersede with our own popcnt
based
VOPs: we only have to replace (:translate popcnt:popcnt)
with
(:translate logcount)
. That’s an easy improvement but isn’t in trunk
because popcnt
is a recent x86 extension.
Adding VOPs for (adhoc) polymorphic or otherwise generic functions
can be surprising: a VOP will only be considered if the arguments and
the return values are known to have CL types that are compatible with
the VOP’s representation specification. For popcnt
, we guarantee
that the return value is a positive integer between 0 and 64; for
cl:logcount
, the defknown
guarantees that the return value is a
positive fixnum. In both cases, the return values can always be
represented as an unsigned machine integer, so our new VOPs will
always match if the argument fits in positive fixnums or unsigned
machine integers (and will have priority over the generic x8664 VOP
because their cost is lower). More complex cases depend on
derivetype
optimizers, but that’s rarely necessary when defining
instrinsics for lowlevel code.
I first looked into integer division by constants four years ago, and I was immediately struck by the ad hoc treatment of the transformation: I have yet to find a paper that summarises and relates algorithms that are currently in use. Worse, the pseudocode tends to assume fixedwidth integer, which drowns the interesting logic in bignummanagement noise. Back when I had free time, I uploaded an early draft of what may become a more enlightening introduction to the topic. My goal was to unite all the simplification algorithms I’d seen and to generalise them to SBCL’s needs: our optimisers benefit from precise integer range derivation, and codegen ought to deal with tagged fixnums. The draft should take shape as the GSoC project progresses.
There is one widespread – but very specialised – integer division algorithm that does not fit in the draft: multiplication by modular inverses. I’m guessing it’s common because it’s the first thing that comes to mind when we say divisionbymultiplication. The transformation is also so specialised that I find it’s usually mentioned in contexts where it wouldn’t work. Still, it’s a nice application of algebra and the coefficients are simple enough to generate at runtime (even in C or assembly language), so here goes.
Let \(a\) and \(m\) be naturals. The multiplicative inverse of \(a\) modulo \(m\) is a natural \(b\) such that \(a \times b \equiv 1 \mod m\). Machine arithmetic is naturally modular (e.g., mod \(m = 2\sp{32}\)). This seems perfect!
There are a couple issues here:
For a concrete example of the third issue, consider \(11\), the multiplicative inverse of \(3 \mod 16\): \(3\times 11 = 33 \equiv 1 \mod 16\) and \(6 \times 11 = 66 \equiv 2 \mod 16\). However, \(4 \times 11 = 44 \equiv 12 \mod 16\), and \(12\) is nowhere close to \(4 \div 3\).
This post addresses the first two points. There is no workaround for the last one.
We can generate a modular inverse with the extended Euclidean algorithm. Wikipedia shows the iterative version, which I can never remember, so I’ll instead construct the simple recursive one.
We already assume that \[\mathop{gcd}(a, b) = 1 \] and we wish to find \(x, y\) such that \[ax + by = 1.\] Bézout’s identity guarantees that such coefficients exist.
Things are simpler if we assume that \(a < b\) (they can only be equal if \(a = b = 1\), and that case is both annoying and uninteresting).
If \(a = 1\), \(a + b0 = 1\).
Otherwise, let \(q = \lfloor b/a\rfloor\) and \(r = b  qa\). \[\mathop{gcd}(a, r) = \mathop{gcd}(a, b) = 1,\] and, given \[ax’ + ry’ = 1,\] we can revert our change to find \[ax’ + (b  qa)y’ = a(x’  qy’) + by’ = 1.\]
We’re working in modular arithmetic, so we can sprinkle mod m
without changing the result. In C, this will naturally happen for
unsigned integers, via overflows. In CL, we can still force modular
reduction, just to convince ourselves that we don’t need bignums.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

And a quick sanity check:
CLUSER> (loop for m from 2 upto (ash 1 10)
do (loop for i from 1 below m
when (= 1 (gcd i m))
do (inverse i m)))
NIL ; no assertion failure
The second issue is that the multiplicative inverse only exists if our divisor and our modulo (e.g., \(2\sp{32}\)) are coprime. The good news is that \(\mathop{gcd}(a, 2\sp{w})\) can only be a power of two. We only have to factor our divisor \(a = 2\sp{s} v\), and find \(i\), the multiplicative inverse of \(v\). Division by \(a\) is then a right shift by \(s\) and a multiplication by \(i\).
1 2 3 4 5 6 7 8 9 10 

And now, a final round of tests:
CLUSER> (defun testdivisor (d m)
(let ((divisor (divisor d m)))
(loop for i upfrom 0
for j from 0 by d below m
do (assert (= (funcall divisor j) i)))))
TESTDIVISOR
CLUSER> (loop for width from 1 upto 20
for mod = (ash 1 width)
do (loop for x from 1 below mod
do (testdivisor x mod)))
NIL
A simple transformation from integer division to shift and multiplication… that works only in very specific conditions.
I’ve only seen this transformation used for pointer subtractions in Clike languages: machines count in chars and programs in whatever the pointers point to. Pointer arithmetic is only defined within the same array, so the compiler can assume that the distance between the two pointers is a multiple of the object size.
The following program is deep in undefined behaviour, for example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

pkhuong:tmp pkhuong $ clang foo.c && ./a.out
2635249153387078801 0
What I find interesting is that, if we pay attention to the correctness analysis, it’s clear that general divbymul transformations benefit from known common factors between the divisor and the dividend. In the extreme case, when the dividend is always a multiple of the divisor, we can convert the division to a single doublewide multiplication, without any shift or additional multiword arithmetic. On architectures with fast multipliers or ones that let us compute the high half of product without the low part, the general case (coupled with a tight analysis) may be marginally quicker than this specialised transformation. Yet, both GCC and clang convert pointer subtractions to shifts and multiplications by modular inverses.
In the end multiplicative inverses seem mostly useful as a red herring, and as a minimalcomplexity low hanging fruit. The only reason I use them is that it’s easy to generate the coefficients in C, which is helpful when allocation sizes are determined at runtime.
]]>The impetus for this post was a max heap routine I had to write because libc, unlike the STL, does not support incremental or even partial sorting. After staring at the standard implicit binary heap for a while, I realised how to generalise it to arbitrary arity. The routine will be used for medium size elements (a couple dozen bytes) and with a trivial comparison function; in that situation, it makes sense to implement a high arity heap and perform fewer swaps in return for additional comparisons. In fact, the tradeoff is interesting enough that a heapsort based on this routine is competitive with BSD and glibc sorts. This post will present the kary heap code and explore the impact of memory traffic on the performance of a few classical sort routines. The worst performing sorts in BSD’s and GNU’s libc overlook swaps and focus on minimising comparisons. I argue this is rarely the correct choice, although our hands are partly tied by POSIX.
The heart of a heap – and of heapsort – is the siftdown function.
I explored the indexing scheme below in details
somewhere else, but the gist of it is
that we work with regular 0indexed arrays and the children of
heap[i]
lie at indices heap[way * i + 1]
to heap[way * i + way]
,
inclusively. Like sifting for regular binary heaps, the routine
restores the maxheap property under the assumption that only the root
(head) may be smaller than any of its children.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 

The expected use case for this routine is that compar
is a trivial
function (e.g., comparing two longs) and size
relatively large, up
to fifty or even a few hundred bytes. That’s why we do some more
work to not only minimise the number of swaps, but also to accelerate
each swap.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

When the data will always be aligned to 16byte boundaries, we swap
vector registers. Otherwise, we resort to registersize calls to
memcpy(3)
– decent compilers will turn them into unaligned accesses
– and handle any slop with bytebybyte swaps.
It’s a quick coding job to wrap the above in a lineartime heapify
and a heapsort.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 

This heap implementation proved more than good enough for our
production use. In fact, it is so efficient that nheapsort
is
competitive with battletested sort implementations, particularly when
sorting medium or large structs. However, these classical
implementations were primarily tuned with comparison counts in mind…
which I argue is now inappropriate. In the remainder of this post,
I’ll explore the impact of swaps on the performance of sort routines.
I will compare with the sort routines in two libc: GNU’s glibc and BSD’s (FreeBSD and OS X, at least, and other *BSD I expect) libc. All are more than 20 years old.
The qsort(3)
routine in glibc is usually a mergesort that allocates
scratch space. It includes some tests to try and detect out of memory
conditions, but I doubt the logic is very effective on contemporary
hardware and with Linux’s overcommit by default. BSD manpages specify
that qsort(3)
is inplace. GNU’s don’t… because glibc’s qsort is not
a quicksort.
I can see many arguments for glibc’s design choice. For one, a nicely tuned mergesort may be quicker than a safe (guaranteed \(\mathcal{O}(n \log n)\) time) quicksort. It also enables msort’s indirect sorting strategy for large elements: when elements span more than 32 chars, the sort phase only manipulates pointers and a final lineartime pass shuffles the actual data in place.
On the other hand, I would be annoyed if qsort triggered the OOM killer.
For this reason, I’ll explicitly consider the real inplace quicksort that gets called when it’s clear that allocating more memory is infeasible. Sadly, I doubt that quicksort received much attention since it was tuned for a Sun 4/260. It’s clearly a product of the 80’s. It has an explicit stack, the recursion order guarantees logarithmic stack depth, the pivot is always chosen with a median of three, and the base case (a partition of 4 or fewer elements) switches to insertion sort.
The code and the very design of the routine seem to completely disregard memory traffic. The swap loop is always charbychar, and the base case (insertion sort) isn’t actually called when recursion stops at a leaf. Instead, small unsorted partitions are left as is. Insertion sort is only executed at the end, over the whole array: unsorted partitions contain at most 4 elements and are otherwise ordered correctly, so each insertion sort inner loop iterates at most 4 times. This strategy wastes the locality advantage of divideandconquer sorts: the insertion sort pass is almost streaming, but it would likely operate on warm data if leaf partitions were sorted on the fly.
It also includes unsafe code like
char *const end_ptr = &base_ptr[size * (total_elems  1)];
char *tmp_ptr = base_ptr;
char *thresh = min(end_ptr, base_ptr + max_thresh);
I can only pray that compilers never becomes smart enough to exploit
base_ptr + max_thresh
to determine that this routine will never sort
arrays of fewer than 4 elements.
glibc’s implementation of inserts only aggravates the problem. After finding a sorted position for the new element, insertion sort must insert it there. Usually one would implement that by copying the new element (one past the end of the sorted section) to a temporary buffer, sliding the tail of the sorted array one slot toward the end, and writing the new element in its expected position. Unfortunately, we can’t allocate an arbitrarysize temporary buffer if our quicksort is to be inplace. The insertion loop in glibc circumvents the problem by doing the copy/slide/write dance… one char at a time. That’s the worst implementation I can think of with respect to memory access patterns. It operates on a set of contiguous addresses (the tail of the sorted array and the new element at its end) with a double loop of strided accesses that still end up touching each of these addresses. When the element size is a power of two, it might even hit aliasing issues and exhaust the associativity of data caches.
The quicksort in *BSD is saner, perhaps because it is still called by
qsort(3)
. While glibc pays lip service to Bentley and McIlroy’s “Engineering a Sort Function”, BSD actually uses their implementation.
The swap macro works one long
at a time if possible, and the
recursive (with tail recursion replaced by a goto
) function switches
to insertion sort as soon as the input comprises fewwer than 7
elements. Finally, unlike GNU’s insertion sort, BSD’s rotates in
place with a series of swaps. This approach executes more writes than
glibc’s strided rotations, but only accesses pairs of contiguous addresses.
There’s one tweak that I find worrying: whenever a pivoting step
leaves everything in place, quicksort directly switches to insertion
sort, regardless of the partition size. This is an attempt to detect
presorted subsequences, but malicious or unlucky input can easily
turn BSD’s quicksort into a quadratictime insertion sort.
GNU’s quicksort finishes with insertion sort because the latter copes well with almost sorted data, including the result of its partial quicksort. BSD’s qsort is a straight, cachefriendly, divideandconquer implementation. It looks like it’d make sense to replace its base case with a selection sort: selection sort will perform more comparisons than insertion sort (the latter could even use binary search), but, with a single swap per iteration, will move much less data around. I tried it out and the change had little to no impact, even for large elements.
For completeness’s sake, I’ll also include results with BSD libc’s heapsort and mergesort. The heapsort doesn’t look like it’s been touched in a long while; it too gives compilers licence to kill code, via
/*
* Items are numbered from 1 to nmemb, so offset from size bytes
* below the starting address.
*/
base = (char *)vbase  size;
Even libcs can’t get undefined behaviour straight.
Ideally, we could expose the effect of both element size and element count on the performance of these sort routines. However, even if I were to measure runtimes on the cross product of a range of element size, range of element count, and a set of sort routine, I don’t know how I would make sense of the results. Instead, I’ll let element sizes vary from one word to a few hundred bytes, and bin element counts in a few representative ranges:
The test program times each sort routine on the same randomly
generated input (sampling with replacement from random(3)
), for each
element size. I try and smooth out outliers by repeating this process
(including regenerating a fresh input) 20 times for each element size and
count.
Regardless of the element size, the comparison function is the same:
it compares a single unsigned
field at the head of the member. I
chose this comparator to reflect real world workloads in which
comparisons are trivial. When comparisons are slow, element size
becomes irrelevant and classical performance analyses apply.
The idea is to track, within each count bin, relative speedup/slowdown compared to BSD’s quicksort: any difference in multiplicative factors, as a function of element size, should show itself. The element count bins can instead highlight differences in additive factors.
A test program logged runtimes for this cross product of all element
sizes from 8 to 512 bytes (inclusively), all element counts (4 to 64,
inclusively), and various sort routines. For each size/count pair,
the program regenerated a number of random(3)
input; each sort
routine received a copy of the same input. Given such random input,
we can expect all runtimes to scale with \(\Theta(n\log n)\), and it
makes sense to report cycle counts scaled by a baseline (BSD’s
quicksort).
I compiled each sort routine (with gcc 4.8.2 at O2
) separately from
the benchmark driver. This prevented fancy interprocedural
optimisation or specialisation from taking place, exactly like we
would expect from calling to libc. Everything below reports (scaled)
cycle counts on my machine, an E54617. I executed the benchmark with
four processes pinned to different sockets, so the results should
reflect a regular singlethreaded setting. I definitely could have
run the benchmark on more machines, but I doubt that the relative
computational overhead of copying structs versus comparing two machine
words have varied much in the past couple years.
I first tested our kary heapsort with many values for k
: 29, 12,
1517.
When sorting 8byte values, k
doesn’t have too much of an impact.
Nevertheless, it’s clear that large k (12 and over) are slower than even
binary heapsort. The sweetspot seems to be around 4 to 7.
Larger elements (32 and 64 bytes) show that the usual choice of
k = 2
causes more than a 30% slowdown when swaps are slow. The
ideal k
falls somewhere between 5 and 9 and between 6 and 17.
Finally, at the extreme, with 512 byte elements, binary heapsort is almost twice as slow as 7ary to 17ary heapsorts.
The best choice of k
will vary depending on how much slower it is to
swap two elements than to compare them. A fixed value between 5 and 7
should be close to optimal for elements of 8 to 512 bytes.
I already mentioned that GNU’s quicksort has a slow swap macro, and that it ends with an insertion sort pass rather than completely sorting recursion leaves. I tested versions of that quicksort with a different swap function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

I also tested versions that only stops recursing on inputs of size 1
(with #define MAX_THRESH 1
). This is a simpleminded way to
implement quicksort, but avoids the final insertion sort pass (which
proved too hairy to convert to the new swap function).
When elements are small (8 bytes), the trivial recursion base case (GQL) is a bad idea; it’s slower than the original GNU quicksort (GQ). The new swap function (GQS), however, is always useful. In fact, the combination of trivial base cases and improved swap (GQLS) is pretty much identical to the original GNU quicksort.
Switching to larger elements (16, 32, 64 or 512 bytes) shakes things up a bit. The impact of faster swaps increases, and GQLS – the version with faster swap and trivial base case – is markedly quicker than the other variants.
In the sequel, I will only consider the original GNU quicksort (GQ) and the variant with trivial base case and faster swaps. From now on, I’ll refer to that variant as GQ’.
I proposed to partition array lengths in 4 bins. For that to work, the behaviour of the sort routines must be fairly homogeneous within each bin.
The graph below fixes the element size at 8 bytes and presents normalised times for a few sort routines and a range of array sizes. We report the (geometric) average of 20 runtimes, scaled by the runtime for BSD quicksort on the same input, on a logarithmic scale. The logarithmic scale preserves the symmetry between a routine that is twice as slow as BSD quicksort (log ratio = 1) and one that is twice as fast (log ratio = 1). The sort routines are:
The graph is pretty noisy for smaller arrays, but there’s no obvious discrepancy within size bins. The same holds for larger element sizes (16, 64 and 512 bytes).
We can already see that BSD’s quicksort is hard to beat on the element sizes considered above.
The next graph shows the evolution of average (log) normalised sort times for tiny arrays (47 elements), as the size of each element grows from 8 to 512 bytes.
There’s some overplotting, but it’s clear that there are actually two sets of curves: the main one, and another that only comprises sizes that are multiples of 8 (wordaligned). I believe this is caused by special optimisations for nice element sizes in some (but not all) sort routines, including our baseline, BSD’s quicksort.
For wordunaligned sizes, GNU’s mergesort and our tweaked GNU quicksort are never slower than the baseline; for large items, they are around 4x as quick. On the same input, BSD’s mergesort and our kary heapsorts are on par with one another and slightly slower than the baseline, while GNU’s quicksort is slightly quicker. The only sort routine that’s markedly slower than BSD’s quicksort is BSD’s binary heapsort (around 23x as slow).
BSD’s wordalignment optimisation makes sense in practice: larger structs will often include at least one field that requires nontrivial alignment. I think it makes sense to look at arbitrary sizes only up to a small limit (e.g., 32 bytes), and then only consider wordaligned sizes.
For small element sizes, all but two routines are comparable to the baseline. BSD’s heapsort is markedly slower, and our tweaked GNU quicksort somewhat faster.
When element sizes are wordaligned, BSD’s quicksort is hard to beat. The only routine that manages to be quicker is GNU’s mergesort, for very large element sizes: the routine exploits dynamic storage to sort pointers and only permute elements inplace during a final lineartime pass. Our kary heapsorts remain in the middle of the pack, slightly slower than BSD’s mergesort and GNU’s quicksort.
The results in the previous section were noisy: that’s expected when sorting arrays of 4 to 7 elements. This section looks at arrays of 8 to 15 elements; there’s less random variations and the graphs are much clearer.
The graph for overall speeds is similar to the one for tiny arrays. However, it’s clear that, for general element sizes, our kary heapsorts are slightly faster than BSD’s mergesort and a bit slower than GNU’s quicksort. Again, GNU’s mergesort and our tweaked quicksort are even quicker than BSD’s quicksort.
When we only consider wordaligned element sizes, the picture changes. The most important change is that our baseline is a lot quicker, and even GNU’s mergesort is slightly slower than BSD’s quicksort. Our heapsorts are now slower than BSD’s mergesort, which is now comparable to GNU’s quicksort.
It’s only when we consider almost ridiculous element sizes that GNU’s mergesort edges out BSD’s quicksort: GNU’s indirect sorting strategy finally pays off.
Small and medium arrays lead to the same conclusion as small ones, only more clearly. For arbitrary element sizes, GNU’s mergesort and tweaked quicksorts are much quicker than everything else. As for kary heapsorts, they are comparable to BSD’s quicksort and quicker than BSD’s mergesort and heapsort, except for small elements (fewer than a dozen bytes each).
For wordaligned element sizes, GNU’s quicksort is still only outperformed by GNU’s indirect mergesort, while kary heapsorts are still slightly slower than GNU’s quicksort and BSD’s mergesort, but always quicker than the latter’s heapsort.
I’d say our results show that heapsort gets a bad rap because we always implement binary heaps. Higher arity (around 7way) heaps are just as easy to implement, and much more efficient when comparators are trivial and data movement slow. For small arrays of mediumsize elements, kway heapsorts are competitive with BSD’s mergesort and GNU’s quicksort, while guaranteeing \(\mathcal{O}(n \log n)\) worstcase performance and without allocating storage; they’re almost magical when we only need partial or incremental sorts.
There’s a more general conclusion from these experiments: qsort(3)
’s
very interface is a bad fit for a lot of workloads. qsort
lets us
implement arbitrary comparisons with a callback, but must rely on
generic swap routines. This probably reflects a mindset according to
which comparisons (logic) are slow and swaps (accessing data)
trivially quick; GNU’s quicksort was definitely written with these
assumptions in mind. The opposite is often the case, nowadays. Most
comparators I’ve written are lexicographic comparisons of machine
words, floating point values, and, very rarely, strings. Such
comparators could be handled with a few specialised sort routines:
lexicographic sorts can either exploit stability (à la bottomup radix
sort), or recursion (topdown radix sort). However, there were some
cases when I could have implemented swaps more cleverly than with
memcpy(3)
, and I could always have passed a sizespecialised swap
routine. There were also cases when I wanted to sort multiple arrays
with respect to the same key, and had to allocate temporary storage,
transpose the data, sort, and transpose back. Mostly, this happened
because of hybrid
struct of arrays
data layout imposed by memory performance considerations, but
implementing the
Schwartzian transform
in C (which avoids recomputing keys) is basically the same thing. A
swap function argument would be ideal: it’s quicker than generic
swaps, supports memoryoptimised layouts, and helps express a classic
Perl(!) idiom for sorting on derived keys.
Behold, the modern sort interface:
struct comparison {
void *base;
size_t stride;
union { void *ptr; unsigned long mask; } auxiliary;
enum comparison_kind kind;
};
void
generic_sort(const struct comparison *cmp, size_t ncmp, size_t length,
void (*swap)(void *, size_t i, size_t j), void *swap_data);
Interpreted DSLs for logic and compiled data shuffling; it’s the future.
]]>I find most developers are vastly more comfortable dealing with
pointerbased data structures than with implicit ones. I blame our
courses, which focus on the former and completely fail to show us how
to reason about the latter. For example, the typical presentation of
the binary heap introduces an indexing scheme to map from parent to
children – the children of heap[i]
are heap[2 * i]
and
heap[2 * i + 1]
, with onebased indices – that is hard to
generalise to ternary or kary heaps (Knuth’s presentation, with
parents at \(\lfloor i/2 \rfloor\), is no better). The reason it’s
so hard to generalise is that the indexing scheme hides a simple
bijection between paths in kway trees and natural numbers.
I find it ironic that I first encountered the idea of describing data structures or algorithms in terms of number systems, through Okasaki’s and Hinze’s work on purely functional data structures: that vantage point seems perfectly suited to the archetypal mutable data structure, the array! I’ll show how number systems help us understand implicit data structures with two examples: a simple indexing scheme for kary heaps and compositions of specially structured permutations to implement inplace bitreversal and (some) matrix transpositions.
The
classical
way
to present implicit binary heaps is to work with onebased array
indices and to root an implicit binary tree at heap[1]
. For any
node heap[i]
, the children live at heap[2 * i]
and
heap[2 * i + 1]
. My issue with this presentation is that it’s
unclear how to extend the scheme to ternary or kary trees: if the
children of heap[1]
are heap[3 * 1 ... 3 * 1 + 2]
, i.e.,
heap[3, 4, 5]
,
what do we do with heap[2]
? We end up with k  1
parallel kary
trees stashed in the same array. Knuth’s choice of mapping children
to parents with a floored integer division suffers the same fate.
Onebased arrays hides the beauty of the binary indexing scheme. With
zerobased arrays, the children of heap[i]
are heap[2 * i + 1]
and
heap[2 * i + 2]
. This is clearly isomorphic to the onebased scheme
for binary heaps. The difference is that the extension to kway
trees is obvious: the children of heap[i]
are heap[k * i + 1 ... k * i + k]
.
A couple examples like the one below fail to show any problem. However, even a large number of tests is no proof. Thinking in terms of number systems leads to a nice demonstration that the scheme creates a bijection between infinite kary trees and the naturals.
There is a unique finite path from the root to any vertex in an infinite kary tree. This path can be described as a finite sequence \((c\sb{1}, c\sb{2}, \ldots, c\sb{n})\) of integers between 1 and k (inclusively). If \(c\sb{1} = 1\), we first went down the first child of the root node; if \(c\sb{1} = 2\), we instead went down the second child; etc. If \(c\sb{2} = 3\), we then went down the third child, and so on. We can recursively encode such finite paths as naturals (in pidgin ML):
path_to_nat [] = 0
path_to_nat [c : cs] = k * path_to_nat cs + c
Clearly, this is an injection from finite paths in kary trees to the
naturals. There’s only a tiny difference with the normal positional
encoding of naturals in base k
: there is no 0, and digits instead
include k
. This prevents us from padding a path with zeros, which
would map multiple paths to the same natural.
We only have to show that path_to_nat
is a bijection between finite
paths and naturals. I’ll do that by constructing an inverse that is
total on the naturals.
nat_to_path 0 = []
nat_to_path n = let c = pop n
in c : nat_to_path (n  c) / k
where pop
is a version of the mod k
that returns k
instead of 0:
pop n = let c = n `mod` k
in
if (c != 0)
then c
else k
The base case is nat_to_path 0 = []
.
In the induction step, we can assume that path_to_nat cs = n
and
that nat_to_path n = cs
. We only have to show that, for any
\(1 \leq c \leq k\), nat_to_path path_to_nat c:cs = c:cs
. Let
n' = path_to_nat c:cs = k * n + c
.
\[n\sp{\prime} = kn + c \equiv c\quad \mod k,\]
so pop n'
will correctly return c
(and k
rather than 0). It’s
then a tiny bit of algebra to show that (n'  c) / k = n
, and we
fall back to the induction hypothesis.
This scheme is so simple that I wound up coding a version of
heapsort(3)
that lets callers choose the heap’s arity at runtime.
Higher arity heaps perform more comparisons but fewer swaps than
binary heaps; the tradeoff is profitable when sorting large items.
It seems to me that, for decades, we’ve been presenting implicit heaps
and heapsort in a way that marginally simplifies the binary case at
the expense of obscuring the elegant general case.
Bit reversing an array of length \(2\sp{l}\) sends the element x[i]
to
x[j]
, where the binary representation of i
(including padding up
to l
bits) is the reverse of j
. For example, in an array of
length 16, 3 = 0011
becomes 1100 = 12
.
Reversing a fixedwidth integer’s binary representation is its selfinverse, so bit reversing an array is a sequence of swaps. This means that the permutation can be performed inplace, as a series of independent swaps. Bit reversal used to be slow on cached machines: contiguous elements (with indices that only vary in their low order bits) swap with faroff elements (indices that only vary in their high order bits). Worse, the stride between the latter elements is a large power of two, which causes all sorts of aliasing issues. Workarounds (see Zhang 99 (PDF)) mostly end up implementing a software cache with explicit buffers. Nowadays, even L1 caches have such a high associativity that aliasing is a much less important issue.
NapaFFT3 implements bit reversals by calling a few specialised functions that only swap the lower and higher bits; the main routine iterates over an array of precomputed middle bit reversals (similar to various publications of Elster’s, but recursing on the middle bits first). In this implementation, the number of L1 cache misses incurred by bit reversing an array is only slightly greater than the compulsory misses. Bit reversal isn’t free, but it’s also not clear that autosorting FFTs are quicker than outoforder FFTs followed by a bit reversal pass.
Bit reversal is the only array permutation I’ve seen described in terms of its effect on indices. I think it’s a fruitful avenue for other inplace permutations.
For example, the viewpoint makes it clear how to transpose a matrix of dimensions \(2\sp{m} \times 2\sp{n}\) with a sequence of inplace bit reversals (each \(i\sb{k}\) and \(j\sb{l}\) is a bit in the index’s binary representation).
For a rowmajor layout, the sketch above corresponds to:
Bit reversal, like all other permutations, is a reversible linear operation. We can thus change the order of operation if we want to. For example, it’s not necessarily preferable to bitreverse contiguous rows first. We could also flip the highorder bits of the indices: rather than swapping scalars, we would swap rows. Separately bit reversing contiguous rows works best when each row fits in cache. Bit reversing columns instead amortises the bad access pattern inherent to bit reversal by spending more time on each swap: swapping rows is slower than swapping scalars, but also very efficients with regards to (streaming!) I/O.
This is interesting because inplace transposition of rectangular matrices is hard, and transposition is already a bad fit for caches. Transposing matrices with a sequence of bit reversals might just be practical. In fact, that’s what I intend to do in NapaFFT3 for multidimensional DFTs: we can fuse all but the middle wholevector bit reversal with mixedradix FFTs (and the latter might similarly benefit from operating on [sub]rows rather than scalars).
One obvious question now appears: can we generalise the trick to general dimensions? It’s pretty clear that we can do it for any other base \(b\) and matrices of dimension \(b\sp{m} \times b\sp{n}\) (it’s interesting how highly composite numbers dimensions are easy to transpose, and, IIRC, so are coprime ones). What if there’s no such factorisation? The best I can do is “more or less.”
For arbitrary matrix dimensions \(m \times n\), I think it’s best to decompose indices in a mixed radix (but still positional) number system. For example, a \(63 \times 21\) matrix might have indices in radix \(3,7\ \ 3,7,3\). Given this number system, matrix transposition is
It’s a small generalisation to let the radices be \(a,b\ \ a,b,a\), for a matrix of dimension \(ab \times a\sp{2}b\). We can then perform most of a matrix transposition by swapping positions of identical weight: first a full mixedradix digit reversal (the weights are palindromic), followed by another mixedradix reversal on the first three positions.
This leaves the last chunk \(b\sb{2},a\sb{3}\), which should instead be \(a\sb{3},b\sb{2}\). That’s another rectangular matrix tranpose, but smaller than the original one. It might be practical to execute that last step with a straightforward outofplace transpose: a smaller transpose needs less scratch space and may fit in cache. We can also apply the same trick as for bit reversals and apply the transpose before everything else, by permuting rows rather than scalars. The simplest way to do that is to transpose a matrix of pointers before replicating the permutation on the actual data (glibc’s mergesort references Knuth vol. 3, exercise 5.210).
Finally, this also works for \(a\sp{2}b \times ab\): matrix transposition is its own inverse, so we only have execute the inverse of each step, in reverse order.
Definitely mixed results, but at least we have some intuition on why general rectangular transpositions are hard to perform in place: they’re hard to express as sequences of swaps.
This post is the more theoretical prologue to
a lowlevel look
at qsort(3)
: I really wanted to make sure the nice implicit tree
layout in the first section had the space it deserves.
I tried to make inorder implicit trees fit in this number system approach. I can’t see how. The problem is that inorder trees associate ranges (rather than indices) with nodes; for example, at what depth is index 1000? It depends on the size of the search tree. It might be the root (in a tree of 2000 vertices) or a leaf (in a tree of 1001 vertices).
]]>I don’t really follow the compression scene, and only pay minimal attention to machine learning. Nevertheless, the “Compression is Learning” slogan feels more and more right to me. In this post, I’ll explore another relationship that I find surprising: one between compression and compilation.
Five years ago, I took Marc Feeley’s compiler class, and he let us choose our term project. I settled on generating traces from recursive algorithms (typical of cacheoblivious algorithms) and reordering them to get iterative code that was better suited to the first level of cache. I came up with a gnarly CLaccented researchquality prototype, but the result was surprisingly effective. Sadly, I never really managed to recover loops or function calls from the fully inlined trace, so even mediumsize kernels would exhaust the instruction cache.
I believe I now see a practical and scalable solution, thanks to Artur Jez’s work on “A really simple approximation of smallest grammar.” His work might lead to a functionoriented analogue of trace compilation. Trace compilers are notoriously weak on recursion (recursive function calls don’t map well to loops), so it would be nifty to have an alternative that identifies functions rather than loops.
This makes me quip that “(Trace) Compilation is Compression.” We can see trace compilers as lazily unpacking traces from the source program, rewriting them a bit, and recompressing traces in an executable format. Admittedly, the analogy is a bit of a stretch for classical compilers: they are less aggressive in decompressing source code and directly translate from one compressed representation (a small source file may describe billions of execution paths) to another.
Anyway… this is extremely hypothetical and Jez’s work is fun independently of my weekend speculations.
Once we cast a program trace as a sequence of opcodes (perhaps without operands, to expose more redundancy), it’s obvious that reducing code size is “just” compression, and LZtype algorithms quickly come to mind: they compress strings by referring to earlier substrings.
The heart of LZ77 is a loop that streams through the input sequence and finds the longest common subsequence earlier in the input. In practice, the search usually considers a fixedsize window; when I write LZ77 here, I instead refer to the theoretical approach with an unbounded search window. Repeated LCS searches on an unbounded window sounds slow, but, in sophisticated implementations, the bottleneck is sorting the input string to generate a suffix array.
I don’t feel like being sophisticated and will implement a quadratictime LCS search.
1 2 3 4 5 6 7 8 9 10 11 12 

Some backreferences clearly look like function calls.
1 2 3 4 5 6 

It’s a bit surprising at first, but some also correspond to repeat
loops.
1 2 3 

This last patch is strange because it’s selfreferential. However, if
we decompress character by character, its meaning becomes obvious: we
replicate the substring in [0, 1]
to generate exactly 5 characters.
This is basically a loop; we can handle partial copies with, e.g.,
rotation and entering the middle of the loop (like Duff’s device).
Lempel and Ziv’s result tells us that we can look for references greedily in an unbounded window and obtain asymptotically optimal (proportional to the string’s entropy) compression. Again, there are algorithms to do that in linear time – via radix sort – but I’ll just bruteforce my way in cubic time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

Here’s the problem with converting an LZ77 factorisation into calls and loops: there is no guarantee that patches nest sanely.
1 2 3 

This factorisation encodes “ababacbac” by repeating “ab” two and a half times, inserting “c”, and then repeating “bac.” The issue is that we wish to convert “ababa” into a loop, but the last patch would then need to enter that loop in its last iteration. We could also have the opposite problem, with a patch that only covers a prefix of an earlier patch (a few iterations of a loop). The sketch below shows how we might even have to deal with both issues in the same factor.
Jez’s paper analyses a method to recover a CFL with a single member, the string we wish to compress. The CFL is described by a fully deterministic (it produces exactly one string) CFG in Chomsky normal form, i.e., a straightline program.
This “really simple” approach takes a step back from LZ compression and instead starts with the easiest, greedy, way to generate a straightline program from a string: just pair characters as you encounter them, lefttoright.
For example, on “abab,” this would pair “(ab)(ab)” and then ”((ab)(ab)),” and generate the following grammar:
G > XX
X > ab
We got lucky: the repetition synced up with the greedy pairing. That’s not always the case; for example, “abcab” becomes “(ab)(ca)b” – which exposes no redundancy – rather than “(ab)c(ab).”
Jez addresses this issue by using LZ77 factorisation as an oracle to synchronise pairings.
First, we simplify the algorithm by assuming that there are no factors that begin exactly one character before their output. Such factors correspond to repetitions of that one character (e.g., “aaaa” = “a” x 4) and we can easily turn them into repetitions of a pair (e.g., “aa” x 2).
It’s then simply a matter of not pairing when it would prevent the next character from synchronising with its backreference. For example, in “abcab,” we’d skip pairing “ca” so that “ab” could pair like the first occurrence did. We also don’t force a pair when the backreference only includes the first letter, but not the second. Finally, we opportunistically match consecutive unpaired letters.
Each such pass creates a new, shorter, string of literals and production rules. We iteratively apply the guided pairing loop until we’re left with a single production rule.
That’s it. Jez’s paper is mostly concerned with showing that we only have to run LZ compression once and with proving suboptimality bounds. The key is that we can convert factors for the input into factors for the output, and that each pass shrinks the string by a multiplicative factor. I’ll simplify things further by recomputing a factorisation (in cubic time!) from scratch in each pass.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 

My implementation is exceedingly naïve and runs in cubic time in the worst case. With a lot more work, the bottleneck (LZ77 factorisation) runs at the speed of sort… but constructing a suffix array would only get in the way of this post. Let’s just see what pairing the code generates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

In short, “abcab” becomes “(ab)c(ab)”, “abababa” “(ab)(ab)(ab)a”, and “ababacbac” “(ab)(ab)(ac)()b(ac)”.
So far, we’ve only made pairing decisions. The next step is to translate them into production rules (i.e., functions), and to merge equivalent rules to actually save space.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 

On the examples above, we find:
1 2 3 4 5 6 7 8 9 10 11 

We only have to pair production rules further until the sequence consists of a single rule (or literal).
1 2 3 4 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 

The “call graph” below shows there’s a lot of sharing for such a short string.
I started by writing about compiling program traces, but so far I’ve only been working with strings of characters. Let’s pretend the following PostScript file was written by someone who’s never heard of functions or loops.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Phew, programming is hard! That’s a lot of copypaste to produce a simple pattern.
Let’s see what our (simplified) implementation of Jez’s algorithm can do.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

35 functions that each contain two literals or function calls. Not bad (: It’s an actual SMOP to convert this grammar into a PostScript program.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

This second PostScript program comprises 211 tokens, rather than the original’s 156, but 105 of those are “{ } def” noise to define functions. Arguably, the number of meaningful tokens was reduced from 156 to slightly more than 100. Crucially, the output remains the same.
Stack programs seem particularly well suited to this rewriting approach: explicit operands (e.g., registers) introduce trivial discrepancies. In a VM, I would consider “compressing” the opcode stream separately from the operands. Compiling only the former to native code would already reduce interpretative overhead, and extracting function calls from the instruction stream should avoid catastrophic size explosions.
There are several obvious deficiencies in the direct LZ77guided
pairing. Just breaking Chomsky normalisation to inline singleuse
function would already help. It might also make sense to specialcase
repetitions as repeat n
loops, instead of hiding that in
\(\log n\) levels of recursion. In a real program, it would also be
useful to inline the bottommost rules so as not to drown in control
flow. The thing is, destroying structure in the name of performance
is well understood compiler tech; recovering it efficiently was the
hard part.
It’s already Google Summer of Code season; like spring, it’s slightly earlier this year. For multiple reasons – including the accelerated schedule and my transition to having a Real Job – I did not manage to guide any of the students who sent in applications. After an initial ranking, we’re left with a fair number of decent proposals for a Summer of SBCL. They’re of comparable quality, and all leave me a bit uneasy. In this post, I’ll try to explain what I look for in GSoC proposals… or, really, in action plans in general.
I care about the quality of GSoC plans for two reasons. First, because it reassures me that the applicant understands the difficulties they’ll have to overcome to complete their project. Second, because, like TDD, a good plan makes it easier to feel when we’re making concrete progress. To me, these two factors are necessary conditions for a good success rate.
We’ll probably allow additional time to update the proposals we have already received, so hopefully this post can be immediately useful. I have a single vote, and other mentors no doubt have different criteria. I only hope that this post can give ideas to applicants looking to review their proposal.
This is a common objection to any form of planning for software projects. I don’t understand where it comes from.
Software development should be easier to plan than most endeavours: we have to deal with relatively few people, and, when we do, it’s rarely in an adversarial setting… and I’m willing to bet most of us can better forecast the behaviour of programs than that of people. There’s currently an electoral campaign in Québec, and I’m certain that, before the date was even announced, every single party had a precise plan for each day, at the riding, regional, and provincial levels. Politicians can do it; why can’t software developers?
Of course, there will be constant adjustments to the initial plan. The reason I think it’s important to spend time and thought on that plan isn’t to stick to it, but primarily because it forces us to systematically consider all sorts of potential issues, challenges, failure modes, etc. This exercise becomes useful when problems crop up: rather than making decisions in the heat of the moment, we can rely on our past (well rested and composed) self’s research, notes, and ideas.
I also find plans tend to become much more handwavy as we get farther in the future. That’s a normal tendency. I think we’re usually pretty good at shortterm planning; most people can cook a meal without sitting down to figure out how they’ll cause awesome chemical reactions to occur without burning their house down. It’s a different matter when it comes to monthlong camping trips or feeding a thousand people.
The difference is one of scale, and it’s exactly when we’re out of our comfort zone that mindful planning becomes useful. I expect an action plan to be as thorough in its last week as in its first. Most of it will be fiction, but, as I wrote above, the thinking involved in writing that fiction is what I’m after.
There are tons of ways to describe a plan. I like this one, but the main thing is what I have in mind when I read each section. For instance, last year’s proposal for register allocation via graph colouring (PDF) had:
It’s a different structure than what I suggest below, but both mostly present the same information. My wish is to feel confident that the student has done the work to have that information.
First, a description of the goal. SBCL offers prefab proposals, and it’s tempting to just copypaste. However, I feel much more confident in the intrinsic motivation of a student who’s made the project their own. Many of our suggestions are skeletons for projects; it should be easy to come up with ways to extend and better specify them according to the student’s interests. Others are already fleshed out; those are a bit harder, but even small variations on the initial theme are encouraging. In all cases, if I don’t feel like the student owns the project, I probably won’t rank the proposal highly, and certainly won’t be inclined to mentor them. The problem statement is a simple way to demonstrate one’s understanding of – and interest for – the project.
Second, a review of the current environment. What currently exists, in SBCL or elsewhere? Can we exploit interesting research or code? Has there been previous efforts, and how/why did they not pan out? Perhaps there’s a relevant postmortem from a similar project that will help understand the problem domain. What potential difficulties do we know of? What don’t we know (known unknowns)? How bad do we estimate unknown unknowns to be, and why? What persons or resources could be helpful? To a certain extent, this is simply demonstrating that one’s done their due diligence. However, it also tells other people (e.g., potential mentors) how they can best work with the applicant. This section should be a good resource to reread when bad things happen.
Third, the suggested course of action at a high level. Not only what would happen in a perfect execution, but, ideally, ways to address (ahead of time or when/if they occur) some of the difficulties and unknowns listed in the previous section.
Fourth, success indicators, a translation of the initial goal description into a few checkboxes. Ideally, the criteria are falsifiable… much like an experiment, with the null hypothesis being project failure. I find that checking even a few of these boxes gives me a good sense of closure on a project.
Finally the calendar, to show how one might execute the third section and satisfy the success indicators in the fourth section. The first few weeks are usually easy to envision, but proposals tend to become fuzzier with time. I’m more confident in proposals that only use one or two week periods, with actions and a few testable for each period. Shorter periods make fuzziness more obvious, but, more importantly, they combine with the milestones to let us tell that we’re (or aren’t) getting traction.
In the unfortunate case that we’re somehow not making the expected progress, frequent milestones means we can tell more quickly that something is off. We only have to make up for a one week setback rather than for a month, i.e., one quarter of the GSoC program. The odds of outright project failure are lower, and we don’t risk feeling (as) down for wasting a month of work.
One thing that I find particularly scary in project calendars are “Magic Happens” steps. This seems more common with longer planning periods. Many proposals are quite detailed with respect to preliminary work and research, and to final integration and documentation. The issue is that most of the work is stashed in a 35 week interval during which the actual project becomes completed. It’s not clear what the student will do to get there (or even who will do the work ;). Planning each week individually makes such steps obvious.
I know that it’s hard to determine how we’ll achieve a goal when we’ve never done something similar before. But we can pretend. I want to feel that an applicant has a clear idea how to execute the task they propose, if only in an artificial perfect world. Otherwise, why should I believe they can do so in reality?
That’s it. I’m told that writers should show rather than tell. I feel the same about plans. I like to be shown how goals could be achieved in ideal and lessthanideal worlds rather than told when steps will be completed. I think the most important point is that I don’t appreciate detailed plans because we should (or can) stick to them. Rather, I believe that the exercise of coming up with such a plan is an efficient way to ensure we can fruitfully respond to what’ll actually happen when we execute it.
The regalloc proposal was one of the longest ones last year, but I felt confident that the student had a good idea of what challenges made the task nontrivial. In the end, we followed only a small part of the plan: it’s hard to come up with a suite of bad regalloc examples, visualisation turned out not to scale to real code, and there was too little time to work on live range splitting. However, every time there was a change, Alexandra had no problem determining how to adjust the schedule and figuring out the next step. On my end, I could easily follow the modified schedule and see that we were making progress.
This planning exercise is not easy, and it’s certainly not quick. (FWIW, I find it useful to get someone else, even without relevant experience, to look at the plan and ask questions.) Action plans have helped me countless times: when I needed to react to problems; when I felt like I was running in circles but could tell for sure that I was making progress; when a project forcibly came to an end but I still had a sense of closure… Hopefully, they can do the same for SBCL GSoC students.
]]>NEXT
sequence (used to) encode
an effective address with an index register but no base. The mistake
doesn’t affect the meaning of the instruction, but forces a wasteful
encoding. The difference in machine code are as follows.
Before (14 bytes):
1 2 3 4 

Now (9 bytes):
1 2 3 4 

I fixed the definition of NEXT
, but not the disassembly snippets
below; they still show the old machine code.
Earlier this week, I took another look at the
F18. As usual with Chuck Moore’s
work, it’s hard to tell the difference between insanity and mere
brilliance ;) One thing that struck me is how small the stack is: 10
slots, with no fancy overflow/underflow trap. The rationale is that,
if you need more slots, you’re doing it wrong, and that silent
overflow is useful when you know what you’re doing. That certainly
jibes with my experience on the HP41C and with x87. It also reminds
me of a
post of djb’s decrying our misuse of x87’s rotating stack:
his thesis was that, with careful scheduling, a “free” FXCH
makes
the stack equivalent – if not superior – to registers. The post
ends with a (nonpipelined) loop that wastes no cycle on shuffling
data, thanks to the x87’s implicit stack rotation.
That lead me to wonder what implementation techniques become available for stackbased VMs that restrict their stack to, e.g., 8 slots. Obviously, it would be ideal to keep everything in registers. However, if we do that naïvely, push and pop become a lot more complicated; there’s a reason why Forth engines usually cache only the top 12 elements of the stack.
I decided to mimic the x87 and the F18 (EDIT: modulo the latter’s two TOS cache registers): pushing/popping doesn’t cause any data movement. Instead, like the drawing below shows, they decrement/increment a modular counter that points to the top of the stack (TOS). That would still be slow in software (most ISAs can’t index registers). The key is that the counter can’t take too many values: only 8 values if there are 8 slots in the stack. Stack VMs already duplicate primops for performance reasons (e.g., to help the BTB by spreading out execution of the same primitive between multiple addresses), so it seems reasonable to specialise primitives for all 8 values the stack counter can take.
In a regular direct threaded VM, most primops would end with a code
sequence that jumps to the next one (NEXT
), something like
add rsi, 8 ; increment virtual IP before jumping
jmp [rsi8] ; jump to the address RSI previously pointed to
where rsi
is the virtual instruction pointer, and VM instructions
are simply pointers to the machine code for the relevant primitive.
I’ll make two changes to this sequence. I don’t like hardcoding addresses in bytecode, and 64 bits per virtual instruction is overly wasteful. Instead, I’ll encode offsets from the primop code block:
mov eax, [rsi]
add rsi, 4
add rax, rdi
jmp rax
where rdi
is the base address for primops.
I also need to dispatch based on the new value of the implicit stack
counter. I decided to make the dispatch as easy as possible by
storing the variants of each primop at regular intervals (e.g. one
page). I rounded that up to 64 * 67 = 4288
bytes to minimise
aliasing accidents. NEXT
becomes something like
mov eax, [rsi]
add rsi, 4
lea rax, [rax + rdi + variant_offset]
jmp rax
The trick is that variant_offset = 4288 * stack_counter
, and the
stack counter is (usually) known when the primitive is compiled. If
the stack is left as is, so is the counter; pushing a value decrements
the counter and popping one increments it.
That seems reasonable enough. Let’s see if we can make it work.
I want to explore a problem for which I’ll emit a lot of repetitive machine code. SLIME’s REPL and SBCL’s assembler are perfect for the task! (I hope it’s clear that I’m using unsupported internals; if it breaks, you keep the pieces.)
The basic design of the VM is:
r8
r15
: stack slots (32 bits);rsi
: base address for machine code primitives;rdi
: virtual instruction pointer (points to the next instruction);rax
,rbx
,rcx
,rdx
: scratch registers;rsp
: (virtual) return stack pointer.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

The idea is that we’ll define functions to emit assembly code for each
primitive; these functions will be implicitly parameterised on
*stackpointer*
thanks to @
. We can then call them as many times
as needed to cover all values of *stackpointer*
. The only
complication is that code sequences will differ in length, so we must
insert padding to keep everything in sync. That’s what emitcode
does:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

This function is used by emitallcode
to emit the machine code for
a bunch of primitives, while tracking the start offset for each
primitive.
1 2 3 4 5 6 7 8 9 10 

Now, the pièce de résistance:
1 2 3 4 5 6 7 8 9 10 11 12 13 

Let’s add a few simple primitives.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

The code for swap
lives between bytes 0 and 32. Let’s take a look
at the version for *stackpointer* = 0
and *stackpointer* = 1
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

dup
is at 3264, and sub
at 128152:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

These are relatively tight. I certainly like how little data
shuffling there is; the NEXT
sequence is a bit hairy, but the
indirect branch is likely its weakest (and least avoidable) point.
A VM without control flow isn’t even a toy. First, unconditional
relative jumps. These can be encoded as [jmp] [offset]
, with the 32
bit offset relative to the end of offset
. We just overwrite
*virtualip*
with the new address.
1 2 3 4 5 

Call and return are at the heart of Forthlike engines. ret
is
easy: just pop from the control stack into *virtualip*
.
1 2 3 

Call is a bit more complicated. It’s like jmp
, but pushes the
address of the next instruction to the control stack:
1 2 3 4 5 6 

Let’s look at the resulting machine code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

We finally almost have enough for interesting demos. The only
important thing that’s still missing is calls from CL into the VM.
I’ll assume that the caller takes care of saving any important
register, and that the primop (rsi
) and virtual IP (rdi
) registers
are setup correctly. The stack will be filled on entry, by copying
values from the buffer rax
points to, and written back on exit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

The CLside interlocutor of enter
follows, as a VOP:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 

The only thing missing is to store the machine code for our primop in a range of memory that’s executable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

1 2 3 

Let’s execute add sub
(and finish with leave
).
1 2 3 4 5 6 7 8 9 

And, indeed, 10  (3 + 2) = 5
.
We should also test function calls
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Instead of executing add
directly, this bytecode sequence calls to
whatever is 8 bytes (2 dwords) after the call instruction; in our
case, add ret
.
Writing bytecode by hand is annoying. This tiny functions takes care of the stupid stuff.
1 2 3 4 5 6 7 

We can now write
1 2 3 4 5 6 7 

We can now either write (basically) straightline code or infinite
loops. We need conditionals. Their implementation is much like
jmp
, with a tiny twist. Let’s start with jump if (top of stack is)
nonzero and jump if zero.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

It’s hard to write programs without immediate values. Earlier control
flow primitives already encode immediate data in the virtual
instruction stream. We’ll do the same for lit
, inc
, and dec
:
1 2 3 4 5 6 7 8 9 10 11 12 

Finally, we have enough for a decentlooking (if useless) loop. First, update the primop code page:
1 2 3 4 5 6 7 8 

1 2 3 4 5 6 7 8 9 10 

One million iterations of this stupid loop that only decrements a counter took 15M cycles. 15 cycles/iteration really isn’t that bad… especially considering that it executes an indirect jump after loading 1, after subtracting, and after comparing with 0.
We can do better by fusing lit sub
into dec
:
1 2 3 4 5 6 7 8 9 10 

Decrementing a counter and jumping if non zero is a common operation
(old x86 even implemented that in hardware, with loop
). Let’s add
decrement and jump if nonzero (djn
) to the VM:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

1 2 3 4 5 6 7 8 9 10 

That’s better… But I’m really not convinced by the conditional move.
The branch will usually be predictable, so it makes sense to expose
that to the hardware and duplicate the NEXT
sequence.
1 2 3 4 5 6 7 8 9 10 

The resulting code isn’t too large, and the two indirect jumps are 16 bytes apart.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

This alternative implementation does work better on our stupid loop.
1 2 3 4 5 6 7 8 9 10 

Let’s see how that compares to straight assembly code.
1 2 3 4 5 6 

1 2 3 4 5 6 7 8 9 10 

My slow macbook air gets 1 iteration/cycle on a loop that’s 100%
control overhead. With djn2
, a good implementation of a reasonable
specialised operator, the loop is about 6x as slow as native code. A
worse implementation of djn
is still only 8x as slow as pure native
code, and horribly unspecialised bytecode is 1115x as slow as native
code.
Specialising primops to a virtual stack pointer is feasible in
practice, when the stack is restricted to a small size. It also seems
to have a reasonable runtime overhead for threaded interpreters. I’m
not actually interested in straight stack languages; however, I
believe that a fixed stack VM makes a nice runtime IR, when coupled
with a mechanism for local variables. We’ll see if I find time to
translate a high level language into superoperators for such a VM.
Fused operators would reduce the importance of NEXT
; in constrast,
simpler function calls (because there’s less shuffling of items to
stack them up in the right position) would remain as useful.
SBCL has definitely proven itself to be a good companion to explore the generation of domainspecific machine code. I don’t know of any other language implementation with that kind of support for interactive programming and machine code generation (and inspection). FWIW, I believe LuaJIT + dynasm will soon be comparable.
Steel Bank Common Lisp: because sometimes C abstracts away too much ;)
]]>EDIT: I was asked for feedback on some of these books, so I added some inline.
Brassard and Bratley, Fundamentals of Algorithmics is my goto reference book for basic algorithmic stuff. I love the presentation, the authors are careful with Landau notation, and they specify in what circumstances handwaving tricks apply. The quality of Brassard’s lectures is probably a factor as well: the book reminds me of one of the best courses in Université de Montréal’s undergraduate curriculum.
CLRS for baseline implementations of classic data structures and algorithms.
Knuth 14A for its attention to minutiae. Also very useful as a reference to point to: if something is so old or well known that I can’t think of a canonical reference, it’s probably in Knuth.
Sedgewick, Algorithms (First edition) has an interesting section on geometric algorithms. I also like the way many older data structure and algorithm books clearly take lowlevel coding consideration into account while discussing asymptotics. I find nearly everything by Tarjan very good in that regard; his papers are among the few that describe (and painstakingly prove) algorithmic improvements (e.g., the inverseAckermann Union Find) only to reveal that a simpler implementation with a worse bound consistently performs better in practice.
Bertot and Castéran, Interactive Theorem Proving and Program Development is heavy and I’m still going through the book.
Pierce, Types and Programming Languages should be mandatory skimming before entering a debate on “static” and “dynamic” types.
Gries, The Science of Programming changed the way I write programs. It’s not that I go full formal on my code, but that I try and structure it so it’s easier to reason about. I was just reminded of Dijkstra’s Discipline of Programming. The first time I read both books, I did so concurrently. It was a long time ago, and I was less mature mathematically, but I definitely remember finding Gries clearer and more directly applicable, while Dijkstra did a better job of making me think hard about how to best express the meaning of code. Even now, I’d say they’re still complementary.
Jackson, Software Abstraction describes an interesting approach to interpolate between testing and formal methods. This book changed the way I write tests. It’s also based on a nice application of combinatorial optimisation ;)
Oram and Wilson, Beautiful Code feels uneven to me. Some chapters cover inspiring gems, but many left no mark. A friend suggests Bentley’s Programming Pearls or Neuman’s Computer Related Risks.
Golub and Van Loan, Matrix Computations is obviously a good reference for the implementation of dense linear algebra. However, I like this work even more for its clear and precise discussion of performance and numerical precision matters. One can’t hide behind bigOh when discussing linear algebra. Yet, the book is as enlightening and general as classic tomes on asymptotic analysis of algorithms and data structures. Better: that is true for both the implementation and the use of the algorithms they describe. If all software engineering were as well understood as numerical linear algebra, our discipline would be in a good place.
McGeoch, A Guide to Experimental Algorithmics was my bedside companion for the holiday season. On several occasions, I told friends that this is the book I wish I had read ten years ago and that I dreamt of writing the past couple year. It has good rules of thumb and suggestions for everything from designing proper experiments to deriving robust performance models.
Warren, Hacker’s Delight is really a fundamental book for me. It’s not that bit twiddling tricks are that important, but that I know I can count on Warren’s nice exposition if I need to direct someone to a reference. Now nicely complemented by Knuth 4A.
Ahuja, Magnanti and Orlin, Network Flows because it describes all the classics. In particular, if an optimisation problem is in P, it’s probably in there.
Bazaraa, Sherali and Shetty, Nonlinear Programming mostly for a couple convergence proofs and as a reference.
Chvátal, Linear Programming for its unique (? rare, definitely) and insightful presentation of linear optimisation and of classic duality theorems. I like the sections on Network flows and on Basis factorisation for similar reasons.
HiriartUrruty and Lemaréchal, Convex Analysis and Minimization Algorithms because decomposition methods depend on convex nondifferentiable optimisation. I rarely care about general convex optimisation, but it’s an unavoidable building block for Lagrangian decomposition methods.
Bollobás, Modern Graph Theory mostly as a reference for classic results. I remember having a lot of fun with the section on matchings.
Lawler, Combinatorial Optimization is short but dense. I don’t grok matroids, and someone I respect gave me this book and told me it would all make sense once I’d gone through it.
Nemhauser and Wolsey, Integer and Combinatorial Optimization for the dark times when I need to directly think in terms of polytopes and facets, or when I try to generate fancy cuts.
Schrijver, Combinatorial Optimization mostly as a reference for theoretical results. It’s exhaustive, so I can use it like the TAoCP of optimisation and as a springboard to other work.
Wolsey, Integer programming when Nemhauser is hard to follow.
These books help me approach problems that don’t fit neatly in any box.
Bertsekas, Dynamic Programming and Optimal Control and Law, Simulation, Modeling and Analysis fill holes left by my focusing on deterministic singleperiod problems. I use Bertsekas and Law as refreshers when I remember that I’m forgetting something. I can then decide whether to go more deeply in that direction.
Polya, How to Solve It. For the really dark times when it feels like I’m asked to do the impossible. Perhaps it’s mostly useful because it distracts me from the immediate task for a couple minutes. Either way, I like to reread it when I get stuck.
Tufte, The Visual Display of Quantitative Information, along with Hadley Wickham’s ggplot2, taught me to think deliberately about presenting results. Our visual cortex is pretty awesome but bad visualisation wastes our brain on distractions.
]]>That’s not too surprising. Given any change, it’s usually easy to design a situation that makes it shine. I find it just as important to ask the opposite question: what’s the worst slowdown I can get from switching from regular 4KB pages to huge 1GB pages? This way, I can make more robust decisions, by taking into account both a good (if not best) and a worst case. By the end of this post, I’ll multiply runtime by 250% for the same operation, simply by switching from 4KB to 1GB pages… with a setup is so contrived that it seems unlikely to occur by accident.
But first, a quick overview of the benefits and downsides of huge pages.
In short, manufacturers are adding huge pages because translating virtual addresses to physical ones is slow.
On x8664, the mapping from virtual to physical pages is represented with a trie; each level dispatches on 9 bits (i.e., each node has 512 children), and leaves correspond to 4KB pages. There are 4 levels from the root node to the leaves, which covers the (currently) standard 48bit virtual address space.
The address translation table is (mostly) stored in normal memory and is too large to fit in cache. Thus, translating a virtual address requires four reads, any of which can hit uncached memory.
This is where the translation lookaside buffer (TLB) comes in. On my E54617, each core has 64 entries for regular 4KB pages in its L1 dTLB, and 512 (shared) entries in its L2 TLB. I don’t know if the TLBs are exclusive or inclusive, but even if they’re exclusive, that’s only enough for 2.25MB worth of data. Assuming that the working set is completely contiguous (i.e., the best case), the TLB space for 4KB pages is less than the total cache/core (2.5MB L3/core + 256 KB L2 + 32 KB L1D).
2MB and 1GB “huge pages” address this imbalance: nine 2MB pages suffice to cover more address space than all the caches in a 6core E54617. However, there are only 32 L1dTLB entries for 2MB pages – and 4 entries for 1GB pages – on my E5.
In addition to covering more address space in the TLB, huge pages offer secondary benefits: there are fewer page table entries, and the trie is shallower. Fewer page table entries means that a larger fraction of memory can be used by data, rather than metadata, and that the page table walk is more likely to stay in cache. Moreover, larger pages are closer to the trie’s root: while the processor traverses 4 levels to get to a 4KB page, it only traverses 3 levels to reach a 2MB page and 2 levels to for a 1GB page. These two effects compound to make TLB misses quicker to handle.
This idea that one must cover as much address space as possible with the TLB is most relevant in two settings: trivially, if the working set is completely covered by the (L1d)TLB, or, more interestingly, when the access patterns show a lot of locality. Examples of the latter case are BLAS routines: with appropriate blocking, they can usually access each page once or twice, but read almost every byte in a page before switching to the next.
The opposite, worst, case would be something like lookups in a large (too big for the TLB) hash table: we choose a virtual address pseudorandomly and painstakingly translate it to a physical address, only to read a handful of words from that page. In that situation, we want as many TLB entries as possible, regardless of the address space each one covers… and that’s where 4KB pages ought to shine. Taking into account both the L1DTLB and the L2TLB, each core has 576 TLB entries for 4KB (data) pages, versus 64x2MB and 4x1GB. Now, I don’t know if the TLBs are exclusive or not, so I’ll assume the worst case and work with 512*4KB entries.
The thing is, 512 TLB entries aren’t that many. If, by chance, our hash table lookups keep hitting the same 512 pages, a contrived microbenchmark will show that 4KB pages are a big win (but really, a software cache might be a better way to exploit the situation). It’s more likely that it’ll be hard to avoid TLB misses regardless of page size, and huge pages then become useful because each TLB miss is handled more quickly. Regardless, I’ll try to approximate this worstcase behaviour to see how bad things can get.
Ideally, I would want to read from 512 (or a bit fewer) locations 1GB apart, but I don’t have that much RAM. In the interest of realism, I decided to “only” allocate 24GB.
My first microbenchmark follows: I allocate 24GB, divide that space in 512 chunks, and read the first word of each chunk in a loop. At first, I didn’t even randomise the traversal order (so as to abuse LRU), but there seems to be some prefetching for 1GB pages.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 

The results: 1.13e10 cycles for 4KB pages, 1.60e10 for 2MB and 1.56e10 for 1GB. That’s only 40% more cycles… it’s bad, but not horrible. The reason is that the data vector spans only 24x1GB, so 1/6th of the random lookups will hit the 1GB TLB. Instead, let’s try and load from each of these 24 pages, in random order. 24 pages will easily fit in the L1DTLB for 4KB pages, but not in the 4 slots for 1GB pages.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

The results are even worse (better)! 4.82e9 cycles for 4KB pages, versus 3.96e9 and 2.84e9 for 2MB and 1GB pages!
The problem is aliasing. The TLB on my E5 has limited wayness (4way, I believe), so, by aligning everything to a 1GB boundary, the effective size of the 4KB page TLB is 4 entries (same for 2MB). In a way, this highlights the effect of page size when TLBs are useless (random accesses to dozens or hundreds of GBs): 2MB pages shave 18% off the runtime, and 1GB pages another 30%, for a total of 60% as much time to handle a 1GB TLB miss versus 4KB.
Let’s try again, with indices[i] = (x*stride + (x*4096)%(1ul<<30)) % (24ul << 30);
on line 19. I now find 1.14e9, 6.18e9 and 2.65e9 cycles. Much better!
For fun, I also tried to offset by 2MB increments, with indices[i] =
(x*stride + (x<<21)%(1ul<<30)) % (24ul << 30);
, and found 2.76e9,
1.30e9, and 2.85e9 cycles.
Finally, I tried
size_t offset = 4096 + (1ul<<21);
indices[i] = (x*stride + (x*offset)%(1ul<<30)) % (24ul << 30);
so that neither 4KB nor 2MB pages would alias, and got 1.13e9, 1.08e9 and 2.65e9 cycles. That’s 234% as much time for 1GB pages as for 4KB.
We’re close: this setup is such that 1GB pages cause a lot of TLB
misses, but neither 4KB nor 2MB pages do. However, perf stat
shows
there’s a lot of cache misses, and that probably reduces the
difference between 4KB and 1GB pages.
Let’s try one last thing, with size_t offset = 4096 + (1ul<<21) +
64;
(to avoid aliasing at the data cache level), and a smaller index
vector that fits in cache.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

We get 1.06e9, 9.94e8, and 2.62e9 cycles, i.e., 250% as much time with 1GB pages than 4KB ones.
We can easily turn this around: we just have to loop over more than 4 4KBaligned locations in a 4GB space. For example, with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

With the above, I find 7.55e9 cycles for 4KB pages, 3.35e9 for 2MB and
1.09e9 for 1GB pages. Here, 4KB pages are almost 7x as slow as 1GB
pages. If I instead let size_t offset = 4096 + 64;
(to avoid
aliasing in the 4KB TLB), I get 4.72e9 cycles for 4KB pages, so still
433% as much time.
We can also play the same trick over 32*2MB = 64MB. On my E5, I find 3.23e9 cycles for 4KB pages, versus 1.09e9 for 2MB and 1GB pages. Eliminating pagelevel aliasing only brings the 4KB case down to 3.02e9 cycles, and doesn’t affect the other two cases.
The following table summarises the runtimes of all the variations above with 2MB and 1GB pages (as a fraction of the number of cycles for 4KB pages).
2MB/4KB  1GB/4KB 
1.42  1.38 
0.82  0.59 
5.42  2.32 
0.47  1.03 
0.96  2.34 
0.94  2.47 
0.44  0.14 
0.72  0.23 
0.34  0.34 
0.36  0.36 
Overall, I think that I wouldn’t automatically switch to 2MB pages, but that 1GB pages are a solid choice for machines that basically run a single process at a time. When the data fits in 4GB, 1GB pages completely eliminate TLB misses. When the data is even larger, 2MB and 1GB pages make page table walks quicker (by 18% and 40%, respectively). It takes a very contrived situation – in which a program keeps hitting fewer than 512 4KBpages that are spread out across multiple GBs – for smaller pages to be preferable. The worst I managed was 250% as much time for 1GB pages vs 4KB; in the other direction, I achieved 693% as much time for 4KB pages versus 1GB, and 433% with a realistic situation (e.g., repeated lookups in a 4GB hash table). Plus, there’s another interesting benefits from larger pages that did not show up in this post: we get more control over aliasing in data caches.
With multiple processes in play, there are fragmentation issues, and things aren’t as clearcut… especially given that 1GB pages must currently be allocated at boottime, on Linux.
I’m also still unsure how 1GB pages interact with NUMA. I’m particularly worried about interleaving: interleaving at a 1GB granularity seems unlikely to smooth out the ratio of local:remote accesses as much as doing it at a 4KB granularity.
]]>One rainy day at SophiaAntipolis, a researcher told me a story about an optimisation software company. They’d looked into adding explicit support for DantzigWolfe decomposition, but market research showed there were a few hundred potential users at most: too few people are familiar enough with decomposition methods to implement new ones. The problem is, the strength of these methods is that they are guided by domain expertise; useful decompositions are custommade for the target problem, and thus obviously novel. What this story tells me is that we, researchers in decomposition, are really bad communicators.
It’s not like these methods are new or ill understood. The theory on how to break up optimisation problems into smaller ones (for Benders and DantzigWolfe decompositions, at least) was fully developed in the fifties and sixties. It’s the same theory that backs stateoftheart solvers for planetscale planning problems today. Surprisingly, it seems that a single undergraduate or graduate course in optimisation methods does not suffice for expertise to trickle down to the wider computer science and management communities.
The divide seems particularly wide with the constraint programming (CP) community (despite the work of Hooker, Chvátal and Achterberg [that I know of]). Perhaps this explains why the VPRI’s efforts on cooperating constraint solvers are not informed by decomposition methods at all. Then again, it’s not just decomposition people who exchange too little with the CP community, but all of mathematical programming. I only single out decomposition because our tools are general enough to work with CP. We write and sometimes think in mixed integer linear programming (MILP) terms, but mostly for theoretical reasons; implementations may (useful ones usually do) include specialised solvers for combinatorial problems that only happen to be presented as MILP.
That’s why I spent three months in Nice: I explored how to crosspollinate between CP and decomposition by developing a TSP solver (it’s a simple problem and Concorde is an impressive baseline). I have a lot to say about that, certainly more than can fit in this buffer. In this first post of many, I hope to communicate an intuition on why DantzigWolfe and Benders decomposition work, to relate these methods to work from the SAT/CSP/CP community, and to show how to decompose finite domain problems.
I noted earlier that we work in MILP terms because it gives a sound theoretical basis for decomposition methods. That basis is duality, especially linear programming duality.
In optimisation, duality refers to the relationship between two problems a primal problem (P) and its dual (D). I’ll assume that we’re interested in solving (P), a minimisation problem (minimise badness, e.g., costs); (D) is then a maximisation problem (maximise goodness, e.g., profits). These problems are such that the minimal (least) value for (P) is always higher or equal to the maximal (greatest) value for (D).
So, even if (P) and (D) are so hard that we can only solve them heuristically, a pair of feasible solutions always brackets the optimal value for the primal problem (P). \(\tilde{x}\), the feasible solution for (P) might not be optimal, but it’s still a solution. In turn, \(\tilde{y}\), the dual feasible solution, gives us a lower bound on the optimal value: we don’t know the optimal value, but we know it can’t be lower than that of any solution for (D). We can then compare the value of \(\tilde{x}\) to this lower bound. If they’re close enough, we stop looking; otherwise we use both primal and dual information to refine our solutions and tighten the bracket (close the gap).
The primal solution is useful: it’s a feasible plan. However, the dual solution is often where we gain insights into the problem. For example, if I wish to minimise my carbon footprint, it’s clear that driving less will help me. However, what is the impact of allowing myself more commuting time? and how does that compare with simply shortening the distance between my home and work? That’s what dual solutions tell us: they estimate the impact of (the righthand side of) each constraint on the objective value.
Linear programs are a special case: they have a “strong” duality, in that we can easily write down an exact dual program for any primal linear program. The dual is exact because its maximal value matches the minimal value of the primal; there is no duality gap. This dual is another linear program that’s exactly as large as the primal, and the dual of the dual is… the initial primal program.
Decomposition exploits this relationship to extract information about simpler subproblems, and translate that into a refined approximation of the complete problem. It would be disappointing if that only worked on linear programs, and this is where Lagrangian duality comes in.
You may remember Lagrange multipliers from calculus courses. That’s exactly what I’m talking about. For reasons both practical and theoretical, I’ll assume that we have a (sub)problem that’s simpler once a few linear (affine) constraints are removed, i.e. something like $$\min\sb{x} z(x)$$ subject to $$Ax \leq b,$$ $$x\in X.$$
This form is general; the objective function \(z\) need only be convex and the feasible set \(X\) is arbitrary.
Lagrangian duality says that we can relax this problem into $$\min\sb{x\in X} z(x) + \lambda(Axb),$$ where \(\lambda\) is any vector of nonnegative multipliers for each row (constraint) in \(A\). I find such subproblems interesting because minimising over \(X\) can be easy with a specialised procedure, but hard as a linear program. For example, \(X\) could represent spanning trees for a graph. That’s how the onetree relaxation for the TSP works. Once you remove an arbitrary “onenode” from a Hamiltonian cycle, the remaining Hamiltonian path is a special kind of spanning tree: the degree of all but two nodes is two, and the remaining two nodes are endpoints (would be connected to the onenode). The onetree relaxation dualises (relaxes into the objective function) the constraint that the degree of each node be at most two, and solves the remaining minimum spanning tree problem for many values of \(\lambda\).
The onetree subproblem, and Lagrangian subproblems in general, is a relaxation because any optimal solution to this problem is a lower bound for the optimal value of the initial problem.
To show this, we first notice that the feasible set of the subproblem is a superset of that of the initial problem: we only removed a set of linear constraints. Moreover, for any feasible solution \(\bar{x}\) for the initial problem, $$z(\bar{x}) + \lambda(A\bar{x}b) \leq z(\bar{x}):$$ \(A\bar{x} \leq b\), i.e., \(A\bar{x}b \leq 0\), and \(\lambda\geq 0\). In particular, that is true if \(\bar{x}\) is an optimal solution to the initial problem. Obviously, an optimal solution to the relaxed subproblem must be at least as good as \(\bar{x}\), and the optimal value of the subproblem is a lower bound for that of the initial problem.
This relation between optimal solutions for the initial problem and for the relaxed subproblem is true given any \(\lambda\geq 0\). However, it’s often useless: bad multipliers (e.g., \(\lambda = 0\)) lead to trivially weak bounds. The Lagrangian dual problem with respect to \(Ax \leq b\) is to maximise this lower bound: $$\max\sb{\lambda\geq 0} \min\sb{x\in X} F(\lambda, x),$$ where $$F(\lambda, x) = z(x) + \lambda(Axb) = z(x) + \lambda Ax  \lambda b.$$
Again, we don’t have to maximise this dual exactly. As long as the minisation subproblem is solved to optimality, we have a set of lower bounds; we only have to consider the highest lower bound. Maximising this Lagrangian dual, given an oracle that solves the linearly penalised subproblem, is an interesting problem in itself that I won’t address in this post. The important bit is that it’s doable, if only approximately, even with very little storage and computing power.
There’s some nifty theoretical results that tell us that, when \(z(x)\) is a linear function, maximising the Lagrangian dual function is equivalent to solving the linear programming dual of $$\min z(x)$$ subject to $$Ax \leq b,$$ $$x\in \textrm{conv}(X),$$ where \(\textrm{conv}(X)\) is the convex hull of \(X\).
This is a relaxation (the convex hull is superset of \(X\)), but a particularly tight one. For example, if \(X\) is a finite set, optimising a linear function over \(\textrm{conv}(X)\) is equivalent to doing so over \(X\) itself. Sadly, this doesn’t hold anymore once we add the linear constraint \(Ax \leq b\) after taking the convex hull, but still gives an idea of how strong the relaxation can be.
One key detail here is that the optimality of \(x\) depends on \(\lambda\), \(z\), \(X\) and \(A\), but not on \(b\): once we have a pair \((\lambda\sp{\star}, x\sp{\star})\) such that \(x\sp{\star}\) is minimal for \(F(\lambda\sp{\star},\cdot)\), their objective value is always a valid lower bound for the initial problem, regardless of the righthand side \(b\). Thus, it make sense to cache a bunch of multipliersolution pairs: whenever \(b\) changes, we only need to reevaluate their new objective value to obtain lower bounds for the modified problem (and interesting pairs can serve to warm start the search for better maximising multipliers).
Because I’m a heartless capitalist. (I find it interesting how economists sometimes seem overly eager to map results based on strong duality to the [not obviously convex] world.)
Seriously, because optimisation composes better than satisfaction. For example, imagine that we only had a routine to enumerate elements of \(X\). There’s not a lot we can do to optimise over \(X\) (when you seriously consider a British Museum search…), or even just to determine if the intersection of \(X\) and \(Ax \leq b\) is empty.
I’ve hinted as to how we can instead exploit an optimisation routine over \(X\) to optimise over an outer approximation of that intersection. It’s not exact, but it’s often close enough; when it isn’t, that approximation is still useful to guide branchandbound searches. A satisfaction problem is simply an optimisation problem with \(z(x) = 0\). Whenever we have a relaxed solution with a strictly positive objective value, we can prune a section of the search space; at some point we either find a feasible solution or prune away all the search space. Thus, even for pure satisfaction problems, the objective function is useful: it replaces a binary indicator (perhaps feasible, definitely infeasible) with a continuous one (higher values for less clearly feasible parts of the search space).
I said optimisation composes better because decomposition also works for the intersection of sets defined by oracles. For example, we can satisfy over the intersection of finite sets \(X\) and \(Y\) by relaxing the linear constraint in $$x = y,$$ $$x\in X, y\in Y.$$ The relaxed subproblem is then $$\min\sb{(x,y)\in X\times Y} \lambda(xy),$$ which we can separate (linear objectives are separable) into two optimisation subproblems: $$\min\sb{x\in X} \lambda x,$$ and $$\min\sb{y\in Y} \lambda y.$$ Even if the initial problem is one of satisfaction, Lagrangian decomposition considers each finite set in the intersection through independent minimisation: components communicate via their objective functions.
All this can also work with categorical data; we just have to do what no CP practicioner would ever do and map each choice to a binary (0/1) variable. When, in the constraint program, a variable \(x\) can take values \({a, b, c}\), we express that in integer programs with a triplet of variables:
and a constraint \(\sum\sb{i\in {a,b,c}} x\sb{i} = 1\).
We can perform this mapping around each subproblem (e.g., optimisation over \(X\)). On entry, the coefficient for \(x\sb{a}\) in the linear objective corresponds to the impact on the objective function of letting \(x=a\). On exit, \(x=c\) becomes \(x\sb{c}=1\), \(x\sb{a}=0\) and \(x\sb{b}=0\), for example.
Benders decomposition is the mathematical programming equivalent of clause learning in SAT solvers, or explanationbased search in CP.
Clause learning improves straight backtracking by extracting information from infeasible branches: what’s a small (minimal is NPhard) set of assignments (e.g. \(x=\mathtt{false}\) and \(y=\mathtt{false}\)) that causes this infeasibility? The problem is then updated to avoid this partial assignment.
Explanations in CP are similar: when a global constraint declares infeasibility, it can also provide an explanation, a small set of assignments that must be avoided. This, combined with explanations from other constraints, can trigger further propagation.
Benders decomposition generalises this feasibilityoriented view for the optimisation of problems with a hierarchical structure. I think the parallel has been known for a while, just not popularised. John Hooker proposed logicbased Benders to try and strengthen the link with logic programming, but I don’t know who (if anyone) else works in that direction.
The idea behind Benders decomposition is to partition the decision variables in two sets, strategic variables \(y\), and tactical ones \(x\). The decomposition attempts to determine good values for \(y\) without spending too much time on \(x\), which are left for the subproblem. This method is usually presented for mixed integer programs with continuous tactical (\(x\)) variables, but I’ll instead present the generalised form, which seems well suited to constraint programming.
We start with the initial integrated formulation $$\min\sb{(x,y)\in S} cx + fy,$$ which we reformulate as $$\min\sb{(x,y)} cx + fy,$$ $$y\in Y,$$ $$x\in X(y).$$
For example, this could be a network design problem: we wish to determine where to build links so that all (known) communication demands are satisfied at least total cost.
Benders decomposition then eliminates variables \(x\) from the reformulation. The latter becomes the master problem $$\min\sb{y,z} fy + z$$ subject to $$Ay = d,$$ $$By \leq z,$$ $$y\in Y.$$
The two linear constraints are generated dynamically, by solving subproblems over \(X\). The first corresponds directly to learned clauses or failure explanations: whenever an assignment for \(y\) happens to be infeasible, we add a linear constraint to avoid it (and similar assignments) in the future. The second enables optimisation over \(S\).
The idea is to solve a Lagrangian relaxation of the (slave) subproblem for a given assignment \(y\sp{i}\): $$\min\sb{x,y} cx$$ subject to $$y = y\sp{i},$$ $$x\in X(y).$$
We will relax the first equality constraint to obtain a new feasible set that is a relaxation (outer approximation) of \(S\): it only considers interactions within \(X\) and between \(X\) and \(Y\), but not within \(Y\) (those are handled in the master problem).
In my network design example, the unrelaxed subproblem would be a lot of independent shortest path problems (over the edges that are open in \(y\sp{i}\)).
We dualise the first constraint and obtain the Lagrangian dual $$\max\sb{\lambda} \min\sb{y\in\mathbb{R}\sp{m}, x\in X(y)} cx + \lambda(yy\sp{i})$$ or, equivalently, $$\max\sb{\lambda} \lambda y\sp{i} + \min\sb{y\in\mathbb{R}\sp{m}, x\in X(y)} cx + \lambda y.$$
The network design subproblem still reduces to a lot of independent shortest path problems, but (with appropriate duplication of decision variables) forbidden edges may be taken, at additional cost per flow.
In this example, the previous step is gratuitously convoluted: we can solve the unrelaxed subproblem and shortest paths can be seen as linear programs, so there are other ways to recover optimal multipliers directly. However, this is useful for more complicated subproblems, particularly when the \(x = y\sp{i}\) constraint is difficult to handle. Note that the slave can be reformulated (e.g., with multiple clones of \(y\sp{i}\)) to simplify the Lagrangian subproblem.
I pointed out earlier that, given \(\bar{\lambda}\sp{i}\) and an optimal solution \((\bar{x}\sp{i},\bar{y}\sp{i})\) for that vector of multipliers, we always have a valid lower bound, regardless of the righthand side, i.e., regardless of the assignment \(y\sp{i}\).
We found a first (hopefully feasible) solution by solving a restricted subproblem in the previous step. The multipliers for our dualised constraint \(y = y\sp{i}\) explain the value associated to \(y\sp{i}\), what part of \(y\sp{i}\) makes the solution to the restricted subproblem so bad (or good).
That’s how we dynamically constrain \(z\). Each time we solve the (updated) master problem, we get a tentative assignment for the strategic variables, and a lower bound for the initial problem. We can then solve for \(x\in X(y\sp{i})\) to find a heuristic solution (or generate a feasibility cut to tell the master why the assignment is infeasible); if the heuristic solution is close enough to our current lower bound, we can stop the search. Otherwise, we solve the fixed Lagrangian dual and generate a fresh optimality cut $$\bar{\lambda}\sp{i}y + c\sp{i} \leq z,$$ where \(c\sp{i} = c\bar{x}\sp{i} + \bar{\lambda}\sp{i}\bar{y}\sp{i}\): we update the master problem with information about why its last tentative assignment isn’t as good as it looked.
It can be harder to explain infeasibility than to solve the Lagrangian subproblem. In that case, it suffices to have an upper bound on the optimal value (e.g., 0 for pure feasibility problems): when the lower bound is greater than the upper bound, the problem is infeasible. Given that the master problem also tries to minimise its total cost, it will avoid solutions that are infeasible if only because their value is too high. The advantage of feasibility cuts is that, with some thinking, a single feasibility cut can forbid a lot more solutions than multiple optimality cuts.
The problem with Benders decomposition is that the master mixed integer program gains new constraints at each iteration. Eventually, its size becomes a liability.
There are a few folk workarounds. One is to not stop (and then restart) the branchandbound search as soon as it finds an optimal (for the current relaxed master problem) integer feasible assignment for \(y\): instead, add a new constraint to the formulation and resume the search (previously computed lower bounds remain valid when we add constraints). Another is to drop constraints that have been inactive for a long while (much like cache replacement strategies); they can always be added back if they ever would be active again.
Chvátal’s Resolution Search takes a different tack: it imposes a structure on the clauses it learns to make sure they can be represented compactly. The master problem is completely different and doesn’t depend on solving mixed integer programs anymore. It works well on problems where the difficulty lies in satisfying the constraints more than finding an optimal solution. A close friend of mine worked on extensions of the method, and it’s not yet clear to me that it’s widely applicable… but there’s hope.
Another issue is that the Benders lower bound is only exact if the slave subproblem is a linear program. In the general case, the master problem is still useful, as a lower bounding method and as a fancy heuristic to guide our search, but it may be necessary to also branch on \(x\) to close the gap.
I believe Lagrangian decomposition, a special case of Lagrangian relaxation, is a more promising approach to improve constraint programming with mathematical programming ideas: I think it works better with the split between constraint propagation and search. The decomposition scheme helps constraints communicate better than via only domain reductions: adjustments to the objective functions pass around partial information, e.g., “\(x\) is likely not equal to \(a\) in feasible/optimal solutions.” Caveat lector, after 7 years of hard work, I’m probably biased (Stockholm syndrome and all that ;).
I already presented the basic idea. We take an initial integrated problem $$\min\sb{x\in X\cap Y} cx$$ and reformulate it as $$\min\sb{(x,y)\in X\times Y} cx$$ subject to $$x = y.$$ Of course, this works for any finite number of sets in the intersection as well as for partial intersections, e.g., \(x\sb{0} = y\sb{0}\) and \(x\) and \(y\) otherwise independent.
Dualising the linear constraint leaves the independent subproblems $$\min\sb{x\in X} cx+\lambda x,$$ and $$\min\sb{y\in Y} \lambda y.$$
The two subproblems communicate indirectly, through adjustments to \(\lambda\)… and updating Lagrange multipliers is a fascinating topic in itself that I’ll leave for a later post (: The basic idea is that, if \(x\sb{i} = 1\) and \(y\sb{i} = 0\), we increase the associated multiplier \(\lambda\sb{i}\), and decrease it in the opposite situation: the new multipliers make the current incoherence less attractive. For a first implementation, subgradient methods (e.g., the volume algorithm) may be interesting because they’re so simple. In my experience, however, bundle methods work better for nontrivial subproblems.
When we have an upper bound on the optimal value (e.g., 0 for pure satisfaction problems), each new multipliers \(\lambda\) can also trigger more pruning: if forcing \(x = a\) means that the lower bound becomes higher than the known upper bound, we must have \(x \neq a\). The challenges becomes to find specialised ways to perform such pruning (based on reduced costs) efficiently, without evaluating each possibility.
Lagrangian decomposition also gives us a fractional solution to go with our lower bound. Thus, we may guide branching decisions with fractional variables, rather than only domains. Again, an optimisationoriented view augments the classic discrete indicators of CP (this variable may or may not take that value) with continuous ones (this assignment is unattractive, or that variable is split .5/.5 between these two values).
I believe decomposition methods are useful for constraint programming at two levels: to enable more communication between constraints, and then to guide branching choices and heuristics.
In classic constraint programming, constraint communicate by reducing the domain of decision variables. For example, given \(x \leq y\), any change to the upper bound for \(y\) (e.g., \(y\in [0, 5]\)) means that we can tighten the upper bound of \(x\): \(x \leq y \leq 5\). This can in turn trigger more domain reduction.
Some constraints also come in “weighted” version, e.g., \(x\sb{i}\) form a Hamiltonian cycle of weight (cost) at most 5, but these constraints still only communicate through the domain of shared decision variables.
With Lagrangian decomposition, we generalise domain reduction through optimisation and reduced cost oracles. For each constraint, we ask, in isolation,
Adding the objective values computed by the optimisation oracle gives us a lower bound. If the lower bound is greater than some upper bound, the problem (search node) is infeasible. Otherwise, we can use reduced costs to prune possibilities.
Let’s say the difference between our current lower and upper bounds is \(\Delta\). We prune possibilities by determining whether setting \(x=a\) increases the lower bound by more than \(\Delta\). We can do this independently for each constraint, or we can go for additive bounding. Additive bounding computes a (lower) estimate for the impact of each assignment on the objective function, and sums these reduced costs for all constraints. The result is a set of valid lower estimates for all constraints simultaneously. We can then scan these reduced costs and prune away all possibilities that lead to an increase that’s greater than \(\Delta\).
The optimisation oracle must be exact, but the reduced cost oracle can be trivially weak (in the extreme, reduced costs are always 0) without affecting correctness. Domain reduction propagators are simply a special case of reduced cost oracles: infeasible solutions have an arbitrarily bad objective value.
Some constraints also come with explanations: whenever they declare infeasibility, they can also generate a small assignment to avoid in the future. The mathematical programming equivalent is cut generation to eliminate fractional solutions. Some very well understood problems (constraints) come with specialised cuts that eliminate fractional solutions, which seems useful in hybrid solvers. For example, in my work on the TSP, I added the (redundant) constraint that Hamiltonian cycles must be 2matchings. Whenever the Lagrangian dual problem was solved fairly accurately, I stopped to compute a fractional solution (for a convexification of the discrete feasible set) that approximately met the current lower bound. I then used that fractional solution to generate new constraints that forbid fractional 2matchings, including the current fractional solution. These constraints increased the lower bound the next time I maximised the Lagrangian dual, and thus caused additional reduced cost pruning.
The same fractional solution I used to generate integrality cuts can also guide branching. There’s a lot of interesting literature on branching choices, in CP as well as in mathematical programming. Lagrangian decomposition, by computing fractional solutions and lower bounds, let us choose from both worlds. Again, the lower bound is useful even for pure feasibility problems: they are simply problems for which we have a trivial upper bound of 0, but relaxations may be lower. Only branching on fractional decision variables also let us cope better with weak domain propagators: the relaxation will naturally avoid assignments that are infeasible for individual constraints… we just can’t tell the difference between a choice that’s infeasible or simply uninteresting because of the objective function.
This is a lot more tentative than everything above, but I believe that we can generate decomposition automatically by studying (hyper)tree decompositions for our primal constraint graphs. Problems with a small (hyper)treewidth are amenable to quick dynamic programming, if we explore the search space in the order defined by the tree decomposition: a small treewidth means that each subtree depends on a few memoisation keys, and there are thus few possibilities to consider.
For example, I’m pretty sure that the fun polytime dynamic programming algorithms in the Squiggol book all correspond to problems with a bounded treewidth for the constraint graph.
In general though, problems don’t have a small treewidth. This is where mathematical decomposition methods come in. Benders decomposition could move a few key linking variables to the master problem; once they are taken out of the slave subproblem, the fixed Lagrangian subproblem has a small tree width.
However, I already wrote that Lagrangian decomposition seems like a better fit to me. In that case, we find a few variables such that removing them from the primal graph leaves a small tree decomposition. Each subproblem gets its own clones of these variables, and Lagrangian decomposition manipulates the objective function to make the clones agree with one another.
With this approach, we could also automatically generate dynamic programming solvers for Lagrangian subproblems… and it might even be practical to do the same for reduced cost oracle.
Going to hypertreewidth improves our support for global constraints
(e.g., alldiff
) or constraints that otherwise affect many variables
directly. Global constraint should probably get their own
decomposition component, to easily exploit their pruning routines, but
not necessarily large ad hoc ones. Hypertree decomposition takes the
latter point into account, that a large clique created by a single
constraint is simpler to handle than one that corresponds to many
smaller (e.g., pairwise) constraints.
I don’t know. I won’t have a lot of time to work on this stuff now that I have a real job. I’ve decided to dedicate what little time I have to a solver for Lagrangian master problems, because even the research community has a bad handle on that.
For the rest, I’m mostly unexcited about work that attempts to explore the search space as quickly as possible. I believe we should instead reify the constraint graph as much as possible to expose the structure of each instance to analyses and propagators, and thus reduce our reliance on search.
As a first step, I’d look into a simple system that only does table constraints (perhaps represented as streams of feasible tuples). A bucket heuristic for tree decomposition may suffice to get decent dynamic programming orders and efficiently optimise over the intersection of these table constraints. (Yes, this is highly related to query optimisation; constraint programming and relational joins have much in common.)
After that, I’d probably play with a few externally specified decomposition schemes, then add support for global constraints, and finally automatic decomposition.
All in all, this looks like more than a thesis’s worth of work…
]]>Linear programming (LP) is awesome. I based my PhD work on solving LPs with millions of variables and constraints… instead of integer programs a couple orders of magnitude smaller. However, writing LP solvers is an art that should probably be left to experts, or to those willing to dedicate a couple years to becoming one.
That being said, if one must write their own, an interior point method (IPM) is probably the best approach. This post walks through the development of a primal affine scaling method; it doesn’t guarantee stateoftheart practical or theoretical efficiency (it’s not even polynomial time), but is simple and easy to understand.
I think interior point methods suffer from an undeserved bad reputation: they only seem complicated compared to the nice combinatorial Simplex method. That would explain why I regularly (once or twice a year) see hackers implement a (classic!) simplex method, but rarely IPMs.
The problem is that simplex only works because worldclass
implementations are marvels of engineering: it’s almost like their
performance is in spite of the algorithm, only thanks to
codercenturies of effort. Worse, textbooks usually present the
classic simplex method (the dark force behind simplex tableaux) so
that’s what unsuspecting hackers implement. Pretty much no one uses
that method: it’s slow and doesn’t really exploit sparsity (the fact
that each constraint tends to only involve a few variables). Avoiding
the trap of the classic simplex is only the first step. Factorising
the basis — and updating the factors — is essential for efficient
revised simplex methods, and some clever tricks really help in
practice (packages like
LUSOL mostly
take care of that); pivot selection (pricing) is fiddly but has a huge
impact on the number of iterations — never mind degeneracy, which
complicates everything; and all that must be balanced with minimising
work per iteration (which explains the edge of dual simplex: there’s
no primal analogue to devex normalised dual steepest edge
pricing). There’s also presolving and crash basis construction, which
can simplify a problem so much that it’s solved without any simplex
pivot. Despite all this sophistication, only recent theoretical
advances
protect against instances going exponential.
Sidenote: Vasek Chvatal’s Linear Programming takes a view that’s closer to the revised simplex algorithm and seems more enlightening to me. Also, it’s been a while since I read about pricing in the simplex, but the dual simplex definitely has a computational edge for obscure reasons; this dissertation might have interesting details.
There’s also the learning angle. My perspective is biased (I’ve been swimming in this for 5+ years), but I’d say that implementing IPMs teaches a lot more optimisation theory than implementing simplex methods. The latter are mostly engineering hacks, and relying on tableaux or bases to understand duality is a liability for large scale optimisation.
I hope to show that IPMs are the better choice, with respect to both the performance of naïve implementations and the insights gained by coding them, by deriving a decent implementation from an intuitive starting point.
I’ll solve linear problems of the form $$\min\sb{x} cx$$ subject to $$Ax = b$$ $$l \leq x \leq u.$$
I will assume that we already have a feasible point that is strictly inside the box \([l, u]\) and that A has full row rank. Full rank doesn’t always pan out in practice, but that’s a topic for another day.
If it weren’t for the box constraint on x, we would be minimising a smooth function subject to affine equalities. We could then easily find a feasible descent direction by projecting \(c\) (the opposite of the gradient) in the nullspace of A, i.e., by solving the linear equalityconstrained least squares $$\min\sb{d} \(c)d\\sb{2}$$ subject to $$Ad = 0,$$ e.g. with LAPACK’s xGGLSE.
The problem with this direction is that it doesn’t take into account the box constraints. Once some component of x is at its bound, we can’t move past that wall, so we should try to avoid getting to close to any bound.
Interior point methods add a logarithmic penalty on the distance between x and its bounds; (the gradient of) the objective function then reflects how close each component of x is to its lower or upper bound. As long as x isn’t directly on the border of the box, we’ll be able to make a nontrivial step forward without getting too close to that border.
There’s a similar interpretation for primal affine scaling methods, but I prefer the original explanation. These methods rescale the space in which we solve for d so that x is always far from its bounds; rescaling the direction back in the original space means that we’ll tend to move more quickly along coordinates that are far from their bounds, and less for those close to either bound. As long as we’re strictly inside the box, the transformation is meaningful and invertible.
It’s a primal method because we only manipulate primal variables; dual methods instead work with the dual variables associated with the linear constraints.
This sketch might be useful. x is close to the lefthand side bound, and, after projection in the nullspace, d quickly hits the border. With scaling, d is more vertical, and this carries to the unscaled direction.
The descent direction may naturally guide us away from a closeby bound; scaling is then useless. However, each jump away from that bound makes the next longer, and such situations resolve themselves after a couple iterations.
Formally, let $$s = \min\{xl, ux\} > 0$$ be the slack vector (computed element wise) for the distance from x to its bounds. We’ll just scale d and c elementwise by S (S is the diagonal matrix with the elements of s on its diagonal). The scaled least square system is $$\min\sb{d\sb{s}} \(Sc)d\sb{s}\ $$ subject to $$ASd\sb{s} = 0.$$
Now that we have a descent direction \(d=Sd\sb{s}\), we only have to decide how far along to go in that direction. We determine the limit, r, with a ratio test. For each coordinate, look at \(d\sb{i}\): if it’s zero, this coordinate doesn’t affect the limit; if it’s negative, the limit must be at most \((lx)\sb{i}/d\sb{i}\); if it’s positive, the limit is at most \((ux)\sb{i}/d\sb{i}\). We then take the minimum of these ratios.
If the minimum doesn’t exist (i.e., \(r=\infty\)), the problem is unbounded.
Otherwise, we could go up to \(x+rd\) to improve the solution while maintaining feasibility. However, we want to stay strictly inside the feasible space: once we hit a bound, we’re stuck there. That’s why we instead take a slightly smaller step, e.g., \(0.9r\).
This drawing shows what’s going on: given d, r takes us exactly to the edge, so we take a slightly shorter step. The new solution is still strictly feasible, but has a better objective value than the previous one. In this case, we’re lucky and the new iterate \(x\sp{\prime}\) is more isotropic than the previous one; usually, we slide closer to the edge, only less so than without scaling.
We then solve the constrained least squares system again, with a new value for S.
I assumed that x is a strictly feasible (i.e., interior) solution: it satisfies the equality constraint \(Ax=b\) and is strictly inside its box \([l, u]\). That’s not always easy to achieve; in theory, finding a feasible solution is as hard as optimising a linear program.
I’ll now assume that x is strictly inside its box and repair it toward feasibility, again with a scaled least squares solution. This time, we’re looking for the least norm (the system is underdetermined) solution of $$Ad = (b  Ax).$$
We rescale the norm to penalise movements in coordinates that are already close to their bound by instead solving $$ASd\sb{s} = (b  Ax),$$ for example with xGELSD.
If we move in direction \(Sd\sb{s}\) with step \(r=1\), we’ll satisfy the equality exactly. Again, we must also take the box into account, and perform a ratio test to determine the step size; if \(0.9r > 1\), we instead set \(r=1/0.9\) and the result is a strictly feasible solution.
I uploaded some hacktastic CL to github. The initial affine scaling method corresponds to this commit. The outer loop looks at the residual \(Axb\) to determine whether to run a repair (feasibility) or optimisation iteration.
The code depends on a few libraries for the MPS parser and on matlisp, and isn’t even ASDFloadable; I just saved and committed whatever I defined in the REPL.
This commit depends on a patch to matlisp’s src/lapack.lisp
(and to
packages.lisp
to export lapack:dgglse
):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

1 2 3 4 5 6 7 8 

MPS is a very old (punchcardoriented) format to describe linear and
mixed integer programs. The parser mostly works (the format isn’t
very well specified), and standardform.lisp
converts everything to
our standard form with equality constraints and a box around decision
variables.
I will test the code on a couple LPs from netlib, a classic instance set (I think the tarball in CUTEr is OK).
AFIRO is a tiny LP (32 variables by 28 constraints, 88 nonzeros). Good news: the program works on this trivial instance and finds the optimum (the relative difference with the reference value is 7e6). The first four “repair” iterations find a feasible solutions, and 11 more iterations eventually get to the optimum.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

I also tested on ADLITTLE, another tiny instance (97x57, 465 nz): 48 iterations, 0.37 seconds total.
AGG is more interesting, at 163x489 and 2541 nz; that one took 156 iterations and 80 seconds (130% CPU, thanks to Apple’s parallel BLAS). Finally, FIT1D is challenging, at 1026x25 and 14430 nz; that one took 100 seconds and the final objective value was off .5%.
All these instances are solved in a fraction of a second by stateoftheart solvers. The next steps will get us closer to a decent implementation.
The initial implementation resorted to DGGLSE for all least squares solves. That function is much too general.
I changed the repair least squares to a GELSY (GELSD isn’t nicely wrapped in matlisp yet).
We can do the same for the optimisation constrained least squares. Some doodling around with Lagrange multipliers shows that the solution of $$\min\sb{x} \xc\\sb{2}\sp{2}$$ subject to $$Ax = 0$$ is $$x = A\sp{t}(AA\sp{t})\sp{1}Ac  c.$$
This is a series of matrixvector multiplications and a single linear solve, which I first perform with GELSY: when we’re nearly optimal, the system is really badly conditioned and regular linear solves just crash.
Both linear algebra microoptimisations are in this commit.
ADLITTLE becomes about five times faster, and FIT1D now runs in 2.9 seconds (instead of 100), but AGG eventually fails to make progress: we still have problems with numerical stability.
Our numerical issues aren’t surprising: we work with a scaled matrix \(AS\), where the diagonal of \(S\) reflects how close \(x\) is to its bounds. As we get closer to an optimal solution, some of \(x\) will converge to the bounds, and some to a value inbetween: the limit is a basic solution. We then take this badly conditioned matrix and multiply it by its own transpose, which squares the condition number! It’s a wonder that we can get any useful bit from our linear solves.
There’s a way out: the normal matrix \(AA\sp{t}\) is not only symmetric, but also positive (semi)definite (the full rank assumption isn’t always satisfied in practice). This means we can go for a Cholesky \(LL\sp{t}\) (or \(LDL\sp{t}\)) factorisation.
We’re lucky: it turns out that Cholesky factorisation reacts to our badly conditioned matrix in a way that is safe for Newton steps in IPMs (for reasons I totally don’t understand). It also works for us. I guess the intuition is that the bad conditioning stems from our scaling term. When we go back in the unscaled space, rounding errors have accumulated exactly in the components that are reduced into nothingness.
I used Cholesky factorisation with LAPACK’s DPOTRF and DPOTRS for the optimisation projection, and for repair iterations as well.
LAPACK’s factorisation may fail on semidefinite matrices; I call out to GELSY when this happens.
To my eyes, this yields the first decent commit.
The result terminates even closer to optimum, more quickly. AGG stops after 168 iterations, in 26 seconds, and everything else is slightly quicker than before.
I think the code is finally usable: the key was to reduce everything to PD (positive definite) solves and Cholesky factorisation. So far, we’ve been doing dense linear algebra with BLAS and LAPACK and benefitted from our platform’s tuned and parallel code. For small (up to a hundred or so constraints and variables) or dense instances, this is a good simple implementation.
We’ll now get a 10100x speedup on practical instances. I already noted, en passant, that linear programs in the wild tend to be very sparse. This is certainly true of our four test instances: their nonzero density is around 1% or less. Larger instances tend to be even sparser.
There’s been a lot of work on sparse PD solvers. Direct sparse linear solvers is another complex area that I don’t think should be approached lightly: expensive supercomputers have been solving sparse PD linear systems for a couple decades, and there’s some really crazy code around. I’ve read reports that, with appropriate blocking and vectorisation, a sparse matrix can be factored to disk at 80% peak FLOP/s. If you’re into that, I’m told there are nice introductions, like Tim Davis’s book, which covers a didactical yet useful implementation of sparse Cholesky in 2000 LOC.
I decided to go with CHOLMOD from
SuiteSparse, a
mixed GPL/LGPL library. CHOLMOD implements state of the art methods:
when all its dependencies are available, it exposes parallelisation
and vectorisation opportunities to the BLAS (thanks to permutations
computed, e.g., by METIS) and exploits Intel’s TBB and Nvidia’s CUDA.
It’s also well designed for embedding in other programs; for example,
it won’t abort(3)
on error (I’m looking at you, LLVM), and includes
data checking/printing routines that help detect FFI issues. Its
interface even helps you find memory leaks! Overall, it’s a really
refreshing experience compared to other academic code, and I wish all
libraries were this well thought out.
I built only the barest version on my Mac and linked it with a
wrapper
to help
my bindings.
I had to build the dynamic library with Wl,all_load
on OS X to
preserve symbols that only appear in archives; Wl,wholearchive
should do it for gnu ld. (I also bound a lot of functions that I
don’t use: I was paying more attention to the TV at the time.)
CHOLMOD includes functions to simultaneously multiply a sparse matrix with
its transpose and factorise the result.
Function solvedense
shows how I use it to solve a dense PD system. It’s a threestep
process:
This is really stupid: I’m not even dropping zero entries in the dense matrix. Yet, it suffices to speed up AGG to 14 seconds!
It’s clear that we need to preserve the constraint matrix A in a sparse format. CHOLMOD has a function to translate from the trivial triplet representation (one vector of row indices, another of column indices, and another of values) to the more widely used compressed representation. Instances in our standard form already represent the constraint matrix as a vector of triplets. We only have to copy from CL to a CHOLMOD triplet struct to exploit sparsity in Cholesky solves and in matrixvector multiplications.
We now solve AGG in 0.83 seconds and FIT1D in 0.95. I think we can expect runtimes of one second or less for instances up to ~200x200. Better, we can finally hope to solve respectable LPs, like FIT2D (10500x26, 138018 nz, 2.3 s) or FIT1P (1677x628, 10894 nz, 14 s).
Our normal matrix \(ASS\sp{t}A\sp{t}\) always has the same pattern (nonzero locations): we only change the scaling diagonals in the middle. Sparse solvers separate the analysis and factorisation steps for exactly such situations. When we solve a lot of systems with the same pattern, it makes sense to spend a lot of time on a onetime analysis that we then reuse at each iteration: fancy analysis routines generate factors that take up less space and need fewer FLOPs to build.
I do that here. Our less tiny instances are almost twice as fast. We solve FIT1P in 8 seconds now, and even FIT2P (13525x3001, 60784 nz) in 200 seconds.
I then made some microoptimisation to reuse storage.
Finally, I added steps that push the current solution away from closeby bounds. These centering steps help subsequent iterations make longer jumps toward optimality
The last two changes shave a couple more seconds on large instances and gets closer to optimality on nastier ones.
I hope I managed to communicate the intuition behind primal affine scaling methods, and that the walkthrough helped map that intuition to a sparse implementation. I also realise that the code isn’t pretty: I wrote it during a short marathon and tried to only make incremental changes. Still, the algorithm should be more or less usable for small instances; more than a naïve simplex method, anyway.
That’s particularly true on multicore machines. Parallelising simplex methods has been an area of slow research for a couple decades; the best work I’ve seen so far takes a huge hit by reverting to the classic algorithm and hopes that parallelisation can compensate for that initial 101000x slowdown. In contrast, even our sparse method is already parallel: CHOLMOD automatically vectorises, parallelises, and offloads work to the GPU.
I’ll try to code a sparse primaldual affine scaling method from scratch soon. Primaldual methods usually work better than pure primal (or dual) methods and I find their theory interesting (if a bit complicated).
If you liked this post, you might be interested in Stephen Boyd’s Convex optimisation course. He’s offering it online this winter, starting January 21st.
]]>I think the underlying confusion is frequent; I’ve seen it in a couple discussions where people report conflicting experiences with Robin Hood hashing.
The scheme described in Pedro Celis’s dissertation (PDF) is a variant of double hashing. Hashing is based on simulating random allocation with a hash function; double hashing asks for a second hash function to simulate independent random probing sequences for each key. This is really useful to reduce the worstcase probe sequence length, and Robin Hood hashing further improves the distribution’s tail by reducing the variance: better locations are given to faroff entries after bumping out welloff ones, hence the name.
Twentyfive years later, I don’t think any variant of double hashing makes sense. If we allow multiple hash functions and reads to uncorrelated addresses (and writes that bump entries out), we might as well go for simpler strategies with awesome worstcase bounds, like 2left or cuckoo hashing. Then again, I’m not convinced that these schemes are useful in software either: we can do a lot of probing in the time it takes to read a second random address [*]. In hardware, the tradeoffs seem completely different; I wouldn’t be surprised if two 128 KB SRAM chips were cheaper than a single 256 KB chip.
[*] We could hope for memorylevel parallelism… but, in my experience, TLB misses are where (uncached) hash lookups hurt, and those don’t seem to be resolved in parallel. In cache, linear probing eliminates bank conflicts and is really quick.
When I looked into Robin Hood hashing, I was interested in a degenerate variant in which probing sequences are linear. This variant also seems the one others usually mean when they say “Robin Hood hashing,” nowadays. The algorithms for inserts and lookups in Celis’s dissertation still work: probe locations and distances just happen to be computed trivially. However, the analysis doesn’t hold anymore. This is where later work like Viola’s Distributional analysis of Robin Hood linear probing hashing with buckets and the associated journal paper comes in. The bounds are worse than with (pseudo)random probing sequences, but each probe is a lot quicker.
Such weaker bounds also mean that Celis’s suggestion of using tombstone (markers for deleted entries) doesn’t perform as well. Linear probing clumps up, even with Robin Hood rebalancing. Tombstones can be cleared out with a full rehash after \(\Theta(n)\) operations, or deletion instead implemented by copying later elements backwards. I’m a fan of the latter option: when ties are broken deterministically (e.g., by comparing keys), the layout of the hash table is independent of the sequence of insertions and deletions executed to get there.
When there’s a disagreement over the performance of Robin Hood hashing, it may help to specify the probing sequence. Older (pre2000) references and experiments probably refer to the original twist on double hashing; newer ones may instead describe the linear probing variant. Robin Hood double hashing seems obsolete, and the linear probing variant isn’t what Celis described. It may be more appropriate to refer to the latter as “Robin Hood linear probing.”
]]>SBCL is known for its flowsensitive type propagation pass… or, rather, for its effects. That is probably why it can be so shocking when SBCL fails to infer obvious invariants. I feel like I can now describe the root causes of these surprising weaknesses and propose a series of fixes. Coding that up would likely take months, but the result should be more precise type propagation with improved asymptotic efficiency, and even a tunable knob between precision and compile times.
The compiler in SBCL, Python, doesn’t work on an IR in SSA form, nor does it convert internally to CPS à la Rabbit/Orbit. However, a clever – I don’t recall seeing it described anywhere else – design decision achieves similar effects. Operations receive their arguments as linear vars (LVARs) and write their results to an LVAR. Each LVAR has exactly one destination (one reader node), and, on any execution path, can only be used (written to) by one node. LVARs may nevertheless have multiple uses: these correspond to the value of conditionals and of conditional nonlocal exits. Assignment and access to lexical variables are represented as SET and REF nodes that receive and produce values through LVARs. This means that nearly everything can work in terms of LVARs when manipulating code and reasoning about dataflow. The conversion of readonce lexical variables to LVARs further simplifies a lot of mostly functional code.
Python takes this split to the extreme: only the constraint (type) propagation pass really handles fullblown lexical variables. Everything else depends on the flowsensitive type information summarised in CFG nodes and joined in LVARs. Want to check whether a given argument is always positive? Simply test for subtyping on the LVAR’s derived type to implicitly leverage flowsensitive type information. Want to insert some computation (e.g. a type assertion or an explicit modulo for overflowing arithmetic) between a value and its destination? Insert a new node in the CFG and substitute around nothing but opaque LVARs. These operations are so natural in Python that I only recently realised the design decision that makes them so easy.
However, I believe that Python went too far in relegating lexical variables to only the constraint pass. That pass only handles lexical variables (LAMBDAVARs) and flowsensitive information about them, and only propagates three kinds of constraints (derived information):
The importance of EQL constraints is subtle. These constraints represent the way that, e.g., information about a lexical variable is also true of the result of a lookup for that variable’s value (and vice versa), even if that information is learned after the variable is accessed. Similarly, when two lexical variables are EQL, information about one is true of the other. Finally, equality with a constant is always useful for constant propagation.
All three constraint types are centered on lexical variables. The reason is that no flowsensitive information needs to be computed for LVARs: they are only used once. For example, if a type test is performed on an LVAR, that is the LVAR’s only appearance and there is no information to propagate… unless that information is also true of some lexical variable.
There’s some more clever engineering and the analysis is simplified by disregarding variables that may be assigned from closures, but that’s the gist of it. A vanilla dataflow analysis worklist loop propagates information around, and constraints sets for each basic block shrink until the least (in terms of types, greatest when measuring information) fixed point is reached.
The first weakness is caused by the aforementioned cleverness: knowledge about the state of the world as each basic block is entered and exited is represented with sets of constraints (bitsets indexed with consecutive indices assigned to constraints on demand). In the interest of speed, join and meet are implemented as (bit)set union/intersection. The problem with this approach is that it completely discards information about a lexical variable when it is not identically present in both sets. For example, when one predecessor says that V is a positive integer and another that V is an integer, their successor discards all information about the type of V.
In practice, this is a lesser problem than it first appears to be: LVARs merge information about their multiple uses (writes) without any loss. The kind of code that suffers is code that does things like:
(lambda (x)
(ecase x
(1 ...)
(2 ...))
[work on X, knowing that it is either 1 or 2])
The two branches of the ecase that eventually continue execution respectively derive that X is EQL to 1 and to 2. However, their shared successor combines that information by forgetting about both equalities rather than weakening/downgrading the two constraints into “X is of type (INTEGER 1 2).”
Another interesting case is
(lambda (x)
(let ((y nil)) ; IF expression translated to an effectful IR
(if x
(setf y 1)
(setf y 2))
y))
for which Python derives that the result is either 1, 2, or NIL (the
union of all the values ever assigned to Y). The equivalent code
(lambda (x) (if x 1 2))
compiles everything to LVARs and the result is
known to be either 1 or 2 even before constraint propagation.
This can be addressed with a redundant representation that adds, to each constraint set, a dictionary from lexical variables to the (downgradable) information known about them: their types and the types they’re known to be < or > than. When intersecting constraint sets, these dictionaries make it easy to weaken information and insert the corresponding constraints.
The second weakness is deeply embedded in the structure of the compiler: the constraint propagation pass only pushes around precomputed flowinsensitive information. That information is valid because the flowinsensitive pass (that computes things like the type of the values produced by each CFG node) works from an older (over) approximation of the flowsensitive information. Once a fixed point is reached in the constraint propagation pass, the flowinsensitive pass is reexecuted and information refined. In the end, there is a backandforth between the simple flowsensitive constraint propagation pass and the flowinsensitive “optimization” pass that leverages a huge knowledge base about Common Lisp operators and functions.
That exchange of information not only takes many (hefty) iterations to converge, but is also lossy: the constraint propagation pass is flowsensitive, but manipulates types based on an earlier pass or on type declarations. In effect, constraint propagation computes the least fixed point of a function that is defined by an earlier coarse upper approximation… a process that ends up preserving a lot of the initial weakness. This strange choice presents an interesting characteristic: every iteration the outer communication loop between flow sensitive and insensitive analysis produces valid (but overly conservative) information. So convergence is slow, but at least intermediate results can guide rewrites safely.
The obvious fix is to derive node types within the constraint propagation pass, even as the final result of the flowsensitive pass is approximated from below. For example, upon assignment to a lexical variable, the current pass adds a new constraint: the type of that variable is that of the LVAR for the assignment’s value, and the LVAR’s type is computed based on the information from the previous constraint propagation pass. Instead, it should be based on the current (overly agressive) flowsensitive information. That would eventually converge to a smaller fixed point. This alternative design would also obviate the need for a questionable hack that improves precision for some lucky iteration variables.
A weaker form of this change would be to use preliminary (perhaps overly aggressive) information only to tentatively detect impossible conditional branches. This would enable Python to derive that the return value of
(lambda ()
(loop with x = 0
while (foo)
do (case x
(1 (setf x 2))
(2 (setf x 1)))
finally (return x)))
is always 0 (and never 1 or 2), or that the expression (if (eql x y) (eql x y) t)
is always true.
With suitable handling of calls to local functions, useful types might even be derived for recursive functions. The current type propagation pass derives insanely stupid types for recursive functions and for assignment to variables in loops (unless the iteration variable hack suffices). For example
(lambda ()
(labels ((foo (x)
(if (bar)
(foo (1+ x))
x)))
(foo 42)))
is derived to return an arbitrary NUMBER. Instead, the return type of local functions can be initialised to the empty type (in which case propagation must be stopped for the caller block) and extended as argument types grow and as recursion triggers more propagation.
The problem is that this more precise flowsensitive pass loses the finite chain condition: there is no guarantee that the least fixed point will always be reached in finite time. The current constraint propagation pass operates on a finite lattice: only variables and types that exist in the input IR1 graph can appear in constraints, so the fixpointed function is defined over the powerset of this finite set of constraints for each basic block. In contrast, the full CL type universe is most definitely not finite (type tests aren’t even decidable in general).
There is a simple solution to this problem: before extending the constraint universe, widen types to satisfy the finite chain condition, i.e., make sure that types eventually stop growing. For example, when taking a nontrivial union or intersection of types, the result could be quantised to a coarse set of types (singleton types, named class types, a few key numeric ranges, etc.). Appropriate widening can also accelerate finite but slow convergence; the resulting fixed point isn’t guaranteed to be minimal, but is always valid. In fact, widening is likely to be necessary for the first improvement (with redundant dictionaries) as well.
Another issue in the constraint propagation pass is an instance of the phase ordering problem. Some sourcetosource transforms help type propagation (e.g., turning complicated functions into simpler ones for which type propagation is feasible), and others hinder it (e.g., converting integer division by a constant to multiply and shift or MAPCAR to an inline loop). Worse, there’s a dependency cycle as transforms rely on type propagation to determine whether they are applicable. There’s been some academic work that exposes multiple equivalent code sequences to analyses, and it seems possible to do something similar in Python as well: sourcetosource transforms generate functions, so we’d “only” have to analyse the function without splicing it in. However, that’s likely to be impractically slow (even termination is nonobvious)… perhaps it will be doable for select transforms that expand into pure expressions.
The only suggestion to accelerate constraint propagation so far is to widen types, and that mostly compensates for slowdowns (or infinite convergence) introduced by other changes. There’s a classic trick that ought to accelerate even the current pass: guide the order in which basic blocks are analysed by partitioning them in strongly connected components. Python exploits a simple ordering technique, a depthfirst ordering that minimises backward edges in reducible controlflow graphs. Basic blocks could instead be partitioned into strongly connected components: contracting each strongly connected component into a single vertex leaves a directed acyclic graph. It then suffices to fixpoint on each strong component and traverse the condensed DAG so that each component is only visited after all its predecessors. Python already computes loop nests, so that’s an easy change; it’s likely even better to look at nested loops and recursively order analysis within each component. A few other passes perform some form of dataflow analysis and would benefit from that improved ordering… perhaps compile times can be really improved (Amdahl says that’s unlikely until splicing in sourcetosource transformations is a lot quicker).
Recap :
In terms of implementation difficulty, these changes probably go 163425. This also happens to be the order in which I’d try to implement them.
]]>So, why is the continuation monad the mother of all monads? The short answer is that, by enabling transparent inversion of control, it eliminates the need to sprinkle hooks for monadspecific code everywhere; normal (as much as anything involving delimited continuations can be “normal”) evaluation rules will be subverted as needed. Here’s how.
First, some boilerplate (boilerfoil may be the more appropriate term).
#lang racket
(require racket/control) ; for shift/reset
(define (run thunk return . arguments)
(reset (return (apply thunk arguments))))
This definition is all that is needed to execute arbitrary Racket code
in a monad: the only thing to specify is how a ground value should be
moved in the monad. bind
will be defined implicitly, through the
code for monadspecific behaviour. The body of run
defines a
context to determine where to stop capturing the continuation, and
executes the form (return (apply thunk arguments))
in that context:
the thunk is called with any argument provided, and the result is
return
ed into the monad.
For the sake of generalisation, I’ll start with a trivial example: the Maybe monad. First, I’ll quickly define structure types. In practice, a distinguished “nothing” value would suffice, but this way parallels Haskell more closely.
(struct Maybe ())
(struct Nothing Maybe ())
(struct Some Maybe (value))
The constructor Some
also doubles as return
. In order to abort
out, nothing
must unwind the continuation and return a Nothing
object.
(define (nothing)
(shift k (Nothing)))
> (run (lambda () 42) Some)
#<Some>
> (Somevalue (run (lambda () 42) Some))
42
The function run
obviously works as it should in these trivial
examples. It’s also not surprising that nothing
works, because it’s
the obvious implementation of unwinding with delimited continuations.
> (run (lambda () 42 (nothing) #t) Some)
#<Nothing>
In the List monad, return
is just list
. run
can be called with a
thunk and list
as the second argument, and indeed, the result is a
normal computation that returns its value as a singleton.
> (run (lambda () (+ 4 5)) list)
'(9)
The useful thing to do with the List monad is to specify multiple return values, and have the computation fork (lazily in Haskell, eagerly here, because I’m working with strict lists) on each choice. That’s a oneliner:
(define (injectvalues . values)
(shift k (appendmap k values)))
This function captures the continuation up to reset
, unwinds the
current continuation up to that point, and binds the captured
delimited continuation to k
. Then, it passes each possible value to
the continuation, and appends the results together.
Here’s a first example:
> (run (lambda () (+ (injectvalues 1 2) (injectvalues 3 4))) list)
'(4 5 5 6)
That is, reassuringly, the four possible sums: 1 + 3 = 4
, 1 + 4 = 5
,
2 + 3 = 5
, and 2 + 4 = 6
. The magic is that all Scheme code,
by virtue of supporting the capture (and unwind) of delimited
continuation, is now “in” the List monad. It is certainly the case
for that uninstrumented thunk. The predefined function map
provides a more convincing example.
> (run (lambda ()
(map cons
(injectvalues '(1 2) '(4 5))
(injectvalues '(3 4) '(5 6))))
list)
'(((1 . 3) (2 . 4))
((1 . 5) (2 . 6))
((4 . 3) (5 . 4))
((4 . 5) (5 . 6)))
I’ve taken the liberty of linewrapping the return value, and clearly,
(map cons ...)
has been called with all four pairs of lists… but
all the special monadic operations happens outside map
. Moving
injectvalues
inside the mapped function is a much stronger evidence
that arbitrary uninstrumented Scheme code is implicitly lifted in the
monad: map
is a predefined (precompiled even) library function.
> (run (lambda ()
(map (lambda (x y) ((injectvalues + ) x y))
'(1 2 3)
'(4 5 6)))
list)
'((5 7 9) ; + + +
(5 7 3) ; + + 
(5 3 9) ; +  +
(5 3 3) ; +  
(3 7 9) ;  + +
(3 7 3) ;  + 
(3 3 9) ;   +
(3 3 3)) ;   
At each call to the mapped function, the computation explores both the branch in which the arguments are added and the one in which they are subtracted. The result, for 3 pairs of triplets, is a list of 8 triplets.
Neither of these implementations is surprising or new; I believe
they’re standard undergraduate exercises in a few universities. The
insight in
Filinski’s work
is that both nothing
and injectvalues
share the same structure
and can be defined in terms of the monad they help implement. Because
I dislike scrolling, their definitions are copied here.
(define (nothing) ; Maybe
(shift k (Nothing)))
(define (injectvalues . values) ; List
(shift k (appendmap k values)))
Instead of returning the (monadic) value (Nothing)
or values
(a
list), both capture and unwind the delimited continuation, bind it to
k
, and then do something more to the value. Some squinting reveals
that that something more is calling bind
with the continuation k
as the next step. In the Maybe monad, Nothing >>= k
evaluates to
Nothing
. In the List monad, values >>= k
becomes foldr ((++)
. k) [] values
, which is basically (appendmap k values)
. The
general form for any monad is then to implement any operator that
doesn’t just return
a value as
(define (operator ...)
(shift k (bind [special stuff] k)))
As code, this gives
(define (makeoperator bind operator)
(lambda arguments
(shift k (bind (apply operator arguments) k))))
after adding support for variadic operators. For example, choosing between multiple items in a list is
(define (listbind x k) (appendmap k x))
(define injectvalues2 (makeoperator listbind list))
Some sketching on paper will show that the transformation is generic
and correct, with a proof that mostly goes through repeated
applications of the monad laws. Capturing the continuation moves the
whole stack of functions that will eventually receive results into a
single function (the continuation), and that function can then be
passed to bind
. The monad laws guarantee that this associative
reordering of nested bind
s preserves the meaning of the program.
We can also mimic do
notation more closely and implement join
, an
“antireturn”. Given that operator, not return
ing a value can instead
be implemented as join
ing it after the fact. One definition is
(makeoperator bind identity)
, but I feel it’s simpler to just do it
longhand.
(define (makejoin bind)
(lambda (value)
(shift k (bind value k))))
(define joinlist (makejoin listbind))
> (run (lambda () (+ 1 (joinlist '(1 2 3)))) list)
'(2 3 4)
Of course all this also works when operator
is equivalent to
return
; it’s just pointless. The shift
/return
/bind
dance is
then a convoluted way to demand regular Scheme evaluation rules.
And that is how the continuation monad is universal. When code is in the continuation monad (or in a language with delimited continuations), there is a mechanical way to have that code execute in almost any other monad. There are technical restrictions on the monad for the transformation to work, but I think any monad that can be implemented in (pure) Haskell qualifies.
I feel like sigfpe’s presentation was hobbled by the fact that he used the Continuation monad in Haskell, making it harder to see that the Continuation monad of the implementation is completely independent of the emulated one. Really, the idea is one of these nice insights that formalise and generalise old hacks, and seem obvious in retrospect.
The post’s title refers to the fact that delimited continuations can
themselves be implemented in terms of callwithcurrentcontinuation
(and a single mutable cell). There are many ways to interpret the
corollary that call/cc
suffices to transpose code into arbitrary
monads. It certainly seems like a victory for the minimalist crowd.
On the other hand, I believe that the power of programming languages
and paradigms lies as much in what they enable as it does in what they
forbid (or make inconvenient, at least). From that point of view,
it’s not obvious whether the universality of call/cc
makes a strong
case for or against the feature.
This result also provides a tentative explanation for the low traction of monads among Lispers: perhaps many would rather directly hack the features in, with (ersatz) continuations.
]]>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

The canonical bitap algorithm matches literal strings, e.g. “abad”. Like other bitap algorithms, it exploits the fact that, given a bitset representation of states, it’s easy to implement transfer functions of the form “if state \( i \) was previously active and we just consumed character \( c \), state \( i+1 \) is now active”. It suffices to shift the bitset representation of the set of active states by 1, and to mask out transitions that are forbidden by the character that was just consumed.
For example, when matching the literal string “abad”, a state is associated with each position in the pattern. 0 is the initial state, 1 is the state when we’ve matched the first ‘a’, 2 after we’ve also matched ‘b’, 3 after the second ‘a’, and 4 after ‘d’, the final character, has been matched. Transposing this information gives us the information we really need: ‘a’ allows a transition from 0 to 1 and from 2 to 3, ‘b’ a transition from 1 to 2, and ‘d’ from 3 to 4.
A trivial state machine is probably clearer.
The initial state is in grey, the accept state is a double circle, and transitions are labelled with the character they accept.
Finding the substring “abad” in a long string thus reduces to a bunch of bitwise operations. At each step, we OR bit 0 in the set of states: we’re always looking for the beginning of a new match. Then, we consume a character. If it’s ‘a’, the mask is 0b01010; if it’s ‘b’, mask is 0b00100; if it’s ‘d’, mask is 0b10000; otherwise, its mask is 0. Then, we shift the set of state left by 1 bit, and AND with mask. The effect is that bit 1 is set iff mask has bit 1 set (and that only happens if the character is ‘a’), and bit 0 was previously set (… which is always the case), bit 2 is set iff mask has bit 2 set (i.e. we consumed a ‘b’) and bit 1 was previously set, etc.
We declare victory whenever state 4 is reached, after a simple bit test.
It’s pretty clear that this idea can be extended to wildcards or even character classes: multiple characters can have the same bit set to 1 in their mask. Large alphabets are also straightforward to handle: the mapping from character to mask can be any associative dictionary, e.g. a perfect hash table (wildcards and inverted character classes then work by modifying the default mask value). This works not only with strings, but with any sequence of objects, as long as we can easily map objects to attributes, and attributes to masks. Some thinking shows it’s even possible to handle the repetition operator ”+”: it’s simply a conditional transition from state \( i \) to state \( i \).
What I find amazing is that the technique extends naturally to arbitrary regular expressions (or nondeterministic finite automata).
Simulating an NFA by advancing a set of states in lockstep is an old technique. Russ Cox has written a nice review of classic techniques to parse or recognize regular languages. The NFA simulation was first described by Thomson in 1977, but it’s regularly being rediscovered as taking the Brzozowski derivative of regular expressions ;)
Fast Text Searching With Errors by Sun Wu and Udi Manber is a treasure trove of clever ideas to compile pattern matching, a symbolic task, into simple bitwise operations. For regular expressions, the key insight is that the set of states can be represented with a bitset such that the transition table for the NFA’s states can be factored into a simple datadependent part followed by \( \epsilon \)transitions that are the same, regardless of the character consumed. Even better: the NFA’s state can be numbered so that each character transition is from state \( i \) to \( i+1 \), and such a numbering falls naturally from the obvious way to translate regular expression ASTs into NFAs.
For the initial example “ab(cde)*fg”, the AST looks like a node to match ‘a’, succeeded by a node to match ‘b’, then a repetition node into either “cd”, “e” or \( \epsilon \), and the repetition node is succeeded by “fg” (and, finally, accept). Again, a drawing is much clearer!
Circles correspond to charactermatching states, the square to a repetition node, the diamond to a choice node, and the pointy boxes to dummy states. The final accept state is, again, a double circle. \( \epsilon \)transitions are marked in blue, and regular transitions are labelled with the character they consume.
The nodes are numbered according to a straight depthfirst ordering in which children are traversed before the successor. As promised, character transitions are all from \( i \) to \( i+1 \), e.g. 0 to 1, 1 to 2, 6 to 7, … This is a simple consequence of the fact that charactermatching nodes have no children and a single successor.
This numbering is criminally wasteful. States 3, 6 and 10 serves no purpose, except forwarding \( \epsilon \) transitions to other states. The size of the state machine matters because bitwise operations become less efficient when values larger than a machine word must be manipulated. Using fewer states means that larger regular expressions will be executed more quickly.
Eliminating them and sliding the numbers back yields the something equivalent to the more reasonable 11state machine shown in Wu and Manber (Figure 3 on page 10). Simply not assigning numbers to states that don’t match characters and don’t follow character states suffices to obtain such decent numberings.
Some more thinking shows we can do better, and shave three more states. State 2 could directly match against character ‘e’, instead of only forwarding into state 3; what is currently state 4 could match against character ‘c’, instead of only forwarding into state 8, then 2 and then 5 (which itself matches against ‘c’); and similarly for state 7 into state 8. The result wastes not a single state: each state is used to match against a character, except for the accept state. Interestingly, the \( \epsilon \) transitions are also more regular: they form a complete graph between states 2, 3 and 5.
It’s possible to use fewer states than the naïve translation, and it’s useful to do so. How can a program find compact numberings?
The natural impulse for functional programmers (particularly functional compiler writers ;) is probably to start matching patterns to iteratively reduce the graph. If they’ve had bad experience with slow fixpoint computations, there might also be some attempts at recognising patterns before even emitting them.
This certainly describes my first couple stabs; they were either mediocre or wrong (sometimes both), and certainly not easy to reason about. It took me a while to heed ageold advice about crossing the street and compacting state machines.
Really, what we’re trying to do when compacting the state machine is to determine equivalence classes of states: sets of states that can be tracked as an atomic unit. With rewrite rules, we start by assuming that all the states are distinct, and gradually merge them. In other words, we’re computing a fixed point starting from the initial hypothesis that nothing is equivalent.
Problem is, we should be doing the opposite! If we assume that all the states can be tracked as a single unit, and break equivalence classes up in as we’re proven wrong, we’ll get maximal equivalence classes (and thus as few classes as possible).
To achieve this, I start with the naïvely numbered state machine. I’ll refer to the start state and character states as “interesting sources”, and to the accept state and character states as “interesting destinations”. Ideally, we’d be able to eliminate everything but interesting destinations: the start state can be preprocessed away by instead working with all the interesting destinations transitively reachable from the start state via \( \epsilon \) transitions (including itself if applicable).
The idea is that two states are equivalent iff they are always active after the same set of interesting sources. For example, after the start state 0, only state 1 is active (assuming that the character matches). After state 1, however, all of 2, 3, 4, 6, 7, 10 and 11 are active. We have the same set after states 4 and 8. Finally, only one state is alive after each of 11 and 12.
Intersecting the equivalence relations thus defined give a few trivial equivalence classes (0, 1, 5, 8, 9, 12, 13), and one huge equivalence class comprised of {2,3,4,6,7,10,11} made of all the states that are active exactly after 1, 5 and 9. For simplicity’s sake, I’ll refer to that equivalence class as K. After contraction, we find this smaller state graph.
We can renumber this reasonably to obey the restriction on character edges: K is split into three nodes (one for each outgoing characterconsuming edge) numbered 2, 4 and 7.
We could do even better if 5 and 9 (3 and 6 above) in the earlier contracted graph were also in equivalence class K: they only exist to forward back into K! I suggest to achieve that with a simple postprocessing pass.
Equivalence classes are found by find determining after which interesting node each state is live. States that are live after exactly the same sets of interesting nodes define an equivalence class. I’ll denote this map from state to transitive interesting predecessors \( pred(state) \).
We can coarsen the relationship a bit, to obtain \( pred\sp{\prime}(state) \). For interesting destinations, \( pred\sp{\prime} = pred \). For other nodes, \(pred\sp{\prime}(state) = \cap\sb{s\in reachable(state)}pred(s)\), where \(reachable(state)\) is the set of interesting destinations reachable via \( \epsilon \) transitions from \(state\). This widening makes sense because \(state\) isn’t interesting (we never want to know whether it is active, only whether its reachable set is), so it doesn’t matter if \(state\) is active when it shouldn’t, as long as its destinations would all be active anyway.
This is how we get the final set of equivalence classes.
We’re left with a directed multigraph, and we’d like to label nodes such that each outgoing edge goes from its own label \( i \) to \( i+1 \). We wish to do so while using the fewest number of labels. I’m pretty sure we can reduce something NPHard like the minimum path cover problem to this problem, but we can still attempt a simple heuristic.
If there were an Eulerian directed path in the multigraph, that path would give a minimal set of labels: simply label the origin of each arc with its rank in the path. An easy way to generate an Eulerian circuit, if there is one, is to simply keep following any unvisited outgoing arc. If we’re stuck in a dead end, restart from any vertex that has been visited and still has unvisited outgoing arcs.
There’s a fair amount of underspecification there. Whenever many equivalence classes could be chosen, I choose the one that corresponds to the lexicographically minimal (sorted) set of regexp states (with respect to their depthfirst numbering). This has the effect of mostly following the depthfirst traversal, which isn’t that bad. There’s also no guarantee that there exists an Eulerian path. If we’re completely stuck, I start another Eulerian path, again starting from the lexicographically minimal equivalence class with an unvisited outgoing edge.
Finally, once the equivalence states are labelled, both character and \( \epsilon \) transitions are reexpressed in terms of these labels. The result is a nice 8state machine.
This only covers the abstract stuff. There’s a
CL code dump on github.
You’re probably looking for compileregexp3.lisp
and
scalarbitap.lisp
; the rest are failed experiments.
Once a small labeling is found, generating a matcher is really straightforward. The datadependent masks are just a dictionary lookup (probably in a vector or in a perfect hash table), a shift and a mask.
Traditionally, epsilon transitions have been implemented with a few
table lookups. For example, the input state can be cut up in bytes;
each byte maps to a word in a different lookup table, and all the
bytes are ORed together. The tables can be pretty huge (nstates/8
lookup tables of 256 state values each), and the process can be slow
for large states (bitsets). This makes it even more important to
reduce the size of the state machine.
When runtime compilation is easy, it seems to make sense to instead generate a small number of test and conditional moves… even more so if SIMD is used to handle larger state sets. A couple branchfree instructions to avoid some uncorrelated accesses to LUTs looks like a reasonable tradeoff, and, if SIMD is involved, the lookups would probably cause some slow crosspipeline pingponging.
There’s another interesting lowlevel trick. It’s possible to handle large state sets without multiword shifts: simply insert padding states (linked via \( \epsilon \) transitions) to avoid character transitions that straddle word boundaries.
There’s a lot more depth to this bitap for regexp matching thing. For example, bitap regular expressions can be adapted to fuzzy matchings (up to a maximum edit distance), by counting the edit distance in unary and working with one bitset for each edit distance value. More important in practice, the approach described so far only handles recognising a regular language; parsing into capture groups and selecting the correct match is a complex issue about which Russ Cox has a lot to say.
What I find interesting is that running the NFA backward from accept states gives us a forward oracle: we can then tell whether a certain state at a given location in the string will eventually reach an accept state. Guiding an otherwise deterministic parsing process with such a regular language oracle clearly suffices to implement capture groups (all nondeterministic choices become deterministic), but it also looks like it would be possible to parse useful nonregular languages without backtracking or overly onerous memoisation.
]]>Huge month for #SBCL: devious bugs fixed, new optimisations, new features. Help us find regressions in HEAD!sbcl.git.sourceforge.net/git/gitweb.cgi…
— Paul Khuong (@pkhuong) May 24, 2013One feature I’m particularly excited about is the basic support for SSE intrinsics on x8664. It’s only the fundamental infrastructure needed for the compiler and runtime to manipulate and compile operations on SIMD packs (the platformindependent name suggested by Kevin Reid). However, now that it’s merged in, we can easily add new functionality that maps (more or less) directly to SSE assembly in a running image, like the sbrotatebyte contrib does for bitwise rotations.
There is currently no such contrib in SBCL, although Alexander Gavrilov has been maintaining clsimd, an extension for SBCL and ECL (he also kept an old SSE/SBCL branch of mine on life support since 2008). I’m really not sure what the right interface will look like for Common Lisp, so I decided to only provide the mechanism to implement such an interface; when something emerges that looks sane and has been used in anger, we can merge it in.
In the meantime, we get to define our own interface and see where that experience leads us. In this post, I’ll use a classic example to guide my first stab at an interface: the Mandelbrot set. I’ll only define as much of an interface as I need, and adapt it along the way if I see the need for more sophistication; I prefer to build programs and libraries bottomup and iteratively, rather than trying to divine my requirements ahead of time. A consequence of this approach is that I have to/am able to maintain a dynamic development style, even when working with (and working on) SSE intrinsics. I know many people use REPLs as desktop calculators; maybe others can be persuaded to also use SBCL as an SIMD calculator, to doodle on SSE code sequences (:
1 2 3 4 5 6 7 8 9 

The last form adds three new functions (f4+
, f4*
and f4
) to the
compiler’s database (fndb). Each function accepts two packs of
single floats as arguments, and returns another single float pack (the
result of elementwise addition, multiplication or subtraction).
Moreover, the functions can be reordered and flushed as dead code by
the compiler, and, whenever they appear in code, they should
ultimately be translated to assembly code. The last bit is so the
runtime system silently overwrites the database if the functions
are already there, rather than warning at loadtime.
Next, we add a template (VOP) to convert calls to f4+
into assembly
code. This task is so deeply tied with the internals that it makes
sense to just do it from SBVM.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

The first line defines a VOP with name mandelbrot::f4+
(it’s an
arbitrary symbol and the important bit is only to avoid collisions,
but good naming is useful when reading compiler traces or optimisation
notes).
Then, we specify that it’s to be automatically considered when there
is a call to mandelbrot::f4+
, and that it can be used both in fast
and in safe code.
The template converts a call with two arguments, x
and y
, that
must both be in SSE registers of single floats. Both value must be
represented as single floats in SIMD packs: when defining VOPs,
types refer to primitive types, and primitive types are concerned with
representation (like C types) rather than only sets of values (a
given CL value or type can be mapped to many primitive types
depending on the context). We also declare a preference for x
to be
allocated in the same register as the result r
, if possible.
The template has a single result, which is also an SIMD pack of single floats allocated in an SSE register of single floats.
Finally, we have code to emit assembly. If r
and y
were packed
(registerallocated) in the same register, we exploit commutativity to
add x
into y
. Otherwise, we move x
into r
if necessary (the
move
function takes care of checking for equivalent locations), and
emit a packed single float addition (addps
) of y
into r
(which
has just been overwritten with the contents of x
).
The reason we have to specify that the SSE registers hold single
float values (rather than doubles or integers) is for functions like
move
to emit a move of FP SSE registers rather than integer, for
example. However, this additional information doesn’t really affect
register allocation: the three storage classes (SC) –
singlessereg
, doublessereg
and intssereg
– all map to the
same storage base (SB), and the same SSE register can be used for an
simdpacksingle
value at one program point, an simdpackint
at
another, and a doublefloat
at yet another.
This VOP is the bare minimum for a useful packed addition of single floats: in the future, it will probably make sense to add support for a constant argument, which could be directly loaded from a RIPrelative address to save a register (SBCL isn’t clever enough to determine when it makes sense to keep a constant in a register across a basic block or a loop).
We do the same for multiplication and subtraction. There’s one last
twist for f4
. We can’t exploit commutativity there, so we have to
make sure y
and r
aren’t packed in the same register; this is
achieved by extending the live range of r
from the end of the
(default) live range for x
. There are obvious opportunities for
macroisation, but, again, I can refactor when I have a clear need
(e.g. when I decide to add a dozen more intrinsics).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

Now that we’ve declared the functions and defined one way to convert
each to machine code, we can define stubs to interact with them from
interpreted code or as normal firstclass functions. The compiler
already has all the information needed to convert any call to f4+
et
al to machine code; however, no such function is actually defined. We
can define them with what looks like infinite recursion: the
defknown
will ensure the compiler inserts type checks for
(simdpack singlefloat)
as needed and infers that the result is
also an (simdpack singlefloat)
, so the templates will be
applicable.
1 2 3 4 5 6 7 8 

Now, we can call our new functions at the REPL, pass them to higherorder functions, etc., like any normal function.
1 2 3 4 

The first form makes an simdpack
from four single float values. We
must already track at compiletime whether each SIMD pack value is a
pack of singles, doubles or integers, so it makes sense to do the same
at runtime: that way we preserve type information when constant
folding, and that we can print them nicely. Thus, we can call f4+
on that value, and see that 1.0 + 1.0 = 2.0
. We can also
disassemble the function f4*
to confirm that it is a normal function
that can be passed around (in this case to disassemble
), and that it
doesn’t endlessly recurse.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

The inner loop body to approximate the Mandelbrot set is simply z' =
z*z + c
, where z
and c
are complex values. If we expand the
complexes into their real and imaginary parts (zr + i zi
, cr + i
ci
), we find z*z+c = (zr*zr  zi*zi + cr) + i(2*zr*zi + ci)
. That’s easily
vectorised if we have a pack of four zr
values, another for the
zi
, and similarly for cr
and ci
.
1 2 3 4 5 6 

1 2 3 4 5 6 7 8 9 

This is a straight translation of the scalar formula above, but it computes four values in parallel (that’s why struct of array layouts lead to easy to vectorisation). Crucially, a disassembly reveals we get the code we expect:
1 2 3 4 5 6 7 8 9 10 11 12 13 

The beauty of Python’s representation selection capabilities is that we can specify multiple representations (unboxed in a register or on stack, or boxed as a generic pointer to a heapallocated value) for SIMD packs, and the right (usually…) one will be chosen depending on context. In this case, the function receives boxed SIMD packs, unboxes them into SSE registers in the prologue, performs its FP arithmetic only on SSE registers, and converts the results back into boxed SIMD packs.
Even better: it seems to compute the right thing (:
1 2 3 4 5 6 7 8 9 10 11 12 13 

The first return value is a pack of real components, and the second a pack of imaginary components, and the values do correspond to the scalar computation in normal complex arithmetic.
The next step is to compute the (squared) norm of complex values, in order to detect escape from the orbit.
1 2 3 4 

Again, a couple smoke tests, and a disassembly
1 2 3 4 5 6 7 8 9 10 11 12 

Finally, we have to compare floats (squared norms) against a limit
(4.0), and perform some integer arithmetic to count the number of
iterations until escape. f4signmask
is the first instance of a
function that accepts arbitrary SIMD pack types: singles, integers, or
even doubles, it’ll extract sign bits from the four 32bit chunks.
1 2 3 4 5 6 7 8 9 10 11 12 13 

The VOPs are really straightforward and somewhat repetitive, again: we only want the bare minimum to get something working, and fancy special cases can wait until they actually show up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

Again, we add stubs to easily work at the REPL
1 2 3 4 5 6 7 8 9 10 11 

And some smoke tests
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

f4<=
compares two packs of single floats element by element, and
returns a pack of integers with all 32 bits set to 1 for pairs that
are <=
, and 0 otherwise. i4
subtracts 32bit signed integers
elementwise; subtracting 1 (#xffff...
) is equivalent to adding 1.
Finally, f4signmask
extracts the topmost (sign) bit from each
chunk of 32 bits, and returns the 4 bits as a normal integer (i.e. in
a general purpose register). The function sbext:%simdpackub32s
is useful to extract the four 32 bit unsigned values from an SIMD pack
– in this case the result of subtracting the comparison masks from
zero – and work on them in a scalar context.
The idea is that we’ll want to compare the result of %norm^2
with
the limit (4.0), and stop iterating when the maximal iteration count
is reached, or when all four norms are greater than 4.0 (all sign bits
are zero). Until then, we can subtract from the counter to increment
the number of unescaped iterations. When we’re done, we can easily
extract individual iteration counts from the packed counters.
This yields
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

We can run a couple random tests and compare with a straightforward scalar version:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 

So, there’s some register allocation oddities (mostly because the VOPs
don’t handle the case when both arguments are the same, I think), but
it’s pretty good. The one issue that bothers me is (zerop
(f4signmask stillinorbit))
: it’s a common operation, and there’s
a slight miscompile (the result of f4signmask
is converted to a
fixnum before comparison… an instance of bad representation
selection). These two reasons – factoring code out and improving
code generation for a mediumlevel operation – are enough for me to
add a specialised function to test if all the sign bits are zero.
We define predicates simply as functions that return boolean
values (this one is prefixed with f4
because the instruction
officially works on single floats).
1 2 3 

However, VOPs for such functions shouldn’t return values; instead,
they can define how to determine that their result is true. In this
case, if the :z
(zero) flag is set. Code generation will take care of
using that information in conditional branches or conditional moves.
1 2 3 4 5 6 7 8 9 10 

For example, in this stub, the compiler will emit the template, followed by a conditional move to return T or NIL.
1 2 

1 2 3 4 5 6 7 8 9 10 11 12 

Now, we can slightly adjust mandelbrotescape
to exploit this new
function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

This all looks like a lot of work, but the vast majority of it is
reusable code to define SSE operations: 22 LOC of defknown, 83 LOC for
assembly templates, 20 LOC in stubs and utility code. There’s only 31
LOC for the SIMD mandelbrot loop (%mandelbrotiter
, %norm^2
and
mandelbrotescape
), 6 for the naïve scalar loop, and 12 for the
random test harness. I uploaded all the files on
github to simplify
further experimentation; just fork and play around! The function
names could certainly be improved, for one.
My request for extrahard bug shaking in SBCL 1.1.8 didn’t function as well as I’d wish: only a few hours after release (and after a week of code freeze), we received a new bug report (with fixes committed very rapidly, at least). We observed the same phenomenon in 1.1.6, and wound up being extremely conservative for 1.1.7. I’m not sure what’s the best way to encourage testing (binary release candidates during freeze?), but I am starting to see the point of the old odd/even policy for Linux. For now, the result is that 1.1.8 offers a lot of exciting improvements, but also more than its share of bugs; for production code, it’s probably best to either build an early 1.1.8.gitcommit, or wait for 1.1.9.
Longtime readers might remember that I first blogged about SSE
intrinsics in
2008.
The problem is that I hit some issues with representing type
information (whether an SIMD pack contains integer, single float or
double float data), and didn’t manage to find a nice resolution. I
also had trouble with type operations (intersection, union, negation)
on (simdpack [type])
. For the first problem, I bit the bullet and
created many storage classes and primitive types; as shown above, a
function can accept arbitrary simdpack
, and so can VOPs: when
converting to primitive types, unspecialised simdpack
are
treated like integer simdpack
. The second, I solved by moving to
a type upgrading system inspired by array types. There are really only
three specialised simdpack
types: on integers, single floats and
double floats. Any specialised simdpack
type specifier must be
specialised on a recognisable subtype of one of these three types.
It’s a hack, but all type operations become trivial and easy to think
about.
The reason I finally sat down, reread my code and thought hard for a few days is Google’s Summer of Code. A student proposed to build on that work to vectorise standard sequence operations in CL, and I really didn’t want to be blocking their work. In the end we were only allotted two slots, and had to select a pair of extremely strong proposals from ten interesting submissions… more on that soon!
In the meantime, I hope the recent burst of activity and outside interest can motivate others to contribute to SBCL, even without Google support. We already have one person working on a contrib to exploit GMP for bignum and rational arithmetic, instead of our handrolled routines; I expect to see it merged in 1.1.9 or in 1.1.10.
P.S. I recently stumbled on Glassman’s FFT and it’s really nifty: an inorder generalsize FFT (not so fast when prime factors are huge) in approximately 50 LOC. The performance isn’t awesome on cached architectures, but shouldn’t be atrocious either… and it handles arbitrary DFT sizes. There are issues with numerical stability when computing the twiddle factors on the fly, and the access patterns could be improved, but it’s a very simple and interesting foundation… especially given that the inner loop is naturally vectorisable.
]]>EDIT: There’s a section with a bunch of general references.
EDIT 2: Added a note on genesis when playing with core formats.
The first step is probably to install git, and clone our repo
(the github mirror works well, and
lets you fork to your own github account for easy publication). Then,
building from source and installing SBCL (a local installation to
$HOME works fine) is obviously useful experience, and will be useful
to explore the source. Reading INSTALL should be enough to get
started on Linux or OS X and x86[64]. Other platforms may need more
work, and might not be the best choice if you’re not interested in
improving the port itself (although I’m told FreeBSD and Solaris work
very well on x86[64]). To build SBCL from source, you’ll need an
SBCL binary (bootstrapping from other CLs should work, but support
regularly bitrots away), and the usual C development tools
(e.g. buildessential on debian). A fancy build (./make.sh fancy
)
is probably the best choice for development.
You’ll also want to run the test suite often; better try it out now
(cd tests; sh runtests.sh
) to make sure you can get it working.
The test suite will barf if there’s nonASCII characters in the
environment. Oh my Zsh’s
git customisation systematically trips me up, for example (I currently
kludge it and start a bash from ~ and then run the tests).
Once SBCL HEAD is working and installed, it’s probably best to install emacs and SLIME. Quicklisp’s quicklispslimehelper can take care of installing SLIME. It is possible to work on SBCL without SLIME. However, SLIME has a lot of useful extensions, if only to explore the code base. If you’re not (and don’t wish to become) comfortable with emacs, it’s probably best to nevertheless use emacs and SLIME for the REPL, debugger, inspector, etc. and write code in your favourite editor. Later on, it’ll be useful to figure out how to make SLIME connect to a freshlybuilt SBCL.
I often see newcomers try to read the source like a book, and, once they realise there’s a lot of code, try to figure out a good order to read the source. I don’t think that’s the best approach. SBCL is pretty huge, and I doubt anyone ever simultaneously holds the complete system in their head. RAM’s “The Python Compiler for CMU Common Lisp” is still useful as an overview, and SBCL’s internals manual is a good supplement. Once you get close to bootstrapping logic, Christophe Rhodes’s “SBCL: a SanelyBootstrappable Common Lisp” helps understand the exclamation marks. Past that, I believe it’s preferable to start out small, learn just enough to get the current task done, and accept that some things just work, without asking how (for now).
In that spirit, I’d say M. (Alt period, Command period on some OS X emacsen) is the best way to explore most of SBCL’s source. SBCL’s build process preserves a lot of source location information, and M. queries that information to jump to the definitions for any given symbol (M, will pop back up to the previous location). For example, if you type “(truncate” at the REPL and hit M. (with the point on or just after “truncate”), you’ll find the out of line definition for truncate in (mostly) regular Common Lisp, optimisation rules regarding truncate, and VOPs, assembly language templates, for truncate called with a few sets of argument and return types. The out of line definition isn’t that interesting. The transforms, however, are. (VOPs aren’t useful if one isn’t comfortable with the platform’s assembly language, and mostly selfexplanatory otherwise.)
The one to “convert integer division to multiplication” is a very good
example. One could M. on deftransform
, and go down a very long
chain of definitions. Instead, I think it’s only essential to see
that the form defines a new rule, like a compiler macro, such that
compiletime values (lvars) that represent its two arguments are bound
to x and y, and the rule only triggers if its first argument is known
to be an unsigned word, and its second a constant unsigned word. If
that’s satisfied, the transformation still only triggers if the speed
optimisation quality is higher than both compilation speed and space
(code size).
Then, the constant value for y is extracted and bound to y, and a
conservative bound on the maximum value that x can take at runtime is
computed. If truncate by y should be handled elsewhere, the transform
gives up. Otherwise, it returns a form that will be wrapped in
(lambda (x y) ...)
and spliced in the call, instead of truncate.
To extend SBCL’s support for division by constants, it’s not necessary to understand more of SBCL’s compiler than the above. There’s no need to try and understand how deftransform works, only that it defines a rule to simplify calls to truncate. Similarly for lvarvalue and lvartype: the former extracts the value for constant lvars, and the latter the static type derived for that lvar (value at a program point). With time, knowledge will slowly accrete. However it’s possible, if not best, to start hacking without understanding the whole system. This approach will lead to a bit of cargo culting, but mentors and people on IRC will help make sure it doesn’t do any harm, and can explain more stuff if it’s interesting or à propos.
Working on the compiler itself is a bit more work. I think the best
approach is to go in src/compiler/main.lisp
and look for
compilecomponent
. ir1phases
loops on a component and performs
highlevel optimisations until fixpoint (or we get tired of waiting),
while %compilecomponent
handles the conversion to IR2 and then to
machine code. The compilation pipeline hasn’t really changed since
the Python paper was written, and the subphases each have their own
function (and file). M. on stuff that sounds interesting is probably
the best approach at the IR2 level.
The C and assembly runtime lives in src/runtime/
. There’s a lot of
stuff that’s symlinked or generated during the build, so it’s probably
best to look at it after a successful build. Sadly, we don’t track
source locations there, but {c,e,whatever}tags works; so does grep.
GC stuff is in the obvious suspects (gccommon, gencgc, gc, etc.), but
may end up affecting core loading/saving (core, coreparse, save).
Depending on what in the core loading code is affected, code in
genesis (the initial bootstrap that reads fasl files from the cross
compiler and builds the initial core file) might also have to be
modified (mostly in src/compiler/generic/genesis.lisp
). That’s…
more work. Like the project suggestions list says, when we change
things in the runtime, it sometimes ends up affecting a lot of other
components.
GDB tends to be less than useful, because of the many tricks SBCL
plays on itself. It’s usually hard to beat pen, paper, and printf.
At least, rebuilding the C runtime is quick: if the feature
:sbafterxccore
is enabled (which already happens for fancy
builds), slam.sh
should be able to rebuild only the C runtime, and
then continue the bootstrap with the rest of SBCL from the previous
build. That mostly leaves PCL to build, so the whole thing should
takes less than a minute on a decent machine.
I was replying to an email when I realised that some general compiler references would be useful, in addition to project and SBCL specific tips.
Christian Queinnec’s Lisp in Small Pieces gives a good overview of issues regarding compiling Lisplike languages. Andrew Appel’s Modern Compiler Implementation in ML is more, well, modern (I hear the versions in C and Java have the same text, but the code isn’t as nice… and ML is a very nice language for writing compilers). I also remember liking Appel’s Compiling with Continuations, but I don’t know if it’s particularly useful for CL or the projects we suggest.
For more complicated stuff, I believe Stephen Muchnick’s Advanced Compiler Design and Implementation would have been really nice to have, instead of slogging through code and dozens of papers. Allen & Kennedy’s Optimizing Compilers for Modern Architectures: A Dependencebased Approach is another really good read, but I’m not sure how useful it would be when working on SBCL: we still have a lot of work to do before reaching for the really sophisticated stuff (and what sophistication there is is fairly nonstandard).
I believe the Rabbit and Orbit compilers have influenced the design of CMUCL and SBCL. The Lambda papers provide some historical perspective, and the RABBIT and ORBIT theses are linked here.
What little magic remains in SBCL and CMUCL is the type derivation (constraint propagation) pass, and how it’s used to exploit a repository of sourcetosource transformations (deftransforms). The rest is bogstandard tech from the 70s or 80s. When trying to understand SBCL’s type derivation pass at a very high level, I remember finding Henry Baker’s The Nimble Type Inferencer for Common Lisp84 very useful, even though it describes a scheme that doesn’t quite work for Common Lisp (it’s very hard to propagate information backward while respecting the final standard). Kaplan and Ullman’s A Scheme for the Automatic Inference of Variable Types was also helpful.
Over the years, I’ve seen a couple of people come in with great ambition, and give up after some time, seemingly without having made any progress. I believe a large part of the problem is that they tried to understand all of SBCL instead of just learning the bare minimum to get hacking, and that their goal was too big. I already wrote that SBCL is probably best approached bit by bit, with some guidance from people who’ve been there before, and I hope the projects we suggest can all lead to visible progress quickly, after a couple days or two weeks of work at most.
Still, before investing my time, I like to see the other person also give some of theirs to SBCL. This is why, as I wrote on the mailing list last week, I’m much more inclined to help someone who’s already built SBCL on their own and has submitted a patch that’s been committed or is being improved on the mailing list. I absolutely do not care what the patch is; it can be new code, a bugfix for a highly unlikely corner case, better documentation, or spelling and grammar corrections in comments. The bugs tagged as easy in our bugtracker may provide some inspiration. However trivial a patch might seem, it’s still a sign that someone is willing to put the work in to concretely make SBCL better, and I like that… it’s also a minimal test to make sure the person is able to work with our toolchain. (This isn’t SBCL policy for GSoC. It’s simply how I feel about these things.)
Again, I’m amazed by the number of people who wish to hack on SBCL this summer (as part of Google’s Summer of Code or otherwise). Because of that, I think it’s important to note that this is our first year, and so we’ll likely not have more than two or three spots. However, I always like seeing more contributors, and I hope anyone who’d like to contribute will always be guided, GSoC or not.
Finally, I’ll note that Google’s Summer of Code program was only a good excuse to write up our list of projects: they’re simply suggestions to incite programmers to see what they can do that is useful for SBCL and, most importantly, is interesting for them. Anyone should feel welcome to work on any of these projects, even if they’re not eligible or chosen for GSoC. They’re also only suggestions; if someone has their own idea, we can likely help them out just the same.
]]>