Paul Khuong mostly on Lisp

Fixed Points and Strike Mandates

Many tasks in compilation and program analysis (in symbolic computation in general, I suppose) amount to finding solutions to systems of the form \(x = f(x)\). However, when asked to define algorithms to find such fixed points, we rarely stop and ask “which fixed point are we looking for?”

In practice, we tend to be interested in fixed points of monotone functions: given a partial order \((\prec)\), we have \(a \prec b \Rightarrow f(a)\prec f(b)\). Now, in addition to being a fairly reasonable hypothesis, this condition usually lets us exploit Tarski’s fixed point theorem. If the domain of \(f\) (with \(\prec\)) forms a complete lattice, so does the set of fixpoints of \(f\) ! As a corollary, there then exists exactly one least and one greatest fixed point under \(\prec\).

This is extremely useful, because we can usually define useful meet and join operations, and enjoy a complete lattice. For example, for a domain that’s the power set of a given set, we can use \(\subset\) as the order relation, \(\cup\) as join, and \(\cap\) as meet. However, what I find interesting to note is that, when we don’t pay attention to which fixpoint we wish to find, humans seem to consistently develop algorithms that converge to the least or greatest one, depending on the problem. It’s as though we all have a common blind spot covering one of the extreme fixed points.

A simple example is dead value (useless variable) elimination. When I ask people how they’d identify such variables in a program, the naïve solutions tend to be very similar. They exploit the observation that a value is useless if it’s only used to compute values that are themselves useless. The routines start out with every value live (used), and prune away useless values, until there’s nothing left to remove.

These algorithms converge to solutions that are correct, but suboptimal (except for cycle-free code). We wish to identify as many useless values as possible, to eliminate as many computations as possible. Yet, if we start by assuming that all values are live, our algorithm will fail to identify some obviously-useless values, like x in:

for (...)
  x = x

We could keep adding more special cases. However, the correct (simplest) solution is to try and identify live values, rather than dead ones. A value is live if it’s used to compute a live value. Moreover, return values and writes to memory are always live. Our routine now starts out by assuming that only the latter values are live, and adjoins live values as it finds them, until there’s nothing left to add.

In this case, the intuitive solution converges to the greatest fixed point, but we’re looking for the least fixed point. Setting the right initial value ensures convergence to the right fixed point.

Other common instances of this pattern are reference counting instead of marking, or performing type propagation by initially assigning the top type to all values (like SBCL).

# I recently found a use for fixed point computations outside of math and computer science.

Most university or CEGEP student unions in Québec will vote (or already have voted) on strike mandates to help organize protests against rising university tuition fees this winter and spring. There are hundreds of such unions across the province representing, in total, around four hundred thousand students. The vast majority of these unions comprise a couple hundred (or fewer) students, and many feel it would be counter-productive for only a tiny number of students to be on strike. Thus, strike mandates commonly include conditions regarding the minimal number of other students who also hold strike mandates, along with additional lower bounds on the number of unions and universities or colleges involved. As far as I know, all the mandates adopted so far are monotone: if they are satisfied by a set striking unions, they are also satisfied by all of its supersets.

Tarski’s theorem applies (again, with \((\subset, \cup, \cap)\) on the power set of the set of student unions). Which fixed point are we looking for?

It’s clear to me that we’re looking for the fixed point with the largest set of striking unions. In some situations, the least fixed point could trivially be the empty set (or all unions that did not adopt any lower bound). Moreover, the mandates are usually presented with an explanation to the effect that, if unions representing at least \(n_0\) students adopt the same mandate, then all unions that have adopted the mandate will go on strike simultaneously.

I asked fellow graduate students in computer science to sketch an algorithm to determine which unions should go on strike given their mandates; they started with the set of student unions currently on strike, and adjoined unions for which all the conditions were met. Such algorithms will converge toward the least fixed point. For example, there could be two unions, each comprising 5 000 students, with the same strike floor of 10 000 students, and these algorithms would have both unions deadlocked, waiting for the other to go on strike.

Instead, we should start by assuming that all the unions (with a strike mandate) are on strike, and iteratively remove unions whose conditions are not all met, until we hit the greatest fixed point. I’m fairly sure this will end up being a purely theoretical concern, but it’s a pretty neat case of abstract mathematics helping us interpret a real-world situation.

This pattern of intuitively converging toward a suboptimal solution seems to come up a lot when computing fixed points. It’s not necessarily a bad choice: conservative initial values tend to lead to faster convergence, and often have the property that intermediate solutions are always correct (feasible). When we need quick results, it may make sense to settle for suboptimal solutions. However, it ought to be a deliberate choice, rather than a consequence of failing to consider other possibilities.

Migration and Synopsis

This blog has been going for five years. Back then, it seemed like the only widely-used static blog generators were Blosxom or pyBlosxom. They weren’t that hard to set up, but getting everything right rather than good enough is a lot of work. Latex and MathML support was also very weak, so I wound up using a (insane) one-off hack with tex4ht. I feel like Octopress and MathJax now do everything I need out of the box, better than anything I could design by myself.

The permalinks from the old blog are still around, but not the rss feeds or the date-based links.

I figure this is a good opportunity to make sure the (marginally useful) permalinks are available somewhere else than via google.

Lisp-related posts

Another way to accumulate data in vectors describes a copying-free extendable vector. The advantage over the usual geometric growth with copy is that the performance with respect to the number of elements added is much smoother. Runtimes are then more easily predictable, and sometimes improved (e.g. right when a copy would be needed). It’s also more amenable to a lock-free adaptation, while preserving O(1) operation complexity (assuming that integer-length on machine integers is constant time), as shown in Dechev et al’s “Lock-free Dynamically Resizable Arrays”.

Common Cold is a really old hack to get serialisable closures in SBCL, with serialisable continuations built on top of that. Nowadays, I’d do the closure part differently, without any macro or change to the source.

Concurrency with MVars has short and simple(istic) code for mvars, and uses it to implement same-fringe with threads.

Constraint sets in SBCL: preliminary exploration summarises some statistics on how constraint sets (internal SBCL data structures) are used by SBCL’s compiler.

SBCL’s flow sensitive analysis pass explores what operations on constraint sets actually mean. This, along with the stats from the previous post, guided a rewrite, not of constraint sets, but of the analysis pass that uses them. The frequency of slow operations or bad usage patterns is reduced enough to take care of most (all?) performance regression associated with the original switch to bit-vector-based constraint sets, without penalising the common case.

Finalizing foreign pointers just late enough is a short reminder that attaching finalizers to system area pointers isn’t a good idea: SAPs are randomly unboxed and consed back, like numbers.

Hacking SSE Intrinsics in SBCL (part 1) walks through an SBCL branch that adds support for SSE operations. Alexander Gavrilov has kept a fork on life support on github. There’s still no part 2, in which the branch is polished enough to merge it in the mainline.

In the meantime, Complex float improvements for sbcl 1.0.30/x86-64 built upon the original work on SSE intrinsics to implement operations on (complex single-float) and (complex double-float) with SIMD code on x86-64. That sped up most complex arithmetic operations by 100%. That work also came with support for references to unboxed constants on x86oids; this significantly improved floating point performance as well, for both real and complex values.

Initialising structure objects modularly is a solution to a problem that I hit, trying to implement non-trivial initialisation for structures, while allowing inheritance. Tobias Rittweiler points out that the protocol is very similar to a common CLOS pattern where, instead of functions that allocate objects, class designators are passed. It also looks a bit like the way Factor libraries seem to do struct initialisation, but with actual initialisation instead of assignment (which matters for read-only slots).

An Impure Persistent Dictionary is an example of a technique I find really useful to implement persistent versions of side-effectful data structures. Henry Baker has a paper that shows how shallow binding can be used to implement persistent arrays on top of functional arrays, with constant-time overhead for operations on the latest version. It’s a really nice generalisation of trailing in backtracking searches. Here, I use it to get persistent hash tables in only a couple dozen lines of code.

Pipes is an early attempt to develop a DSL for stream processing, like an 80% SERIES. I’ve refocused my efforts on Xecto, which only handles vectors, rather than potentially unbounded streams. The advantage is that Xecto looks like it has the potential to be simpler while achieving near-peak performance to me; the main downside is that vectors don’t allow us to represent control flow as data via lazy evaluation… and I’m not sure that’s such a bad thing.

The post on string-case is an overview of how I structured a CL macro to dispatch that compares with string= instead of eql. If I were to do this again, I’d probably try and improve string=; I later tested an SSE comparison routine, and it ended up being, in a lot of cases, faster and simpler (with a linear search) than the search tree generated by string-case.

The type-lower-bound branch describes early work on a branch that provides a way to shut the compiler up about certain failed type-directed optimisations. A lot of the output from SBCL’s compiler amounts to reports of optimisations that couldn’t be performed (e.g. converting multiplication by a constant power of two to a shift), and why (e.g. the variant argument isn’t known to be small enough). Sometimes, there’s nothing we can do about it: we can’t show the compiler that the argument is small enough because we know that it will sometimes be too large! Yet, CL’s type system (like most) does not let us express that information. Programmers are expected to provide upper bounds on the best static type of values (e.g. we can specify that a value is always a fixnum, although it may really only be integers between 0 and 1023). We would like a way to specify lower bounds as well: “I know that this will take arbitrary fixnum values.” Once we have that, the compiler can skip reporting optimisations that we know can’t be performed (as opposed to those we don’t know whether they can be performed).

Finally, Yet another way to fake continuations sketches a simple but somewhat inefficient way to implement continuations for pure programs. It may be useful for IO-heavy applications (web programming), or in certain cases similar to backtracking search, but in which most of the work is performed outside of backtracking (e.g. during constraint propagation).

General low-level programming issues

SWAR implementation of (some #’zerop …) sketches how we can use SIMD-within-a-register techniques to have fast search for patterns of sub-word size. A degenerate case is when we look for 0 or 1 in bit vectors; in these case, it’s clear how we can test whole words at a time. The idea can be extended to testing vectors of 2, 4, 8 (or any size) -bit elements. I haven’t found time to move this in SBCL’s runtime library (yet), but it would probably be a neat and feasible first project.

Revisiting VM tricks for safepoints explores the performance impact of switching from instrumented pseudo-atomic code sequences to safepoints. The bottom line is that it’s noise. However, some members of the russian Lisp mafia have used it as inspiration, and have managed to implement seemingly solid threaded SBCL on Windows! It’s still a third-party fork for now, but some committers are working on merging it with the mainline.

Fast Constant Integer Division has some stuff on integer division by constants. It’s mostly superseded by Lutz Euler’s work to implement the same algorithm as GCC. There are some interesting identities that can be used to improve on that algorithm a tiny bit and, more interestingly, to implement truncated multiplication by arbitrary fractions. I only stumbled upon those a long time after I wrote the post; I’ll try and come back to this topic in the coming months.

There’s more to locality than caches tracks my attempts to understand why a data structure designed to be cache-efficient did not perform as well as expected. It turns out that cache lines aren’t exactly read atomically (so reading two adjacent addresses may be significantly slower than only one), and that sometimes L2 matters less than TLB. The latter point was an important lesson for me. TLBs are used to accelerate the translation of virtual addresses to physical; every memory access must be translated. TLBs are usually fully associative (behave like content-addressed memory or hash tables, basically), but with a small fixed size, on the order of 512 pages for the slower level. With normal (on x86oids) 4KB pages, that’s only enough for 2 MB of data! Even worse: a cache miss results in a single access to main memory, which is equivalent to ~60-100 cycles at most; a TLB miss, however, results in a lookup in a 4 level page table on x86-64, which often takes on the order of 2-300 cycles. Luckily, there are workarounds, like using 2 MB pages.

Napa-FFT(2) implementation notes is where I try to make the code I wrote for a Fast Fourier transform understandable, especially why it does what it does. Napa-FFT and Napa-FFT2 are vastly faster than Bordeaux-FFT (and than all other CL FFT codes I know, on SBCL), but it’s still around 20-50% slower than the usual benchmark, FFTW. Napa-FFT3 is coming, and it’s a completely different approach which manages to be within a couple percent points of FFTW, and is faster on some operations.

0x7FDE623822FC16E6 : a magic constant for double float reciprocal is a surprisingly popular post. I was trying to approximate reciprocals as fast as possible for a mathematical optimization method. The usual way to do that is to use a hardware-provided approximation and then improve it with a couple iterations of Newton’s method. The post shows how we can instead use the way floats are laid out in memory to provide a surprisingly accurate guess with an integer subtraction. I actually think the interesting part was that it made for a practical use case for the golden section search…

Some notes on Warren has a couple notes about stuff in Warren’s book Hacker’s Delight. The sign extension bit probably deserves more attention; it seems like someone on #lisp asks how they can sign-extend unsigned integers at least once a month.

Two variations on old themes has some stuff on Linux’s ticket spinaphores, and is the beginning of my looking into Robin Hood hashing with linear probing for cache-friendly hash tables.

Interlude: Numerical experiments in hashing covers a first stab at designing a hash table that exploits cache memory. 2-left hashing looks interesting, but its performance was worse than expected, for various reasons, mostly related to the fact that caches can be surprisingly complicated. Two years later, More numerical experiments in hashing: a conclusion revisits the question, and settles on Robin Hood hashing with linear probing. It’s a tiny tweak to normal open addressing (insertions can bump previously-inserted items farther from their insertion point), but it suffices to greatly improve the worst and average probing length, while preserving the nice practical characteristics of linear probing. I’ve also started some work on implementing SBCL’s hash table this way, but there are practical issues with weak hash functions, GC and mutations.

Miscellaneous stuff

In Specify absolute deadlines, not relative timeouts and the sequel, I argue that we should have interfaces that allow users to specify an absolute timeout, with respect to a monotonic clock. Timeouts are convenient, but don’t compose well: how do we implement a timeout version of an operation that sequences two calls to functions that only offer timeouts as well? Any solution will be full of race conditions. PHK disagrees; I’m not sure if all of his complaints can be addressed by using a monotonic clock.

Finally, Space-complexity of SSA in practices has some early thoughts on how Static single assignment scales for typical functional programs. It’s fairly clear that many compilers for functional languages have inefficient (wrt to compiler performance) internal representations; however, it’s not as clear that the industry standard, SSA, would fare much better.