Paul Khuong mostly on Lisp

Starting to Hack on SBCL

SBCL was accepted as a mentoring organisation for Google’s Summer of Code 2013 (our list of project suggestion is here). This will be our first time, so that’s really great news. I’m also extremely surprised by the number of people who’ve expressed interest in working with us. I was going to reply to a bunch of emails individually, but I figure I should also centralise some of the stuff here.

EDIT: There’s a section with a bunch of general references.

EDIT 2: Added a note on genesis when playing with core formats.

Getting started

Setting up the basic tools

The first step is probably to install git, and clone our repo (the github mirror works well, and lets you fork to your own github account for easy publication). Then, building from source and installing SBCL (a local installation to $HOME works fine) is obviously useful experience, and will be useful to explore the source. Reading INSTALL should be enough to get started on Linux or OS X and x86[-64]. Other platforms may need more work, and might not be the best choice if you’re not interested in improving the port itself (although I’m told FreeBSD and Solaris work very well on x86[-64]). To build SBCL from source, you’ll need an SBCL binary (bootstrapping from other CLs should work, but support regularly bitrots away), and the usual C development tools (e.g. build-essential on debian). A fancy build (./make.sh --fancy) is probably the best choice for development.

You’ll also want to run the test suite often; better try it out now (cd tests; sh run-tests.sh) to make sure you can get it working. The test suite will barf if there’s non-ASCII characters in the environment. Oh my Zsh’s git customisation systematically trips me up, for example (I currently kludge it and start a bash from ~ and then run the tests).

Once SBCL HEAD is working and installed, it’s probably best to install emacs and SLIME. Quicklisp’s quicklisp-slime-helper can take care of installing SLIME. It is possible to work on SBCL without SLIME. However, SLIME has a lot of useful extensions, if only to explore the code base. If you’re not (and don’t wish to become) comfortable with emacs, it’s probably best to nevertheless use emacs and SLIME for the REPL, debugger, inspector, etc. and write code in your favourite editor. Later on, it’ll be useful to figure out how to make SLIME connect to a freshly-built SBCL.

Exploring the source

I often see newcomers try to read the source like a book, and, once they realise there’s a lot of code, try to figure out a good order to read the source. I don’t think that’s the best approach. SBCL is pretty huge, and I doubt anyone ever simultaneously holds the complete system in their head. RAM’s “The Python Compiler for CMU Common Lisp” is still useful as an overview, and SBCL’s internals manual is a good supplement. Once you get close to bootstrapping logic, Christophe Rhodes’s “SBCL: a Sanely-Bootstrappable Common Lisp” helps understand the exclamation marks. Past that, I believe it’s preferrable to start out small, learn just enough to get the current task done, and accept that some things just work, without asking how (for now).

In that spirit, I’d say M-. (Alt period, Command period on some OS X emacsen) is the best way to explore most of SBCL’s source. SBCL’s build process preserves a lot of source location information, and M-. queries that information to jump to the definitions for any given symbol (M-, will pop back up to the previous location). For example, if you type “(truncate” at the REPL and hit M-. (with the point on or just after “truncate”), you’ll find the out of line definition for truncate in (mostly) regular Common Lisp, optimisation rules regarding truncate, and VOPs, assembly language templates, for truncate called with a few sets of argument and return types. The out of line definition isn’t that interesting. The transforms, however, are. (VOPs aren’t useful if one isn’t comfortable with the platform’s assembly language, and mostly self-explanatory otherwise.)

The one to “convert integer division to multiplication” is a very good example. One could M-. on deftransform, and go down a very long chain of definitions. Instead, I think it’s only essential to see that the form defines a new rule, like a compiler macro, such that compile-time values (lvars) that represent its two arguments are bound to x and y, and the rule only triggers if its first argument is known to be an unsigned word, and its second a constant unsigned word. If that’s satisfied, the transformation still only triggers if the speed optimisation quality is higher than both compilation speed and space (code size).

Then, the constant value for y is extracted and bound to y, and a conservative bound on the maximum value that x can take at runtime is computed. If truncate by y should be handled elsewhere, the transform gives up. Otherwise, it returns a form that will be wrapped in (lambda (x y) ...) and spliced in the call, instead of truncate.

To extend SBCL’s support for division by constants, it’s not necessary to understand more of SBCL’s compiler than the above. There’s no need to try and understand how deftransform works, only that it defines a rule to simplify calls to truncate. Similarly for lvar-value and lvar-type: the former extracts the value for constant lvars, and the latter the static type derived for that lvar (value at a program point). With time, knowledge will slowly accrete. However it’s possible, if not best, to start hacking without understanding the whole system. This approach will lead to a bit of cargo culting, but mentors and people on IRC will help make sure it doesn’t do any harm, and can explain more stuff if it’s interesting or à propos.

Finding where the compiler lives

Working on the compiler itself is a bit more work. I think the best approach is to go in src/compiler/main.lisp and look for compile-component. ir1-phases loops on a component and performs high-level optimisations until fixpoint (or we get tired of waiting), while %compile-component handles the conversion to IR2 and then to machine code. The compilation pipeline hasn’t really changed since the Python paper was written, and the subphases each have their own function (and file). M-. on stuff that sounds interesting is probably the best approach at the IR2 level.

Runtime stuff

The C and assembly runtime lives in src/runtime/. There’s a lot of stuff that’s symlinked or generated during the build, so it’s probably best to look at it after a successful build. Sadly, we don’t track source locations there, but {c,e,whatever}tags works; so does grep.

GC stuff is in the obvious suspects (gc-common, gencgc, gc, etc.), but may end up affecting core loading/saving (core, coreparse, save). Depending on what in the core loading code is affected, code in genesis (the initial bootstrap that reads fasl files from the cross compiler and builds the initial core file) might also have to be modified (mostly in src/compiler/generic/genesis.lisp). That’s… more work. Like the project suggestions list says, when we change things in the runtime, it sometimes ends up affecting a lot of other components.

GDB tends to be less than useful, because of the many tricks SBCL plays on itself. It’s usually hard to beat pen, paper, and printf. At least, rebuilding the C runtime is quick: if the feature :sb-after-xc-core is enabled (which already happens for --fancy builds), slam.sh should be able to rebuild only the C runtime, and then continue the bootstrap with the rest of SBCL from the previous build. That mostly leaves PCL to build, so the whole thing should takes less than a minute on a decent machine.

Some references

I was replying to an email when I realised that some general compiler references would be useful, in addition to project- and SBCL- specific tips.

Christian Queinnec’s Lisp in Small Pieces gives a good overview of issues regarding compiling Lisp-like languages. Andrew Appel’s Modern Compiler Implementation in ML is more, well, modern (I hear the versions in C and Java have the same text, but the code isn’t as nice… and ML is a very nice language for writing compilers). I also remember liking Appel’s Compiling with Continuations, but I don’t know if it’s particularly useful for CL or the projects we suggest.

For more complicated stuff, I believe Stephen Muchnick’s Advanced Compiler Design and Implementation would have been really nice to have, instead of slogging through code and dozens of papers. Allen & Kennedy’s Optimizing Compilers for Modern Architectures: A Dependence-based Approach is another really good read, but I’m not sure how useful it would be when working on SBCL: we still have a lot of work to do before reaching for the really sophisticated stuff (and what sophistication there is is fairly non-standard).

I believe the Rabbit and Orbit compilers have influenced the design of CMUCL and SBCL. The Lambda papers provide some historical perspective, and the RABBIT and ORBIT theses are linked here.

What little magic remains in SBCL and CMUCL is the type derivation (constraint propagation) pass, and how it’s used to exploit a repository of source-to-source transformations (deftransforms). The rest is bog-standard tech from the 70s or 80s. When trying to understand SBCL’s type derivation pass at a very high level, I remember finding Henry Baker’s The Nimble Type Inferencer for Common Lisp-84 very useful, even though it describes a scheme that doesn’t quite work for Common Lisp (it’s very hard to propagate information backward while respecting the final standard). Kaplan and Ullman’s A Scheme for the Automatic Inference of Variable Types was also helpful.

Getting help

Over the years, I’ve seen a couple of people come in with great ambition, and give up after some time, seemingly without having made any progress. I believe a large part of the problem is that they tried to understand all of SBCL instead of just learning the bare minimum to get hacking, and that their goal was too big. I already wrote that SBCL is probably best approached bit by bit, with some guidance from people who’ve been there before, and I hope the projects we suggest can all lead to visible progress quickly, after a couple days or two weeks of work at most.

Still, before investing my time, I like to see the other person also give some of theirs to SBCL. This is why, as I wrote on the mailing list last week, I’m much more inclined to help someone who’s already built SBCL on their own and has submitted a patch that’s been committed or is being improved on the mailing list. I absolutely do not care what the patch is; it can be new code, a bugfix for a highly unlikely corner case, better documentation, or spelling and grammar corrections in comments. The bugs tagged as easy in our bugtracker may provide some inspiration. However trivial a patch might seem, it’s still a sign that someone is willing to put the work in to concretely make SBCL better, and I like that… it’s also a minimal test to make sure the person is able to work with our toolchain. (This isn’t SBCL policy for GSoC. It’s simply how I feel about these things.)

Again, I’m amazed by the number of people who wish to hack on SBCL this summer (as part of Google’s Summer of Code or otherwise). Because of that, I think it’s important to note that this is our first year, and so we’ll likely not have more than two or three spots. However, I always like seeing more contributors, and I hope anyone who’d like to contribute will always be guided, GSoC or not.

Finally, I’ll note that Google’s Summer of Code program was only a good excuse to write up our list of projects: they’re simply suggestions to incite programmers to see what they can do that is useful for SBCL and, most importantly, is interesting for them. Anyone should feel welcome to work on any of these projects, even if they’re not eligible or chosen for GSoC. They’re also only suggestions; if someone has their own idea, we can likely help them out just the same.

The Eight Useful Polynomial Approximations of `sinf(3)’

I just spent a few CPU-months to generate these text files. They catalogue all the “interesting” (from an efficiency and accuracy point of view) polynomial approximations of degree 16 or lower for a couple transcendental functions, over small but useful ranges, in single and double float arithmetic. This claim seems to raise many good questions when people hear it.

What’s wrong with Taylor approximations? Why the need to specify a range?

Why are the results different for single and double floating point arithmetic? Doesn’t rounding each coefficient to the closest float suffice?

Why do I deem only certain approximations to be interesting and others not, and how can there be so few?

In this post, I attempt to provide answers to these interrogations, and sketch how I exploited classic operations research (OR) tools to enter the numerical analysts’ playground.

The final section describe how I’d interpret the catalogue when coding quick and slightly inaccurate polynomial approximations. Such lossy approximations seem to be used a lot in machine learning, game programming and signal processing: for these domains, it makes sense to allow more error than usual, in exchange for faster computations.

The CL code is all up in the repository for rational-simplex, but it’s definitely research-grade. Readers beware.

Minimax approximations

The natural way to approximate functions with polynomials (particularly if one spends time with physicists or engineers) is to use truncated Taylor series. Taylor approximations are usually easy to compute, but only provide good approximations over an infinitesimal range. For example, the degree-1 Taylor approximation for \(\exp\) centered around 0 is \(1 + x\). It’s also obviously suboptimal, if one wishes to minimise the worst-case error: the exponential function is convex, and gradients consistently under-approximate convex functions.

Another affine function, \(1.26 + 1.18x\), intersects \(\exp\) in two points and is overall much closer over \([-1, 1]\). In fact, this latter approximation (roughly) minimises the maximal absolute error over that range: it’s clearly not as good as the Taylor approximation in the vicinity of 0, but it’s also much better in the worst case (all bets are off outside \([-1, 1]\)). That’s why I’m (or anyone’s libm, most likely) not satisfied by Taylor polynomials, and instead wish to compute approximations that minimise the error over a known range; function-specific identities can be exploited to reduce any input to such a range.

Computing minimax polynomials

As far as I know, the typical methods to find polynomial approximation that minimise the maximal error (minimax polynomials) are iterative algorithms in the style of the Remez exchange algorithm. These methods exploit real analysis results to reduce the problem to computing minimax polynomials over very few points (one per coefficient, i.e. one more than the degree): once an approximation is found by solving a linear equation system, error extrema are computed and used as a basis to find the next approximation. Given arbitrary-precision arithmetic and some properties on the approximated function, the method converges. It’s elegant, but depends on high-precision arithmetic.

Instead, I reduce the approximation problem to a sequence of linear optimisation programs. Exchange algorithms solve minimax subproblems over exactly as many points as there are coefficients: the fit can then be solved as a linear equation. I find it simpler to use many more constraints than there are coefficients, and solve the resulting optimisation problem subject to linear inequalities directly, as a linear program.

There are obviously no cycling problems in this cutting planes approach (the set of points grows monotonically), and all the machinery from exchange algorithms can be reused: there is the same need for a good set of initial points and for determining error extrema. The only difference is that points can always be added to the subproblem without having to remove any, and that we can restrict points to correspond to floating values (i.e. values we might actually get as input) without hindering convergence. The last point seems pretty important when looking at high-precision approximations.

For example, let’s approximate the exponential function over \([-1, 1]\) with an affine function. The initial points could simply be the bounds, -1 and 1. The result is the line that passes by \(\exp(-1)\) and \(\exp(1)\), approximately \(1.54 + 1.18x\). The error is pretty bad around 0; solving for the minimax line over three points (-1, 0 and 1) yields a globally optimal solution, approximately \(1.16 + 1.18x\).

There’s a lot of meat to wrap around this bone. I use a computable reals package in Common Lisp to pull arbitrary-precision rational approximations for arithmetic expressions; using libm directly would approximate an approximation, and likely cause strange results. A variant of Newton’s algorithm (with bisection steps) converges to error extrema (points which can be added to the linear subproblem); arbitrary precision real arithmetic is very useful to ensure convergence down to machine precision here. Each linear subproblem is solved with an exact simplex algorithm in rational arithmetic, and convergence is declared when the value of the error extrema found in the Newton steps correspond to that estimated by the subproblem. Finally, the fact that the input are floating-point values is exploited by ensuring that all the points considered in the linear subproblems correspond exactly to FP values: rather than only rounding a point to the nearest FP value, its two immediately neighbouring (predecessor and successor) FP values were also added to the subproblem. Adding immediate neighbours helps skip iterations in which extrema move only by one ULP.

Initialising the method with a good set of points is essential to obtain reasonable performance. The Chebyshev nodes are known to yield nearly optimal [PDF] initial approximations. The LP-based approach can exploit a large number of starting points, so I went with very many initial Chebyshev nodes (256, for polynomials of degree at most 16), and, again, adjoined three neighbouring FP values for each point. It doesn’t seem useful to me to determine the maximal absolute error very precisely, and I declared convergence when the value predicted by the LP relaxation was off by less than 0.01%. Also key to the performance were tweaks in the polynomial evaluation function to avoid rational arithmetic until the very last step.

Exactly represented coefficients

The previous section gives one reason why there are different tables for single and double float approximations: the ranges of input considered during optimisation differ. There’s another reason that’s more important, particularly for single float coefficients: rounding coefficients can result in catastrophic error blowups.

For example, rounding \(2.5 x\sp{2}\) to the nearest integer coefficient finds either \(2 x\sp{2}\) or \(3 x\sp{2}\). However \(2 x\sp{2} + x\sp{3}\) is also restricted to integer coefficients, but more accurate over \([0, .5]\). Straight coefficient-wise rounding yields an error of \(\frac{1}{2} x\sp{2}\), versus \(\frac{1}{2} x\sp{2} - x\sp{3}\) for the degree-3 approximation. As the following graph shows, the degree-3 approximation is much closer for the range we’re concerned with.

Floating point values are more densely distributed than the integers for the range of values we usually encounter in polynomial approximations, but still discrete, and the same phenomenon crops up, at a smaller scale.

The state of the art for this version of the approximation problem seems to be based on an initial Remez step, followed by a reduction to a CVP.

The cutting planes approach points to a natural solution from the OR world: a branch and cut method. The issue with the cutting planes solution is that, while I only generate points that correspond to FP values, the decision variables (coefficients) are free to take arbitrary rational values.

Branching can be used to split up a decision variable’s range and eliminate from consideration a range of non-FP values. The decision variables’ upper and lower bounds are tightened in a large number of subproblems; solutions gradually converge to all-FP values and their optimality is then proven.

Embedding the cutting planes method in a branch and bound lets us find polynomial approximations with float coefficients that minimise the maximal error over float arguments, with correctly rounded powers. The only remaining simplification is that we assume that the dot product in the polynomial evaluation is error-free. Sadly, removing this simplification results in a horrible discrete optimisation problem with a very bumpy objective function. I’m not sure that any exact approach can solve this in reasonable time. Still, there are ways to evaluate FP polynomials very accurately, and I’m mostly interested in approximations to trade accuracy for speed, so rounding errors may well be negligible compared to approximation errors.

A branch-and-cut method for polynomial approximations

As in all branch and bound methods, a problem is split in subproblems by restricting the range of some coefficient to exclude some infeasible (non-FP) values. For example, in the previous example (in which we’re looking for integer coefficients), the 2.5 coefficient would lead to two subproblems, one in which the degree-2 coefficient is at most 2 \((\lfloor 2.5\rfloor)\), and another in which it’s at least 3 \((\lceil2.5 \rceil)\), excluding all the fractional values between 2 and 3. For floats, we restrict to the closest floats that under- or over- approximate the current value.

However, instead of solving the full continuous relaxation (which is impractical, given the large number of potential argument values), our subproblems are solved with cutting planes. The trick is that cuts (constraints, which correspond to argument values) from any branch can be used everywhere else. Thus, the global point pool is shared between all subproblems, rather than re-generating it from scratch for each subproblem. In practice, this lifting of cutting planes to the root problem seems essential for efficiency; that’s certainly what it took for branch-and-cut MIP solvers to take off.

I generate cuts at the root, but afterwards only when a subproblem yields all-FP values. This choice was made for efficiency reasons. Finding error extrema is relatively slow, but, more importantly, adding cuts really slows down re-optimisation: the state of the simplex algorithm can be preserved between invocations, and warm-starting tends to be extremely efficient when only some variables’ bounds have been modified. However, it’s also necessary to add constraints when an incumbent is all-FP, lest we prematurely declare victory.

There are two really important choices when designing branch and bound algorithms: how the branching variable is chosen, and the order in which subproblems are explored. With polynomial approximations, it seems to make sense to branch on the coefficient corresponding to the lowest degree first: I simply scan the solution and choose the least-degree coefficient that isn’t represented exactly as a float. Nodes are explored in a hybrid depth-first/best-first order: when a subproblem yields children (its objective value is lower than the current best feasible solution, and it isn’t feasible itself), the next node is its child corresponding to the bound closest to the branching variable’s value, otherwise the node with the least predicted value is chosen. The depth-first/closest-first dives quickly converge to decent feasible solutions, while the best-first choice will increase the lower bound, bringing us closer to proving optimality.

A randomised rounding heuristic is also used to provide initial feasible (all-FP) solutions, and the incumbent is considered close enough to optimal when it’s less than 5% off from the best-known lower bound. When the method returns, we have a polynomial approximation with coefficients that are exactly representable as single or double floats (depending on the setting), and which (almost) minimises the approximation error on float arguments.

Interesting approximations

The branch and cut can be used to determine an (nearly-) optimally accurate FP polynomial approximation given a maximum degree. However, the degree isn’t the only tunable to accelerate polynomial evaluation: some coefficients are nicer than others. Multiplying by 0 is obviously very easy (nothing to do), while multiplication by +/-1 is pretty good (no multiplication), and by +/- 2 not too bad either (strength-reduce the multiplication into an addition). It would be possible to consider other integers, but it doesn’t seem to make sense: on current X86, floating point multiplication is only 33-66% slower (more latency) than FP addition, and fiddling with the exponent field means ping-ponging between the FP and integer domains.

There are millions of such approximations with a few “nice” coefficients, even when restricting the search to low degrees (e.g. 10 or lower). However, the vast majority of them will be wildly inaccurate. I decided to only consider approximations that are at least as accurate as the best approximation of degree three lower: e.g. a degree-3 polynomial with nice coefficients is only (potentially) interesting if it’s at least as accurate as a constant. Otherwise, it’s most likely even quicker to just use a lower-degree approximation.

That’s not enough: this filter still leaves thousands of polynomials. Most of the approximations will be dominated by another; what’s the point in looking at a degree-4 polynomial with one coefficient equal to 0 if there’s a degree-3 with one zero that’s just as accurate? The relative importance of the degree, and the number of zeroes, ones and twos will vary depending on the environment and evaluation technique. However, it seems reasonable to assume that zeroes are always at least as quick to work with as ones, and ones as quick as twos. The constant offset ought to be treated distinctly from other coefficients. It’s only added rather than multiplied, so any speed-up is lower, but an offset of 0 is still pretty nice, and, on some architectures, there is special support to load constants like 1 or 2.

This lets me construct a simple but robust performance model: a polynomial is more quickly evaluated than another if it’s of lower or same degree, doesn’t have more non-zero, non-{zero, one} or non-{zero, one, two} multipliers, and if its constant offset is a nicer integer (0 is nicer than +/- 1 is nicer than +/- 2 is nicer than arbitrary floats).

With these five metrics, in addition to accuracy, we have a multi-objective optimisation problem. In some situations, humans may be able to determine a weighting procedure to bring the dimension down to a scalar objective value, but the procedure would be highly domain-specific. Instead, we can use this partial order to report solutions on the Pareto front of accuracy and efficiency: it’s only worth reporting an approximation if there is no approximation that’s better or equal in accuracy and in all the performance metrics. The performance and accuracy characteristics are strongly correlated (decreasing the degree or forcing nice coefficients tends to decrease the accuracy, and a zero coefficient is also zero-or-one, etc.), so it’s not too surprising that there are so few non-dominated solutions.

Enumerating potentially-interesting approximations

The branch and cut can be used to find the most accurate approximation, given an assignment for a few values (e.g. the constant term is 1, or the first degree coefficient 0). I’ll use it as a subproblem solver, in a more exotic branch and bound approach.

We wish to enumerate all the partial assignments that correspond to not-too-horrible solutions, and save those. A normal branch and bound can’t be applied directly, as one of the choices is to leave a given variable free to take any value. However, bounding still works: if a given partial assignment leads to an approximation that’s too inaccurate, the accuracy won’t improve by fixing even more coefficients.

I started with a search in which children were generated by adjoining one fixed value to partial assignments. So, after the root node, there could be one child with the constant term fixed to 0, 1 or 2 (and everything else free), another with the first coefficient fixed to 0, 1 or 2 (everything else left free), etc.

Obviously, this approach leads to a search graph: fixing the constant term to 0 and then the first coefficient to 0, or doing in the reverse order leads to the same partial assignment. A hash table ensures that no partial assignment is explored twice. There’s still a lot of potential for wasted computation: if bounding lets us determine that fixing the constant coefficient to 0 is worthless, we will still generate children with the constant fixed to 0 in other branches!

I borrowed a trick from the SAT solving and constraint programming communities, nogood recording. Modern SAT solvers go far beyond strict branching search: in practice, it seems that good branching orders will lead to quick solutions, while a few bad ones take forever to solve. Solvers thus frequently reset the search tree to a certain extent. However, information is still communicated across search trees via learned clauses. When the search backtracks (an infeasible partial assignment is found), a nogood set of conflicting partial assignments can be computed: no feasible solution will include this nogood assignment.

Some time ago, Vasek Chvatal introduced Resolution Search to port this idea to 0/1 optimisation, and Marius Posta, a friend at CIRRELT and Université de Montréal, extended it to general discrete optimisation [PDF]. The complexity mostly comes from the desire to generate partial assignments that, if they’re bad, will merge well with the current set of learned clauses. This way, the whole set of nogoods (or, rather, an approximation that suffices to guarantee convergence) can be represented and queried efficiently.

There’s no need to be this clever here: each subproblem involves an exact (rational arithmetic) branch and cut. Simply scanning the set of arbitrary nogoods to look for a match significantly accelerates the search. The process is sketched below.

The search is further accelerated by executing multiple branch and cuts and nogood scans in parallel.

The size of the search graph is also reduced by only fixing coefficients to a few values. There’s no point in forcing a coefficient to take the value of 0, 1 or 2 (modulo sign) if it already does. Thus, a coefficient with a value of 0 is left free; otherwise a child extending the partial assignment with 0 is created. Similarly, a child extended with a value of 1 is only generated if the coefficient isn’t already at 0 or 1 (we suppose that multiplication by 0 is at least as efficient as by 1), and similarly for 2. Finally, coefficients between 0 and 1 are only forced to 0 or 1, and those between 1 and 3 to 0, 1 or 2. If a coefficient takes a greater absolute value than 3, fixing it most probably degrades the approximation too strongly, and it’s left free – then again such coefficients only seem to happen on really hard-to-approximate functions like \(\log\) over \([1, 2]\). Also, the last coefficient is never fixed to zero (that would be equivalent to looking at lower-degree approximations, which was done in previous searches). Of course, with negative coefficients, the fixed values are negated as well.

This process generates a large number of potentially interesting polynomial approximations. The non-dominated ones are found with a straight doubly nested loop, with a slight twist: accuracy is computed with machine floating point arithmetic, thus taking rounding into account. The downside is that it’s actually approximated, by sampling fairly many (the FP neighbourhood of 8K Chebyshev nodes) points; preliminary testing indicates that’s good enough for a relative error (on the absolute error estimate) lower than 1e-5. There tends to be a few exactly equivalent polynomials (all the attributes are the same, including accuracy – they only differ by a couple ULP in a few coefficients); in that case, one is chosen arbitrarily. There’s definitely some low-hanging fruit to better capture the performance partial order; the error estimate is an obvious candidate. The hard part was generating all the potentially interesting approximations, though, so one can easily re-run the selection algorithm with tweaked criteria later.

Exploiting the approximation indices

I’m not sure what functions are frequently approximated over what ranges, so I went with the obvious ones: \(\cos\) and \(\sin\) over \([-\pi/2, \pi/2]\), \(\exp\) and arctan over \([-1, 1]\) or \([0, 1]\), \(\log\) over \([1, 2]\), and \(\log 1+x\) and \(\log\sb{2} 1+x\) over \([0, 1]\). This used up a fair amount of CPU time, so I stopped at degree 16.

Each file reports the accuracy and efficient metrics, then the coefficients in floating point and rational form, and a hash of the coefficients to identify the approximation. The summary columns are all aligned, but each line is very long, so the files are best read without line wrapping.

For example, if I were looking for a fairly good approximation of degree 3 for \(\exp\) in single floats, I’d look at exp-degree-lb_error-non_zero-non_one-non_two-constant-error.

The columns report the accuracy and efficiency metrics, in the order used to sort approximations lexicographically:

  1. the approximation’s degree;
  2. the floor of the negated base-2 logarithm of the maximum error (roughly bits of absolute accuracy, rounded up);
  3. the number of non-zero multipliers;
  4. the number of non-{-1, 0, 1} multipliers;
  5. whe number of non-{-2, -1, 0, 1, 2} multipliers;
  6. whether the constant’s absolute value is 0, 1, 2, or other (in which case the value is 3); and
  7. the maximum error.

After that, separated by pipes, come the coefficients in float form, then in rational form, and the MD5 hash of the coefficients in a float vector (in a contiguous vector, in increasing order of degree, with X86’s little-endian sign-magnitude representation). The hash might also be useful if you’re worried that your favourite implementation isn’t parsing floats right.

There are three polynomials with degree 3, and they all offer approximately the same accuracy (lb_error = 10). I’d choose between the most accurate polynomial

1
exp(x) = 0.9994552 + 1.0166024 x + 0.42170283 x**2 + 0.2799766 x**3 # exp-74F7B9B7E0E73A804ABF6AC6C006BD98

or one with a nicer multiplier that doesn’t even double the maximum error

1
exp(x) = 1.0009761 + x + 0.4587815 x**2 + 0.2575481 x**3 # exp-D4C349D8F2C45EC0BE2154D1052EAA03

On the other hand, if I were looking for an accurate-enough approximation of \(\log 1+x\), I’d open log1px-lb_error-degree-error-non_zero-non_one-non_two-constant.

The columns are in the order used for the lexicographic sort:

  1. the number of bits of accuracy;
  2. the approximation’s degree;
  3. the maximum error;
  4. the number of non-zero multipliers;
  5. the number of non-{-1, 0, 1} multipliers;
  6. the number of non-{-2, -1, 0, 1, 2} multipliers; and
  7. whether the constant’s absolute value is 0, 1, 2, or other (in which case the value is 3).

An error around 1e-4 would be reasonable for my needs, and log1px-6AE509 seems interesting: maximum error is around 1.5e-4, it’s degree 4, the constant offset is 0 and the first multiplier 1. If I needed a bit more accuracy (7.1e-5), I’d consider log1px-E8200B: it’s degree 4 as well, and the constant is still 0.

It seems to me optimisation tools like approximation generators are geared toward fire and forget usage. I don’t believe that’s a realistic story: very often, the operator will have a fuzzy range of acceptable parameters, and presenting a small number of solutions with fairly close matches lets them exploit domain-specific insights. In this case, rather than specifying fixed coefficients and degree or accuracy goals, users can scan the indices and decide whether each trade-off is worth it or not. That’s particularly true of the single float approximations, for which the number of possibilities tends to be tiny (e.g. eight non-dominated approximations for \(\sin\)).

Tabasco Sort: A Super-optimal Merge Sort

EDIT: 2012-08-29: I added a section to compare comparison counts with known bounds for general comparison sorts and sorting networks.

In an earlier post, I noted how tedious coding unrolled sorts can be. Frankly, that’s the main reason I stopped at leaf sorts of size three. Recently, Neil Toronto wrote a nice post on the generation of size-specialised merge sorts. The post made me think about that issue a bit more, and I now have a neat way to generate unrolled/inlined merge sorts that are significantly smaller than the comparison and size “-optimal” inlined merge sorts.

The code is up as a single-file library, and sorts short vectors faster than SBCL’s inlined heapsort by a factor of two to three… and compiles to less machine code. The generator is a bit less than 100 LOC, so I’m not sure I want to include it in the mainline yet. If someone wants to add support for more implementations, I’d be happy to extend Tabasco sort, and might even consider letting it span multiple files ;)

Differently-optimal sorts

The inlined merge sort for three values (a, b, and c) is copied below. It has to detect between \(3! = 6\) permutation, and does so with an optimal binary search tree. That scheme leads to code with \(n! - 1\) comparisons to sort n values, and for which each execution only goes through two or three comparisons (\(\approx \lg n!\)).

“optimal” inlined merge sort (n = 3)
1
2
3
4
5
6
7
8
9
10
11
(if (< b c)
    (if (< a b)
        (values a b c)
        (if (< a c)
            (values b a c)
            (values b c a)))
    (if (< a c)
        (values a c b)
        (if (< a b)
            (values c a b)
            (values c b a))))

An optimal sorting network for three values needs only three comparisons, and always executes those three comparisons.

optimal sorting network (n = 3)
1
2
3
4
5
6
7
(progn
  (when (< c b)
    (rotatef b c))
  (when (< b a)
    (rotatef a b))
  (when (< c b)
    (rotatef b c)))

Finally, the leaf sort I used in SBCL is smaller than the inlined merge sort (three comparisons), but sometimes executes fewer than three comparisons. It’s superoptimal ;)

“super-optimal” inlined merge sort (n = 3)
1
2
3
4
5
6
7
8
(progn
  (when (< c b)
    (rotatef b c))
  (if (< b a)
      (if (< c a)
          (values b c a)
          (values b a c))
      (values a b c)))

The optimal merge sort is larger than the optimal sorting network, and the optimal sorting network performs potentially more comparisons than the optimal merge sort…

Each implementation is optimal for different search spaces: the optimal merge sort never merges continuations, and the sorting network’s only control dependencies are in the conditional swaps.

The “super-optimal” merge sort does better by allowing itself both assignments (or tail-calls) and non-trivial control flow: it’s smaller than the inlined merge sort (but performs more data movement), and potentially executes fewer comparisons than the sorting network (with a larger total code size). And, essential attribute in practice, it’s easy to generate. This contrasts with optimal sorting networks, for which we do not have any generation method short of brute force search; in fact, in practice, sorting networks tend to exploit suboptimal (by a factor of \(\log n\)) schemes like bitonic sort or odd-even merge sort. Then again, we’re only concerned with tiny sorts, and asymptotics can be misleading: Batcher’s odd-even merge sort happens to be optimal for \(n\leq 8\). The issue with sorting networks remains: data-oblivious control flow pessimises their comparison count.

Generalising from size 3

What the last merge sort does is to first sort both halves of the values (a is trivially sorted, and b c needs one conditional swap), and then, assuming that each half (a and b c) is sorted, find the right permutation with which to merge them. Rather than \(n!\) permutations, a merge only needs to distinguish between \(C(n, \lfloor n/2\rfloor) = \frac{n!}{\lfloor n/2\rfloor!\lceil n/2\rceil!}\) permutations, and the recursive sorts are negligible compared to the merge step. That’s a huge reduction in code size!

A simple merge generator fits in half a screenful.

unrolled merge generator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(defun emit-permute (destinations sources)
  ;; (setf values) is parallel assignment
  `(setf (values ,@destinations) (values ,@sources)))

(defun emit-merge-1 (destinations left right acc)
  "Build a search tree to determine the right permutation to
   merge LEFT and RIGHT, given that each is pre-sorted."
  (cond ((null left)
         (emit-permute destinations (append (reverse acc) right)))
        ((null right)
         (emit-permute destinations (append (reverse acc) left)))
        (t
         `(if (< ,(first right) ,(first left)) ; stable sort
              ,(emit-merge-1 destinations
                             left (rest right)
                             (cons (first right) acc))
              ,(emit-merge-1 destinations
                             (rest left) right
                             (cons (first left) acc))))))

(defun emit-merge (left right)
  (emit-merge-1 (append left right) left right nil))

Given two lists of sorted variables, emit-merge calls emit-merge-1 to generate code that finds the right permutation, and executes it at the leaf. A binary search tree is generated by keeping track of the merged list in a reverse-order (to enable tail-sharing) accumulator of variable names. As expected, when merging a list of length one with another of length two, we get pretty much the code I wrote by hand earlier.

CL-USER> (emit-merge '(a) '(b c))
(if (< b a)
    (if (< c a)
        (setf (values a b c) (values b c a))
        (setf (values a b c) (values b a c)))
    (setf (values a b c) (values a b c)))

There’s one striking weakness: we generate useless code for the identity permutation. We could detect that case, or, more generally, we could find the cycle decomposition of each permutation and use it to minimise temporary values; that’d implicitly take care of cases like (setf (values a b c) (values b a c)), in which some values are left unaffected.

A smarter permutation generator

I’ll represent permutations as associative lists, from source to destination. Finding a cycle is easy: just walk the permutation from an arbitrary value until we loop back.

extract a single cycle from a linearly-represented permutation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(defun find-cycle (mapping)
  "Extract an arbitrary cycle from a non-empty mapping,
   returning both the cycle and the rest of the mapping."
  (assert mapping)
  (let* ((head  (pop mapping))
         (cycle (list (cdr head))))
    (loop
     (let* ((next-source (first cycle))
            (pair        (assoc next-source mapping)))
       (cond (pair
              (push (cdr pair) cycle)
              ;; if this sucks enough to matter, the output
              ;; will be humongous anyway
              (setf mapping (remove pair mapping)))
             (t
              (assert (eql next-source (first head)))
              (return (values cycle mapping))))))))

To generate the code corresponding to a permutation, I can extract all the cycles, execute each cycle with a rotatef.

cycle-decomposition-based permute generator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(defun emit-permute (destinations sources)
  "Emit a [destinations <- sources] permutation via its
   cycle decomposition"
  ;; source -> destination alist, minus trivial pairs
  (let ((mapping (remove-if (lambda (pair)
                              (eql (car pair) (cdr pair)))
                            (pairlis sources destinations))))
    `(progn
       ,@(loop while mapping
               collect
               (multiple-value-bind (cycle new-mapping)
                   (find-cycle mapping)
                 (setf mapping new-mapping)
                 `(rotatef ,@cycle))))))

The merge step for a sort of size three is now a bit more explicit, but likely compiles to code that uses fewer registers as well. It probably doesn’t matter on good SSA-based backends, but those are the exception rather than the norm in the Lisp world.

CL-USER> (emit-merge '(a) '(b c))
(if (< b a)
    (if (< c a)
        (progn (rotatef a b c))
        (progn (rotatef a b)))
    (progn))

Adding the recursive steps

The only thing missing for a merge sort is to add base cases and recursion. The base case is easy: lists of length one are sorted. Inlining recursion is trivial, as is usually the case when generating Lisp code.

“super-optimal” inlined merge sort
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(defun emit-sort-1 (values length)
  (when (> length 1)
    (let* ((split (truncate length 2))
           (left  (subseq values 0 split))
           (right (subseq values split)))
      `(progn
         ,(emit-sort-1 left  split)
         ,(emit-sort-1 right (- length split))
         ,(emit-merge left right)))))

(defun emit-sort (values)
  (emit-sort-1 values (length values)))

(defmacro inline-sort (&rest values)
  (let* ((pairs (loop for value in values
                      collect `(,(gensym "TEMP") ,value)))
         (temps (mapcar #'first pairs)))
    `(let ,pairs
       ,(emit-sort temps)
       (values ,@temps))))

The resulting three-value sorter looks good; there are some redundancies with nested or empty progns, but any half-decent compiler will take care of that. Python certainly does a goob job on that code.

CL-USER> (emit-sort '(a b c))
(progn
 nil
 (progn
  nil
  nil
  (if (< c b)
      (progn (rotatef b c))
      (progn)))
 (if (< b a)
     (if (< c a)
         (progn (rotatef a b c))
         (progn (rotatef a b)))
     (progn)))
CL-USER> (disassemble (lambda (a b c)
                        (declare (type fixnum a b c))
                        (inline-sort a b c)))
; disassembly for (lambda (a b c))
; 0E88F150:       498BD0           mov rdx, r8                ; no-arg-parsing entry point
;       53:       498BC9           mov rcx, r9
;       56:       498BDA           mov rbx, r10
;       59:       4D39CA           cmp r10, r9
;       5C:       7C30             jl L3
;       5E: L0:   4C39C1           cmp rcx, r8
;       61:       7D0B             jnl L1
;       63:       4C39C3           cmp rbx, r8
;       66:       7C1B             jl L2
;       68:       488BD1           mov rdx, rcx
;       6B:       498BC8           mov rcx, r8
;       6E: L1:   488BF9           mov rdi, rcx
;       71:       488BF3           mov rsi, rbx
;       74:       488D5D10         lea rbx, [rbp+16]
;       78:       B906000000       mov ecx, 6
;       7D:       F9               stc
;       7E:       488BE5           mov rsp, rbp
;       81:       5D               pop rbp
;       82:       C3               ret
;       83: L2:   488BD1           mov rdx, rcx
;       86:       488BCB           mov rcx, rbx
;       89:       498BD8           mov rbx, r8
;       8C:       EBE0             jmp L1
;       8E: L3:   498BCA           mov rcx, r10
;       91:       498BD9           mov rbx, r9
;       94:       EBC8             jmp L0

I thought about generating calls to values in the final merge, rather than permuting, but decided against: I know SBCL doesn’t generate clever code when permuting registers, and that’d result in avoidable spills. I also considered generating code in CPS rather than emitting assignments; again, I decided against because I can’t depend on SBCL to emit clever permutation code. The transformation would make sense in a dialect with weaker support (both compiler and social) for assignment.

How good is the generated code?

Both this inline merge sort and the original, permutation-free (except at the leaves), one actually define the exact same algorithms. For any input, both (should) execute the same comparisons in the same order: the original inline merge sort simply inlines the whole set of execution traces, without even merging control flow.

The permutation-free sort guarantees that it never performs redundant comparisons. Whether it performs the strict minimum number of comparisons, either on average or in the worst case, is another question. At first, I thought that \(\lg n!\) should be a good estimate, since the search tree seems optimally balanced. The problem is \(n!\) tends to have many other prime factors than 2, and we can thus expect multiple comparisons to extract less than 1 bit of information, for each execution. The lower bound can thus be fairly far from the actual value… Still, this question is only practically relevant for tiny sorts, so the discrepancy shouldn’t be too large.

A simple way to get the minimum, average or maximum comparison count would be to annotate the permutation-free generator to compute the shortest, average and longest path as it generates the search tree.

I’ll instead mimic the current generator.

The first step is to find the number of comparisons to perform a merge of m and n values.

If either m or n is zero, merging the sequences is trivial.

Otherwise, the minimum number of comparisons is (min m n): the sequences are pre-sorted, and the shortest sequence comes first. The maximum is (1- (+ m n)). I couldn’t find a simple expression for the average over all permutations. Instead, I iterate over all possible combinations and count the number of comparisons until either subsequence is exhausted.

compute the number of comparisons during merges
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
(defun merge-count (m n)
  ;; return min, expected, max comparison to merge
  ;; presorted subsequences of m and n values
  (if (zerop (min m n))
      (values 0 0 0)
      (let ((comparisons 0)
            (count 0)
            (min   most-positive-fixnum)
            (max   0))
        (dotimes (i (ash 1 (+ m n))
                    (values min
                            (/ comparisons
                               count)
                            max))
          ;; only consider combinations with m ones
          ;; (and n zeros)
          (when (= (logcount i) m)
            (let ((cmp (1- (+ m n)))
                  (mask i))
              ;; no more comparison needed until two consecutive
              ;; elements from distinct subsequences
              (loop while (eql (logbitp 0 mask)
                               (logbitp 1 mask))
                    do (decf cmp)
                       (setf mask (ash mask -1)))
              (setf min (min min cmp)
                    max (max max cmp))
              (incf comparisons cmp)
              (incf count)))))))

Counting the number of comparisons in sorts is then trivial, with a recursive function. I didn’t even try to memoise repeated computations: the generated code is ludicrously long when sorting as few as 13 or 14 values.

compute the number of comparisons in merge sort
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(defun sort-count (n)
  (if (<= n 1)       ; trivially sorted
      (values 0 0 0)
      (let ((min 0)
            (avg 0)
            (max 0))
        (flet ((inc (min- avg- max-)
                 (incf min min-)
                 (incf avg avg-)
                 (incf max max-)))
          ;; accumulate min/avg/max count from sorting the left and
          ;; right subsequences, and merging
          (multiple-value-call #'inc (sort-count (floor n 2)))
          (multiple-value-call #'inc (sort-count (ceiling n 2)))
          (multiple-value-call #'inc (merge-count (floor n 2)
                                                  (ceiling n 2))))
        (values min avg max))))
CL-USER> (loop for i from 2 upto 16
               do (multiple-value-bind (min avg max)
                      (sort-count i)
                    (format t "~4D ~6,2F ~6D ~6,2F ~6D~%"
                            i (log (! i) 2d0)
                            min (float avg) max)))
;; n  lg(n!)   min   avg     max ;   best   network
   2   1.00      1   1.00      1 ;   1      1
   3   2.58      2   2.67      3 ;   3      3
   4   4.58      4   4.67      5 ;   5      5
   5   6.91      5   7.17      8 ;   7      9
   6   9.49      7   9.83     11 ;   10     12
   7  12.30      9  12.73     14 ;   13     16
   8  15.30     12  15.73     17 ;   16     19
   9  18.47     13  19.17     21 ;   19     25?
  10  21.79     15  22.67     25 ;   22     29?
  11  25.25     17  26.29     29 ;   26     35?
  12  28.84     20  29.95     33 ;   30     39?
  13  32.54     22  33.82     37 ;   34     45?
  14  36.34     25  37.72     41 ;   38     51?
  15  40.25     28  41.69     45 ;   42     56?
  16  44.25     32  45.69     49 ;   46?    60?

I annotated the output with comments (marked with semicolons). The columns are, from left to right, the sort size, the theoretical lower bound (on the average or maximum number of comparisons), the minimum number of comparisons (depending on the input permutation), the average (over all input permutations), and the maximum count. I added two columns by hand: the optimal worst-case (maximum) comparison counts (over all sorting methods, copied from the OEIS), and the optimal size for sorting networks, when known (lifted from a table here [pdf]). Inexact (potentially higher than the optimum) bounds are marked with a question mark.

For the input size the inline merge sort can reasonably tackle (up to ten or so), its worst-case is reasonably close to the best possible, and its average case tends to fall between the lower bound and the best possible. Over all these sizes, the merge sort’s worst case performs fewer comparisons than the optimal or best-known sorting networks. The current best upper bounds on the minimal worst-case comparison count seem to be based on insertion sort passes that minimise the number of comparisons with a binary search. I don’t believe that’s directly useful for the current use case, but a similar trick might be useful to reduce the number of comparisons, at the expense of reasonably more data movement.

Making it CL-strength

That’s already a decent proof of concept. It’s also far too plain to fit in the industrial-strength Common Lisp way. An inline sorting macro worthy of CL should be parameterised on both comparator and key functions, and work with arbitrary places rather than only variables.

Parameterising the comparator is trivial. The key could be handled by calling it at each comparison, but that’s wasteful. We’re generating code; might as well go for glory. Just like in the list merge sort, I’ll cache calls to the key function in the merge tree. I’ll also use special variables instead of passing a half-dozen parameters around in the generator.

key/comparator-aware merge generator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
(defvar *inline-sort-comparator*)
(defvar *inline-sort-key*)
(defvar *inline-sort-destinations*)
(defvar *inline-sort-left-head*)
(defvar *inline-sort-right-head*)

(defun emit-merge-1 (left right acc)
  "Build a search tree to determine the right permutation to
   merge LEFT and RIGHT, given that each is pre-sorted."
  ;; stability trickery
  `(if (funcall ,*inline-sort-comparator* ,*inline-sort-right-head*
                                          ,*inline-sort-left-head*)
       ,(let* ((acc        (cons (first right) acc))
               (right      (rest right)))
          ;; pop from RIGHT, and recurse if RIGHT isn't empty.
          (if right
              `(let ((,*inline-sort-right-head*
                       (funcall ,*inline-sort-key* ,(first right))))
                 ,(emit-merge-1 left right acc))
              (emit-permute *inline-sort-destinations*
                            (append (reverse acc) left))))
       ;; same
       ,(let* ((acc  (cons (first left) acc))
               (left (rest left)))
          (if left
              `(let ((,*inline-sort-left-head*
                       (funcall ,*inline-sort-key* ,(first left))))
                 ,(emit-merge-1 left right acc))
              (emit-permute *inline-sort-destinations*
                            (append (reverse acc) right))))))

(defun emit-merge (left right)
  "Caching calls to KEY means we have to special-case empty lists
   (which doesn't happen when we sort, anyway)"
  (cond ((null left)
         (emit-permute right right))
        ((null right)
         (emit-permute left left))
        (t
         (let ((*inline-sort-destinations* (append left right))
               (*inline-sort-left-head*  (gensym "LEFT-HEAD-KEY"))
               (*inline-sort-right-head* (gensym "RIGHT-HEAD-KEY")))
           `(let ((,*inline-sort-left-head*  (funcall ,*inline-sort-key*
                                                      ,(first left)))
                  (,*inline-sort-right-head* (funcall ,*inline-sort-key*
                                                      ,(first right))))
              ,(emit-merge-1 left right nil))))))

(defun emit-sort-1 (values length)
  "Unrolled and inlined recursive merge sort generator.
   Lists of length 1 or less are trivially sorted; recurse
   on the rest."
  (when (> length 1)
    (let* ((split (truncate length 2))
           (left  (subseq values 0 split))
           (right (subseq values split)))
      `(progn
         ,(emit-sort-1 left  split)
         ,(emit-sort-1 right (- length split))
         ,(emit-merge left right)))))

(defun emit-sort (values *inline-sort-comparator* *inline-sort-key*)
  (emit-sort-1 values (length values)))

Finally, handling arbitrary places, thus letting the macro take care of writing results back to the places, is just regular macrology.

CL-style inline sort macro
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
(defmacro inline-sort ((comparator &key key (overwrite t))
                       &body values
                       &environment env)
  "Sorts all VALUES in increasing order with respect to COMPARATOR and
   KEY.  COMPARATOR should be a strict order, like CL:<, and KEY defaults
   to NIL (which is interpreted as the identity).  By default, the result
   is written back to the places; that's skipped if OVERWRITE is NIL. A
   literal NIL value for overwrite will avoid generating any write.
   The SORT form always evaluates to the sorted values, in order."
  (let (vars vals
        store-vars writer-forms
        reader-forms
        temps
        (_comparator (gensym "COMPARATOR"))
        (_key        (gensym "KEY"))
        (_overwrite  (gensym "OVERWRITE")))
    (loop for value in (reverse values) do
      (push (gensym "TEMP") temps)
      ;; only use the setf expansion if we might write to the place.
      (if (not overwrite)
          (push value reader-forms)
          (multiple-value-bind (var val store-var writer reader)
              (get-setf-expansion value env)
            (setf vars (append var vars)
                  vals (append val vals))
            (push store-var store-vars)
            (push writer writer-forms)
            (push reader reader-forms))))
    `(let* ((,_comparator ,comparator)
            (,_comparator (if (functionp ,_comparator)
                              ,_comparator
                              (symbol-function ,_comparator)))
            (,_key        ,(or key '#'identity))
            (,_key        (if (functionp ,_key)
                              ,_key
                              (symbol-function ,_key)))
            (,_overwrite  ,overwrite)
            ,@(mapcar 'list vars vals)
            ,@(mapcar 'list temps reader-forms))
       (declare (ignorable ,_comparator ,_key ,_overwrite))
       ,(emit-sort temps _comparator _key)
       ,(and overwrite
             `(when ,_overwrite
                ,@(loop
                    for value in values
                    for store-var-list in store-vars
                    for writer in writer-forms
                    for temp in temps
                    collect
                    (progn
                      (unless (= 1 (length store-var-list))
                        (error "Can't sort multiple-value place ~S" value))
                      `(let ((,(first store-var-list) ,temp))
                         ,writer)))))
       (values ,@temps))))

Now, the macro can be used to sort, e.g., vectors of double floats “in-place” (inasmuch as copying everything to registers can be considered in-place).

CL-USER> (macroexpand-1 `(inline-sort (#'< :key #'-)
                           (aref array 0) (aref array 1) (aref array 2)))
(let* ((#:comparator1184 #'<)
       (#:comparator1233
        (if (functionp #:comparator1233)
            #:comparator1233
            (symbol-function #:comparator1233)))
       (#:key1185 #'-)
       (#:key1234
        (if (functionp #:key1234)
            #:key1234
            (symbol-function #:key1234)))
       (#:overwrite1186 t)
       (#:array1195 array)
       (#:array1192 array)
       (#:array1189 array)
       (#:temp1193 (aref #:array1195 0))
       (#:temp1190 (aref #:array1192 1))
       (#:temp1187 (aref #:array1189 2)))
  (declare (ignorable #:comparator1184 #:key1185 #:overwrite1186))
  (progn
   nil
   (progn
    nil
    nil
    (let ((#:left-head-key1196 (funcall #:key1185 #:temp1190))
          (#:right-head-key1197 (funcall #:key1185 #:temp1187)))
      (if (funcall #:comparator1184 #:right-head-key1197 #:left-head-key1196)
          (progn (rotatef #:temp1190 #:temp1187))
          (progn))))
   (let ((#:left-head-key1198 (funcall #:key1185 #:temp1193))
         (#:right-head-key1199 (funcall #:key1185 #:temp1190)))
     (if (funcall #:comparator1184 #:right-head-key1199 #:left-head-key1198)
         (let ((#:right-head-key1199 (funcall #:key1185 #:temp1187)))
           (if (funcall #:comparator1184 #:right-head-key1199
                        #:left-head-key1198)
               (progn (rotatef #:temp1193 #:temp1190 #:temp1187))
               (progn (rotatef #:temp1193 #:temp1190))))
         (progn))))
  (when #:overwrite1186
    (let ((#:new1194 #:temp1193))
      (sb-kernel:%aset #:array1195 0 #:new1194))
    (let ((#:new1191 #:temp1190))
      (sb-kernel:%aset #:array1192 1 #:new1191))
    (let ((#:new1188 #:temp1187))
      (sb-kernel:%aset #:array1189 2 #:new1188)))
  (values #:temp1193 #:temp1190 #:temp1187))
t
CL-USER> (disassemble (lambda (array)
                        (declare (type (simple-array double-float (3)) array))
                        (inline-sort (#'< :key #'-)
                          (aref array 0) (aref array 1) (aref array 2))
                        array))
; disassembly for (lambda (array))
; 0C5A5661:       F20F105201       movsd XMM2, [rdx+1]        ; no-arg-parsing entry point
;      666:       F20F104209       movsd XMM0, [rdx+9]
;      66B:       F20F104A11       movsd XMM1, [rdx+17]
;      670:       660F28E0         movapd XMM4, XMM0
;      674:       660F5725A4000000 xorpd XMM4, [rip+164]
;      67C:       660F28D9         movapd XMM3, XMM1
;      680:       660F571D98000000 xorpd XMM3, [rip+152]
;      688:       660F2FDC         comisd XMM3, XMM4
;      68C:       7A02             jp L0
;      68E:       7267             jb L3
;      690: L0:   660F28DA         movapd XMM3, XMM2
;      694:       660F571D84000000 xorpd XMM3, [rip+132]      ; negate double-float
;      69C:       660F28E0         movapd XMM4, XMM0
;      6A0:       660F572578000000 xorpd XMM4, [rip+120]
;      6A8:       660F2FE3         comisd XMM4, XMM3
;      6AC:       7A26             jp L1
;      6AE:       7324             jnb L1
;      6B0:       660F28E1         movapd XMM4, XMM1
;      6B4:       660F572564000000 xorpd XMM4, [rip+100]
;      6BC:       660F2FE3         comisd XMM4, XMM3
;      6C0:       7A27             jp L2
;      6C2:       7325             jnb L2
;      6C4:       660F28DA         movapd XMM3, XMM2
;      6C8:       660F28D0         movapd XMM2, XMM0
;      6CC:       660F28C1         movapd XMM0, XMM1
;      6D0:       660F28CB         movapd XMM1, XMM3
;      6D4: L1:   F20F115201       movsd [rdx+1], XMM2
;      6D9:       F20F114209       movsd [rdx+9], XMM0
;      6DE:       F20F114A11       movsd [rdx+17], XMM1
;      6E3:       488BE5           mov rsp, rbp
;      6E6:       F8               clc
;      6E7:       5D               pop rbp
;      6E8:       C3               ret
;      6E9: L2:   660F28DA         movapd XMM3, XMM2
;      6ED:       660F28D0         movapd XMM2, XMM0
;      6F1:       660F28C3         movapd XMM0, XMM3
;      6F5:       EBDD             jmp L1
;      6F7: L3:   660F28D8         movapd XMM3, XMM0
;      6FB:       660F28C1         movapd XMM0, XMM1
;      6FF:       660F28CB         movapd XMM1, XMM3
;      703:       EB8B             jmp L0

Bonus: Hooking in SBCL

The inline sort supports the same options as CL:SORT, so it’d be really interesting to opportunistically compile calls to the latter into size-specialised inline sort. The usual, portable, way to code that sort of macro qua source-to-source optimiser in CL is with compiler macros; compiler macros have access to all the usual macroexpansion-time utility, but the function definition is left in place. That way the user can still use the function as a first-class function, and the compiler-macro can decline the transformation if a regular call would work better (and the compiler can ignore any compiler macro). That’s not enough for our needs, though… and there can only be one compiler macro per function, so adding one to code we don’t own is a bad idea.

Python’s first internal representation (ir1) is optimised by iteratively deriving tighter type information, and (mostly) pattern matching on the type of function calls. Its DEFTRANSFORM form lets us add new rules, and there may be an arbitrary number of such rules for each function.

Hooking our sort generator in SBCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
(in-package "SB-C")
(defvar *unrolled-vector-sort-max-length* 8)

(defun maybe-emit-unrolled-merge-sort (node sequence key)
  (unless (policy node (> speed space))
    (give-up-ir1-transform))
  (let* ((sequence-type (lvar-type sequence))
         (dimensions (array-type-dimensions-or-give-up
                      sequence-type)))
    (unless (typep dimensions '(cons number null))
      (give-up-ir1-transform
       "~@<sequence argument isn't a vector of known length~:@>"))
    (let ((length (first dimensions)))
      (when (> length *unrolled-vector-sort-max-length*)
        (give-up-ir1-transform
         "~@<sequence argument too long for unrolled sort ~
              (length ~S greater than ~S)~:@>"
         length *unrolled-vector-sort-max-length*))
      (if (<= length 1)
          'sequence
          `(with-array-data ((array sequence)
                             (start)
                             (end))
             (declare (optimize (insert-array-bounds-checks 0))
                      (ignore end))
             (inline-sort
                 ((%coerce-callable-to-fun predicate)
                  :key ,(if key
                            '(%coerce-callable-to-fun key)
                            '#'identity))
               ,@(loop for i below length
                       collect `(aref array (+ start ,i))))
             sequence)))))

(deftransform sort ((sequence predicate &key key)
                    * * :node node)
  "unroll sort of short vectors"
  (maybe-emit-unrolled-merge-sort node sequence key))

(deftransform stable-sort ((sequence predicate &key key)
                           * * :node node)
  "unroll stable-sort of short vectors"
  (maybe-emit-unrolled-merge-sort node sequence key))

The two deftransforms at the end define new rules that match on calls to CL:SORT and CL:STABLE-SORT, with arbitrary argument types and return types: basic type checks are performed elsewhere, and maybe-emit-unrolled-merge-sort does the rest. Transforms are identified by the docstring (which also improve compiler notes), and the argument and return types, so the forms are reevaluation-safe.

All the logic lies in maybe-emit-unrolled-merge-sort. The policy form checks that the optimisation policy at the call node has speed greater than space, and gives up on the transformation otherwise. The next step is to make sure the sequence argument is an array, and that its dimensions are known and define a vector (its dimension list is a list of one number). The final guard makes sure we only specialise on small sorts (at most *unrolled-vector-sort-max-length*).

Finally, we get to code generation itself. A vector of length 1 or 0 is trivially pre-sorted. I could also directly emit the inner inline-sort form, but SBCL has some stuff to hoist out computations related to hairy arrays. with-array-data takes a potentially complex array (e.g. displaced, or not a vector), and binds the relevant variables to the underlying simple array of rank 1, and the start and end indices corresponding to the range we defined (defaulting to the whole range of the input array). Bound checks are eliminated because static information ensures the accesses are safe (or the user lied and asked not to insert type checks earlier), and the start index is declared to be small enough that we can add to it without overflow – Python doesn’t implement the sort of sophisticated shape analyses that could figure that out. Finally, a straight inline-sort form can be emitted.

That machinery means we’ll get quicker and shorter inline sort code when the size is known ahead of time. For example, a quick disassembly shows the following is about one fourth the size of the size-generic inline code (with (simple-array double-float (*)), and (optimize speed (space 0)).

CL-USER> (lambda (x)
           (declare (type (simple-array double-float (4)) x)
                    (optimize speed))
           (sort x #'<))
#<FUNCTION (lambda (x)) {100BFF457B}>
CL-USER> (disassemble *)
; disassembly for (lambda (x))
; 0BFF45CF:       F20F105A01       movsd XMM3, [rdx+1]        ; no-arg-parsing entry point
;      5D4:       F20F104209       movsd XMM0, [rdx+9]
;      5D9:       F20F104A11       movsd XMM1, [rdx+17]
;      5DE:       F20F105219       movsd XMM2, [rdx+25]
;      5E3:       660F2FC3         comisd XMM0, XMM3
;      5E7:       7A0E             jp L0
;      5E9:       730C             jnb L0
;      5EB:       660F28E3         movapd XMM4, XMM3
;      5EF:       660F28D8         movapd XMM3, XMM0
;      5F3:       660F28C4         movapd XMM0, XMM4
[...]

Even for vectors of length 8 (the default limit), the specialised merge sort is shorter than SBCL’s inlined heapsort, and about three times as fast on shuffled vectors.

That’s it

It took me much longer to write this up than to code the generator, but I hope this can be useful to other people. One thing I’d like to note is that sorting networks are much harder to get right than this generator, and pessimise performance: without branches, there must be partially redundant computations on non-power-of-two sizes. In the absence of solid compiler support for conditional swaps, I doubt the additional overhead of optimal sorting networks can be compensated by the simpler control flow, never mind the usual odd-even or bitonic networks.