SBCL was accepted as a mentoring organisation for Google’s Summer of
Code 2013 (our list of project suggestion is
here). This will be our first
time, so that’s really great news. I’m also extremely surprised by
the number of people who’ve expressed interest in working with us. I
was going to reply to a bunch of emails individually, but I figure I
should also centralise some of the stuff here.
EDIT: There’s a section with a bunch of general references.
EDIT 2: Added a note on genesis when playing with core formats.
Getting started
Setting up the basic tools
The first step is probably to install git, and clone our repo
(the github mirror works well, and
lets you fork to your own github account for easy publication). Then,
building from source and installing SBCL (a local installation to
$HOME works fine) is obviously useful experience, and will be useful
to explore the source. Reading INSTALL should be enough to get
started on Linux or OS X and x86[-64]. Other platforms may need more
work, and might not be the best choice if you’re not interested in
improving the port itself (although I’m told FreeBSD and Solaris work
very well on x86[-64]). To build SBCL from source, you’ll need an
SBCL binary (bootstrapping from other CLs should work, but support
regularly bitrots away), and the usual C development tools
(e.g. build-essential on debian). A fancy build (./make.sh --fancy)
is probably the best choice for development.
You’ll also want to run the test suite often; better try it out now
(cd tests; sh run-tests.sh) to make sure you can get it working.
The test suite will barf if there’s non-ASCII characters in the
environment. Oh my Zsh’s
git customisation systematically trips me up, for example (I currently
kludge it and start a bash from ~ and then run the tests).
Once SBCL HEAD is working and installed, it’s probably best to install
emacs and SLIME. Quicklisp’s
quicklisp-slime-helper can take care of installing SLIME. It is
possible to work on SBCL without SLIME. However, SLIME has a lot of
useful extensions, if only to explore the code base. If you’re not
(and don’t wish to become) comfortable with emacs, it’s probably best
to nevertheless use emacs and SLIME for the REPL, debugger, inspector,
etc. and write code in your favourite editor. Later on, it’ll be
useful to figure out how to make SLIME connect to a freshly-built
SBCL.
Exploring the source
I often see newcomers try to read the source like a book, and, once
they realise there’s a lot of code, try to figure out a good order to
read the source. I don’t think that’s the best approach. SBCL is
pretty huge, and I doubt anyone ever simultaneously holds the complete
system in their head.
RAM’s “The Python Compiler for CMU Common Lisp”
is still useful as an overview, and
SBCL’s internals manual is a
good supplement. Once you get close to bootstrapping logic,
Christophe Rhodes’s “SBCL: a Sanely-Bootstrappable Common Lisp”
helps understand the exclamation marks. Past that, I believe it’s
preferrable to start out small, learn just enough to get the current
task done, and accept that some things just work, without asking how
(for now).
In that spirit, I’d say M-. (Alt period, Command period on some OS X
emacsen) is the best way to explore most of SBCL’s source. SBCL’s
build process preserves a lot of source location information, and
M-. queries that information to jump to the definitions for any given
symbol (M-, will pop back up to the previous location). For example,
if you type “(truncate” at the REPL and hit M-. (with the point on or
just after “truncate”), you’ll find the out of line definition for
truncate in (mostly) regular Common Lisp, optimisation rules regarding
truncate, and VOPs, assembly language templates, for truncate called
with a few sets of argument and return types. The out of line
definition isn’t that interesting. The transforms, however, are.
(VOPs aren’t useful if one isn’t comfortable with the platform’s
assembly language, and mostly self-explanatory otherwise.)
The one to “convert integer division to multiplication” is a very good
example. One could M-. on deftransform, and go down a very long
chain of definitions. Instead, I think it’s only essential to see
that the form defines a new rule, like a compiler macro, such that
compile-time values (lvars) that represent its two arguments are bound
to x and y, and the rule only triggers if its first argument is known
to be an unsigned word, and its second a constant unsigned word. If
that’s satisfied, the transformation still only triggers if the speed
optimisation quality is higher than both compilation speed and space
(code size).
Then, the constant value for y is extracted and bound to y, and a
conservative bound on the maximum value that x can take at runtime is
computed. If truncate by y should be handled elsewhere, the transform
gives up. Otherwise, it returns a form that will be wrapped in
(lambda (x y) ...) and spliced in the call, instead of truncate.
To extend SBCL’s support for division by constants, it’s not necessary
to understand more of SBCL’s compiler than the above. There’s no need
to try and understand how deftransform works, only that it defines a
rule to simplify calls to truncate. Similarly for lvar-value and
lvar-type: the former extracts the value for constant lvars, and the
latter the static type derived for that lvar (value at a program
point). With time, knowledge will slowly accrete. However it’s
possible, if not best, to start hacking without understanding the
whole system. This approach will lead to a bit of cargo culting, but
mentors and people on IRC will help make sure it doesn’t do any harm,
and can explain more stuff if it’s interesting or à propos.
Finding where the compiler lives
Working on the compiler itself is a bit more work. I think the best
approach is to go in src/compiler/main.lisp and look for
compile-component. ir1-phases loops on a component and performs
high-level optimisations until fixpoint (or we get tired of waiting),
while %compile-component handles the conversion to IR2 and then to
machine code. The compilation pipeline hasn’t really changed since
the Python paper was written, and the subphases each have their own
function (and file). M-. on stuff that sounds interesting is probably
the best approach at the IR2 level.
Runtime stuff
The C and assembly runtime lives in src/runtime/. There’s a lot of
stuff that’s symlinked or generated during the build, so it’s probably
best to look at it after a successful build. Sadly, we don’t track
source locations there, but {c,e,whatever}tags works; so does grep.
GC stuff is in the obvious suspects (gc-common, gencgc, gc, etc.), but
may end up affecting core loading/saving (core, coreparse, save).
Depending on what in the core loading code is affected, code in
genesis (the initial bootstrap that reads fasl files from the cross
compiler and builds the initial core file) might also have to be
modified (mostly in src/compiler/generic/genesis.lisp). That’s…
more work. Like the project suggestions list says, when we change
things in the runtime, it sometimes ends up affecting a lot of other
components.
GDB tends to be less than useful, because of the many tricks SBCL
plays on itself. It’s usually hard to beat pen, paper, and printf.
At least, rebuilding the C runtime is quick: if the feature
:sb-after-xc-core is enabled (which already happens for --fancy
builds), slam.sh should be able to rebuild only the C runtime, and
then continue the bootstrap with the rest of SBCL from the previous
build. That mostly leaves PCL to build, so the whole thing should
takes less than a minute on a decent machine.
Some references
I was replying to an email when I realised that some general compiler
references would be useful, in addition to project- and SBCL- specific
tips.
Christian Queinnec’s
Lisp in Small Pieces
gives a good overview of issues regarding compiling Lisp-like
languages. Andrew Appel’s
Modern Compiler Implementation in ML
is more, well, modern (I hear the versions in C and Java have the same
text, but the code isn’t as nice… and ML is a very nice language for
writing compilers). I also remember liking Appel’s
Compiling with Continuations,
but I don’t know if it’s particularly useful for CL or the projects we
suggest.
For more complicated stuff, I believe Stephen Muchnick’s
Advanced Compiler Design and Implementation
would have been really nice to have, instead of slogging through code
and dozens of papers. Allen & Kennedy’s
Optimizing Compilers for Modern Architectures: A Dependence-based Approach
is another really good read, but I’m not sure how useful it would be
when working on SBCL: we still have a lot of work to do before
reaching for the really sophisticated stuff (and what sophistication
there is is fairly non-standard).
I believe the Rabbit and Orbit compilers have influenced the design of
CMUCL and SBCL. The
Lambda papers provide some
historical perspective, and the RABBIT and ORBIT theses are linked
here.
What little magic remains in SBCL and CMUCL is the type derivation
(constraint propagation) pass, and how it’s used to exploit a
repository of source-to-source transformations (deftransforms). The
rest is bog-standard tech from the 70s or 80s. When trying to
understand SBCL’s type derivation pass at a very high level, I
remember finding Henry Baker’s
The Nimble Type Inferencer for Common Lisp-84
very useful, even though it describes a scheme that doesn’t quite work
for Common Lisp (it’s very hard to propagate information backward
while respecting the final standard). Kaplan and Ullman’s
A Scheme for the Automatic Inference of Variable Types
was also helpful.
Getting help
Over the years, I’ve seen a couple of people come in with great
ambition, and give up after some time, seemingly without having made
any progress. I believe a large part of the problem is that they
tried to understand all of SBCL instead of just learning the bare
minimum to get hacking, and that their goal was too big. I already
wrote that SBCL is probably best approached bit by bit, with some
guidance from people who’ve been there before, and I hope the projects
we suggest can all lead to visible progress quickly, after a couple
days or two weeks of work at most.
Still, before investing my time, I like to see the other person also
give some of theirs to SBCL. This is why, as I wrote on the mailing
list last week, I’m much more inclined to help someone who’s already
built SBCL on their own and has submitted a patch that’s been
committed or is being improved on the mailing list. I absolutely do
not care what the patch is; it can be new code, a bugfix for a highly
unlikely corner case, better documentation, or spelling and grammar
corrections in comments. The bugs
tagged as easy in our bugtracker
may provide some inspiration. However trivial a patch might seem,
it’s still a sign that someone is willing to put the work in to
concretely make SBCL better, and I like that… it’s also a minimal test
to make sure the person is able to work with our toolchain. (This
isn’t SBCL policy for GSoC. It’s simply how I feel about these
things.)
Again, I’m amazed by the number of people who wish to hack on SBCL
this summer (as part of Google’s Summer of Code or otherwise).
Because of that, I think it’s important to note that this is our first
year, and so we’ll likely not have more than two or three spots.
However, I always like seeing more contributors, and I hope anyone
who’d like to contribute will always be guided, GSoC or not.
Finally, I’ll note that Google’s Summer of Code program was only a
good excuse to write up our
list of projects: they’re
simply suggestions to incite programmers to see what they can do that
is useful for SBCL and, most importantly, is interesting for them.
Anyone should feel welcome to work on any of these projects, even if
they’re not eligible or chosen for GSoC. They’re also only
suggestions; if someone has their own idea, we can likely help them
out just the same.
I just spent a few CPU-months to generate
these text files.
They catalogue all the “interesting” (from an efficiency and accuracy
point of view) polynomial approximations of degree 16 or lower for a
couple transcendental functions, over small but useful ranges, in
single and double float arithmetic. This claim seems to raise many
good questions when people hear it.
What’s wrong with Taylor approximations? Why the need to specify a
range?
Why are the results different for single and double floating point
arithmetic? Doesn’t rounding each coefficient to the closest float
suffice?
Why do I deem only certain approximations to be interesting and
others not, and how can there be so few?
In this post, I attempt to provide answers to these interrogations,
and sketch how I exploited classic
operations research (OR) tools to
enter the numerical analysts’ playground.
The final section describe how I’d interpret the catalogue when coding
quick and slightly inaccurate polynomial approximations. Such lossy
approximations seem to be used a lot in machine learning, game
programming and signal processing: for these domains, it makes sense
to allow more error than usual, in exchange for faster computations.
The CL code is all up in the repository for
rational-simplex,
but it’s definitely research-grade. Readers beware.
Minimax approximations
The natural way to approximate functions with polynomials
(particularly if one spends time with physicists or engineers) is to
use truncated Taylor series. Taylor approximations are usually easy to compute, but only provide good approximations over an infinitesimal range. For example, the degree-1 Taylor approximation for \(\exp\)
centered around 0 is \(1 + x\). It’s also obviously suboptimal, if
one wishes to minimise the worst-case error: the exponential function
is convex, and gradients consistently under-approximate convex functions.
Another affine function, \(1.26 + 1.18x\), intersects \(\exp\) in
two points and is overall much closer over \([-1, 1]\). In fact,
this latter approximation (roughly) minimises the maximal absolute
error over that range: it’s clearly not as good as the Taylor
approximation in the vicinity of 0, but it’s also much better in the
worst case (all bets are off outside \([-1, 1]\)). That’s why I’m
(or anyone’s libm, most likely) not satisfied by Taylor polynomials,
and instead wish to compute approximations that minimise the error
over a known range; function-specific identities can be exploited to
reduce any input to such a range.
Computing minimax polynomials
As far as I know, the typical methods to find polynomial
approximation that minimise the maximal error (minimax polynomials)
are iterative algorithms in the style of the Remez exchange algorithm.
These methods exploit real analysis results to reduce the problem to
computing minimax polynomials over very few points (one per coefficient,
i.e. one more than the degree): once an approximation is found by
solving a linear equation system, error extrema are computed and used
as a basis to find the next approximation. Given arbitrary-precision
arithmetic and some properties on the approximated function, the
method converges. It’s elegant, but depends on high-precision
arithmetic.
Instead, I
reduce the approximation problem to a sequence of linear optimisation programs.
Exchange algorithms solve minimax subproblems over exactly as many
points as there are coefficients: the fit can then be solved as a
linear equation. I find it simpler to use many more constraints than
there are coefficients, and solve the resulting optimisation problem
subject to linear inequalities directly, as a
linear program.
There are obviously no cycling problems in this cutting planes
approach (the set of points grows monotonically), and all the
machinery from exchange algorithms can be reused: there is the same
need for a good set of initial points and for determining error
extrema. The only difference is that points can always be added to
the subproblem without having to remove any, and that we can restrict
points to correspond to floating values (i.e. values we might actually
get as input) without hindering convergence. The last point seems
pretty important when looking at high-precision approximations.
For example, let’s approximate the exponential function over \([-1, 1]\)
with an affine function. The initial points could
simply be the bounds, -1 and 1. The result is the line that passes by
\(\exp(-1)\) and \(\exp(1)\), approximately \(1.54 + 1.18x\).
The error is pretty bad around 0; solving for the minimax line over
three points (-1, 0 and 1) yields a globally optimal solution,
approximately \(1.16 + 1.18x\).
There’s a lot of meat to wrap around this bone. I use a
computable reals
package in Common Lisp to pull arbitrary-precision rational
approximations for arithmetic expressions; using libm directly would
approximate an approximation, and likely cause strange results. A
variant of Newton’s algorithm
(with bisection steps) converges to error extrema (points which can be
added to the linear subproblem); arbitrary precision real arithmetic
is very useful to ensure convergence down to machine precision here.
Each linear subproblem is solved with an
exact simplex algorithm in rational arithmetic,
and convergence is declared when the value of the error extrema found
in the Newton steps correspond to that estimated by the subproblem.
Finally, the fact that the input are floating-point values is
exploited by ensuring that all the points considered in the linear
subproblems correspond exactly to FP values: rather than only rounding
a point to the nearest FP value, its two immediately neighbouring
(predecessor and successor) FP values were also added to the
subproblem. Adding immediate neighbours helps skip iterations in
which extrema move only by one ULP.
Initialising the method with a good set of points is essential to
obtain reasonable performance. The
Chebyshev nodes are
known to yield
nearly optimal [PDF]
initial approximations. The LP-based approach can exploit a large
number of starting points, so I went with very many initial Chebyshev
nodes (256, for polynomials of degree at most 16), and, again,
adjoined three neighbouring FP values for each point. It doesn’t seem
useful to me to determine the maximal absolute error very precisely,
and I declared convergence when the value predicted by the LP
relaxation was off by less than 0.01%. Also key to the performance
were tweaks in the polynomial evaluation function to avoid rational
arithmetic until the very last step.
Exactly represented coefficients
The previous section gives one reason why there are different tables
for single and double float approximations: the ranges of input
considered during optimisation differ. There’s another reason that’s
more important, particularly for single float coefficients:
rounding coefficients can result in catastrophic error blowups.
For example, rounding \(2.5 x\sp{2}\) to the nearest integer
coefficient finds either \(2 x\sp{2}\) or \(3 x\sp{2}\). However \(2
x\sp{2} + x\sp{3}\) is also restricted to integer coefficients, but
more accurate over \([0, .5]\). Straight coefficient-wise
rounding yields an error of \(\frac{1}{2} x\sp{2}\), versus
\(\frac{1}{2} x\sp{2} - x\sp{3}\) for the degree-3 approximation.
As the following graph shows, the degree-3 approximation is much
closer for the range we’re concerned with.
Floating point values are more densely distributed than the integers
for the range of values we usually encounter in polynomial
approximations, but still discrete, and the same phenomenon crops up,
at a smaller scale.
The
state of the art
for this version of the approximation problem seems to be based on an
initial Remez step, followed by a reduction to a
CVP.
The cutting planes approach points to a natural solution from the OR
world: a branch and cut method. The issue with the cutting planes
solution is that, while I only generate points that correspond to FP
values, the decision variables (coefficients) are free to take
arbitrary rational values.
Branching can be used to split up a decision variable’s range and
eliminate from consideration a range of non-FP values. The decision
variables’ upper and lower bounds are tightened in a large number of
subproblems; solutions gradually converge to all-FP values and their
optimality is then proven.
Embedding the cutting planes method in a branch and bound lets us find
polynomial approximations with float coefficients that minimise the
maximal error over float arguments, with correctly rounded powers.
The only remaining simplification is that we assume that the dot
product in the polynomial evaluation is error-free. Sadly, removing
this simplification results in a horrible discrete optimisation
problem with a very bumpy objective function. I’m not sure that any
exact approach can solve this in reasonable time. Still, there are
ways to evaluate FP polynomials very accurately, and I’m mostly
interested in approximations to trade accuracy for speed, so rounding
errors may well be negligible compared to approximation errors.
A branch-and-cut method for polynomial approximations
As in all
branch and bound
methods, a problem is split in subproblems by restricting the range of
some coefficient to exclude some infeasible (non-FP) values. For
example, in the previous example (in which we’re looking for integer
coefficients), the 2.5 coefficient would lead to two subproblems, one
in which the degree-2 coefficient is at most 2 \((\lfloor 2.5\rfloor)\),
and another in which it’s at least 3 \((\lceil2.5 \rceil)\),
excluding all the fractional values between 2 and 3. For floats,
we restrict to the closest floats that under- or over- approximate
the current value.
However, instead of solving the full continuous relaxation (which is
impractical, given the large number of potential argument values), our
subproblems are solved with cutting planes. The trick is that cuts
(constraints, which correspond to argument values) from any branch can
be used everywhere else. Thus, the global point pool is shared
between all subproblems, rather than re-generating it from scratch for
each subproblem. In practice, this lifting of cutting planes to the
root problem seems essential for efficiency; that’s certainly what it
took for branch-and-cut MIP solvers to take off.
I generate cuts at the root, but afterwards only when a subproblem
yields all-FP values. This choice was made for efficiency reasons.
Finding error extrema is relatively slow, but, more importantly,
adding cuts really slows down re-optimisation: the state of the
simplex algorithm can be preserved between invocations, and
warm-starting tends to be extremely efficient when only some
variables’ bounds have been modified. However, it’s also necessary to
add constraints when an incumbent is all-FP, lest we prematurely
declare victory.
There are two really important choices when designing branch and bound
algorithms: how the branching variable is chosen, and the order in
which subproblems are explored. With polynomial approximations, it
seems to make sense to branch on the coefficient corresponding to the
lowest degree first: I simply scan the solution and choose the
least-degree coefficient that isn’t represented exactly as a float.
Nodes are explored in a hybrid depth-first/best-first order: when a
subproblem yields children (its objective value is lower than the
current best feasible solution, and it isn’t feasible itself), the
next node is its child corresponding to the bound closest to the
branching variable’s value, otherwise the node with the least
predicted value is chosen. The depth-first/closest-first dives
quickly converge to decent feasible solutions, while the best-first
choice will increase the lower bound, bringing us closer to proving
optimality.
A randomised rounding heuristic is also used to provide initial
feasible (all-FP) solutions, and the incumbent is considered close
enough to optimal when it’s less than 5% off from the best-known lower
bound. When the method returns, we have a polynomial approximation
with coefficients that are exactly representable as single or double
floats (depending on the setting), and which (almost) minimises the
approximation error on float arguments.
Interesting approximations
The branch and cut can be used to determine an (nearly-)
optimally accurate FP polynomial approximation given a maximum degree.
However, the degree isn’t the only tunable to accelerate polynomial
evaluation: some coefficients are nicer than others. Multiplying by 0
is obviously very easy (nothing to do), while multiplication by +/-1
is pretty good (no multiplication), and by +/- 2 not too bad either
(strength-reduce the multiplication into an addition). It would be
possible to consider other integers, but it doesn’t seem to make
sense: on current X86, floating point multiplication is only 33-66%
slower (more latency) than FP addition, and fiddling with the exponent
field means ping-ponging between the FP and integer domains.
There are millions of such approximations with a few “nice”
coefficients, even when restricting the search to low degrees (e.g. 10
or lower). However, the vast majority of them will be wildly
inaccurate. I decided to only consider approximations that are at
least as accurate as the best approximation of degree three lower:
e.g. a degree-3 polynomial with nice coefficients is only
(potentially) interesting if it’s at least as accurate as a constant.
Otherwise, it’s most likely even quicker to just use a lower-degree
approximation.
That’s not enough: this filter still leaves thousands of polynomials.
Most of the approximations will be dominated by another; what’s the point in looking at a degree-4 polynomial with one coefficient equal to 0 if there’s a degree-3 with one zero that’s just as accurate? The relative importance
of the degree, and the number of zeroes, ones and twos will vary
depending on the environment and evaluation technique. However, it
seems reasonable to assume that zeroes are always at least as quick to
work with as ones, and ones as quick as twos. The constant
offset ought to be treated distinctly from other coefficients. It’s
only added rather than multiplied, so any speed-up is lower, but an
offset of 0 is still pretty nice, and, on some architectures, there is
special support to load constants like 1 or 2.
This lets me construct a simple but robust performance model: a
polynomial is more quickly evaluated than another if it’s of lower or
same degree, doesn’t have more non-zero, non-{zero, one} or non-{zero,
one, two} multipliers, and if its constant offset is a nicer integer
(0 is nicer than +/- 1 is nicer than +/- 2 is nicer than arbitrary
floats).
With these five metrics, in addition to accuracy, we have a
multi-objective optimisation problem. In some situations, humans may
be able to determine a weighting procedure to bring the dimension down
to a scalar objective value, but the procedure would be highly
domain-specific. Instead, we can use this partial order to report
solutions on the Pareto front of accuracy and efficiency: it’s only
worth reporting an approximation if there is no approximation that’s
better or equal in accuracy and in all the performance metrics. The
performance and accuracy characteristics are strongly correlated
(decreasing the degree or forcing nice coefficients tends to decrease
the accuracy, and a zero coefficient is also zero-or-one, etc.), so
it’s not too surprising that there are so few non-dominated solutions.
The branch and cut can be used to find the most accurate
approximation, given an assignment for a few values (e.g. the constant
term is 1, or the first degree coefficient 0). I’ll use it as a
subproblem solver, in a more exotic branch and bound approach.
We wish to enumerate all the partial assignments that correspond to
not-too-horrible solutions, and save those. A normal branch and bound
can’t be applied directly, as one of the choices is to leave a given
variable free to take any value. However, bounding still works: if a
given partial assignment leads to an approximation that’s too
inaccurate, the accuracy won’t improve by fixing even more
coefficients.
I started with a search in which children were generated by adjoining
one fixed value to partial assignments. So, after the root node,
there could be one child with the constant term fixed to 0, 1 or 2
(and everything else free), another with the first coefficient fixed
to 0, 1 or 2 (everything else left free), etc.
Obviously, this approach leads to a search graph: fixing the constant
term to 0 and then the first coefficient to 0, or doing in the reverse
order leads to the same partial assignment. A hash table ensures that
no partial assignment is explored twice. There’s still a lot of
potential for wasted computation: if bounding lets us determine that
fixing the constant coefficient to 0 is worthless, we will still
generate children with the constant fixed to 0 in other branches!
I borrowed a trick from the SAT solving and constraint programming
communities,
nogood recording.
Modern SAT solvers go far beyond strict branching search: in practice,
it seems that good branching orders will lead to quick solutions,
while a few bad ones take forever to solve. Solvers thus frequently
reset the search tree to a certain extent. However, information is
still communicated across search trees via learned clauses. When the
search backtracks (an infeasible partial assignment is found), a nogood
set of conflicting partial assignments can be computed: no feasible
solution will include this nogood assignment.
Some time ago, Vasek Chvatal introduced
Resolution Search
to port this idea to 0/1 optimisation, and Marius Posta, a friend at
CIRRELT and Université de Montréal,
extended it to general discrete optimisation [PDF].
The complexity mostly comes from the desire to generate partial
assignments that, if they’re bad, will merge well with the current set
of learned clauses. This way, the whole set of nogoods (or, rather,
an approximation that suffices to guarantee convergence) can be
represented and queried efficiently.
There’s no need to be this clever here: each subproblem involves an
exact (rational arithmetic) branch and cut. Simply scanning the set
of arbitrary nogoods to look for a match significantly accelerates the
search. The process is sketched below.
The search is further accelerated by executing multiple branch and
cuts and nogood scans in parallel.
The size of the search graph is also reduced by only fixing
coefficients to a few values. There’s no point in forcing a
coefficient to take the value of 0, 1 or 2 (modulo sign) if it already
does. Thus, a coefficient with a value of 0 is left free; otherwise a
child extending the partial assignment with 0 is created. Similarly,
a child extended with a value of 1 is only generated if the
coefficient isn’t already at 0 or 1 (we suppose that multiplication by
0 is at least as efficient as by 1), and similarly for 2. Finally,
coefficients between 0 and 1 are only forced to 0 or 1, and those
between 1 and 3 to 0, 1 or 2. If a coefficient takes a greater
absolute value than 3, fixing it most probably degrades the
approximation too strongly, and it’s left free – then again such
coefficients only seem to happen on really hard-to-approximate
functions like \(\log\) over \([1, 2]\). Also, the last
coefficient is never fixed to zero (that would be equivalent to
looking at lower-degree approximations, which was done in previous
searches). Of course, with negative coefficients, the fixed values
are negated as well.
This process generates a large number of potentially interesting
polynomial approximations. The non-dominated ones are found with a
straight doubly nested loop, with a slight twist: accuracy is computed
with machine floating point arithmetic, thus taking rounding into
account. The downside is that it’s actually approximated, by sampling
fairly many (the FP neighbourhood of 8K Chebyshev nodes) points;
preliminary testing indicates that’s good enough for a relative error
(on the absolute error estimate) lower than 1e-5. There tends to be a
few exactly equivalent polynomials (all the attributes are the same,
including accuracy – they only differ by a couple ULP in a few
coefficients); in that case, one is chosen arbitrarily. There’s
definitely some low-hanging fruit to better capture the performance
partial order; the error estimate is an obvious candidate. The hard
part was generating all the potentially interesting approximations,
though, so one can easily re-run the selection algorithm with tweaked
criteria later.
Exploiting the approximation indices
I’m not sure what functions are frequently approximated over what
ranges, so I went with the obvious ones: \(\cos\) and \(\sin\)
over \([-\pi/2, \pi/2]\), \(\exp\) and arctan over \([-1, 1]\)
or \([0, 1]\), \(\log\) over \([1, 2]\), and \(\log 1+x\) and
\(\log\sb{2} 1+x\) over \([0, 1]\). This used up a fair amount of
CPU time, so I stopped at degree 16.
Each file reports the accuracy and efficient metrics, then the
coefficients in floating point and rational form, and a hash of the
coefficients to identify the approximation. The summary columns are
all aligned, but each line is very long, so the files are best read
without line wrapping.
The columns report the accuracy and efficiency metrics, in the order
used to sort approximations lexicographically:
the approximation’s degree;
the floor of the negated base-2 logarithm of the maximum error
(roughly bits of absolute accuracy, rounded up);
the number of non-zero multipliers;
the number of non-{-1, 0, 1} multipliers;
whe number of non-{-2, -1, 0, 1, 2} multipliers;
whether the constant’s absolute value is 0, 1, 2, or other (in which
case the value is 3); and
the maximum error.
After that, separated by pipes, come the coefficients in float form,
then in rational form, and the MD5 hash of the coefficients in a float
vector (in a contiguous vector, in increasing order of degree, with
X86’s little-endian sign-magnitude representation). The hash might
also be useful if you’re worried that your favourite implementation
isn’t parsing floats right.
There are three polynomials with degree 3, and they all offer
approximately the same accuracy (lb_error = 10). I’d choose between
the most accurate polynomial
The columns are in the order used for the lexicographic sort:
the number of bits of accuracy;
the approximation’s degree;
the maximum error;
the number of non-zero multipliers;
the number of non-{-1, 0, 1} multipliers;
the number of non-{-2, -1, 0, 1, 2} multipliers; and
whether the constant’s absolute value is 0, 1, 2, or other (in which
case the value is 3).
An error around 1e-4 would be reasonable for my needs, and
log1px-6AE509 seems interesting: maximum error is around 1.5e-4,
it’s degree 4, the constant offset is 0 and the first multiplier 1.
If I needed a bit more accuracy (7.1e-5), I’d consider log1px-E8200B:
it’s degree 4 as well, and the constant is still 0.
It seems to me optimisation tools like approximation generators are
geared toward fire and forget usage. I don’t believe that’s a
realistic story: very often, the operator will have a fuzzy range of
acceptable parameters, and presenting a small number of solutions with
fairly close matches lets them exploit domain-specific insights. In
this case, rather than specifying fixed coefficients and degree or
accuracy goals, users can scan the indices and decide whether each
trade-off is worth it or not. That’s particularly true of the single
float approximations, for which the number of possibilities tends to
be tiny (e.g. eight non-dominated approximations for
\(\sin\)).
EDIT: 2012-08-29: I added a section to compare comparison counts with
known bounds for general comparison sorts and sorting networks.
In an
earlier post,
I noted how tedious coding unrolled sorts can be. Frankly, that’s the
main reason I stopped at leaf sorts of size three. Recently,
Neil Toronto
wrote a nice post on the generation of size-specialised merge sorts.
The post made me think about that issue a bit more, and I now have a
neat way to generate unrolled/inlined merge sorts that are
significantly smaller than the comparison and size “-optimal” inlined
merge sorts.
The code is up as a
single-file library,
and sorts short vectors faster than SBCL’s inlined heapsort by a
factor of two to three… and compiles to less machine code. The
generator is a bit less than 100 LOC, so I’m not sure I want to
include it in the mainline yet. If someone wants to add support for
more implementations, I’d be happy to extend Tabasco sort, and might
even consider letting it span multiple files ;)
Differently-optimal sorts
The inlined merge sort for three values (a, b, and c) is copied
below. It has to detect between \(3! = 6\) permutation, and does so
with an optimal binary search tree. That scheme leads to code with
\(n! - 1\) comparisons to sort n values, and for which each
execution only goes through two or three comparisons (\(\approx \lg n!\)).
“optimal” inlined merge sort (n = 3)
1234567891011
(if (< b c)
(if (< a b)
(values a b c)
(if (< a c)
(values b a c)
(values b c a)))
(if (< a c)
(values a c b)
(if (< a b)
(values c a b)
(values c b a))))
An optimal sorting network for three values needs only three
comparisons, and always executes those three comparisons.
optimal sorting network (n = 3)
1234567
(progn
(when (< c b)
(rotatef b c))
(when (< b a)
(rotatef a b))
(when (< c b)
(rotatef b c)))
Finally, the leaf sort I used in SBCL is smaller than the inlined
merge sort (three comparisons), but sometimes executes fewer than
three comparisons. It’s superoptimal ;)
“super-optimal” inlined merge sort (n = 3)
12345678
(progn
(when (< c b)
(rotatef b c))
(if (< b a)
(if (< c a)
(values b c a)
(values b a c))
(values a b c)))
The optimal merge sort is larger than the optimal sorting network, and
the optimal sorting network performs potentially more comparisons than
the optimal merge sort…
Each implementation is optimal for different search spaces: the
optimal merge sort never merges continuations, and the sorting
network’s only control dependencies are in the conditional swaps.
The “super-optimal” merge sort does better by allowing itself both
assignments (or tail-calls) and non-trivial control flow: it’s smaller
than the inlined merge sort (but performs more data movement), and
potentially executes fewer comparisons than the sorting network (with
a larger total code size). And, essential attribute in practice, it’s
easy to generate. This contrasts with optimal sorting networks, for
which we do not have any generation method short of brute force
search; in fact, in practice, sorting networks tend to exploit
suboptimal (by a factor of \(\log n\)) schemes like
bitonic sort
or
odd-even merge sort.
Then again, we’re only concerned with tiny sorts, and asymptotics can
be misleading: Batcher’s odd-even merge sort happens to be optimal for
\(n\leq 8\). The issue with sorting networks remains:
data-oblivious control flow pessimises their comparison count.
Generalising from size 3
What the last merge sort does is to first sort both halves of the
values (a is trivially sorted, and b c needs one conditional
swap), and then, assuming that each half (a and b c) is sorted,
find the right permutation with which to merge them. Rather than
\(n!\) permutations, a merge only needs to distinguish between
\(C(n, \lfloor n/2\rfloor) = \frac{n!}{\lfloor n/2\rfloor!\lceil n/2\rceil!}\)
permutations, and the recursive sorts are negligible compared to the
merge step. That’s a huge reduction in code size!
A simple merge generator fits in half a
screenful.
unrolled merge generator
12345678910111213141516171819202122
(defun emit-permute (destinations sources)
;; (setf values) is parallel assignment
`(setf (values ,@destinations) (values ,@sources)))
(defun emit-merge-1 (destinations left right acc)
"Build a search tree to determine the right permutation to
merge LEFT and RIGHT, given that each is pre-sorted."
(cond ((null left)
(emit-permute destinations (append (reverse acc) right)))
((null right)
(emit-permute destinations (append (reverse acc) left)))
(t
`(if (< ,(first right) ,(first left)) ; stable sort
,(emit-merge-1 destinations
left (rest right)
(cons (first right) acc))
,(emit-merge-1 destinations
(rest left) right
(cons (first left) acc))))))
(defun emit-merge (left right)
(emit-merge-1 (append left right) left right nil))
Given two lists of sorted variables, emit-merge calls emit-merge-1
to generate code that finds the right permutation, and executes it at
the leaf. A binary search tree is generated by keeping track of the
merged list in a reverse-order (to enable tail-sharing) accumulator of
variable names. As expected, when merging a list of length one with
another of length two, we get pretty much the code I wrote by hand
earlier.
CL-USER> (emit-merge '(a) '(b c))
(if (< b a)
(if (< c a)
(setf (values a b c) (values b c a))
(setf (values a b c) (values b a c)))
(setf (values a b c) (values a b c)))
There’s one striking weakness: we generate useless code for the
identity permutation. We could detect that case, or, more generally,
we could find the cycle decomposition of each permutation and use it
to minimise temporary values; that’d implicitly take care of cases
like (setf (values a b c) (values b a c)), in which some values are
left unaffected.
A smarter permutation generator
I’ll represent permutations as associative lists, from source to
destination. Finding a cycle is easy: just walk the permutation from
an arbitrary value until we loop back.
extract a single cycle from a linearly-represented permutation
1234567891011121314151617
(defun find-cycle (mapping)
"Extract an arbitrary cycle from a non-empty mapping,
returning both the cycle and the rest of the mapping."
(assert mapping)
(let* ((head (pop mapping))
(cycle (list (cdr head))))
(loop
(let* ((next-source (first cycle))
(pair (assoc next-source mapping)))
(cond (pair
(push (cdr pair) cycle)
;; if this sucks enough to matter, the output
;; will be humongous anyway
(setf mapping (remove pair mapping)))
(t
(assert (eql next-source (first head)))
(return (values cycle mapping))))))))
To generate the code corresponding to a permutation, I can extract all
the cycles, execute each cycle with a rotatef.
cycle-decomposition-based permute generator
1234567891011121314
(defun emit-permute (destinations sources)
"Emit a [destinations <- sources] permutation via its
cycle decomposition"
;; source -> destination alist, minus trivial pairs
(let ((mapping (remove-if (lambda (pair)
(eql (car pair) (cdr pair)))
(pairlis sources destinations))))
`(progn
,@(loop while mapping
collect
(multiple-value-bind (cycle new-mapping)
(find-cycle mapping)
(setf mapping new-mapping)
`(rotatef ,@cycle))))))
The merge step for a sort of size three is now a bit more explicit,
but likely compiles to code that uses fewer registers as well. It
probably doesn’t matter on good SSA-based backends, but those are the
exception rather than the norm in the Lisp world.
CL-USER> (emit-merge '(a) '(b c))
(if (< b a)
(if (< c a)
(progn (rotatef a b c))
(progn (rotatef a b)))
(progn))
Adding the recursive steps
The only thing missing for a merge sort is to add base cases and
recursion. The base case is easy: lists of length one are sorted.
Inlining recursion is trivial, as is usually the case when generating
Lisp code.
The resulting three-value sorter looks good; there are some
redundancies with nested or empty progns, but any half-decent
compiler will take care of that. Python certainly does a goob job on
that code.
CL-USER> (emit-sort '(a b c))
(progn
nil
(progn
nil
nil
(if (< c b)
(progn (rotatef b c))
(progn)))
(if (< b a)
(if (< c a)
(progn (rotatef a b c))
(progn (rotatef a b)))
(progn)))
CL-USER> (disassemble (lambda (a b c)
(declare (type fixnum a b c))
(inline-sort a b c)))
; disassembly for (lambda (a b c))
; 0E88F150: 498BD0 mov rdx, r8 ; no-arg-parsing entry point
; 53: 498BC9 mov rcx, r9
; 56: 498BDA mov rbx, r10
; 59: 4D39CA cmp r10, r9
; 5C: 7C30 jl L3
; 5E: L0: 4C39C1 cmp rcx, r8
; 61: 7D0B jnl L1
; 63: 4C39C3 cmp rbx, r8
; 66: 7C1B jl L2
; 68: 488BD1 mov rdx, rcx
; 6B: 498BC8 mov rcx, r8
; 6E: L1: 488BF9 mov rdi, rcx
; 71: 488BF3 mov rsi, rbx
; 74: 488D5D10 lea rbx, [rbp+16]
; 78: B906000000 mov ecx, 6
; 7D: F9 stc
; 7E: 488BE5 mov rsp, rbp
; 81: 5D pop rbp
; 82: C3 ret
; 83: L2: 488BD1 mov rdx, rcx
; 86: 488BCB mov rcx, rbx
; 89: 498BD8 mov rbx, r8
; 8C: EBE0 jmp L1
; 8E: L3: 498BCA mov rcx, r10
; 91: 498BD9 mov rbx, r9
; 94: EBC8 jmp L0
I thought about generating calls to values in the final merge,
rather than permuting, but decided against: I know SBCL doesn’t
generate clever code when permuting registers, and that’d result in
avoidable spills. I also considered generating code in CPS rather
than emitting assignments; again, I decided against because I can’t
depend on SBCL to emit clever permutation code. The transformation
would make sense in a dialect with weaker support (both compiler and
social) for assignment.
How good is the generated code?
Both this inline merge sort and the original, permutation-free (except
at the leaves), one actually define the exact same algorithms. For
any input, both (should) execute the same comparisons in the same
order: the original inline merge sort simply inlines the whole set of
execution traces, without even merging control flow.
The permutation-free sort guarantees that it never performs redundant
comparisons. Whether it performs the strict minimum number of
comparisons, either on average or in the worst case, is another question.
At first, I thought that \(\lg n!\) should be a good estimate, since
the search tree seems optimally balanced. The problem is \(n!\)
tends to have many other prime factors than 2, and we can thus expect
multiple comparisons to extract less than 1 bit of information, for
each execution. The lower bound can thus be fairly far from the
actual value… Still, this question is only practically relevant for
tiny sorts, so the discrepancy shouldn’t be too large.
A simple way to get the minimum, average or maximum comparison count
would be to annotate the permutation-free generator to compute the
shortest, average and longest path as it generates the search tree.
I’ll instead mimic the current generator.
The first step is to find the number of comparisons to perform a
merge of m and n values.
If either m or n is zero, merging the sequences is trivial.
Otherwise, the minimum number of comparisons is (min m n): the
sequences are pre-sorted, and the shortest sequence comes first. The
maximum is (1- (+ m n)). I couldn’t find a simple expression for
the average over all permutations. Instead, I iterate over all
possible combinations and count the number of comparisons until either
subsequence is exhausted.
compute the number of comparisons during merges
1234567891011121314151617181920212223242526272829
(defun merge-count (m n)
;; return min, expected, max comparison to merge
;; presorted subsequences of m and n values
(if (zerop (min m n))
(values 0 0 0)
(let ((comparisons 0)
(count 0)
(min most-positive-fixnum)
(max 0))
(dotimes (i (ash 1 (+ m n))
(values min
(/ comparisons
count)
max))
;; only consider combinations with m ones
;; (and n zeros)
(when (= (logcount i) m)
(let ((cmp (1- (+ m n)))
(mask i))
;; no more comparison needed until two consecutive
;; elements from distinct subsequences
(loop while (eql (logbitp 0 mask)
(logbitp 1 mask))
do (decf cmp)
(setf mask (ash mask -1)))
(setf min (min min cmp)
max (max max cmp))
(incf comparisons cmp)
(incf count)))))))
Counting the number of comparisons in sorts is then trivial, with a
recursive function. I didn’t even try to memoise repeated
computations: the generated code is ludicrously long when sorting as
few as 13 or 14 values.
compute the number of comparisons in merge sort
1234567891011121314151617
(defun sort-count (n)
(if (<= n 1) ; trivially sorted
(values 0 0 0)
(let ((min 0)
(avg 0)
(max 0))
(flet ((inc (min- avg- max-)
(incf min min-)
(incf avg avg-)
(incf max max-)))
;; accumulate min/avg/max count from sorting the left and
;; right subsequences, and merging
(multiple-value-call #'inc (sort-count (floor n 2)))
(multiple-value-call #'inc (sort-count (ceiling n 2)))
(multiple-value-call #'inc (merge-count (floor n 2)
(ceiling n 2))))
(values min avg max))))
I annotated the output with comments (marked with semicolons). The
columns are, from left to right, the sort size, the theoretical lower
bound (on the average or maximum number of comparisons), the minimum
number of comparisons (depending on the input permutation), the
average (over all input permutations), and the maximum count. I added
two columns by hand: the optimal worst-case (maximum) comparison
counts (over all sorting methods, copied from the
OEIS), and the optimal size for sorting
networks, when known (lifted from a table
here [pdf]).
Inexact (potentially higher than the optimum) bounds are marked with a
question mark.
For the input size the inline merge sort can reasonably tackle (up to
ten or so), its worst-case is reasonably close to the best possible,
and its average case tends to fall between the lower bound and the
best possible. Over all these sizes, the merge sort’s worst case
performs fewer comparisons than the optimal or best-known sorting
networks. The current best upper bounds on the minimal worst-case
comparison count seem to be based on insertion sort passes that
minimise the number of comparisons with a binary search. I don’t
believe that’s directly useful for the current use case, but a similar
trick might be useful to reduce the number of comparisons, at the
expense of reasonably more data movement.
Making it CL-strength
That’s already a decent proof of concept. It’s also far too plain to
fit in the industrial-strength Common Lisp way. An inline sorting
macro worthy of CL should be parameterised on both comparator and key
functions, and work with arbitrary places rather than only variables.
Parameterising the comparator is trivial. The key could be handled by
calling it at each comparison, but that’s wasteful. We’re generating
code; might as well go for glory. Just like in the list merge sort,
I’ll cache calls to the key function in the merge tree. I’ll also use
special variables instead of passing a half-dozen parameters around in
the generator.
(defvar *inline-sort-comparator*)
(defvar *inline-sort-key*)
(defvar *inline-sort-destinations*)
(defvar *inline-sort-left-head*)
(defvar *inline-sort-right-head*)
(defun emit-merge-1 (left right acc)
"Build a search tree to determine the right permutation to
merge LEFT and RIGHT, given that each is pre-sorted."
;; stability trickery
`(if (funcall ,*inline-sort-comparator* ,*inline-sort-right-head*
,*inline-sort-left-head*)
,(let* ((acc (cons (first right) acc))
(right (rest right)))
;; pop from RIGHT, and recurse if RIGHT isn't empty.
(if right
`(let ((,*inline-sort-right-head*
(funcall ,*inline-sort-key* ,(first right))))
,(emit-merge-1 left right acc))
(emit-permute *inline-sort-destinations*
(append (reverse acc) left))))
;; same
,(let* ((acc (cons (first left) acc))
(left (rest left)))
(if left
`(let ((,*inline-sort-left-head*
(funcall ,*inline-sort-key* ,(first left))))
,(emit-merge-1 left right acc))
(emit-permute *inline-sort-destinations*
(append (reverse acc) right))))))
(defun emit-merge (left right)
"Caching calls to KEY means we have to special-case empty lists
(which doesn't happen when we sort, anyway)"
(cond ((null left)
(emit-permute right right))
((null right)
(emit-permute left left))
(t
(let ((*inline-sort-destinations* (append left right))
(*inline-sort-left-head* (gensym "LEFT-HEAD-KEY"))
(*inline-sort-right-head* (gensym "RIGHT-HEAD-KEY")))
`(let ((,*inline-sort-left-head* (funcall ,*inline-sort-key*
,(first left)))
(,*inline-sort-right-head* (funcall ,*inline-sort-key*
,(first right))))
,(emit-merge-1 left right nil))))))
(defun emit-sort-1 (values length)
"Unrolled and inlined recursive merge sort generator.
Lists of length 1 or less are trivially sorted; recurse
on the rest."
(when (> length 1)
(let* ((split (truncate length 2))
(left (subseq values 0 split))
(right (subseq values split)))
`(progn
,(emit-sort-1 left split)
,(emit-sort-1 right (- length split))
,(emit-merge left right)))))
(defun emit-sort (values *inline-sort-comparator* *inline-sort-key*)
(emit-sort-1 values (length values)))
Finally, handling arbitrary places, thus letting the macro take care
of writing results back to the places, is just regular macrology.
(defmacro inline-sort ((comparator &key key (overwrite t))
&body values
&environment env)
"Sorts all VALUES in increasing order with respect to COMPARATOR and
KEY. COMPARATOR should be a strict order, like CL:<, and KEY defaults
to NIL (which is interpreted as the identity). By default, the result
is written back to the places; that's skipped if OVERWRITE is NIL. A
literal NIL value for overwrite will avoid generating any write.
The SORT form always evaluates to the sorted values, in order."
(let (vars vals
store-vars writer-forms
reader-forms
temps
(_comparator (gensym "COMPARATOR"))
(_key (gensym "KEY"))
(_overwrite (gensym "OVERWRITE")))
(loop for value in (reverse values) do
(push (gensym "TEMP") temps)
;; only use the setf expansion if we might write to the place.
(if (not overwrite)
(push value reader-forms)
(multiple-value-bind (var val store-var writer reader)
(get-setf-expansion value env)
(setf vars (append var vars)
vals (append val vals))
(push store-var store-vars)
(push writer writer-forms)
(push reader reader-forms))))
`(let* ((,_comparator ,comparator)
(,_comparator (if (functionp ,_comparator)
,_comparator
(symbol-function ,_comparator)))
(,_key ,(or key '#'identity))
(,_key (if (functionp ,_key)
,_key
(symbol-function ,_key)))
(,_overwrite ,overwrite)
,@(mapcar 'list vars vals)
,@(mapcar 'list temps reader-forms))
(declare (ignorable ,_comparator ,_key ,_overwrite))
,(emit-sort temps _comparator _key)
,(and overwrite
`(when ,_overwrite
,@(loop
for value in values
for store-var-list in store-vars
for writer in writer-forms
for temp in temps
collect
(progn
(unless (= 1 (length store-var-list))
(error "Can't sort multiple-value place ~S" value))
`(let ((,(first store-var-list) ,temp))
,writer)))))
(values ,@temps))))
Now, the macro can be used to sort, e.g., vectors of double floats
“in-place” (inasmuch as copying everything to registers can be
considered in-place).
The inline sort supports the same options as CL:SORT, so it’d be
really interesting to opportunistically compile calls to the latter
into size-specialised inline sort. The usual, portable, way to code
that sort of macro qua source-to-source optimiser in CL is with
compiler macros; compiler macros have access to all the usual
macroexpansion-time utility, but the function definition is left in
place. That way the user can still use the function as a first-class
function, and the compiler-macro can decline the transformation if a
regular call would work better (and the compiler can ignore any
compiler macro). That’s not enough for our needs, though… and there
can only be one compiler macro per function, so adding one to code we
don’t own is a bad idea.
Python’s first internal representation (ir1) is optimised by
iteratively deriving tighter type information, and (mostly) pattern
matching on the type of function calls. Its DEFTRANSFORM form lets us
add new rules, and there may be an arbitrary number of such rules for
each function.
(in-package "SB-C")
(defvar *unrolled-vector-sort-max-length* 8)
(defun maybe-emit-unrolled-merge-sort (node sequence key)
(unless (policy node (> speed space))
(give-up-ir1-transform))
(let* ((sequence-type (lvar-type sequence))
(dimensions (array-type-dimensions-or-give-up
sequence-type)))
(unless (typep dimensions '(cons number null))
(give-up-ir1-transform
"~@<sequence argument isn't a vector of known length~:@>"))
(let ((length (first dimensions)))
(when (> length *unrolled-vector-sort-max-length*)
(give-up-ir1-transform
"~@<sequence argument too long for unrolled sort ~
(length ~S greater than ~S)~:@>"
length *unrolled-vector-sort-max-length*))
(if (<= length 1)
'sequence
`(with-array-data ((array sequence)
(start)
(end))
(declare (optimize (insert-array-bounds-checks 0))
(ignore end))
(inline-sort
((%coerce-callable-to-fun predicate)
:key ,(if key
'(%coerce-callable-to-fun key)
'#'identity))
,@(loop for i below length
collect `(aref array (+ start ,i))))
sequence)))))
(deftransform sort ((sequence predicate &key key)
* * :node node)
"unroll sort of short vectors"
(maybe-emit-unrolled-merge-sort node sequence key))
(deftransform stable-sort ((sequence predicate &key key)
* * :node node)
"unroll stable-sort of short vectors"
(maybe-emit-unrolled-merge-sort node sequence key))
The two deftransforms at the end define new rules that match on calls
to CL:SORT and CL:STABLE-SORT, with arbitrary argument types and
return types: basic type checks are performed elsewhere, and
maybe-emit-unrolled-merge-sort does the rest. Transforms are
identified by the docstring (which also improve compiler notes), and
the argument and return types, so the forms are reevaluation-safe.
All the logic lies in maybe-emit-unrolled-merge-sort. The policy
form checks that the optimisation policy at the call node has speed
greater than space, and gives up on the transformation otherwise.
The next step is to make sure the sequence argument is an array, and
that its dimensions are known and define a vector (its dimension list
is a list of one number). The final guard makes sure we only
specialise on small sorts (at most
*unrolled-vector-sort-max-length*).
Finally, we get to code generation itself. A vector of length 1 or 0
is trivially pre-sorted. I could also directly emit the inner
inline-sort form, but SBCL has some stuff to hoist out computations
related to hairy arrays. with-array-data takes a potentially
complex array (e.g. displaced, or not a vector), and binds the
relevant variables to the underlying simple array of rank 1, and the
start and end indices corresponding to the range we defined
(defaulting to the whole range of the input array). Bound checks are
eliminated because static information ensures the accesses are safe
(or the user lied and asked not to insert type checks earlier), and
the start index is declared to be small enough that we can add to it
without overflow – Python doesn’t implement the sort of sophisticated
shape analyses that could figure that out. Finally, a straight
inline-sort form can be emitted.
That machinery means we’ll get quicker and shorter inline sort code
when the size is known ahead of time. For example, a quick
disassembly shows the following is about one fourth the size of the
size-generic inline code (with (simple-array double-float (*)), and
(optimize speed (space 0)).
Even for vectors of length 8 (the default limit), the specialised
merge sort is shorter than SBCL’s inlined heapsort, and about three
times as fast on shuffled vectors.
That’s it
It took me much longer to write this up than to code the generator,
but I hope this can be useful to other people. One thing I’d like to
note is that sorting networks are much harder to get right than this
generator, and pessimise performance: without branches, there must be
partially redundant computations on non-power-of-two sizes. In the
absence of solid compiler support for conditional swaps, I doubt the
additional overhead of optimal sorting networks can be compensated
by the simpler control flow, never mind the usual odd-even or bitonic
networks.