<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[Paul Khuong mostly on Lisp]]></title>
  <link href="http://www.pvk.ca/atom.xml" rel="self"/>
  <link href="http://www.pvk.ca/"/>
  <updated>2013-04-14T06:53:03+02:00</updated>
  <id>http://www.pvk.ca/</id>
  <author>
    <name><![CDATA[Paul Khuong]]></name>
    <email><![CDATA[pvk@pvk.ca]]></email>
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[Starting to hack on SBCL]]></title>
    <link href="http://www.pvk.ca/Blog/2013/04/13/starting-to-hack-on-sbcl/"/>
    <updated>2013-04-13T22:39:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2013/04/13/starting-to-hack-on-sbcl</id>
    <content type="html"><![CDATA[<p>SBCL was accepted as a mentoring organisation for Google&#8217;s Summer of
Code 2013 (our list of project suggestion is
<a href="http://www.sbcl.org/gsoc2013/ideas">here</a>).  This will be our first
time, so that&#8217;s really great news.  I&#8217;m also extremely surprised by
the number of people who&#8217;ve expressed interest in working with us.  I
was going to reply to a bunch of emails individually, but I figure I
should also centralise some of the stuff here.</p>

<p>EDIT: There&#8217;s a section with a bunch of general references.</p>

<p>EDIT 2: Added a note on genesis when playing with core formats.</p>

<h2>Getting started</h2>

<h3>Setting up the basic tools</h3>

<p>The first step is probably to install git, and clone our repo
(<a href="https://github.com/sbcl/sbcl">the github mirror</a> works well, and
lets you fork to your own github account for easy publication).  Then,
building from source and installing SBCL (a local installation to
$HOME works fine) is obviously useful experience, and will be useful
to explore the source.  Reading INSTALL should be enough to get
started on Linux or OS X and x86[-64].  Other platforms may need more
work, and might not be the best choice if you&#8217;re not interested in
improving the port itself (although I&#8217;m told FreeBSD and Solaris work
very well on x86[-64]).  To build SBCL from source, you&#8217;ll need an
SBCL binary (bootstrapping from other CLs should work, but support
regularly bitrots away), and the usual C development tools
(e.g. build-essential on debian).  A fancy build (<code>./make.sh --fancy</code>)
is probably the best choice for development.</p>

<p>You&#8217;ll also want to run the test suite often; better try it out now
(<code>cd tests; sh run-tests.sh</code>) to make sure you can get it working.
The test suite will barf if there&#8217;s non-ASCII characters in the
environment.  <a href="https://github.com/robbyrussell/oh-my-zsh">Oh my Zsh</a>&#8217;s
git customisation systematically trips me up, for example (I currently
kludge it and start a bash from ~ and then run the tests).</p>

<p>Once SBCL HEAD is working and installed, it&#8217;s probably best to install
emacs and SLIME.  <a href="http://www.quicklisp.org/beta/">Quicklisp</a>&#8217;s
quicklisp-slime-helper can take care of installing SLIME.  It is
possible to work on SBCL without SLIME.  However, SLIME has a lot of
useful extensions, if only to explore the code base.  If you&#8217;re not
(and don&#8217;t wish to become) comfortable with emacs, it&#8217;s probably best
to nevertheless use emacs and SLIME for the REPL, debugger, inspector,
etc. and write code in your favourite editor.  Later on, it&#8217;ll be
useful to figure out how to make SLIME connect to a freshly-built
SBCL.</p>

<h3>Exploring the source</h3>

<p>I often see newcomers try to read the source like a book, and, once
they realise there&#8217;s a lot of code, try to figure out a good order to
read the source.  I don&#8217;t think that&#8217;s the best approach.  SBCL is
pretty huge, and I doubt anyone ever simultaneously holds the complete
system in their head.
<a href="http://www.cs.cmu.edu/~ram/pub/lfp.ps">RAM&#8217;s &#8220;The Python Compiler for CMU Common Lisp&#8221;</a>
is still useful as an overview, and
<a href="http://www.sbcl.org/sbcl-internals/">SBCL&#8217;s internals manual</a> is a
good supplement.  Once you get close to bootstrapping logic,
<a href="http://www.doc.gold.ac.uk/~mas01cr/papers/s32008/sbcl.pdf">Christophe Rhodes&#8217;s &#8220;SBCL: a Sanely-Bootstrappable Common Lisp&#8221;</a>
helps understand the exclamation marks.  Past that, I believe it&#8217;s
preferrable to start out small, learn just enough to get the current
task done, and accept that some things just work, without asking how
(for now).</p>

<p>In that spirit, I&#8217;d say M-. (Alt period, Command period on some OS X
emacsen) is the best way to explore most of SBCL&#8217;s source.  SBCL&#8217;s
build process preserves a lot of source location information, and
M-. queries that information to jump to the definitions for any given
symbol (M-, will pop back up to the previous location).  For example,
if you type &#8220;(truncate&#8221; at the REPL and hit M-. (with the point on or
just after &#8220;truncate&#8221;), you&#8217;ll find the out of line definition for
truncate in (mostly) regular Common Lisp, optimisation rules regarding
truncate, and VOPs, assembly language templates, for truncate called
with a few sets of argument and return types.  The out of line
definition isn&#8217;t that interesting.  The transforms, however, are.
(VOPs aren&#8217;t useful if one isn&#8217;t comfortable with the platform&#8217;s
assembly language, and mostly self-explanatory otherwise.)</p>

<p>The one to &#8220;convert integer division to multiplication&#8221; is a very good
example.  One could M-. on <code>deftransform</code>, and go down a very long
chain of definitions.  Instead, I think it&#8217;s only essential to see
that the form defines a new rule, like a compiler macro, such that
compile-time values (lvars) that represent its two arguments are bound
to x and y, and the rule only triggers if its first argument is known
to be an unsigned word, and its second a constant unsigned word.  If
that&#8217;s satisfied, the transformation still only triggers if the speed
optimisation quality is higher than both compilation speed and space
(code size).</p>

<p>Then, the constant value for y is extracted and bound to y, and a
conservative bound on the maximum value that x can take at runtime is
computed.  If truncate by y should be handled elsewhere, the transform
gives up.  Otherwise, it returns a form that will be wrapped in
<code>(lambda (x y) ...)</code> and spliced in the call, instead of truncate.</p>

<p>To extend SBCL&#8217;s support for division by constants, it&#8217;s not necessary
to understand more of SBCL&#8217;s compiler than the above.  There&#8217;s no need
to try and understand <em>how</em> deftransform works, only that it defines a
rule to simplify calls to truncate.  Similarly for lvar-value and
lvar-type: the former extracts the value for constant lvars, and the
latter the static type derived for that lvar (value at a program
point).  With time, knowledge will slowly accrete.  However it&#8217;s
possible, if not best, to start hacking without understanding the
whole system.  This approach will lead to a bit of cargo culting, but
mentors and people on IRC will help make sure it doesn&#8217;t do any harm,
and can explain more stuff if it&#8217;s interesting or à propos.</p>

<h3>Finding where the compiler lives</h3>

<p>Working on the compiler itself is a bit more work.  I think the best
approach is to go in <code>src/compiler/main.lisp</code> and look for
<code>compile-component</code>.  <code>ir1-phases</code> loops on a component and performs
high-level optimisations until fixpoint (or we get tired of waiting),
while <code>%compile-component</code> handles the conversion to IR2 and then to
machine code.  The compilation pipeline hasn&#8217;t really changed since
the Python paper was written, and the subphases each have their own
function (and file).  M-. on stuff that sounds interesting is probably
the best approach at the IR2 level.</p>

<h3>Runtime stuff</h3>

<p>The C and assembly runtime lives in <code>src/runtime/</code>.  There&#8217;s a lot of
stuff that&#8217;s symlinked or generated during the build, so it&#8217;s probably
best to look at it after a successful build.  Sadly, we don&#8217;t track
source locations there, but {c,e,whatever}tags works; so does grep.</p>

<p>GC stuff is in the obvious suspects (gc-common, gencgc, gc, etc.), but
may end up affecting core loading/saving (core, coreparse, save).
Depending on what in the core loading code is affected, code in
genesis (the initial bootstrap that reads fasl files from the cross
compiler and builds the initial core file) might also have to be
modified (mostly in <code>src/compiler/generic/genesis.lisp</code>).  That&#8217;s…
more work.  Like the project suggestions list says, when we change
things in the runtime, it sometimes ends up affecting a lot of other
components.</p>

<p>GDB tends to be less than useful, because of the many tricks SBCL
plays on itself.  It&#8217;s usually hard to beat pen, paper, and printf.
At least, rebuilding the C runtime is quick: if the feature
<code>:sb-after-xc-core</code> is enabled (which already happens for <code>--fancy</code>
builds), <code>slam.sh</code> should be able to rebuild only the C runtime, and
then continue the bootstrap with the rest of SBCL from the previous
build.  That mostly leaves PCL to build, so the whole thing should
takes less than a minute on a decent machine.</p>

<h2>Some references</h2>

<p>I was replying to an email when I realised that some general compiler
references would be useful, in addition to project- and SBCL- specific
tips.</p>

<p>Christian Queinnec&#8217;s
<a href="http://pagesperso-systeme.lip6.fr/Christian.Queinnec/WWW/LiSP.html">Lisp in Small Pieces</a>
gives a good overview of issues regarding compiling Lisp-like
languages.  Andrew Appel&#8217;s
<a href="http://www.cs.princeton.edu/~appel/modern/ml/">Modern Compiler Implementation in ML</a>
is more, well, modern (I hear the versions in C and Java have the same
text, but the code isn&#8217;t as nice… and ML is a very nice language for
writing compilers).  I also remember liking Appel&#8217;s
<a href="http://www.amazon.ca/Compiling-Continuations-Andrew-W-Appel/dp/052103311X">Compiling with Continuations</a>,
but I don&#8217;t know if it&#8217;s particularly useful for CL or the projects we
suggest.</p>

<p>For more complicated stuff, I believe Stephen Muchnick&#8217;s
<a href="http://www.amazon.com/Advanced-Compiler-Design-Implementation-Muchnick/dp/1558603204">Advanced Compiler Design and Implementation</a>
would have been really nice to have, instead of slogging through code
and dozens of papers.  Allen &amp; Kennedy&#8217;s
<a href="http://www.amazon.ca/Optimizing-Compilers-Modern-Architectures-Dependence-based/dp/1558602860">Optimizing Compilers for Modern Architectures: A Dependence-based Approach</a>
is another really good read, but I&#8217;m not sure how useful it would be
when working on SBCL: we still have a lot of work to do before
reaching for the really sophisticated stuff (and what sophistication
there is is fairly non-standard).</p>

<p>I believe the Rabbit and Orbit compilers have influenced the design of
CMUCL and SBCL.  The
<a href="http://library.readscheme.org/page1.html">Lambda papers</a> provide some
historical perspective, and the RABBIT and ORBIT theses are linked
<a href="http://library.readscheme.org/page8.html">here</a>.</p>

<p>What little magic remains in SBCL and CMUCL is the type derivation
(constraint propagation) pass, and how it&#8217;s used to exploit a
repository of source-to-source transformations (deftransforms).  The
rest is bog-standard tech from the 70s or 80s.  When trying to
understand SBCL&#8217;s type derivation pass at a very high level, I
remember finding Henry Baker&#8217;s
<a href="http://home.pipeline.com/~hbaker1/TInference.html">The Nimble Type Inferencer for Common Lisp-84</a>
very useful, even though it describes a scheme that doesn&#8217;t quite work
for Common Lisp (it&#8217;s very hard to propagate information backward
while respecting the final standard).  Kaplan and Ullman&#8217;s
<a href="http://pdf.aminer.org/000/546/423/a_general_scheme_for_the_automatic_inference_of_variable_types.pdf">A Scheme for the Automatic Inference of Variable Types</a>
was also helpful.</p>

<h2>Getting help</h2>

<p>Over the years, I&#8217;ve seen a couple of people come in with great
ambition, and give up after some time, seemingly without having made
any progress.  I believe a large part of the problem is that they
tried to understand all of SBCL instead of just learning the bare
minimum to get hacking, and that their goal was too big.  I already
wrote that SBCL is probably best approached bit by bit, with some
guidance from people who&#8217;ve been there before, and I hope the projects
we suggest can all lead to visible progress quickly, after a couple
days or two weeks of work at most.</p>

<p>Still, before investing my time, I like to see the other person also
give some of theirs to SBCL.  This is why, as I wrote on the mailing
list last week, I&#8217;m much more inclined to help someone who&#8217;s already
built SBCL on their own and has submitted a patch that&#8217;s been
committed or is being improved on the mailing list.  I absolutely do
not care what the patch is; it can be new code, a bugfix for a highly
unlikely corner case, better documentation, or spelling and grammar
corrections in comments.  The bugs
<a href="https://bugs.launchpad.net/sbcl/+bugs?field.tag=easy">tagged as easy in our bugtracker</a>
may provide some inspiration.  However trivial a patch might seem,
it&#8217;s still a sign that someone is willing to put the work in to
concretely make SBCL better, and I like that… it&#8217;s also a minimal test
to make sure the person is able to work with our toolchain.  (This
isn&#8217;t SBCL policy for GSoC.  It&#8217;s simply how I feel about these
things.)</p>

<p>Again, I&#8217;m amazed by the number of people who wish to hack on SBCL
this summer (as part of Google&#8217;s Summer of Code or otherwise).
Because of that, I think it&#8217;s important to note that this is our first
year, and so we&#8217;ll likely not have more than two or three spots.
However, I always like seeing more contributors, and I hope anyone
who&#8217;d like to contribute will always be guided, GSoC or not.</p>

<p>Finally, I&#8217;ll note that Google&#8217;s Summer of Code program was only a
good excuse to write up our
<a href="http://www.sbcl.org/gsoc2013/ideas/">list of projects</a>: they&#8217;re
simply suggestions to incite programmers to see what they can do that
is useful for SBCL and, most importantly, is interesting for them.
Anyone should feel welcome to work on any of these projects, even if
they&#8217;re not eligible or chosen for GSoC.  They&#8217;re also only
suggestions; if someone has their own idea, we can likely help them
out just the same.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[The eight useful polynomial approximations of `sinf(3)']]></title>
    <link href="http://www.pvk.ca/Blog/2012/10/07/the-eight-useful-polynomial-approximations-of-sinf-3/"/>
    <updated>2012-10-07T23:35:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2012/10/07/the-eight-useful-polynomial-approximations-of-sinf-3</id>
    <content type="html"><![CDATA[<p>I just spent a few CPU-months to generate
<a href="https://github.com/pkhuong/polynomial-approximation-catalogue">these text files</a>.
They catalogue all the &#8220;interesting&#8221; (from an efficiency and accuracy
point of view) polynomial approximations of degree 16 or lower for a
couple transcendental functions, over small but useful ranges, in
single and double float arithmetic.  This claim seems to raise many
good questions when people hear it.</p>

<p>What&#8217;s wrong with Taylor approximations? Why the need to specify a
range?</p>

<p>Why are the results different for single and double floating point
arithmetic? Doesn&#8217;t rounding each coefficient to the closest float
suffice?</p>

<p>Why do I deem only certain approximations to be interesting and
others not, and how can there be so few?</p>

<p>In this post, I attempt to provide answers to these interrogations,
and sketch how I exploited classic
<a href="http://www.scienceofbetter.org/">operations research (OR)</a> tools to
enter the numerical analysts&#8217; playground.</p>

<p>The final section describe how I&#8217;d interpret the catalogue when coding
quick and slightly inaccurate polynomial approximations.  Such lossy
approximations seem to be used a lot in machine learning, game
programming and signal processing: for these domains, it makes sense
to allow more error than usual, in exchange for faster computations.</p>

<p>The CL code is all up in the repository for
<a href="https://github.com/pkhuong/rational-simplex/tree/master/demo/branch-and-cut-fit">rational-simplex</a>,
but it&#8217;s definitely research-grade. Readers beware.</p>

<h1>Minimax approximations</h1>

<p><span class='pullquote-right' data-pullquote='Taylor approximations are usually easy to compute, but only provide good approximations over an infinitesimal range. '>
The natural way to approximate functions with polynomials
(particularly if one spends time with physicists or engineers) is to
use truncated Taylor series.  Taylor approximations are usually easy to compute, but only provide good approximations over an infinitesimal range.  For example, the degree-1 Taylor approximation for \(\exp\)
centered around 0 is \(1 + x\).  It&#8217;s also obviously suboptimal, if
one wishes to minimise the worst-case error: the exponential function
is convex, and gradients consistently under-approximate convex functions.
</span></p>

<p><img class="center" src="http://www.pvk.ca/images/2012-10-07-the-eight-useful-polynomial-approximations-of-sinf-3/minimax.png"></p>

<!-- ggplot(data.frame(x=c(-1, 1)), aes(x)) + stat_function(fun=exp, aes(colour='exp')) + stat_function(fun=function(x) { 1+x}, aes(color='1 + x')) + stat_function(fun=function(x) { 1.26+1.18*x}, aes(colour='1.26 + 1.18x')) + scale_colour_manual("Function", value=c("black","blue", "red"), breaks=c("exp", "1 + x", "1.26 + 1.18x")) -->


<p>Another affine function, \(1.26 + 1.18x\), intersects \(\exp\) in
two points and is overall much closer over \([-1, 1]\).  In fact,
this latter approximation (roughly) minimises the maximal absolute
error over that range: it&#8217;s clearly not as good as the Taylor
approximation in the vicinity of 0, but it&#8217;s also much better in the
worst case (all bets are off outside \([-1, 1]\)).  That&#8217;s why I&#8217;m
(or anyone&#8217;s libm, most likely) not satisfied by Taylor polynomials,
and instead wish to compute approximations that minimise the error
over a known range; function-specific identities can be exploited to
reduce any input to such a range.</p>

<h2>Computing minimax polynomials</h2>

<p>As far as I know, the typical methods to find polynomial
approximation that minimise the maximal error (minimax polynomials)
are iterative algorithms in the style of the <a href="http://en.wikipedia.org/wiki/Remez_algorithm">Remez exchange algorithm</a>.
These methods exploit real analysis results to reduce the problem to
computing minimax polynomials over very few points (one per coefficient,
i.e. one more than the degree): once an approximation is found by
solving a linear equation system, error extrema are computed and used
as a basis to find the next approximation.  Given arbitrary-precision
arithmetic and some properties on the approximated function, the
method converges.  It&#8217;s elegant, but depends on high-precision
arithmetic.</p>

<p>Instead, I
<a href="http://pvk.ca/Blog/2012/05/24/fitting-polynomials-by-generating-linear-constraints/">reduce the approximation problem to a sequence of linear optimisation programs</a>.
Exchange algorithms solve minimax subproblems over exactly as many
points as there are coefficients: the fit can then be solved as a
linear equation.  I find it simpler to use many more constraints than
there are coefficients, and solve the resulting optimisation problem
subject to linear inequalities directly, as a
<a href="http://en.wikipedia.org/wiki/Linear_programming">linear program</a>.</p>

<p>There are obviously no cycling problems in this cutting planes
approach (the set of points grows monotonically), and all the
machinery from exchange algorithms can be reused: there is the same
need for a good set of initial points and for determining error
extrema.  The only difference is that points can always be added to
the subproblem without having to remove any, and that we can restrict
points to correspond to floating values (i.e. values we might actually
get as input) without hindering convergence.  The last point seems
pretty important when looking at high-precision approximations.</p>

<p>For example, let&#8217;s approximate the exponential function over \([-1, 1]\)
with an affine function.  The initial points could
simply be the bounds, -1 and 1.  The result is the line that passes by
\(\exp(-1)\) and \(\exp(1)\), approximately \(1.54 + 1.18x\).
The error is pretty bad around 0; solving for the minimax line over
three points (-1, 0 and 1) yields a globally optimal solution,
approximately \(1.16 + 1.18x\).</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-10-07-the-eight-useful-polynomial-approximations-of-sinf-3/cutting-planes.png"></p>

<!-- ggplot(data.frame(x=c(-1, 1)), aes(x)) + stat_function(fun=exp, aes(colour='exp')) + stat_function(fun=function(x) { 1.54+1.18*x}, aes(color='1.54 + 1.18x')) + stat_function(fun=function(x) { 1.26+1.18*x}, aes(colour='1.26 + 1.18x')) + scale_colour_manual("Function", value=c("black","blue", "red"), breaks=c("exp", "1.54 + 1.18x", "1.26 + 1.18x")) -->


<p>There&#8217;s a lot of meat to wrap around this bone.  I use a
<a href="https://bitbucket.org/tarballs_are_good/computable-reals">computable reals</a>
package in Common Lisp to pull arbitrary-precision rational
approximations for arithmetic expressions; using libm directly would
approximate an approximation, and likely cause strange results.  A
<a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/branch-and-cut-fit/newton.lisp">variant of Newton&#8217;s algorithm</a>
(with bisection steps) converges to error extrema (points which can be
added to the linear subproblem); arbitrary precision real arithmetic
is very useful to ensure convergence down to machine precision here.
Each linear subproblem is solved with an
<a href="https://github.com/pkhuong/rational-simplex">exact simplex algorithm in rational arithmetic</a>,
and convergence is declared when the value of the error extrema found
in the Newton steps correspond to that estimated by the subproblem.
Finally, the fact that the input are floating-point values is
exploited by ensuring that all the points considered in the linear
subproblems correspond exactly to FP values: rather than only rounding
a point to the nearest FP value, its two immediately neighbouring
(predecessor and successor) FP values were also added to the
subproblem.  Adding immediate neighbours helps skip iterations in
which extrema move only by one <a href="http://en.wikipedia.org/wiki/Unit_in_the_last_place">ULP</a>.</p>

<p>Initialising the method with a good set of points is essential to
obtain reasonable performance.  The
<a href="http://en.wikipedia.org/wiki/Chebyshev_nodes">Chebyshev nodes</a> are
known to yield
<a href="http://www.uta.edu/faculty/rcli/papers/li2004.pdf">nearly optimal [PDF]</a>
initial approximations.  The LP-based approach can exploit a large
number of starting points, so I went with very many initial Chebyshev
nodes (256, for polynomials of degree at most 16), and, again,
adjoined three neighbouring FP values for each point.  It doesn&#8217;t seem
useful to me to determine the maximal absolute error very precisely,
and I declared convergence when the value predicted by the LP
relaxation was off by less than 0.01%.  Also key to the performance
were tweaks in the polynomial evaluation function to avoid rational
arithmetic until the very last step.</p>

<h1>Exactly represented coefficients</h1>

<p><span class='pullquote-right' data-pullquote='rounding coefficients can result in catastrophic error blowups. '>
The previous section gives one reason why there are different tables
for single and double float approximations: the ranges of input
considered during optimisation differ.  There&#8217;s another reason that&#8217;s
more important, particularly for single float coefficients:
rounding coefficients can result in catastrophic error blowups.
</span></p>

<p>For example, rounding \(2.5 x\sp{2}\) to the nearest integer
coefficient finds either \(2 x\sp{2}\) or \(3 x\sp{2}\).  However \(2
x\sp{2} + x\sp{3}\) is also restricted to integer coefficients, but
more accurate over \([0, .5]\).  Straight coefficient-wise
rounding yields an error of \(\frac{1}{2} x\sp{2}\), versus
\(\frac{1}{2} x\sp{2} - x\sp{3}\) for the degree-3 approximation.
As the following graph shows, the degree-3 approximation is much
closer for the range we&#8217;re concerned with.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-10-07-the-eight-useful-polynomial-approximations-of-sinf-3/rounding.png"></p>

<!-- ggplot(data.frame(x=c(0, .5)), aes(x)) + stat_function(fun=function(x){.5*x^2}, aes(colour='|2.5 x^2 - 2 x^2|')) + stat_function(fun=function(x) { .5*x^2-x^3}, aes(color='|2.5 x^2 - (2 x^2 + x^3)|')) + scale_colour_manual("Error", breaks=c('|2.5 x^2 - 2 x^2|', '|2.5 x^2 - (2 x^2 + x^3)|'), value=c('red', 'blue')) -->


<p>Floating point values are more densely distributed than the integers
for the range of values we usually encounter in polynomial
approximations, but still discrete, and the same phenomenon crops up,
at a smaller scale.</p>

<p>The
<a href="http://www.marc.mezzarobba.net/m2/summary_chevillard.pdf">state of the art</a>
for this version of the approximation problem seems to be based on an
initial Remez step, followed by a reduction to a
<a href="http://en.wikipedia.org/wiki/Lattice_problem#Closest_vector_problem_.28CVP.29">CVP</a>.</p>

<p>The cutting planes approach points to a natural solution from the OR
world: a branch and cut method.  The issue with the cutting planes
solution is that, while I only generate points that correspond to FP
values, the decision variables (coefficients) are free to take
arbitrary rational values.</p>

<p>Branching can be used to split up a decision variable&#8217;s range and
eliminate from consideration a range of non-FP values.  The decision
variables&#8217; upper and lower bounds are tightened in a large number of
subproblems; solutions gradually converge to all-FP values and their
optimality is then proven.</p>

<p>Embedding the cutting planes method in a branch and bound lets us find
polynomial approximations with float coefficients that minimise the
maximal error over float arguments, with correctly rounded powers.
The only remaining simplification is that we assume that the dot
product in the polynomial evaluation is error-free.  Sadly, removing
this simplification results in a horrible discrete optimisation
problem with a very bumpy objective function.  I&#8217;m not sure that any
exact approach can solve this in reasonable time.  Still, there are
ways to evaluate FP polynomials very accurately, and I&#8217;m mostly
interested in approximations to trade accuracy for speed, so rounding
errors may well be negligible compared to approximation errors.</p>

<h2>A branch-and-cut method for polynomial approximations</h2>

<p>As in all
<a href="http://en.wikipedia.org/wiki/Branch_and_bound">branch and bound</a>
methods, a problem is split in subproblems by restricting the range of
some coefficient to exclude some infeasible (non-FP) values.  For
example, in the previous example (in which we&#8217;re looking for integer
coefficients), the 2.5 coefficient would lead to two subproblems, one
in which the degree-2 coefficient is at most 2 \((\lfloor 2.5\rfloor)\),
and another in which it&#8217;s at least 3 \((\lceil2.5 \rceil)\),
excluding all the fractional values between 2 and 3.  For floats,
we restrict to the closest floats that under- or over- approximate
the current value.</p>

<p>However, instead of solving the full continuous relaxation (which is
impractical, given the large number of potential argument values), our
subproblems are solved with cutting planes.  The trick is that cuts
(constraints, which correspond to argument values) from any branch can
be used everywhere else.  Thus, the global point pool is shared
between all subproblems, rather than re-generating it from scratch for
each subproblem.  In practice, this lifting of cutting planes to the
root problem seems essential for efficiency; that&#8217;s certainly what it
took for branch-and-cut MIP solvers to take off.</p>

<p>I generate cuts at the root, but afterwards only when a subproblem
yields all-FP values.  This choice was made for efficiency reasons.
Finding error extrema is relatively slow, but, more importantly,
adding cuts really slows down re-optimisation: the state of the
simplex algorithm can be preserved between invocations, and
warm-starting tends to be extremely efficient when only some
variables&#8217; bounds have been modified.  However, it&#8217;s also necessary to
add constraints when an incumbent is all-FP, lest we prematurely
declare victory.</p>

<p>There are two really important choices when designing branch and bound
algorithms: how the branching variable is chosen, and the order in
which subproblems are explored.  With polynomial approximations, it
seems to make sense to branch on the coefficient corresponding to the
lowest degree first: I simply scan the solution and choose the
least-degree coefficient that isn&#8217;t represented exactly as a float.
Nodes are explored in a hybrid depth-first/best-first order: when a
subproblem yields children (its objective value is lower than the
current best feasible solution, and it isn&#8217;t feasible itself), the
next node is its child corresponding to the bound closest to the
branching variable&#8217;s value, otherwise the node with the least
predicted value is chosen.  The depth-first/closest-first dives
quickly converge to decent feasible solutions, while the best-first
choice will increase the lower bound, bringing us closer to proving
optimality.</p>

<p>A randomised rounding heuristic is also used to provide initial
feasible (all-FP) solutions, and the incumbent is considered close
enough to optimal when it&#8217;s less than 5% off from the best-known lower
bound.  When the method returns, we have a polynomial approximation
with coefficients that are exactly representable as single or double
floats (depending on the setting), and which (almost) minimises the
approximation error on float arguments.</p>

<h1>Interesting approximations</h1>

<p>The branch and cut can be used to determine an (nearly-)
optimally accurate FP polynomial approximation given a maximum degree.
However, the degree isn&#8217;t the only tunable to accelerate polynomial
evaluation: some coefficients are nicer than others.  Multiplying by 0
is obviously very easy (nothing to do), while multiplication by +/-1
is pretty good (no multiplication), and by +/- 2 not too bad either
(strength-reduce the multiplication into an addition).  It would be
possible to consider other integers, but it doesn&#8217;t seem to make
sense: on current X86, floating point multiplication is only 33-66%
slower (more latency) than FP addition, and fiddling with the exponent
field means ping-ponging between the FP and integer domains.</p>

<p>There are millions of such approximations with a few &#8220;nice&#8221;
coefficients, even when restricting the search to low degrees (e.g. 10
or lower).  However, the vast majority of them will be wildly
inaccurate.  I decided to only consider approximations that are at
least as accurate as the best approximation of degree three lower:
e.g. a degree-3 polynomial with nice coefficients is only
(potentially) interesting if it&#8217;s at least as accurate as a constant.
Otherwise, it&#8217;s most likely even quicker to just use a lower-degree
approximation.</p>

<p><span class='pullquote-right' data-pullquote='what&#8217;s the point in looking at a degree-4 polynomial with one coefficient equal to 0 if there&#8217;s a degree-3 with one zero that&#8217;s just as accurate? '>
That&#8217;s not enough: this filter still leaves thousands of polynomials.
Most of the approximations will be dominated by another; what&#8217;s the point in looking at a degree-4 polynomial with one coefficient equal to 0 if there&#8217;s a degree-3 with one zero that&#8217;s just as accurate?  The relative importance
of the degree, and the number of zeroes, ones and twos will vary
depending on the environment and evaluation technique.  However, it
seems reasonable to assume that zeroes are always at least as quick to
work with as ones, and ones as quick as twos.  The constant
offset ought to be treated distinctly from other coefficients.  It&#8217;s
only added rather than multiplied, so any speed-up is lower, but an
offset of 0 is still pretty nice, and, on some architectures, there is
special support to load constants like 1 or 2.
</span></p>

<p>This lets me construct a simple but robust performance model: a
polynomial is more quickly evaluated than another if it&#8217;s of lower or
same degree, doesn&#8217;t have more non-zero, non-{zero, one} or non-{zero,
one, two} multipliers, and if its constant offset is a nicer integer
(0 is nicer than +/- 1 is nicer than +/- 2 is nicer than arbitrary
floats).</p>

<p>With these five metrics, in addition to accuracy, we have a
multi-objective optimisation problem.  In some situations, humans may
be able to determine a weighting procedure to bring the dimension down
to a scalar objective value, but the procedure would be highly
domain-specific.  Instead, we can use this partial order to report
solutions on the Pareto front of accuracy and efficiency: it&#8217;s only
worth reporting an approximation if there is no approximation that&#8217;s
better or equal in accuracy and in all the performance metrics.  The
performance and accuracy characteristics are strongly correlated
(decreasing the degree or forcing nice coefficients tends to decrease
the accuracy, and a zero coefficient is also zero-or-one, etc.), so
it&#8217;s not too surprising that there are so few non-dominated solutions.</p>

<h2>Enumerating potentially-interesting approximations</h2>

<p>The branch and cut can be used to find the most accurate
approximation, given an assignment for a few values (e.g. the constant
term is 1, or the first degree coefficient 0).  I&#8217;ll use it as a
subproblem solver, in a more exotic branch and bound approach.</p>

<p>We wish to enumerate all the partial assignments that correspond to
not-too-horrible solutions, and save those.  A normal branch and bound
can&#8217;t be applied directly, as one of the choices is to leave a given
variable free to take any value.  However, bounding still works: if a
given partial assignment leads to an approximation that&#8217;s too
inaccurate, the accuracy won&#8217;t improve by fixing even more
coefficients.</p>

<p>I started with a search in which children were generated by adjoining
one fixed value to partial assignments.  So, after the root node,
there could be one child with the constant term fixed to 0, 1 or 2
(and everything else free), another with the first coefficient fixed
to 0, 1 or 2 (everything else left free), etc.</p>

<p>Obviously, this approach leads to a search graph: fixing the constant
term to 0 and then the first coefficient to 0, or doing in the reverse
order leads to the same partial assignment.  A hash table ensures that
no partial assignment is explored twice.  There&#8217;s still a lot of
potential for wasted computation: if bounding lets us determine that
fixing the constant coefficient to 0 is worthless, we will still
generate children with the constant fixed to 0 in other branches!</p>

<p>I borrowed a trick from the SAT solving and constraint programming
communities,
<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.2870">nogood recording</a>.
Modern SAT solvers go far beyond strict branching search: in practice,
it seems that good branching orders will lead to quick solutions,
while a few bad ones take forever to solve.  Solvers thus frequently
reset the search tree to a certain extent.  However, information is
still communicated across search trees via learned clauses.  When the
search backtracks (an infeasible partial assignment is found), a nogood
set of conflicting partial assignments can be computed: no feasible
solution will include this nogood assignment.</p>

<p>Some time ago, Vasek Chvatal introduced
<a href="http://dimacs.rutgers.edu/TechnicalReports/abstracts/1995/95-14.html">Resolution Search</a>
to port this idea to 0/1 optimisation, and Marius Posta, a friend at
CIRRELT and Université de Montréal,
<a href="https://www.cirrelt.ca/DocumentsTravail/CIRRELT-2009-16.pdf">extended it to general discrete optimisation [PDF]</a>.
The complexity mostly comes from the desire to generate partial
assignments that, if they&#8217;re bad, will merge well with the current set
of learned clauses.  This way, the whole set of nogoods (or, rather,
an approximation that suffices to guarantee convergence) can be
represented and queried efficiently.</p>

<p>There&#8217;s no need to be this clever here: each subproblem involves an
exact (rational arithmetic) branch and cut.  Simply scanning the set
of arbitrary nogoods to look for a match significantly accelerates the
search.  The process is sketched below.</p>

<p><a href="http://www.pvk.ca/images/2012-10-07-the-eight-useful-polynomial-approximations-of-sinf-3/search-tree.jpg"><img class="center" src="http://www.pvk.ca/images/2012-10-07-the-eight-useful-polynomial-approximations-of-sinf-3/search-tree-small.jpg"></a></p>

<p>The search is further accelerated by executing multiple branch and
cuts and nogood scans in parallel.</p>

<p>The size of the search graph is also reduced by only fixing
coefficients to a few values.  There&#8217;s no point in forcing a
coefficient to take the value of 0, 1 or 2 (modulo sign) if it already
does.  Thus, a coefficient with a value of 0 is left free; otherwise a
child extending the partial assignment with 0 is created.  Similarly,
a child extended with a value of 1 is only generated if the
coefficient isn&#8217;t already at 0 or 1 (we suppose that multiplication by
0 is at least as efficient as by 1), and similarly for 2.  Finally,
coefficients between 0 and 1 are only forced to 0 or 1, and those
between 1 and 3 to 0, 1 or 2.  If a coefficient takes a greater
absolute value than 3, fixing it most probably degrades the
approximation too strongly, and it&#8217;s left free &#8211; then again such
coefficients only seem to happen on really hard-to-approximate
functions like \(\log\) over \([1, 2]\).  Also, the last
coefficient is never fixed to zero (that would be equivalent to
looking at lower-degree approximations, which was done in previous
searches).  Of course, with negative coefficients, the fixed values
are negated as well.</p>

<p>This process generates a large number of potentially interesting
polynomial approximations.  The non-dominated ones are found with a
straight doubly nested loop, with a slight twist: accuracy is computed
with machine floating point arithmetic, thus taking rounding into
account.  The downside is that it&#8217;s actually approximated, by sampling
fairly many (the FP neighbourhood of 8K Chebyshev nodes) points;
preliminary testing indicates that&#8217;s good enough for a relative error
(on the absolute error estimate) lower than 1e-5.  There tends to be a
few exactly equivalent polynomials (all the attributes are the same,
including accuracy &#8211; they only differ by a couple ULP in a few
coefficients); in that case, one is chosen arbitrarily.  There&#8217;s
definitely some low-hanging fruit to better capture the performance
partial order; the error estimate is an obvious candidate.  The hard
part was generating all the potentially interesting approximations,
though, so one can easily re-run the selection algorithm with tweaked
criteria later.</p>

<h1>Exploiting the approximation indices</h1>

<p>I&#8217;m not sure what functions are frequently approximated over what
ranges, so I went with the obvious ones: \(\cos\) and \(\sin\)
over \([-\pi/2, \pi/2]\), \(\exp\) and arctan over \([-1, 1]\)
or \([0, 1]\), \(\log\) over \([1, 2]\), and \(\log 1+x\) and
\(\log\sb{2} 1+x\) over \([0, 1]\).  This used up a fair amount of
CPU time, so I stopped at degree 16.</p>

<p>Each file reports the accuracy and efficient metrics, then the
coefficients in floating point and rational form, and a hash of the
coefficients to identify the approximation.  The summary columns are
all aligned, but each line is very long, so the files are best read
without line wrapping.</p>

<p>For example, if I were looking for a fairly good approximation of
degree 3 for \(\exp\) in single floats, I&#8217;d look at
<a href="https://github.com/pkhuong/polynomial-approximation-catalogue/blob/master/single/exp-degree-lb_error-non_zero-non_one-non_two-constant-error">exp-degree-lb_error-non_zero-non_one-non_two-constant-error</a>.</p>

<p>The columns report the accuracy and efficiency metrics, in the order
used to sort approximations lexicographically:</p>

<ol>
<li>the approximation&#8217;s degree;</li>
<li>the floor of the negated base-2 logarithm of the maximum error
(roughly bits of absolute accuracy, rounded up);</li>
<li>the number of non-zero multipliers;</li>
<li>the number of non-{-1, 0, 1} multipliers;</li>
<li>whe number of non-{-2, -1, 0, 1, 2} multipliers;</li>
<li>whether the constant&#8217;s absolute value is 0, 1, 2, or other (in which
case the value is 3); and</li>
<li>the maximum error.</li>
</ol>


<p>After that, separated by pipes, come the coefficients in float form,
then in rational form, and the MD5 hash of the coefficients in a float
vector (in a contiguous vector, in increasing order of degree, with
X86&#8217;s little-endian sign-magnitude representation).  The hash might
also be useful if you&#8217;re worried that your favourite implementation
isn&#8217;t parsing floats right.</p>

<p>There are three polynomials with degree 3, and they all offer
approximately the same accuracy (lb_error = 10).  I&#8217;d choose between
the most accurate polynomial</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>exp(x) = 0.9994552 + 1.0166024 x + 0.42170283 x**2 + 0.2799766 x**3 # exp-74F7B9B7E0E73A804ABF6AC6C006BD98</span></code></pre></td></tr></table></div></figure>


<p>or one with a nicer multiplier that doesn&#8217;t even double the maximum error</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>exp(x) = 1.0009761 + x + 0.4587815 x**2 + 0.2575481 x**3 # exp-D4C349D8F2C45EC0BE2154D1052EAA03</span></code></pre></td></tr></table></div></figure>


<p>On the other hand, if I were looking for an accurate-enough
approximation of \(\log 1+x\), I&#8217;d open
<a href="https://github.com/pkhuong/polynomial-approximation-catalogue/blob/master/single/log1px-lb_error-degree-error-non_zero-non_one-non_two-constant">log1px-lb_error-degree-error-non_zero-non_one-non_two-constant</a>.</p>

<p>The columns are in the order used for the lexicographic sort:</p>

<ol>
<li>the number of bits of accuracy;</li>
<li>the approximation&#8217;s degree;</li>
<li>the maximum error;</li>
<li>the number of non-zero multipliers;</li>
<li>the number of non-{-1, 0, 1} multipliers;</li>
<li>the number of non-{-2, -1, 0, 1, 2} multipliers; and</li>
<li>whether the constant&#8217;s absolute value is 0, 1, 2, or other (in which
case the value is 3).</li>
</ol>


<p>An error around 1e-4 would be reasonable for my needs, and
<code>log1px-6AE509</code> seems interesting: maximum error is around 1.5e-4,
it&#8217;s degree 4, the constant offset is 0 and the first multiplier 1.
If I needed a bit more accuracy (7.1e-5), I&#8217;d consider <code>log1px-E8200B</code>:
it&#8217;s degree 4 as well, and the constant is still 0.</p>

<p>It seems to me optimisation tools like approximation generators are
geared toward fire and forget usage.  I don&#8217;t believe that&#8217;s a
realistic story: very often, the operator will have a fuzzy range of
acceptable parameters, and presenting a small number of solutions with
fairly close matches lets them exploit domain-specific insights.  In
this case, rather than specifying fixed coefficients and degree or
accuracy goals, users can scan the indices and decide whether each
trade-off is worth it or not.  That&#8217;s particularly true of the single
float approximations, for which the number of possibilities tends to
be tiny (e.g. eight non-dominated approximations for
<a href="https://github.com/pkhuong/polynomial-approximation-catalogue/blob/master/single/sin-degree-lb_error-non_zero-non_one-non_two-constant-error">\(\sin\)</a>).</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Tabasco Sort: a super-optimal merge sort]]></title>
    <link href="http://www.pvk.ca/Blog/2012/08/27/tabasco-sort-super-optimal-merge-sort/"/>
    <updated>2012-08-27T10:30:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2012/08/27/tabasco-sort-super-optimal-merge-sort</id>
    <content type="html"><![CDATA[<p>EDIT: 2012-08-29: I added a section to compare comparison counts with
known bounds for general comparison sorts and sorting networks.</p>

<p>In an
<a href="http://pvk.ca/Blog/2012/08/13/engineering-a-list-merge-sort/">earlier post</a>,
I noted how tedious coding unrolled sorts can be.  Frankly, that&#8217;s the
main reason I stopped at leaf sorts of size three.  Recently,
<a href="http://blog.racket-lang.org/2012/08/fully-inlined-merge-sort.html">Neil Toronto</a>
wrote a nice post on the generation of size-specialised merge sorts.
The post made me think about that issue a bit more, and I now have a
neat way to generate unrolled/inlined merge sorts that are
significantly smaller than the comparison and size &#8220;-optimal&#8221; inlined
merge sorts.</p>

<p>The code is up as a
<a href="http://discontinuity.info/~pkhuong/tabasco-sort.lisp">single-file library</a>,
and sorts short vectors faster than SBCL&#8217;s inlined heapsort by a
factor of two to three… and compiles to less machine code.  The
generator is a bit less than 100 LOC, so I&#8217;m not sure I want to
include it in the mainline yet.  If someone wants to add support for
more implementations, I&#8217;d be happy to extend Tabasco sort, and might
even consider letting it span multiple files ;)</p>

<h2>Differently-optimal sorts</h2>

<p>The inlined merge sort for three values (<code>a</code>, <code>b</code>, and <code>c</code>) is copied
below.  It has to detect between \(3! = 6\) permutation, and does so
with an optimal binary search tree.  That scheme leads to code with
\(n! - 1\) comparisons to sort n values, and for which each
execution only goes through two or three comparisons (\(\approx \lg n!\)).</p>

<figure class='code'><figcaption><span>&#8220;optimal&#8221; inlined merge sort (n = 3) </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(if (&lt; b c)
</span><span class='line'>    (if (&lt; a b)
</span><span class='line'>        (values a b c)
</span><span class='line'>        (if (&lt; a c)
</span><span class='line'>            (values b a c)
</span><span class='line'>            (values b c a)))
</span><span class='line'>    (if (&lt; a c)
</span><span class='line'>        (values a c b)
</span><span class='line'>        (if (&lt; a b)
</span><span class='line'>            (values c a b)
</span><span class='line'>            (values c b a))))</span></code></pre></td></tr></table></div></figure>


<p>An optimal sorting network for three values needs only three
comparisons, and always executes those three comparisons.</p>

<figure class='code'><figcaption><span>optimal sorting network (n = 3) </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(progn
</span><span class='line'>  (when (&lt; c b)
</span><span class='line'>    (rotatef b c))
</span><span class='line'>  (when (&lt; b a)
</span><span class='line'>    (rotatef a b))
</span><span class='line'>  (when (&lt; c b)
</span><span class='line'>    (rotatef b c)))</span></code></pre></td></tr></table></div></figure>


<p>Finally, the leaf sort I used in SBCL is smaller than the inlined
merge sort (three comparisons), but sometimes executes fewer than
three comparisons.  It&#8217;s superoptimal ;)</p>

<figure class='code'><figcaption><span>&#8220;super-optimal&#8221; inlined merge sort (n = 3) </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(progn
</span><span class='line'>  (when (&lt; c b)
</span><span class='line'>    (rotatef b c))
</span><span class='line'>  (if (&lt; b a)
</span><span class='line'>      (if (&lt; c a)
</span><span class='line'>          (values b c a)
</span><span class='line'>          (values b a c))
</span><span class='line'>      (values a b c)))</span></code></pre></td></tr></table></div></figure>


<p>The optimal merge sort is larger than the optimal sorting network, and
the optimal sorting network performs potentially more comparisons than
the optimal merge sort…</p>

<p>Each implementation is optimal for different search spaces: the
optimal merge sort never merges continuations, and the sorting
network&#8217;s only control dependencies are in the conditional swaps.</p>

<p>The &#8220;super-optimal&#8221; merge sort does better by allowing itself both
assignments (or tail-calls) and non-trivial control flow: it&#8217;s smaller
than the inlined merge sort (but performs more data movement), and
potentially executes fewer comparisons than the sorting network (with
a larger total code size).  And, essential attribute in practice, it&#8217;s
easy to generate.  This contrasts with optimal sorting networks, for
which we do not have any generation method short of brute force
search; in fact, in practice, sorting networks tend to exploit
suboptimal (by a factor of \(\log n\)) schemes like
<a href="http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm">bitonic sort</a>
or
<a href="http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/oemen.htm">odd-even merge sort</a>.
Then again, we&#8217;re only concerned with tiny sorts, and asymptotics can
be misleading: Batcher&#8217;s odd-even merge sort happens to be optimal for
\(n\leq 8\).  The issue with sorting networks remains:
data-oblivious control flow pessimises their comparison count.</p>

<h2>Generalising from size 3</h2>

<p>What the last merge sort does is to first sort both halves of the
values (<code>a</code> is trivially sorted, and <code>b c</code> needs one conditional
swap), and then, assuming that each half (<code>a</code> and <code>b c</code>) is sorted,
find the right permutation with which to merge them.  Rather than
\(n!\) permutations, a merge only needs to distinguish between
\(C(n, \lfloor n/2\rfloor) = \frac{n!}{\lfloor n/2\rfloor!\lceil n/2\rceil!}\)
permutations, and the recursive sorts are negligible compared to the
merge step.  That&#8217;s a huge reduction in code size!</p>

<p>A simple merge generator fits in half a
<a href="http://wry.me/~darius/hacks/screenfuls/screen3.html">screenful</a>.</p>

<figure class='code'><figcaption><span>unrolled merge generator </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun emit-permute (destinations sources)
</span><span class='line'>  ;; (setf values) is parallel assignment
</span><span class='line'>  `(setf (values ,@destinations) (values ,@sources)))
</span><span class='line'>
</span><span class='line'>(defun emit-merge-1 (destinations left right acc)
</span><span class='line'>  "Build a search tree to determine the right permutation to
</span><span class='line'>   merge LEFT and RIGHT, given that each is pre-sorted."
</span><span class='line'>  (cond ((null left)
</span><span class='line'>         (emit-permute destinations (append (reverse acc) right)))
</span><span class='line'>        ((null right)
</span><span class='line'>         (emit-permute destinations (append (reverse acc) left)))
</span><span class='line'>        (t
</span><span class='line'>         `(if (&lt; ,(first right) ,(first left)) ; stable sort
</span><span class='line'>              ,(emit-merge-1 destinations
</span><span class='line'>                             left (rest right)
</span><span class='line'>                             (cons (first right) acc))
</span><span class='line'>              ,(emit-merge-1 destinations
</span><span class='line'>                             (rest left) right
</span><span class='line'>                             (cons (first left) acc))))))
</span><span class='line'>
</span><span class='line'>(defun emit-merge (left right)
</span><span class='line'>  (emit-merge-1 (append left right) left right nil))</span></code></pre></td></tr></table></div></figure>


<p>Given two lists of sorted variables, <code>emit-merge</code> calls <code>emit-merge-1</code>
to generate code that finds the right permutation, and executes it at
the leaf.  A binary search tree is generated by keeping track of the
merged list in a reverse-order (to enable tail-sharing) accumulator of
variable names.  As expected, when merging a list of length one with
another of length two, we get pretty much the code I wrote by hand
earlier.</p>

<pre><code>CL-USER&gt; (emit-merge '(a) '(b c))
(if (&lt; b a)
    (if (&lt; c a)
        (setf (values a b c) (values b c a))
        (setf (values a b c) (values b a c)))
    (setf (values a b c) (values a b c)))
</code></pre>

<p>There&#8217;s one striking weakness: we generate useless code for the
identity permutation.  We could detect that case, or, more generally,
we could find the cycle decomposition of each permutation and use it
to minimise temporary values; that&#8217;d implicitly take care of cases
like <code>(setf (values a b c) (values b a c))</code>, in which some values are
left unaffected.</p>

<h2>A smarter permutation generator</h2>

<p>I&#8217;ll represent permutations as associative lists, from source to
destination.  Finding a cycle is easy: just walk the permutation from
an arbitrary value until we loop back.</p>

<figure class='code'><figcaption><span>extract a single cycle from a linearly-represented permutation </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun find-cycle (mapping)
</span><span class='line'>  "Extract an arbitrary cycle from a non-empty mapping,
</span><span class='line'>   returning both the cycle and the rest of the mapping."
</span><span class='line'>  (assert mapping)
</span><span class='line'>  (let* ((head  (pop mapping))
</span><span class='line'>         (cycle (list (cdr head))))
</span><span class='line'>    (loop
</span><span class='line'>     (let* ((next-source (first cycle))
</span><span class='line'>            (pair        (assoc next-source mapping)))
</span><span class='line'>       (cond (pair
</span><span class='line'>              (push (cdr pair) cycle)
</span><span class='line'>              ;; if this sucks enough to matter, the output
</span><span class='line'>              ;; will be humongous anyway
</span><span class='line'>              (setf mapping (remove pair mapping)))
</span><span class='line'>             (t
</span><span class='line'>              (assert (eql next-source (first head)))
</span><span class='line'>              (return (values cycle mapping))))))))</span></code></pre></td></tr></table></div></figure>


<p>To generate the code corresponding to a permutation, I can extract all
the cycles, execute each cycle with a <code>rotatef</code>.</p>

<figure class='code'><figcaption><span>cycle-decomposition-based permute generator </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun emit-permute (destinations sources)
</span><span class='line'>  "Emit a [destinations &lt;- sources] permutation via its
</span><span class='line'>   cycle decomposition"
</span><span class='line'>  ;; source -> destination alist, minus trivial pairs
</span><span class='line'>  (let ((mapping (remove-if (lambda (pair)
</span><span class='line'>                              (eql (car pair) (cdr pair)))
</span><span class='line'>                            (pairlis sources destinations))))
</span><span class='line'>    `(progn
</span><span class='line'>       ,@(loop while mapping
</span><span class='line'>               collect
</span><span class='line'>               (multiple-value-bind (cycle new-mapping)
</span><span class='line'>                   (find-cycle mapping)
</span><span class='line'>                 (setf mapping new-mapping)
</span><span class='line'>                 `(rotatef ,@cycle))))))</span></code></pre></td></tr></table></div></figure>


<p>The merge step for a sort of size three is now a bit more explicit,
but likely compiles to code that uses fewer registers as well. It
probably doesn&#8217;t matter on good SSA-based backends, but those are the
exception rather than the norm in the Lisp world.</p>

<pre><code>CL-USER&gt; (emit-merge '(a) '(b c))
(if (&lt; b a)
    (if (&lt; c a)
        (progn (rotatef a b c))
        (progn (rotatef a b)))
    (progn))
</code></pre>

<h2>Adding the recursive steps</h2>

<p>The only thing missing for a merge sort is to add base cases and
recursion.  The base case is easy: lists of length one are sorted.
Inlining recursion is trivial, as is usually the case when generating
Lisp code.</p>

<figure class='code'><figcaption><span>&#8220;super-optimal&#8221; inlined merge sort </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun emit-sort-1 (values length)
</span><span class='line'>  (when (> length 1)
</span><span class='line'>    (let* ((split (truncate length 2))
</span><span class='line'>           (left  (subseq values 0 split))
</span><span class='line'>           (right (subseq values split)))
</span><span class='line'>      `(progn
</span><span class='line'>         ,(emit-sort-1 left  split)
</span><span class='line'>         ,(emit-sort-1 right (- length split))
</span><span class='line'>         ,(emit-merge left right)))))
</span><span class='line'>
</span><span class='line'>(defun emit-sort (values)
</span><span class='line'>  (emit-sort-1 values (length values)))
</span><span class='line'>
</span><span class='line'>(defmacro inline-sort (&rest values)
</span><span class='line'>  (let* ((pairs (loop for value in values
</span><span class='line'>                      collect `(,(gensym "TEMP") ,value)))
</span><span class='line'>         (temps (mapcar #'first pairs)))
</span><span class='line'>    `(let ,pairs
</span><span class='line'>       ,(emit-sort temps)
</span><span class='line'>       (values ,@temps))))</span></code></pre></td></tr></table></div></figure>


<p>The resulting three-value sorter looks good; there are some
redundancies with nested or empty <code>progn</code>s, but any half-decent
compiler will take care of that.  Python certainly does a goob job on
that code.</p>

<pre><code>CL-USER&gt; (emit-sort '(a b c))
(progn
 nil
 (progn
  nil
  nil
  (if (&lt; c b)
      (progn (rotatef b c))
      (progn)))
 (if (&lt; b a)
     (if (&lt; c a)
         (progn (rotatef a b c))
         (progn (rotatef a b)))
     (progn)))
CL-USER&gt; (disassemble (lambda (a b c)
                        (declare (type fixnum a b c))
                        (inline-sort a b c)))
; disassembly for (lambda (a b c))
; 0E88F150:       498BD0           mov rdx, r8                ; no-arg-parsing entry point
;       53:       498BC9           mov rcx, r9
;       56:       498BDA           mov rbx, r10
;       59:       4D39CA           cmp r10, r9
;       5C:       7C30             jl L3
;       5E: L0:   4C39C1           cmp rcx, r8
;       61:       7D0B             jnl L1
;       63:       4C39C3           cmp rbx, r8
;       66:       7C1B             jl L2
;       68:       488BD1           mov rdx, rcx
;       6B:       498BC8           mov rcx, r8
;       6E: L1:   488BF9           mov rdi, rcx
;       71:       488BF3           mov rsi, rbx
;       74:       488D5D10         lea rbx, [rbp+16]
;       78:       B906000000       mov ecx, 6
;       7D:       F9               stc
;       7E:       488BE5           mov rsp, rbp
;       81:       5D               pop rbp
;       82:       C3               ret
;       83: L2:   488BD1           mov rdx, rcx
;       86:       488BCB           mov rcx, rbx
;       89:       498BD8           mov rbx, r8
;       8C:       EBE0             jmp L1
;       8E: L3:   498BCA           mov rcx, r10
;       91:       498BD9           mov rbx, r9
;       94:       EBC8             jmp L0
</code></pre>

<p>I thought about generating calls to <code>values</code> in the final merge,
rather than permuting, but decided against: I know SBCL doesn&#8217;t
generate clever code when permuting registers, and that&#8217;d result in
avoidable spills.  I also considered generating code in CPS rather
than emitting assignments; again, I decided against because I can&#8217;t
depend on SBCL to emit clever permutation code.  The transformation
would make sense in a dialect with weaker support (both compiler and
social) for assignment.</p>

<h2>How good is the generated code?</h2>

<p>Both this inline merge sort and the original, permutation-free (except
at the leaves), one actually define the exact same algorithms.  For
any input, both (should) execute the same comparisons in the same
order: the original inline merge sort simply inlines the whole set of
execution traces, without even merging control flow.</p>

<p>The permutation-free sort guarantees that it never performs redundant
comparisons.  Whether it performs the strict minimum number of
comparisons, either on average or in the worst case, is another question.
At first, I thought that \(\lg n!\) should be a good estimate, since
the search tree seems optimally balanced.  The problem is \(n!\)
tends to have many other prime factors than 2, and we can thus expect
multiple comparisons to extract less than 1 bit of information, for
each execution.  The lower bound can thus be fairly far from the
actual value… Still, this question is only practically relevant for
tiny sorts, so the discrepancy shouldn&#8217;t be too large.</p>

<p>A simple way to get the minimum, average or maximum comparison count
would be to annotate the permutation-free generator to compute the
shortest, average and longest path as it generates the search tree.</p>

<p>I&#8217;ll instead mimic the current generator.</p>

<p>The first step is to find the number of comparisons to perform a
merge of <code>m</code> and <code>n</code> values.</p>

<p>If either <code>m</code> or <code>n</code> is zero, merging the sequences is trivial.</p>

<p>Otherwise, the minimum number of comparisons is <code>(min m n)</code>: the
sequences are pre-sorted, and the shortest sequence comes first.  The
maximum is <code>(1- (+ m n))</code>.  I couldn&#8217;t find a simple expression for
the average over all permutations.  Instead, I iterate over all
possible combinations and count the number of comparisons until either
subsequence is exhausted.</p>

<figure class='code'><figcaption><span>compute the number of comparisons during merges </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun merge-count (m n)
</span><span class='line'>  ;; return min, expected, max comparison to merge
</span><span class='line'>  ;; presorted subsequences of m and n values
</span><span class='line'>  (if (zerop (min m n))
</span><span class='line'>      (values 0 0 0)
</span><span class='line'>      (let ((comparisons 0)
</span><span class='line'>            (count 0)
</span><span class='line'>            (min   most-positive-fixnum)
</span><span class='line'>            (max   0))
</span><span class='line'>        (dotimes (i (ash 1 (+ m n))
</span><span class='line'>                    (values min
</span><span class='line'>                            (/ comparisons
</span><span class='line'>                               count)
</span><span class='line'>                            max))
</span><span class='line'>          ;; only consider combinations with m ones
</span><span class='line'>          ;; (and n zeros)
</span><span class='line'>          (when (= (logcount i) m)
</span><span class='line'>            (let ((cmp (1- (+ m n)))
</span><span class='line'>                  (mask i))
</span><span class='line'>              ;; no more comparison needed until two consecutive
</span><span class='line'>              ;; elements from distinct subsequences
</span><span class='line'>              (loop while (eql (logbitp 0 mask)
</span><span class='line'>                               (logbitp 1 mask))
</span><span class='line'>                    do (decf cmp)
</span><span class='line'>                       (setf mask (ash mask -1)))
</span><span class='line'>              (setf min (min min cmp)
</span><span class='line'>                    max (max max cmp))
</span><span class='line'>              (incf comparisons cmp)
</span><span class='line'>              (incf count)))))))</span></code></pre></td></tr></table></div></figure>


<p>Counting the number of comparisons in sorts is then trivial, with a
recursive function.  I didn&#8217;t even try to memoise repeated
computations: the generated code is ludicrously long when sorting as
few as 13 or 14 values.</p>

<figure class='code'><figcaption><span>compute the number of comparisons in merge sort </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun sort-count (n)
</span><span class='line'>  (if (&lt;= n 1)       ; trivially sorted
</span><span class='line'>      (values 0 0 0)
</span><span class='line'>      (let ((min 0)
</span><span class='line'>            (avg 0)
</span><span class='line'>            (max 0))
</span><span class='line'>        (flet ((inc (min- avg- max-)
</span><span class='line'>                 (incf min min-)
</span><span class='line'>                 (incf avg avg-)
</span><span class='line'>                 (incf max max-)))
</span><span class='line'>          ;; accumulate min/avg/max count from sorting the left and
</span><span class='line'>          ;; right subsequences, and merging
</span><span class='line'>          (multiple-value-call #'inc (sort-count (floor n 2)))
</span><span class='line'>          (multiple-value-call #'inc (sort-count (ceiling n 2)))
</span><span class='line'>          (multiple-value-call #'inc (merge-count (floor n 2)
</span><span class='line'>                                                  (ceiling n 2))))
</span><span class='line'>        (values min avg max))))</span></code></pre></td></tr></table></div></figure>


<pre><code>CL-USER&gt; (loop for i from 2 upto 16
               do (multiple-value-bind (min avg max)
                      (sort-count i)
                    (format t "~4D ~6,2F ~6D ~6,2F ~6D~%"
                            i (log (! i) 2d0)
                            min (float avg) max)))
;; n  lg(n!)   min   avg     max ;   best   network
   2   1.00      1   1.00      1 ;   1      1
   3   2.58      2   2.67      3 ;   3      3
   4   4.58      4   4.67      5 ;   5      5
   5   6.91      5   7.17      8 ;   7      9
   6   9.49      7   9.83     11 ;   10     12
   7  12.30      9  12.73     14 ;   13     16
   8  15.30     12  15.73     17 ;   16     19
   9  18.47     13  19.17     21 ;   19     25?
  10  21.79     15  22.67     25 ;   22     29?
  11  25.25     17  26.29     29 ;   26     35?
  12  28.84     20  29.95     33 ;   30     39?
  13  32.54     22  33.82     37 ;   34     45?
  14  36.34     25  37.72     41 ;   38     51?
  15  40.25     28  41.69     45 ;   42     56?
  16  44.25     32  45.69     49 ;   46?    60?
</code></pre>

<p>I annotated the output with comments (marked with semicolons).  The
columns are, from left to right, the sort size, the theoretical lower
bound (on the average or maximum number of comparisons), the minimum
number of comparisons (depending on the input permutation), the
average (over all input permutations), and the maximum count.  I added
two columns by hand: the optimal worst-case (maximum) comparison
counts (over all sorting methods, copied from the
<a href="http://oeis.org/A036604">OEIS</a>), and the optimal size for sorting
networks, when known (lifted from a table
<a href="http://nn.cs.utexas.edu/downloads/papers/valsalam.utcstr11.pdf">here [pdf]</a>).
Inexact (potentially higher than the optimum) bounds are marked with a
question mark.</p>

<p>For the input size the inline merge sort can reasonably tackle (up to
ten or so), its worst-case is reasonably close to the best possible,
and its average case tends to fall between the lower bound and the
best possible.  Over all these sizes, the merge sort&#8217;s worst case
performs fewer comparisons than the optimal or best-known sorting
networks.  The current best upper bounds on the minimal worst-case
comparison count seem to be based on insertion sort passes that
minimise the number of comparisons with a binary search.  I don&#8217;t
believe that&#8217;s directly useful for the current use case, but a similar
trick might be useful to reduce the number of comparisons, at the
expense of reasonably more data movement.</p>

<h2>Making it CL-strength</h2>

<p>That&#8217;s already a decent proof of concept.  It&#8217;s also far too plain to
fit in the industrial-strength Common Lisp way.  An inline sorting
macro worthy of CL should be parameterised on both comparator and key
functions, and work with arbitrary places rather than only variables.</p>

<p>Parameterising the comparator is trivial.  The key could be handled by
calling it at each comparison, but that&#8217;s wasteful.  We&#8217;re generating
code; might as well go for glory.  Just like in the list merge sort,
I&#8217;ll cache calls to the key function in the merge tree.  I&#8217;ll also use
special variables instead of passing a half-dozen parameters around in
the generator.</p>

<figure class='code'><figcaption><span>key/comparator-aware merge generator </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
<span class='line-number'>43</span>
<span class='line-number'>44</span>
<span class='line-number'>45</span>
<span class='line-number'>46</span>
<span class='line-number'>47</span>
<span class='line-number'>48</span>
<span class='line-number'>49</span>
<span class='line-number'>50</span>
<span class='line-number'>51</span>
<span class='line-number'>52</span>
<span class='line-number'>53</span>
<span class='line-number'>54</span>
<span class='line-number'>55</span>
<span class='line-number'>56</span>
<span class='line-number'>57</span>
<span class='line-number'>58</span>
<span class='line-number'>59</span>
<span class='line-number'>60</span>
<span class='line-number'>61</span>
<span class='line-number'>62</span>
<span class='line-number'>63</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defvar *inline-sort-comparator*)
</span><span class='line'>(defvar *inline-sort-key*)
</span><span class='line'>(defvar *inline-sort-destinations*)
</span><span class='line'>(defvar *inline-sort-left-head*)
</span><span class='line'>(defvar *inline-sort-right-head*)
</span><span class='line'>
</span><span class='line'>(defun emit-merge-1 (left right acc)
</span><span class='line'>  "Build a search tree to determine the right permutation to
</span><span class='line'>   merge LEFT and RIGHT, given that each is pre-sorted."
</span><span class='line'>  ;; stability trickery
</span><span class='line'>  `(if (funcall ,*inline-sort-comparator* ,*inline-sort-right-head*
</span><span class='line'>                                          ,*inline-sort-left-head*)
</span><span class='line'>       ,(let* ((acc        (cons (first right) acc))
</span><span class='line'>               (right      (rest right)))
</span><span class='line'>          ;; pop from RIGHT, and recurse if RIGHT isn't empty.
</span><span class='line'>          (if right
</span><span class='line'>              `(let ((,*inline-sort-right-head*
</span><span class='line'>                       (funcall ,*inline-sort-key* ,(first right))))
</span><span class='line'>                 ,(emit-merge-1 left right acc))
</span><span class='line'>              (emit-permute *inline-sort-destinations*
</span><span class='line'>                            (append (reverse acc) left))))
</span><span class='line'>       ;; same
</span><span class='line'>       ,(let* ((acc  (cons (first left) acc))
</span><span class='line'>               (left (rest left)))
</span><span class='line'>          (if left
</span><span class='line'>              `(let ((,*inline-sort-left-head*
</span><span class='line'>                       (funcall ,*inline-sort-key* ,(first left))))
</span><span class='line'>                 ,(emit-merge-1 left right acc))
</span><span class='line'>              (emit-permute *inline-sort-destinations*
</span><span class='line'>                            (append (reverse acc) right))))))
</span><span class='line'>
</span><span class='line'>(defun emit-merge (left right)
</span><span class='line'>  "Caching calls to KEY means we have to special-case empty lists
</span><span class='line'>   (which doesn't happen when we sort, anyway)"
</span><span class='line'>  (cond ((null left)
</span><span class='line'>         (emit-permute right right))
</span><span class='line'>        ((null right)
</span><span class='line'>         (emit-permute left left))
</span><span class='line'>        (t
</span><span class='line'>         (let ((*inline-sort-destinations* (append left right))
</span><span class='line'>               (*inline-sort-left-head*  (gensym "LEFT-HEAD-KEY"))
</span><span class='line'>               (*inline-sort-right-head* (gensym "RIGHT-HEAD-KEY")))
</span><span class='line'>           `(let ((,*inline-sort-left-head*  (funcall ,*inline-sort-key*
</span><span class='line'>                                                      ,(first left)))
</span><span class='line'>                  (,*inline-sort-right-head* (funcall ,*inline-sort-key*
</span><span class='line'>                                                      ,(first right))))
</span><span class='line'>              ,(emit-merge-1 left right nil))))))
</span><span class='line'>
</span><span class='line'>(defun emit-sort-1 (values length)
</span><span class='line'>  "Unrolled and inlined recursive merge sort generator.
</span><span class='line'>   Lists of length 1 or less are trivially sorted; recurse
</span><span class='line'>   on the rest."
</span><span class='line'>  (when (> length 1)
</span><span class='line'>    (let* ((split (truncate length 2))
</span><span class='line'>           (left  (subseq values 0 split))
</span><span class='line'>           (right (subseq values split)))
</span><span class='line'>      `(progn
</span><span class='line'>         ,(emit-sort-1 left  split)
</span><span class='line'>         ,(emit-sort-1 right (- length split))
</span><span class='line'>         ,(emit-merge left right)))))
</span><span class='line'>
</span><span class='line'>(defun emit-sort (values *inline-sort-comparator* *inline-sort-key*)
</span><span class='line'>  (emit-sort-1 values (length values)))</span></code></pre></td></tr></table></div></figure>


<p>Finally, handling arbitrary places, thus letting the macro take care
of writing results back to the places, is just regular macrology.</p>

<figure class='code'><figcaption><span>CL-style inline sort macro </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
<span class='line-number'>43</span>
<span class='line-number'>44</span>
<span class='line-number'>45</span>
<span class='line-number'>46</span>
<span class='line-number'>47</span>
<span class='line-number'>48</span>
<span class='line-number'>49</span>
<span class='line-number'>50</span>
<span class='line-number'>51</span>
<span class='line-number'>52</span>
<span class='line-number'>53</span>
<span class='line-number'>54</span>
<span class='line-number'>55</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defmacro inline-sort ((comparator &key key (overwrite t))
</span><span class='line'>                       &body values
</span><span class='line'>                       &environment env)
</span><span class='line'>  "Sorts all VALUES in increasing order with respect to COMPARATOR and
</span><span class='line'>   KEY.  COMPARATOR should be a strict order, like CL:&lt;, and KEY defaults
</span><span class='line'>   to NIL (which is interpreted as the identity).  By default, the result
</span><span class='line'>   is written back to the places; that's skipped if OVERWRITE is NIL. A
</span><span class='line'>   literal NIL value for overwrite will avoid generating any write.
</span><span class='line'>   The SORT form always evaluates to the sorted values, in order."
</span><span class='line'>  (let (vars vals
</span><span class='line'>        store-vars writer-forms
</span><span class='line'>        reader-forms
</span><span class='line'>        temps
</span><span class='line'>        (_comparator (gensym "COMPARATOR"))
</span><span class='line'>        (_key        (gensym "KEY"))
</span><span class='line'>        (_overwrite  (gensym "OVERWRITE")))
</span><span class='line'>    (loop for value in (reverse values) do
</span><span class='line'>      (push (gensym "TEMP") temps)
</span><span class='line'>      ;; only use the setf expansion if we might write to the place.
</span><span class='line'>      (if (not overwrite)
</span><span class='line'>          (push value reader-forms)
</span><span class='line'>          (multiple-value-bind (var val store-var writer reader)
</span><span class='line'>              (get-setf-expansion value env)
</span><span class='line'>            (setf vars (append var vars)
</span><span class='line'>                  vals (append val vals))
</span><span class='line'>            (push store-var store-vars)
</span><span class='line'>            (push writer writer-forms)
</span><span class='line'>            (push reader reader-forms))))
</span><span class='line'>    `(let* ((,_comparator ,comparator)
</span><span class='line'>            (,_comparator (if (functionp ,_comparator)
</span><span class='line'>                              ,_comparator
</span><span class='line'>                              (symbol-function ,_comparator)))
</span><span class='line'>            (,_key        ,(or key '#'identity))
</span><span class='line'>            (,_key        (if (functionp ,_key)
</span><span class='line'>                              ,_key
</span><span class='line'>                              (symbol-function ,_key)))
</span><span class='line'>            (,_overwrite  ,overwrite)
</span><span class='line'>            ,@(mapcar 'list vars vals)
</span><span class='line'>            ,@(mapcar 'list temps reader-forms))
</span><span class='line'>       (declare (ignorable ,_comparator ,_key ,_overwrite))
</span><span class='line'>       ,(emit-sort temps _comparator _key)
</span><span class='line'>       ,(and overwrite
</span><span class='line'>             `(when ,_overwrite
</span><span class='line'>                ,@(loop
</span><span class='line'>                    for value in values
</span><span class='line'>                    for store-var-list in store-vars
</span><span class='line'>                    for writer in writer-forms
</span><span class='line'>                    for temp in temps
</span><span class='line'>                    collect
</span><span class='line'>                    (progn
</span><span class='line'>                      (unless (= 1 (length store-var-list))
</span><span class='line'>                        (error "Can't sort multiple-value place ~S" value))
</span><span class='line'>                      `(let ((,(first store-var-list) ,temp))
</span><span class='line'>                         ,writer)))))
</span><span class='line'>       (values ,@temps))))</span></code></pre></td></tr></table></div></figure>


<p>Now, the macro can be used to sort, e.g., vectors of double floats
&#8220;in-place&#8221; (inasmuch as copying everything to registers can be
considered in-place).</p>

<pre><code>CL-USER&gt; (macroexpand-1 `(inline-sort (#'&lt; :key #'-)
                           (aref array 0) (aref array 1) (aref array 2)))
(let* ((#:comparator1184 #'&lt;)
       (#:comparator1233
        (if (functionp #:comparator1233)
            #:comparator1233
            (symbol-function #:comparator1233)))
       (#:key1185 #'-)
       (#:key1234
        (if (functionp #:key1234)
            #:key1234
            (symbol-function #:key1234)))
       (#:overwrite1186 t)
       (#:array1195 array)
       (#:array1192 array)
       (#:array1189 array)
       (#:temp1193 (aref #:array1195 0))
       (#:temp1190 (aref #:array1192 1))
       (#:temp1187 (aref #:array1189 2)))
  (declare (ignorable #:comparator1184 #:key1185 #:overwrite1186))
  (progn
   nil
   (progn
    nil
    nil
    (let ((#:left-head-key1196 (funcall #:key1185 #:temp1190))
          (#:right-head-key1197 (funcall #:key1185 #:temp1187)))
      (if (funcall #:comparator1184 #:right-head-key1197 #:left-head-key1196)
          (progn (rotatef #:temp1190 #:temp1187))
          (progn))))
   (let ((#:left-head-key1198 (funcall #:key1185 #:temp1193))
         (#:right-head-key1199 (funcall #:key1185 #:temp1190)))
     (if (funcall #:comparator1184 #:right-head-key1199 #:left-head-key1198)
         (let ((#:right-head-key1199 (funcall #:key1185 #:temp1187)))
           (if (funcall #:comparator1184 #:right-head-key1199
                        #:left-head-key1198)
               (progn (rotatef #:temp1193 #:temp1190 #:temp1187))
               (progn (rotatef #:temp1193 #:temp1190))))
         (progn))))
  (when #:overwrite1186
    (let ((#:new1194 #:temp1193))
      (sb-kernel:%aset #:array1195 0 #:new1194))
    (let ((#:new1191 #:temp1190))
      (sb-kernel:%aset #:array1192 1 #:new1191))
    (let ((#:new1188 #:temp1187))
      (sb-kernel:%aset #:array1189 2 #:new1188)))
  (values #:temp1193 #:temp1190 #:temp1187))
t
CL-USER&gt; (disassemble (lambda (array)
                        (declare (type (simple-array double-float (3)) array))
                        (inline-sort (#'&lt; :key #'-)
                          (aref array 0) (aref array 1) (aref array 2))
                        array))
; disassembly for (lambda (array))
; 0C5A5661:       F20F105201       movsd XMM2, [rdx+1]        ; no-arg-parsing entry point
;      666:       F20F104209       movsd XMM0, [rdx+9]
;      66B:       F20F104A11       movsd XMM1, [rdx+17]
;      670:       660F28E0         movapd XMM4, XMM0
;      674:       660F5725A4000000 xorpd XMM4, [rip+164]
;      67C:       660F28D9         movapd XMM3, XMM1
;      680:       660F571D98000000 xorpd XMM3, [rip+152]
;      688:       660F2FDC         comisd XMM3, XMM4
;      68C:       7A02             jp L0
;      68E:       7267             jb L3
;      690: L0:   660F28DA         movapd XMM3, XMM2
;      694:       660F571D84000000 xorpd XMM3, [rip+132]      ; negate double-float
;      69C:       660F28E0         movapd XMM4, XMM0
;      6A0:       660F572578000000 xorpd XMM4, [rip+120]
;      6A8:       660F2FE3         comisd XMM4, XMM3
;      6AC:       7A26             jp L1
;      6AE:       7324             jnb L1
;      6B0:       660F28E1         movapd XMM4, XMM1
;      6B4:       660F572564000000 xorpd XMM4, [rip+100]
;      6BC:       660F2FE3         comisd XMM4, XMM3
;      6C0:       7A27             jp L2
;      6C2:       7325             jnb L2
;      6C4:       660F28DA         movapd XMM3, XMM2
;      6C8:       660F28D0         movapd XMM2, XMM0
;      6CC:       660F28C1         movapd XMM0, XMM1
;      6D0:       660F28CB         movapd XMM1, XMM3
;      6D4: L1:   F20F115201       movsd [rdx+1], XMM2
;      6D9:       F20F114209       movsd [rdx+9], XMM0
;      6DE:       F20F114A11       movsd [rdx+17], XMM1
;      6E3:       488BE5           mov rsp, rbp
;      6E6:       F8               clc
;      6E7:       5D               pop rbp
;      6E8:       C3               ret
;      6E9: L2:   660F28DA         movapd XMM3, XMM2
;      6ED:       660F28D0         movapd XMM2, XMM0
;      6F1:       660F28C3         movapd XMM0, XMM3
;      6F5:       EBDD             jmp L1
;      6F7: L3:   660F28D8         movapd XMM3, XMM0
;      6FB:       660F28C1         movapd XMM0, XMM1
;      6FF:       660F28CB         movapd XMM1, XMM3
;      703:       EB8B             jmp L0
</code></pre>

<h2>Bonus: Hooking in SBCL</h2>

<p>The inline sort supports the same options as <code>CL:SORT</code>, so it&#8217;d be
really interesting to opportunistically compile calls to the latter
into size-specialised inline sort.  The usual, portable, way to code
that sort of macro qua source-to-source optimiser in CL is with
compiler macros; compiler macros have access to all the usual
macroexpansion-time utility, but the function definition is left in
place.  That way the user can still use the function as a first-class
function, and the compiler-macro can decline the transformation if a
regular call would work better (and the compiler can ignore any
compiler macro).  That&#8217;s not enough for our needs, though… and there
can only be one compiler macro per function, so adding one to code we
don&#8217;t own is a bad idea.</p>

<p>Python&#8217;s first internal representation (ir1) is optimised by
iteratively deriving tighter type information, and (mostly) pattern
matching on the type of function calls.  Its DEFTRANSFORM form lets us
add new rules, and there may be an arbitrary number of such rules for
each function.</p>

<figure class='code'><figcaption><span>Hooking our sort generator in SBCL </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
<span class='line-number'>43</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(in-package "SB-C")
</span><span class='line'>(defvar *unrolled-vector-sort-max-length* 8)
</span><span class='line'>
</span><span class='line'>(defun maybe-emit-unrolled-merge-sort (node sequence key)
</span><span class='line'>  (unless (policy node (> speed space))
</span><span class='line'>    (give-up-ir1-transform))
</span><span class='line'>  (let* ((sequence-type (lvar-type sequence))
</span><span class='line'>         (dimensions (array-type-dimensions-or-give-up
</span><span class='line'>                      sequence-type)))
</span><span class='line'>    (unless (typep dimensions '(cons number null))
</span><span class='line'>      (give-up-ir1-transform
</span><span class='line'>       "~@&lt;sequence argument isn't a vector of known length~:@>"))
</span><span class='line'>    (let ((length (first dimensions)))
</span><span class='line'>      (when (> length *unrolled-vector-sort-max-length*)
</span><span class='line'>        (give-up-ir1-transform
</span><span class='line'>         "~@&lt;sequence argument too long for unrolled sort ~
</span><span class='line'>              (length ~S greater than ~S)~:@>"
</span><span class='line'>         length *unrolled-vector-sort-max-length*))
</span><span class='line'>      (if (&lt;= length 1)
</span><span class='line'>          'sequence
</span><span class='line'>          `(with-array-data ((array sequence)
</span><span class='line'>                             (start)
</span><span class='line'>                             (end))
</span><span class='line'>             (declare (optimize (insert-array-bounds-checks 0))
</span><span class='line'>                      (ignore end))
</span><span class='line'>             (inline-sort
</span><span class='line'>                 ((%coerce-callable-to-fun predicate)
</span><span class='line'>                  :key ,(if key
</span><span class='line'>                            '(%coerce-callable-to-fun key)
</span><span class='line'>                            '#'identity))
</span><span class='line'>               ,@(loop for i below length
</span><span class='line'>                       collect `(aref array (+ start ,i))))
</span><span class='line'>             sequence)))))
</span><span class='line'>
</span><span class='line'>(deftransform sort ((sequence predicate &key key)
</span><span class='line'>                    * * :node node)
</span><span class='line'>  "unroll sort of short vectors"
</span><span class='line'>  (maybe-emit-unrolled-merge-sort node sequence key))
</span><span class='line'>
</span><span class='line'>(deftransform stable-sort ((sequence predicate &key key)
</span><span class='line'>                           * * :node node)
</span><span class='line'>  "unroll stable-sort of short vectors"
</span><span class='line'>  (maybe-emit-unrolled-merge-sort node sequence key))</span></code></pre></td></tr></table></div></figure>


<p>The two deftransforms at the end define new rules that match on calls
to <code>CL:SORT</code> and <code>CL:STABLE-SORT</code>, with arbitrary argument types and
return types: basic type checks are performed elsewhere, and
<code>maybe-emit-unrolled-merge-sort</code> does the rest.  Transforms are
identified by the docstring (which also improve compiler notes), and
the argument and return types, so the forms are reevaluation-safe.</p>

<p>All the logic lies in <code>maybe-emit-unrolled-merge-sort</code>.  The <code>policy</code>
form checks that the optimisation policy at the call node has <code>speed</code>
greater than <code>space</code>, and gives up on the transformation otherwise.
The next step is to make sure the sequence argument is an array, and
that its dimensions are known and define a vector (its dimension list
is a list of one number).  The final guard makes sure we only
specialise on small sorts (at most
<code>*unrolled-vector-sort-max-length*</code>).</p>

<p>Finally, we get to code generation itself.  A vector of length 1 or 0
is trivially pre-sorted.  I could also directly emit the inner
<code>inline-sort</code> form, but SBCL has some stuff to hoist out computations
related to hairy arrays.  <code>with-array-data</code> takes a potentially
complex array (e.g. displaced, or not a vector), and binds the
relevant variables to the underlying simple array of rank 1, and the
start and end indices corresponding to the range we defined
(defaulting to the whole range of the input array).  Bound checks are
eliminated because static information ensures the accesses are safe
(or the user lied and asked not to insert type checks earlier), and
the <code>start</code> index is declared to be small enough that we can add to it
without overflow &#8211; Python doesn&#8217;t implement the sort of sophisticated
shape analyses that could figure that out.  Finally, a straight
<code>inline-sort</code> form can be emitted.</p>

<p>That machinery means we&#8217;ll get quicker and shorter inline sort code
when the size is known ahead of time.  For example, a quick
disassembly shows the following is about one fourth the size of the
size-generic inline code (with <code>(simple-array double-float (*))</code>, and
<code>(optimize speed (space 0))</code>.</p>

<pre><code>CL-USER&gt; (lambda (x)
           (declare (type (simple-array double-float (4)) x)
                    (optimize speed))
           (sort x #'&lt;))
#&lt;FUNCTION (lambda (x)) {100BFF457B}&gt;
CL-USER&gt; (disassemble *)
; disassembly for (lambda (x))
; 0BFF45CF:       F20F105A01       movsd XMM3, [rdx+1]        ; no-arg-parsing entry point
;      5D4:       F20F104209       movsd XMM0, [rdx+9]
;      5D9:       F20F104A11       movsd XMM1, [rdx+17]
;      5DE:       F20F105219       movsd XMM2, [rdx+25]
;      5E3:       660F2FC3         comisd XMM0, XMM3
;      5E7:       7A0E             jp L0
;      5E9:       730C             jnb L0
;      5EB:       660F28E3         movapd XMM4, XMM3
;      5EF:       660F28D8         movapd XMM3, XMM0
;      5F3:       660F28C4         movapd XMM0, XMM4
[...]
</code></pre>

<p>Even for vectors of length 8 (the default limit), the specialised
merge sort is shorter than SBCL&#8217;s inlined heapsort, and about three
times as fast on shuffled vectors.</p>

<h2>That&#8217;s it</h2>

<p>It took me much longer to write this up than to code the generator,
but I hope this can be useful to other people.  One thing I&#8217;d like to
note is that sorting networks are much harder to get right than this
generator, and pessimise performance: without branches, there must be
partially redundant computations on non-power-of-two sizes.  In the
absence of solid compiler support for conditional swaps, I doubt the
additional overhead of <em>optimal</em> sorting networks can be compensated
by the simpler control flow, never mind the usual odd-even or bitonic
networks.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[On the importance of keeping microbenchmarks honest]]></title>
    <link href="http://www.pvk.ca/Blog/2012/08/23/on-the-importance-of-keeping-microbenchmarks-honest/"/>
    <updated>2012-08-23T13:01:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2012/08/23/on-the-importance-of-keeping-microbenchmarks-honest</id>
    <content type="html"><![CDATA[<p><a href="http://tapoueh.org/blog/2012/08/20-performance-the-easiest-way.html">A recent post</a> tried to extract information from a microbenchmark, but the author
absolutely did not care whether the programs computed the right, or
even the same, thing.</p>

<p>The result? Pure noise.</p>

<p><code>(expt 10 10)</code> overflows 32 bit <em>signed</em> integers, so the C version
wound up going through 1410065408 iterations instead.  In fact, signed
overflow is undefined in C, so a sufficiently devious compiler could
cap the iteration count to 65536 and still be standard compliant.</p>

<p>On SBCL/x86-64, we can do the following and explicitly ask for machine
unsigned arithmetic:</p>

<pre><code>CL-USER&gt; (lambda (max)
           (declare (type (unsigned-byte 64) max)
                    (optimize speed))
           (let ((sum 0))
             (declare (type (unsigned-byte 64) sum))
             (dotimes (i max sum)
               (setf sum (ldb (byte 64 0) (+ sum i))))))
#&lt;FUNCTION (LAMBDA (MAX)) {1004DA3D6B}&gt;
CL-USER&gt; (disassemble *)
; disassembly for (LAMBDA (MAX))
; 04DA3E02:       31C9             XOR ECX, ECX               ; no-arg-parsing entry point
;       04:       31C0             XOR EAX, EAX
;       06:       EB0E             JMP L1
;       08:       0F1F840000000000 NOP
;       10: L0:   4801C1           ADD RCX, RAX
;       13:       48FFC0           INC RAX
;       16: L1:   4839D0           CMP RAX, RDX
;       19:       72F5             JB L0
[ function epilogue ]
</code></pre>

<p>Now that <code>ldb</code> <em>portably</em> ensures modular arithmetic, we
virtually get the exact same thing as what GCC outputs, down to
alignment.  It&#8217;s still slower than the C version because it goes
through 1e10 iterations of the lossy sum, rather than
1.4e9.</p>

<p>Microbenchmarks are useful to improve our understanding of complex
systems.  Microbenchmarks whose results we completely discard not so
much: if there&#8217;s nothing keeping us or the compiler honest, we might
as well get them to compile to no-ops.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[A one-instruction write barrier]]></title>
    <link href="http://www.pvk.ca/Blog/2012/08/14/a-one-instruction-write-barrier/"/>
    <updated>2012-08-14T21:25:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2012/08/14/a-one-instruction-write-barrier</id>
    <content type="html"><![CDATA[<p><a href="http://www.hoelzle.org/publications/write-barrier.pdf">Hölzle&#8217;s two-instruction write barrier [PDF]</a>
for garbage collectors looks like</p>

<pre><code> addr = destination
 offset = addr&gt;&gt;k;
 (cards-(heap_base&gt;&gt;k))[offset] = 1 ; mark one byte
 write to addr
</code></pre>

<p>Some SBCL users allocate Lisp object lookalikes in the C heap, and we
have stack-allocated values; I have to test whether the address is in
range or mask the offset to avoid overflows.</p>

<p>Or, we could exploit X86&#8217;s bit-addressing instructions:</p>

<pre><code> addr = destination
 bts cards, addr
 write to addr
</code></pre>

<p>where <code>cards</code> is a vector of 256 or 512MB (there&#8217;s some trickery to handle
negative offsets). <code>bts</code> will index into that vector of 4G bits, and
set the corresponding bit to 1.  On X86-64, we can force <code>cards</code> to be
in the lower 4GB, and stick to 32 bit addressing: the instruction will
also implicitly mask out the upper 32 bit of <code>addr</code> before indexing
into <code>cards</code>.  Too bad it&#8217;s around twice or thrice as slow as a shift
and a byte write (or even shift, mask and byte write) and really sucks
with SMP.</p>

<p>There are also <a href="http://weinholt.se/scheme/alignment-check.pdf">hacks [PDF]</a>
to
<a href="https://groups.google.com/d/msg/comp.lang.lisp/qQdpmfHhJj8/43LfCzBiCJAJ">abuse</a>
alignment checking as hardware lowtag (tag data in the lower bit of
addresses) checks.  Who says that contemporary machines don&#8217;t support
safe languages well? (:</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Engineering a list merge sort]]></title>
    <link href="http://www.pvk.ca/Blog/2012/08/13/engineering-a-list-merge-sort/"/>
    <updated>2012-08-13T02:50:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2012/08/13/engineering-a-list-merge-sort</id>
    <content type="html"><![CDATA[<p>Back in November 2011, <a href="https://github.com/sile/">Takeru Ohta</a>
submitted a
<a href="http://article.gmane.org/gmane.lisp.steel-bank.devel/16416">very nice patch</a>
to replace our (SBCL&#8217;s) in-place stable merge sort on linked lists
with a simpler,
<a href="http://article.gmane.org/gmane.lisp.steel-bank.devel/16789">much more efficient</a>
implementation.  It took me until last May to whip myself into running
a bunch of tests to estimate the performance improvements and make
sure there weren&#8217;t any serious regression, and finally commit the
patch.  This post summarises what happened as I tried to find further
improvements.  The result is an implementation that&#8217;s linear-time on
nearly sorted or reverse-sorted lists, around 4 times as fast on
slightly shuffled lists, and up to 30% faster on completely shuffled
lists, thanks to design choices guided by statistically significant
effects on performance (… on one computer, my dual
2.8 GHz X5660).</p>

<p>I believe the approach I used to choose the implementation can be
applied in other contexts, and the tiny tweak to adapt the sort to
nearly-sorted inputs is simple (much simpler than explicitly detecting runs
like
<a href="http://svn.python.org/projects/python/trunk/Objects/listsort.txt">Timsort</a>),
if a bit weak, and works with pretty much any merge sort.</p>

<h2>A good starting point</h2>

<p>The original code is reproduced below.  The sort is parameterised on
two functions: a comparator (test) and a key function that extracts
the property on which data are compared.  The key function is often
the identity, but having it available is more convenient than having
to pull the calls into the comparator.  The sort is also stable, so we
use it for both stable and regular sorting; I&#8217;d like to keep things
that way to minimise maintenance and testing efforts.  This
implementation seems like a good foundation to me: it&#8217;s simple but
pretty good (both in runtime and in number of comparisons).  Trying to
modify already-complicated code is no fun, and there&#8217;s little point
trying to improve an implementation that doesn&#8217;t get the basics right.</p>

<figure class='code'><figcaption><span>merge function </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun merge-lists* (head list1 list2 test key &aux (tail head))
</span><span class='line'>  (declare (type cons head list1 list2)
</span><span class='line'>           (type function test key)
</span><span class='line'>           (optimize speed))
</span><span class='line'>  (macrolet ((merge-one (l1 l2)
</span><span class='line'>               `(progn
</span><span class='line'>                  (setf (cdr tail) ,l1
</span><span class='line'>                        tail       ,l1)
</span><span class='line'>                  (let ((rest (cdr ,l1)))
</span><span class='line'>                    (cond (rest
</span><span class='line'>                           (setf ,l1 rest))
</span><span class='line'>                          (t
</span><span class='line'>                           (setf (cdr ,l1) ,l2)
</span><span class='line'>                           (return (cdr head))))))))
</span><span class='line'>    (loop
</span><span class='line'>     (if (funcall test (funcall key (car list2))  ; this way, equivalent
</span><span class='line'>                       (funcall key (car list1))) ; values are first popped
</span><span class='line'>         (merge-one list2 list1)                  ; from list1
</span><span class='line'>         (merge-one list1 list2)))))</span></code></pre></td></tr></table></div></figure>




<figure class='code'><figcaption><span>sort function </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun stable-sort-list (list test key &aux (head (cons :head list)))
</span><span class='line'>  (declare (type list list)
</span><span class='line'>           (type function test key)
</span><span class='line'>           (dynamic-extent head))
</span><span class='line'>  (labels ((recur (list size)
</span><span class='line'>             (declare (optimize speed)
</span><span class='line'>                      (type cons list)
</span><span class='line'>                      (type (and fixnum unsigned-byte) size))
</span><span class='line'>             (if (= size 1)
</span><span class='line'>                 (values list (shiftf (cdr list) nil))
</span><span class='line'>                 (let ((half (ash size -1))) ; TRUNCATE would have worked
</span><span class='line'>                   (multiple-value-bind (list1 rest)
</span><span class='line'>                       (recur list half)
</span><span class='line'>                     (multiple-value-bind (list2 rest)
</span><span class='line'>                         (recur rest (- size half))
</span><span class='line'>                       (values (merge-lists* head list1 list2 test key)
</span><span class='line'>                               rest)))))))
</span><span class='line'>    (when list
</span><span class='line'>      (values (recur list (length list))))))</span></code></pre></td></tr></table></div></figure>


<p>There are a few obvious improvements to try out: larger base cases,
recognition of sorted subsequences, converting branches to conditional
moves, finding some way to avoid the initial call to length (which
must traverse the whole linked list), … But first, what would
interesting performance metrics be, and on what inputs?</p>

<h2>Brainstorming an experiment up</h2>

<p>I think it works better to first determine our objective, then the
inputs to consider, and, last, the algorithmic variants to try and
compare (decision variables).  That&#8217;s more or less the reverse order
of what&#8217;s usually suggested when defining mathematical models.  The
difference is that, in the current context, the space of inputs and
algorithms are usually so large that we have to winnow them down by
taking earlier choices into account.</p>

<h3>Objective functions</h3>

<p>A priori, three basic performance metrics seem interesting: runtimes,
number of calls to the comparator, and number of calls to the key
functions.  On further thought, the last one doesn&#8217;t seem useful: if
it really matters, a
<a href="http://en.wikipedia.org/wiki/Schwartzian_transform">schwartzian transform</a>
suffices to reduce these calls to a minimum, regardless of the sort
implementation.</p>

<p>There are some complications when looking at runtimes.  The universe
of test and key functions is huge, and the sorts can be inlined, which
sometimes enables further specialisation on the test and key.  I&#8217;ve
already decided that calls to key don&#8217;t matter directly.  Let&#8217;s
suppose it&#8217;s very simple, the identity function.  The number of
comparisons will correlate nicely with performance when comparisons
are slow.  Again, let&#8217;s suppose that the comparator is simple, a
straight <code>&lt;</code> of fixnums.  The performance of sorts, especially with a
trivial key and a simple comparator, can vary a lot depending on
whether the sort is specialised or not, and both cases are relevant in
practice.  I&#8217;ll have to test for both cases: inlined comparator and
key functions, and generic sorts with unknown functions.</p>

<p>This process lead to a set of three objective functions: the number of
calls to the comparator, the runtime (number of cycles) of normal,
generic sort, and the number of cycles for a specialised sort.</p>

<h3>Inputs</h3>

<p>The obvious thing to vary in the input (the list to sort) is the
length of the list.  The lengths should probably span a wide range of
values, from short lists (e.g. 32 elements) to long ones (a few
million elements).  Programs that are sort-bound on very long lists
should probably use vectors, if only around the sort, and then invest
in a sophisticated sort.</p>

<p>In real programs, sort is sometimes called on nearly sorted or
reverse-sorted sequences, and it&#8217;s useful to sort such inputs faster,
or with fewer comparisons.  However, it&#8217;s probably not that
interesting if the adaptivity comes at the cost of worse performance
on fully shuffled lists.  I decided to test on sorted and fully
shuffled inputs.  I also interpolated between the two by flipping
randomly-selected subranges of the list a few times.</p>

<p>Finally, linked lists are different than vectors in one key manner:
contiguous elements can be arbitrarily scattered around memory.
SBCL&#8217;s allocation scheme ensures that consecutively-allocated objects
will tend to be located next to each other in memory, and the copying
garbage collector is hacked to copy the spine of cons lists in order.
However, a list can still temporarily exhibit bad locality, for
example after an in-place sort.  Again, I decided to go for ordered
conses, fully scattered conses (only the conses were shuffled, not the
list&#8217;s values), and to interpolate, this time by swapping
randomly-selected pairs of consecutive subranges a couple times.</p>

<h3>Code tweaks</h3>

<p>The textbook way to improve a recursive algorithm is to increase the
base cases&#8217; sizes.  In the initial code the base case is a sublist of
size one; such a list is trivially sorted.  We can easily increase
that to two (a single conditional swap suffices), and an optimal
sorting network for three values is only slightly more complicated.  I
decided to stop there, with base cases of size one to three.  These
simple sorts are implemented as a series of conditional swaps
(i.e. pairs of max/min computations), and these can be executed
branch-free, with only conditional moves.  There&#8217;s a bit of overhead,
and conditional moves introduce more latency than well predicted
branches, but it might be useful for the specialised sort on shuffled
inputs, and otherwise not hurt too much.</p>

<p>The merge loop could cache the result of calls to the key functions.
This won&#8217;t be useful in specialised sorts, and won&#8217;t affect the number
of comparisons, but it&#8217;ll probably help the generic sort without
really affecting performance otherwise.</p>

<p>With one more level of indirection, the merge loop can be branch-free:
<code>merge-one</code> can be executed on references to the list heads, and these
references can be swapped with conditional moves.  Again, the
additional complexity makes it hard to guess if the change would be a
net improvement.</p>

<p>Like I hinted back in May, we can accelerate the sort on pre-sorted
inputs by keeping track of the last cons in each list, and tweaking
the merge function: if the first value in one list is greater than (or
equal to) the last in the other, we can directly splice them in order.
Stability means we have to add a bit of complexity to handle equal
values correctly, but nothing major.  With this tiny tweak, merge sort
is linear-time on sorted or reverse-sorted lists (the recursive step
is constant-time, and merge sort recurses on both halves); it also
works on recursively-processed sublists, and the performance is thus
improved on nearly-sorted inputs in general.  There&#8217;s little point
going through additional comparisons to accelerate the merger of two
tiny lists; a minimal length check is in order.  In addition to the
current version, without any quick merge, I decided to try quick
merges when the length of the two sublists summed to at least 8, 16 or
32.  I didn&#8217;t try limits lower than 8 because any improvement would
probably be marginal: trying to detect opportunities for quicker merge
introduces two additional comparisons when it fails.</p>

<p>Finally, I tried to see if the initial call to <code>length</code> (which has to
traverse the whole list) could be avoided, e.g. by switching to a
bottom-up sort.  The benchmarks I ran in May made me realise that&#8217;s
probably a bad idea.  Such a merge sort almost assuredly has to split
its inputs in chunks of power of two (or some other base) sizes.
These splits are suboptimal on non-power-of-two inputs; for example,
when sorting a list of length <code>(1+ (ash 1 n))</code>, the final merge is
between a list of length <code>(ash 1 n)</code> and a list of length … one.
Knowing the exact length of the list means we can split optimally on
recursive calls, and that eliminates bumps in runtimes and in the
number of comparisons around &#8220;round&#8221; lengths.</p>

<h2>How can we compare all these possibilities?</h2>

<p>I usually don&#8217;t try to do anything clever, and simply run a large
number of repetitions for all the possible implementations and
families of inputs, and then choose a few interesting statistical
tests or sic an
<a href="http://en.wikipedia.org/wiki/Analysis_of_variance">ANOVA</a> on it.  The
problem is, I&#8217;d maybe want to test with ten lengths (to span the wide
range between 32 and a couple million), a couple shuffledness values
(say, four, between sorted and shuffled), a couple scatteredness
values (say, four, again), and around 48 implementations (size-1 base
case, size-3 base case, size-3 base case with conditional moves, times
cached or uncached key, times branchful or branch-free merge loop,
times four possibilities for the quick merge).  That&#8217;s a total of 7680
sets of parameter values.  If I repeated each possibility 100 times, a
reasonable sample size, I&#8217;d have to wait around 200 hours, given an
average time of 1 second/execution (a generous estimate, given how
slow shuffling and scattering lists can be)… and I&#8217;d have to do that
separately when testing for comparison counts, generic sort runtimes
and specialised sort runtimes!</p>

<p>I like working on SBCL, but not enough to give its merge sort multiple
CPU-weeks.</p>

<p>Executing multiple repetitions of the full cross product is overkill:
that actually gives us enough information to extract information about
the interaction between arbitrary pairs (or arbitrary subsets, in
fact) of parameters (e.g. shuffledness and the minimum length at which
we try to merge in constant-time).  The thing is, I&#8217;d never even try
to interpret all these crossed effects: there are way too many pairs,
triplets, etc.  I could instead try to determine interesting crosses
ahead of time, and find a design that fits my simple needs.</p>

<p>Increasing the length of the list will lead to longer runtimes and
more comparisons.  Scattering the cons cells around will also slow the
sorts down, particularly on long lists.  Hopefully, the sorts are
similar enough to be affected comparably by the length of the list and
by how its conses are scattered in memory.</p>

<p>Pre-sorted lists should be quicker to sort than shuffled ones, even
without any clever merge step: all the branches that depend on
comparisons are trivially predicted.  Hopefully, the effect is more
marked when sorted pairs of sublists are merged in constant time.</p>

<p>Finally, the interaction between the remaining algorithmic tweaks is
pretty hard to guess, and there are only 12 combinations.  I feel
it&#8217;s reasonable to cross the three parameter sets.</p>

<p>That&#8217;s three sets of crossed effects (length and scatteredness,
shuffledness and quick merge switch-over, remaining algorithmic
tweaks), but I&#8217;m not interested in any further interaction, and am
actually hoping these interactions are negligible.  A Latin square
design can help bring the sample to a much more reasonable size.</p>

<h3>Quadrata Latina pro victoria</h3>

<p>An NxN <a href="http://mathworld.wolfram.com/LatinSquare.html">Latin square</a>
is a square of NxN cells, with one of N symbols in each cell, with the
constraint that each symbol appears once in each row and column; it&#8217;s
a relaxed Sudoku.</p>

<p>When a first set of parameters values is associated with the rows, a
second with the columns, and a third with the symbols, a Latin square
defines N<sup>2</sup> triplets that cover each pair of parameters between the
three sets exactly once.  As long as interactions are absent or
negligible, that&#8217;s enough information to separate the effect of each
set of parameters.  The approach is interesting because there are only
N<sup>2</sup> cells (i.e. trials), instead of N<sup>3.</sup>  Better, the design can cope
with very low repetition counts, as low as a single trial per cell.</p>

<p>Latin squares are also fairly easy to generate.  It suffices to fill
the first column with the symbols in arbitrary order, the second in
the same order, rotated by one position, the third with a rotation by
two, etc.  The square can be further randomised by shuffling the rows
and columns (with Fisher-Yates, for example).  That procedure doesn&#8217;t
sample from the full universe of Latin squares, but it&#8217;s supposed to
be good enough to uncover pairwise interactions.</p>

<p>Latin squares only make sense when all three sets of parameters are
the same size.  Latin rectangles can be used when one of the sets is
smaller than the two others, by simply removing rows or columns from a
random Latin square.  Some pairs are then left unexplored, but the
data still suffices for uncrossed linear fits, and generating
independent rectangles helps cover more possibilities.</p>

<p>I&#8217;ll treat all the variables as categorical, even though some take
numerical values: it&#8217;ll work better on non-linear effects (and I have
no clue what functional form to use).</p>

<h3>Optimising for comparison counts</h3>

<p>Comparison counts are easier to analyse.  They&#8217;re oblivious to
micro-optimisation issues like conditional moves or scattered conses,
and results are deterministic for fixed inputs.  There are much fewer
possibilities to consider, and less noise.</p>

<p>Four values for the minimum length before checking for constant-time
merger (8, 16, 32 or never), and ten shuffledness values (sorted,
one, two, five, ten, 50, 100, 500 or 1000 flips, and full shuffle)
seem reasonable; when the number of flips is equal to or exceeds the
list length, a full shuffle is performed instead.  That&#8217;s 40 values
for one parameter set.</p>

<p>There are only two interesting values for the remaining algorithmic
tweaks: size-3 base cases or not (only size-1).</p>

<p>This means there should be 40 list lengths to balance the design.  I
chose to interpolate from 32 to 16M (inclusively) with a geometric
sequence, rounded to the nearest integer.</p>

<p>The resulting Latin rectangles comprise 80 cells.  Each scenario was
repeated five times (starting from the same five PRNG states), and 30
independent rectangles were generated.  In total, that&#8217;s 12 000
executions.  The are probably smarter ways to do this that better
exploit the fact that there are only two algorithmic tweaks variants;
I stuck to a very thin Latin rectangle to stay closer to the next two
settings.  Still, a full cross product with 100 repetitions would have
called for 320 000 executions, nearly 30 times as many.</p>

<p>I wish to understand the effect of these various parameters on the
number of times the comparison function is called to sort a list.
Simple models tend to suppose additive effects.  That doesn&#8217;t look
like it&#8217;d work well here.  I expect multiplicative effects: enabling
quick merge shouldn&#8217;t add or subtract to the number of comparisons,
but scale it (hopefully by less than one).  A logarithmic
transformation will convert these multiplications into additions.  The
ANOVA method and the linear regression I&#8217;ll use are parametric methods
that suppose that the mean of experimental noise roughly follows a
normal distribution.  It seems like a reasonable hypothesis:
variations will be caused by a sum of many small differences caused by
the shuffling, and we&#8217;re working with many repetitions, hopefully
enough for the central limit theorem to kick in.</p>

<p>The Latin square method also depends on the absence of crossed
interactions between rows and columns, rows and symbols, or columns
and symbols.  If that constraint is violated, the design is highly
vulnerable to Type I errors: variations caused by interactions between
rows and columns could be assigned to rows or columns, for example.</p>

<p>My first step is to look for such interaction effects.</p>

<figure class='code'><figcaption><span>two-way ANOVA for comparison counts </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>> anova(lm(log(Count, 2) ~ Size.Scatter*Shuffle.Quick
</span><span class='line'>                         + Size.Scatter*Leaf.Cache.BranchMerge
</span><span class='line'>                         + Shuffle.Quick*Leaf.Cache.BranchMerge
</span><span class='line'>                         + 0, # 0 y-intercept
</span><span class='line'>           data))
</span><span class='line'>Analysis of Variance Table
</span><span class='line'>
</span><span class='line'>Response: log(Count, 2)
</span><span class='line'>                                        Df  Sum Sq Mean Sq  F value  Pr(>F)
</span><span class='line'>Size.Scatter                            40 3935314   98383 6.90e+06 &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick                           39    7739     198 1.39e+04 &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMerge                   1       8       8 5.96e+02 &lt; 2e-16 ***
</span><span class='line'>Size.Scatter:Shuffle.Quick            1159     770       1 4.66e+01 &lt; 2e-16 ***
</span><span class='line'>Size.Scatter:Leaf.Cache.BranchMerge     39       1       0 1.03e+00    0.41
</span><span class='line'>Shuffle.Quick:Leaf.Cache.BranchMerge    39       1       0 2.10e+00 7.5e-05 ***
</span><span class='line'>Residuals                            10683     152       0
</span><span class='line'>---
</span><span class='line'>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</span></code></pre></td></tr></table></div></figure>


<p>The main effects are statistically significant (in order, list length,
shuffling and quick merge limit, and the algorithmic tweaks), with p &lt;
2e-16.  That&#8217;s reassuring: the odds of observing such results if they
had no effects are negligible.  Two of the pairs are, as well.  Their
effects, on the other hand, don&#8217;t seem meaningful.  The <code>Sum Sq</code>
column reports how much of the variance in the data set is explained
when the parameters corresponding to each row (one for each degree of
freedom <code>Df</code>) are introduced in the fit.  Only the
Size.Scatter:Shuffle.Quick row really improves the fit, and that&#8217;s
with 1159 degrees of freedom; the mean improvement in fit, <code>Mean Sq</code>
(per degree of freedom) is tiny.</p>

<p>The additional assumption that interaction effects are negligible
seems reasonably satisfied.  The linear model should be valid, but,
more importantly, we can analyse each set of parameters independently.
Let&#8217;s look at a regression with only the main effects.</p>

<figure class='code'><figcaption><span>one-way ANOVA for comparison counts </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>> fit &lt;- lm(log(Count, 2) ~ Size.Scatter + Shuffle.Quick
</span><span class='line'>                          + Leaf.Cache.BranchMerge + 0, data)
</span><span class='line'>> anova(fit)
</span><span class='line'>Analysis of Variance Table
</span><span class='line'>
</span><span class='line'>Response: log(Count, 2)
</span><span class='line'>                          Df  Sum Sq Mean Sq F value Pr(>F)
</span><span class='line'>Size.Scatter              40 3935314   98383 1269019 &lt;2e-16 ***
</span><span class='line'>Shuffle.Quick             39    7739     198    2560 &lt;2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMerge     1       8       8     110 &lt;2e-16 ***
</span><span class='line'>Residuals              11920     924       0
</span><span class='line'>---
</span><span class='line'>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</span></code></pre></td></tr></table></div></figure>


<p>The fit is only slightly worse than with pairwise interactions.  The
coefficient table follows.  What we see is that half of the
observations fall within 12% of the linear model&#8217;s prediction (the
worst case is off by more than 100%), and that nearly all the
coefficients are statistically significantly different than zero.</p>

<figure class='code'><figcaption><span>one-way coefficients for comparison counts </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>> summary(fit)
</span><span class='line'>Call:
</span><span class='line'>lm(formula = log(Count, 2) ~ Size.Scatter + Shuffle.Quick
</span><span class='line'>                           + Leaf.Cache.BranchMerge + 0,
</span><span class='line'>   data = data)
</span><span class='line'>
</span><span class='line'>Residuals:
</span><span class='line'>    Min      1Q  Median      3Q     Max
</span><span class='line'>-1.1105 -0.1746  0.0052  0.1440  1.3586
</span><span class='line'>
</span><span class='line'>Coefficients:
</span><span class='line'>                            Estimate Std. Error t value Pr(>|t|)
</span><span class='line'>Size.Scatter32xF             7.20776    0.02279  316.23  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter45xF             7.98704    0.02276  350.87  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter63xF             8.52335    0.02243  380.06  &lt; 2e-16 ***
</span><span class='line'>[...]
</span><span class='line'>Shuffle.QuickFx8            -2.77266    0.02284 -121.39  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFx16           -2.45345    0.02293 -106.98  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFx32           -2.14230    0.02289  -93.59  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFxF            -0.55217    0.02289  -24.12  &lt; 2e-16 ***
</span><span class='line'>[...]
</span><span class='line'>Leaf.Cache.BranchMergeTxFxT  0.05320    0.00508   10.47  &lt; 2e-16 ***
</span><span class='line'>---
</span><span class='line'>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</span><span class='line'>
</span><span class='line'>Residual standard error: 0.278 on 11920 degrees of freedom
</span><span class='line'>Multiple R-squared:    1,       Adjusted R-squared:    1
</span><span class='line'>F-statistic: 6.36e+05 on 80 and 11920 DF,  p-value: &lt;2e-16</span></code></pre></td></tr></table></div></figure>


<p>The Size.Scatter coefficients are plotted below.  The number of
comparison grows with the length of the lists.  The logarithmic factor
shows in the curve&#8217;s slight convexity (compare to the linear
interpolation in blue).</p>

<p><a href="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/count-size-large.png"><img class="center" src="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/count-size.png"></a></p>

<p>The Shuffle.Quick values are the coefficients for the crossed effect
of the level of shuffling and the minimum length (cutoff) at which
constant-time merge may be executed; their values are reported in the
next histogram, with error bars corresponding to one standard
deviation.  Hopefully, a shorter cutoff lowers the number of
comparisons when lists are nearly pre-sorted, and doesn&#8217;t increase it
too much when lists are fully shuffled.  On very nearly sorted lists,
looking for pre-sorted inputs as soon as eight or more values are
merged divides the number of comparisons by a factor of 4 (these are
base-2 logarithms), and the advantage smoothly tails off as lists are
shuffled better.  Overall, cutting off at eight seems to never do
substantially worse than the other choices, and is even roughly
equivalent to vanilla merges on fully shuffled inputs.</p>

<p><a href="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/count-shuffle-large.png"><img class="center" src="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/count-shuffle.png"></a></p>

<p>The coefficient table tells us that nearly all of the Shuffle.Quick
coefficients are statistically significant.  The statistical
significance values are for a null hypothesis that each of these
coefficients is actually zero: the observation would be extremely
unlikely if that were the case.  That test tells us nothing about the
relationship between two coefficients.</p>

<p>Comparing differences with standard deviations helps us detect hugely
significant difference, but we can use statistical tests to try and
make finer distinctions.
<a href="http://en.wikipedia.org/wiki/Tukey's_range_test">Tukey&#8217;s Honest Significant Difference (HSD)</a>
method gives intervals on the difference between two coefficients for
a given confidence level.  For example, the 99.99% confidence interval
between cutoff at 8 and 32 on lists that were flipped 50 times is
[-0.245, -0.00553].  This result means that, if the hypotheses for
Tukey&#8217;s HSD method are satisfied, the probability of observing the
results I found is less than .01% when the actual difference in effect
between cutoff at 8 and 32 is outside that interval.  Since even the
upper bound is negative, it also means that the odds of observing the
current results are less than .01% if the real value for cutoff at 8
isn&#8217;t lower than that of cutoff at 32: it&#8217;s pretty sure that looking
for quick merges as early as length eight pays off compared to only
doing so for merges of length 32 or more.  One could also just prove
that&#8217;s the case.  Overall, cutting off at length eight results in
fewer comparisons than the other options at nearly all shuffling
levels (with very high confidence), and the few cases cases it doesn&#8217;t
aren&#8217;t statistically significant at a 99.99% confidence level &#8211; of
course, absence of evidence isn&#8217;t evidence of absence, but the
differences between these estimates tend to be tiny anyway.</p>

<p>The last row, Leaf.Cache.BranchMergeTxFxT, reports the effect of
adding base cases that sort lists of length 2 and 3.  Doing so causes
4% more comparisons.  That&#8217;s a bit surprising: adding specialised base
cases usually improves performance.  The issue is that the sorting
networks are only optimal for data-oblivious executions.  Sorting
three values requires, in theory, 2.58 (\(\lg 3!\)) bits of
information (comparisons).  A sorting network can&#8217;t do better than the
ceiling of that, three comparisons, but if control flow can depend on
the comparisons, some lists can be sorted in two comparisons.</p>

<p>It seems that, if we wish to minimise the number of comparisons, I
should avoid sorting networks for the size-3 base case, and try to
detect opportunities for constant-time list merges.  Doing so as soon
as the merged list will be of length eight or more seems best.</p>

<h3>Optimising the runtime of generic sorts</h3>

<p>I decided to keep the same general shape of 40x40xM parameter values
when looking at the cycle count for generic sorts.  This time,
scattering conses around in memory will affect the results.  I went
with conses laid out linearly, 10 swaps, 50 swaps, and full
randomisation of addresses.  These 4 scattering values leave 10 list
lengths, in a geometric progression from 32 to 16M.  Now, it makes
sense to try all the other micro-optimisations: trivial base case or
base cases of size up to 3, with branches or conditional moves (3
choices), cached calls to key during merge (2 choices), and branches
or conditional moves in the merge loop (2 choices).  This calls for
Latin rectangles of size 40x12; I generated 10 rectangles, and
repeated each cell 5 times (starting from the same 5 PRNG seeds).  In
total, that&#8217;s 24 000 executions.  A full cross product, without any
repetition, would require 19 200 executions; the Latin square design
easily saved a factor of 10 in terms of sample size (and computation
time) for equivalent power.</p>

<p>I&#8217;m interested in execution times, so I generated the inputs ahead of
time, before sorting them; during both generation and sorting, the
garbage collector was disabled to avoid major slowdowns caused by the
mprotect-based write barrier.</p>

<p>Again, I have to apply a logarithmic transformation for the additive
model to make sense, and first look at the interaction effects.
There&#8217;s a similar situation as the previous section on comparison
counts: one of the crossed effects is statistically significant, but
it&#8217;s not overly meaningful.  A quick look at the coefficients reveals
that the speed-ups caused by processing nearly-sorted lists in close
to linear time are overestimated on short lists and slightly
underestimated on long ones.</p>

<figure class='code'><figcaption><span>two-way ANOVA for generic sort runtime </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>> anova(lm(log(Cycles, 2) ~ Size.Scatter*Shuffle.Quick
</span><span class='line'>                          + Size.Scatter*Leaf.Cache.BranchMerge
</span><span class='line'>                          + Shuffle.Quick*Leaf.Cache.BranchMerge
</span><span class='line'>                          + 0,
</span><span class='line'>           data))
</span><span class='line'>Analysis of Variance Table
</span><span class='line'>
</span><span class='line'>Response: log(Cycles, 2)
</span><span class='line'>                                        Df   Sum Sq Mean Sq  F value Pr(>F)    
</span><span class='line'>Size.Scatter                            40 15666210  391655 3.97e+07 &lt;2e-16 ***
</span><span class='line'>Shuffle.Quick                           39    13396     343 3.48e+04 &lt;2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMerge                  11       58       5 5.37e+02 &lt;2e-16 ***
</span><span class='line'>Size.Scatter:Shuffle.Quick            1477     1620       1 1.11e+02 &lt;2e-16 ***
</span><span class='line'>Size.Scatter:Leaf.Cache.BranchMerge    429        4       0 8.90e-01   0.95    
</span><span class='line'>Shuffle.Quick:Leaf.Cache.BranchMerge   429        2       0 5.40e-01   1.00    
</span><span class='line'>Residuals                            21575      213       0                    
</span><span class='line'>---
</span><span class='line'>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</span></code></pre></td></tr></table></div></figure>


<p>We can basically read the ANOVA with only main effects by skipping the
rows corresponding to crossed effects and instead adding their values
to the residuals.  There are statistically significant coefficients in
there, and they&#8217;re reported below.  Again, I&#8217;m quite happy to be able
to examine each set of parameters independently, rather than having to
understand how, e.g., scattering cons cells around affects quick merges
differently than the vanilla merge.  Maybe I just didn&#8217;t choose the
right parameters, or was really unlucky; I&#8217;m just trying to do the
best I can with induction.</p>

<figure class='code'><figcaption><span>one-way coefficients for generic sort runtime </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
<span class='line-number'>43</span>
<span class='line-number'>44</span>
<span class='line-number'>45</span>
<span class='line-number'>46</span>
<span class='line-number'>47</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>> summary(lm(log(Cycles, 2) ~ Size.Scatter + Shuffle.Quick
</span><span class='line'>                            + Leaf.Cache.BranchMerge + 0, data))
</span><span class='line'>Call:
</span><span class='line'>lm(formula = log(Cycles, 2) ~ Size.Scatter + Shuffle.Quick + 
</span><span class='line'>    Leaf.Cache.BranchMerge + 0, data = data)
</span><span class='line'>
</span><span class='line'>Residuals:
</span><span class='line'>    Min      1Q  Median      3Q     Max 
</span><span class='line'>-0.9658 -0.1637 -0.0036  0.1446  1.4650 
</span><span class='line'>
</span><span class='line'>Coefficients:
</span><span class='line'>                               Estimate Std. Error t value Pr(>|t|)    
</span><span class='line'>Size.Scatter32x10              14.96752    0.01684  888.96  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter32x500             15.01394    0.01677  895.22  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter32xF               15.05236    0.01703  884.05  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter32xT               15.03750    0.01702  883.43  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter138x10             17.44223    0.01703 1024.09  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter138x500            17.45624    0.01696 1029.29  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter138xF              17.45541    0.01696 1029.31  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter138xT              17.48772    0.01710 1022.69  &lt; 2e-16 ***
</span><span class='line'>[...]
</span><span class='line'>Shuffle.QuickFx8               -2.52552    0.01607 -157.17  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFx16              -2.20712    0.01606 -137.42  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFx32              -1.92409    0.01606 -119.81  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFxF               -0.58187    0.01606  -36.24  &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick1x8               -1.84568    0.01607 -114.86  &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick1x16              -1.64170    0.01607 -102.19  &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick1x32              -1.50046    0.01606  -93.42  &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick1xF               -0.48014    0.01607  -29.87  &lt; 2e-16 ***
</span><span class='line'>[...]
</span><span class='line'>Leaf.Cache.BranchMergeCMOVxFxT -0.03229    0.00877   -3.68  0.00023 ***
</span><span class='line'>Leaf.Cache.BranchMergeCMOVxTxF -0.04451    0.00877   -5.08  3.9e-07 ***
</span><span class='line'>Leaf.Cache.BranchMergeCMOVxTxT -0.08632    0.00877   -9.84  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeFxFxF     0.09094    0.00877   10.37  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeFxFxT    -0.01637    0.00877   -1.87  0.06193 .  
</span><span class='line'>Leaf.Cache.BranchMergeFxTxF     0.01617    0.00877    1.84  0.06519 .  
</span><span class='line'>Leaf.Cache.BranchMergeFxTxT    -0.06236    0.00877   -7.11  1.2e-12 ***
</span><span class='line'>Leaf.Cache.BranchMergeTxFxF     0.03694    0.00877    4.21  2.5e-05 ***
</span><span class='line'>Leaf.Cache.BranchMergeTxFxT    -0.03665    0.00877   -4.18  2.9e-05 ***
</span><span class='line'>Leaf.Cache.BranchMergeTxTxF    -0.03920    0.00877   -4.47  7.9e-06 ***
</span><span class='line'>Leaf.Cache.BranchMergeTxTxT    -0.08567    0.00877   -9.77  &lt; 2e-16 ***
</span><span class='line'>---
</span><span class='line'>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
</span><span class='line'>
</span><span class='line'>Residual standard error: 0.277 on 23910 degrees of freedom
</span><span class='line'>Multiple R-squared:    1, Adjusted R-squared:    1 
</span><span class='line'>F-statistic: 2.27e+06 on 90 and 23910 DF,  p-value: &lt;2e-16</span></code></pre></td></tr></table></div></figure>


<p>The coefficients for list length crossed with scattering level are
plotted below. Sorting seems to be slower on longer list (surprise!),
especially when the cons cells are scattered; sorting long scattered
lists is about twice as slow as sorting nicely laid-out lists of the
same length.  The difference between linear and slightly scattered
lists isn&#8217;t statistically significant.</p>

<p><a href="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/normal-size-large.png"><img class="center" src="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/normal-size.png"></a></p>

<p>Just as with comparison counts, sorting pre-sorted lists is faster,
with or without special logic.  Looking for sorted inputs before
merging pays off even on short lists, when the input is nearly sorted:
the effect of looking for pre-sorted inputs even on sublists of length
eight is consistently more negative (i.e. reduces runtimes) than for
the other cutoffs.  The difference is statistically significant at
nearly all shuffling levels, and never significantly positive.</p>

<p><a href="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/normal-shuffle-large.png"><img class="center" src="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/normal-shuffle.png"></a></p>

<p>Finally, the three algorithmic tweaks.  Interestingly, the
coefficients tell us that, overall, the additional overhead of the
branch-free merge loop slows it down by 5%.  The fastest combination
seems to be larger base cases, with or without conditional moves (C or
T), cached calls to key (T), and branchful merge loop (T); the
differences are statistically significant against nearly all other
combinations, except FxTxT (no leaf sort, cached key, and branchful
merge loop).  Compared with the current code (FxFxT), the speed up is
on the order of 5%, and at least 2% with 99.99% confidence.</p>

<p><a href="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/normal-tweak-large.png"><img class="center" src="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/normal-tweak.png"></a></p>

<p>If I want to improve the performance of generic sorts, it looks like I
want to test for pre-sorted inputs when merging into a list of length
8 or more, probably implement larger base cases, cache calls to the
key function, and keep the merge loop branchful.</p>

<h3>Optimising the runtime of specialised sorts</h3>

<p>I kept the exact same plan as for generic sorts.  The only difference
is that independent Latin rectangles were re-generated from scratch.
With the overhead from generic indirect calls removed, I&#8217;m hoping to
see more important effects from the micro-optimisations.</p>

<figure class='code'><figcaption><span>two-way ANOVA for specialised sort runtime </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>> anova(lm(log(Cycles, 2) ~ Size.Scatter*Shuffle.Quick
</span><span class='line'>                          + Size.Scatter*Leaf.Cache.BranchMerge
</span><span class='line'>                          + Shuffle.Quick*Leaf.Cache.BranchMerge
</span><span class='line'>                          + 0,
</span><span class='line'>           data))
</span><span class='line'>Analysis of Variance Table
</span><span class='line'>
</span><span class='line'>Response: log(Cycles, 2)
</span><span class='line'>                                        Df   Sum Sq Mean Sq  F value Pr(>F)    
</span><span class='line'>Size.Scatter                            40 12531049  313276 2.92e+07 &lt;2e-16 ***
</span><span class='line'>Shuffle.Quick                           39    12324     316 2.94e+04 &lt;2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMerge                  11     3365     306 2.85e+04 &lt;2e-16 ***
</span><span class='line'>Size.Scatter:Shuffle.Quick            1475     2952       2 1.86e+02 &lt;2e-16 ***
</span><span class='line'>Size.Scatter:Leaf.Cache.BranchMerge    429      391       1 8.49e+01 &lt;2e-16 ***
</span><span class='line'>Shuffle.Quick:Leaf.Cache.BranchMerge   429      150       0 3.25e+01 &lt;2e-16 ***
</span><span class='line'>Residuals                            21577      232       0                    
</span><span class='line'>---
</span><span class='line'>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
</span><span class='line'>Analysis of Variance Table</span></code></pre></td></tr></table></div></figure>


<p>Here as well, all the main and crossed effects are statistically
significant.  The effect of the micro-optimisations
(Leaf.Cache.BranchMerge) are now about as influential as the fast
merge minimum length.  It&#8217;s also even more clear that the crossed
effects are much less important than the main ones, and that it&#8217;s
probably not too bad to ignore the former.</p>

<figure class='code'><figcaption><span>one-way coefficients for specialised sort runtime </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
<span class='line-number'>43</span>
<span class='line-number'>44</span>
<span class='line-number'>45</span>
<span class='line-number'>46</span>
<span class='line-number'>47</span>
<span class='line-number'>48</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>> summary(lm(log(Cycles, 2) ~ Size.Scatter + Shuffle.Quick
</span><span class='line'>                            + Leaf.Cache.BranchMerge + 0, data))
</span><span class='line'>
</span><span class='line'>Call:
</span><span class='line'>lm(formula = log(Cycles, 2) ~ Size.Scatter + Shuffle.Quick + 
</span><span class='line'>    Leaf.Cache.BranchMerge + 0, data = data)
</span><span class='line'>
</span><span class='line'>Residuals:
</span><span class='line'>    Min      1Q  Median      3Q     Max 
</span><span class='line'>-1.5650 -0.2395 -0.0186  0.2233  2.1683 
</span><span class='line'>
</span><span class='line'>Coefficients:
</span><span class='line'>                               Estimate Std. Error t value Pr(>|t|)    
</span><span class='line'>Size.Scatter32x10              12.60101    0.02431  518.30  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter32x500             12.65034    0.02414  523.98  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter32xF               12.57545    0.02431  517.19  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter32xT               12.64220    0.02423  521.78  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter138x10             14.76927    0.02422  609.70  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter138x500            14.81878    0.02433  609.04  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter138xF              14.76646    0.02440  605.16  &lt; 2e-16 ***
</span><span class='line'>Size.Scatter138xT              14.86926    0.02441  609.04  &lt; 2e-16 ***
</span><span class='line'>[...]
</span><span class='line'>Shuffle.QuickFx8               -2.13508    0.02287  -93.37  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFx16              -1.87832    0.02285  -82.19  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFx32              -1.71007    0.02284  -74.86  &lt; 2e-16 ***
</span><span class='line'>Shuffle.QuickFxF               -0.64569    0.02286  -28.24  &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick1x8               -1.60291    0.02286  -70.12  &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick1x16              -1.48993    0.02284  -65.22  &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick1x32              -1.39072    0.02282  -60.95  &lt; 2e-16 ***
</span><span class='line'>Shuffle.Quick1xF               -0.55361    0.02284  -24.24  &lt; 2e-16 ***
</span><span class='line'>[...]
</span><span class='line'>Leaf.Cache.BranchMergeCMOVxFxT -0.64106    0.01248  -51.37  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeCMOVxTxF  0.02932    0.01248    2.35   0.0188 *  
</span><span class='line'>Leaf.Cache.BranchMergeCMOVxTxT -0.65298    0.01248  -52.32  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeFxFxF     0.30856    0.01248   24.72  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeFxFxT    -0.43997    0.01248  -35.25  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeFxTxF     0.29306    0.01248   23.48  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeFxTxT    -0.44372    0.01248  -35.55  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeTxFxF     0.01703    0.01248    1.36   0.1725    
</span><span class='line'>Leaf.Cache.BranchMergeTxFxT    -0.70801    0.01248  -56.73  &lt; 2e-16 ***
</span><span class='line'>Leaf.Cache.BranchMergeTxTxF     0.02323    0.01248    1.86   0.0627 .  
</span><span class='line'>Leaf.Cache.BranchMergeTxTxT    -0.68746    0.01248  -55.08  &lt; 2e-16 ***
</span><span class='line'>---
</span><span class='line'>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
</span><span class='line'>
</span><span class='line'>Residual standard error: 0.395 on 23910 degrees of freedom
</span><span class='line'>Multiple R-squared:    1, Adjusted R-squared:    1 
</span><span class='line'>F-statistic: 8.95e+05 on 90 and 23910 DF,  p-value: &lt;2e-16</span></code></pre></td></tr></table></div></figure>


<p>The general aspect of the coefficients is pretty much the same as for
generic sorts, except that differences are amplified now that the
constant overhead of indirect calls is eliminated.</p>

<p>The coefficients for crossed list length and scattering level
coefficients are plotted below.  The graph shows that fully shuffling
long lists around slows sort down by a factor of 8.  The initial check
for crossed effect gave good reasons to believe that this effect is
fairly homogeneous throughout all implementations.</p>

<p><a href="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/inline-size-large.png"><img class="center" src="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/inline-size.png"></a></p>

<p>Checking for sorted inputs before merge still helps, even on short
lists (of length 8 or more).  In fact even on completely shuffled
lists, looking for quick merge on short lists very probably
accelerates the sort compared to not looking for pre-sorted inputs,
although the speed up compared to other cutoff values isn&#8217;t
significant to a 99.99% confidence level.</p>

<p><a href="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/inline-shuffle-large.png"><img class="center" src="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/inline-shuffle.png"></a></p>

<p>The key function is the identity, and is inlined into nothing in these
measurements.  It&#8217;s not surprising that the difference between cached
and uncached key values is tiny.  The versions with larger base cases
(C or T) sorts, and branchful merge are quicker than the others at
99.99% confidence level; compared to the initial code, they&#8217;re at
least 13% faster with 99.99% confidence.</p>

<p><a href="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/inline-tweak-large.png"><img class="center" src="http://www.pvk.ca/images/2012-08-13-engineering-a-list-merge-sort/inline-tweak.png"></a></p>

<p>When the sort is specialised, I probably want to use a merge function
that checks for pre-sorted inputs very early, to implement larger base
cases (with conditional moves or branches), and to keep the merge loop
branchful.</p>

<h2>Putting it all together</h2>

<p>Comparison counts are minimised by avoiding sorting networks, and by
enabling opportunistic constant-time merges as early as possible.
Generic sorts are fastest with larger base cases (with or without
branches), cached calls to the key function, a branchful merge loop
and early checks for constant-time merges.  Specialised sorts are,
similarly, fastest with larger base cases, a branchful merge loop
and early checks when merging (without positive or negative effect
from caching calls to the key function, even if it&#8217;s the identity).</p>

<p>Overall, these result point me toward one implementation: branchful
size-2 and size-3 base cases that let me avoid redundant comparisons,
cached calls to the key function, branchful merge loop, and checks
for constant-time merges when the result is of length eight or more.</p>

<p>The compound effect of these choices is linear time complexity on
sorted inputs, speed-ups (and reduction in comparison counts) by
factors of 2 to 4 on nearly-sorted inputs, and by 5% to 30% on
shuffled lists.</p>

<p>The resulting code follows.</p>

<figure class='code'><figcaption><span>new merge function </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun merge-lists* (head list1 list2 test key &aux (tail head))
</span><span class='line'>  (declare (type cons head list1 list2)
</span><span class='line'>           (type function test key)
</span><span class='line'>           (optimize speed))
</span><span class='line'>  (let ((key1 (funcall key (car list1)))
</span><span class='line'>        (key2 (funcall key (car list2))))
</span><span class='line'>    (macrolet ((merge-one (l1 k1 l2)
</span><span class='line'>                 `(progn
</span><span class='line'>                    (setf (cdr tail) ,l1
</span><span class='line'>                          tail       ,l1)
</span><span class='line'>                    (let ((rest (cdr ,l1)))
</span><span class='line'>                      (cond (rest
</span><span class='line'>                             (setf ,l1 rest
</span><span class='line'>                                   ,k1 (funcall key (first rest))))
</span><span class='line'>                            (t
</span><span class='line'>                             (setf (cdr ,l1) ,l2)
</span><span class='line'>                             (return (cdr head))))))))
</span><span class='line'>      (loop
</span><span class='line'>       (if (funcall test key2           ; this way, equivalent
</span><span class='line'>                         key1)          ; values are first popped
</span><span class='line'>           (merge-one list2 key2 list1) ; from list1
</span><span class='line'>           (merge-one list1 key1 list2))))))</span></code></pre></td></tr></table></div></figure>




<figure class='code'><figcaption><span>size-specialised sort functions </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defun stable-sort-list-2 (list test key)
</span><span class='line'>  (declare (type cons list)
</span><span class='line'>           (type function test key))
</span><span class='line'>  (let ((second (cdr list)))
</span><span class='line'>    (declare (type cons second))
</span><span class='line'>    (when (funcall test (funcall key (car second))
</span><span class='line'>                        (funcall key (car list)))
</span><span class='line'>      (rotatef (car list) (car second)))
</span><span class='line'>    (values list second (shiftf (cdr second) nil))))
</span><span class='line'>
</span><span class='line'>(defun stable-sort-list-3 (list test key)
</span><span class='line'>  (declare (type cons list)
</span><span class='line'>           (type function test key))
</span><span class='line'>  (let* ((second (cdr list))
</span><span class='line'>         (third  (cdr second))
</span><span class='line'>         (x (car list))
</span><span class='line'>         (y (car second))
</span><span class='line'>         (z (car third)))
</span><span class='line'>    (declare (type cons second third))
</span><span class='line'>    (when (funcall test (funcall key y)
</span><span class='line'>                        (funcall key x))
</span><span class='line'>      (rotatef x y))
</span><span class='line'>    (let ((key-z (funcall key z)))
</span><span class='line'>      (when (funcall test key-z
</span><span class='line'>                          (funcall key y))
</span><span class='line'>        (if (funcall test key-z
</span><span class='line'>                          (funcall key x))
</span><span class='line'>            (rotatef x z y)
</span><span class='line'>            (rotatef z y))))
</span><span class='line'>    (setf (car list)   x
</span><span class='line'>          (car second) y
</span><span class='line'>          (car third)  z)
</span><span class='line'>    (values list third (shiftf (cdr third) nil))))</span></code></pre></td></tr></table></div></figure>




<figure class='code'><figcaption><span>new sort function </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
<span class='line-number'>43</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(defconstant +stable-sort-fast-merge-limit+ 8)
</span><span class='line'>
</span><span class='line'>(defun stable-sort-list (list test key &aux (head (cons :head list)))
</span><span class='line'>  (declare (type list list)
</span><span class='line'>           (type function test key)
</span><span class='line'>           (dynamic-extent head))
</span><span class='line'>  (labels ((merge* (size list1 tail1 list2 tail2 rest)
</span><span class='line'>             (declare (optimize speed)
</span><span class='line'>                      (type (and fixnum unsigned-byte) size)
</span><span class='line'>                      (type cons list1 tail1 list2 tail2))
</span><span class='line'>             (when (>= size +stable-sort-fast-merge-limit+)
</span><span class='line'>               (cond ((not (funcall test (funcall key (car list2))   ; stability
</span><span class='line'>                                         (funcall key (car tail1)))) ; trickery
</span><span class='line'>                      (setf (cdr tail1) list2)
</span><span class='line'>                      (return-from merge* (values list1 tail2 rest)))
</span><span class='line'>                     ((funcall test (funcall key (car tail2))
</span><span class='line'>                                    (funcall key (car list1)))
</span><span class='line'>                      (setf (cdr tail2) list1)
</span><span class='line'>                      (return-from merge* (values list2 tail1 rest)))))
</span><span class='line'>               (values (merge-lists* head list1 list2 test key)
</span><span class='line'>                       (if (null (cdr tail1))
</span><span class='line'>                           tail1
</span><span class='line'>                           tail2)
</span><span class='line'>                       rest))
</span><span class='line'>           (recur (list size)
</span><span class='line'>             (declare (optimize speed)
</span><span class='line'>                      (type cons list)
</span><span class='line'>                      (type (and fixnum unsigned-byte) size))
</span><span class='line'>             (cond ((> size 3)
</span><span class='line'>                    (let ((half (ash size -1)))
</span><span class='line'>                      (multiple-value-bind (list1 tail1 rest)
</span><span class='line'>                          (recur list half)
</span><span class='line'>                        (multiple-value-bind (list2 tail2 rest)
</span><span class='line'>                            (recur rest (- size half))
</span><span class='line'>                          (merge* size list1 tail1 list2 tail2 rest)))))
</span><span class='line'>                   ((= size 3)
</span><span class='line'>                    (stable-sort-list-3 list test key))
</span><span class='line'>                   ((= size 2)
</span><span class='line'>                    (stable-sort-list-2 list test key))
</span><span class='line'>                   (t ; (= size 1)
</span><span class='line'>                    (values list list (shiftf (cdr list) nil))))))
</span><span class='line'>    (when list
</span><span class='line'>      (values (recur list (length list))))))</span></code></pre></td></tr></table></div></figure>


<p>It&#8217;s somewhat longer than the original, but not much more complicated:
the extra code mostly comes from the tedious but simple leaf sorts.
Particularly satisfying is the absence of conditional move hack: SBCL
only recognizes trivial forms, and only on X86 or X86-64, so the code
tends to be ugly and sometimes sacrifices performance on other
platforms.  SBCL&#8217;s bad support for conditional moves may explain the
lack of any speed up from converting branches to select expressions:
the conditional swaps had to be implemented as pairs of independent
test with T/NIL and conditional moves.  Worse, when the comparison is
inlined, an additional conditional move converted the result of the
integer comparison to a boolean value; in total, three pairs of
comparison/conditional move were then executed instead of one
comparison and two conditional moves.
<a href="http://www.cphstl.dk/Paper/Quicksort/sea12.pdf">Previous work [PDF]</a>
on out-of-place array merge sort in C found it useful to switch to
conditional move-based merge loop and sorting networks.  Some of the
difference is probably caused by SBCL&#8217;s weaker code generation, but
the additional overhead inherent to linked list manipulations
(compared to linear accesses in arrays) may also play a part.</p>

<p>Another code generation issue is caused by the way the initial version
called the comparison function in exactly one place.  This meant that
arbitrary comparators would almost always be inlined in the
specialised sort&#8217;s single call site.  We lose that property with
accelerated merge and larger base cases.  That issue doesn&#8217;t worry me
too much: functions can be declared inline explicitly, and the key
function was already called from multiple sites.</p>

<p>I&#8217;m a bit surprised that neither the sorting networks nor the merge loop
were appreciably sped-up by rewriting them with conditional moves.  I&#8217;m
a lot more surprised by the fact that it pays off to try and detect
pre-sorted lists even on tiny merges, and even when the comparator is
inlined.  The statistical tests were useful here, with results that
defy my initial expectations and let me keep the code simpler.  I
would be pleasantly surprised if complex performance improvement
patches, in SBCL and otherwise, went through similar testing.  Code is
a long-term liability, and we ought to be convinced the additional
complexity is worth the trouble.</p>

<p>Independently of that, the Latin square design was very helpful: it
easily saved me a couple CPU-weeks, and I can see myself using it
regularly in the future.  The approach only works if we already have a
rough (and simple) performance model, but I have a hard time
interpreting complex models with hundreds of interacting parameters
anyway.  Between a simplistic, but still useful, model and a complex
one with a much stronger fit, I&#8217;ll usually choose the former… as long
as I can be fairly certain the simple model isn&#8217;t showing me a mirage.</p>

<p>More generally, research domains that deal with the real world have
probably already hit the kind of scaling issues we&#8217;re now facing when
we try to characterise how computers and digital systems function.
Brute forcing is easier with computers than with interns, but it can
still pay off to look elsewhere.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Binary search is a pathological case for caches]]></title>
    <link href="http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/"/>
    <updated>2012-07-30T01:30:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches</id>
    <content type="html"><![CDATA[<p>Programmers tend to like round numbers, i.e. powers of two. So do
hardware designers. Sadly, this shared value doesn&#8217;t always work to
our advantage. One
<a href="http://yarchive.net/comp/cache_thrashing.html">common</a>
<a href="http://www.realworldtech.com/forum/?threadid=54793&amp;curpostid=54839">issue</a>
is that of cache line aliasing induced by alignment.</p>

<p>Binary search suffers from a related ailment when executed on medium
or large vectors of almost power-of-two size (in bytes), but it can be
cured.  Once that is done, searching a sorted vector can be as fast
as searches with a well-tuned hash table, for a few realistic access
patterns.</p>

<p>The task is interesting to me because I regularly work with static, or
almost static, sets: sets for which there&#8217;s a majority of lookups,
while updates are either rare or batchable.  For such sets, the
improved performance of explicit balanced search trees on insertions
is rarely worth the slowdown on lookups, nor the additional space
usage.  Replacing binary search with slightly off-center binary or
quaternary (four-way) searches only adds a bit more code to provide
even quicker, more consistent lookup times.</p>

<h2>Analyze this</h2>

<p>The following jittered scatterplot reports the runtimes, on my dual
2.8 GHz X5660, for binary searches to the first value in a single
sorted vector of size 16 to 1G elements (32-bit <code>unsigned</code>s), and is
overlaid with the median in green.  For each power-of-two size, the
benchmark program was executed 10 times, with 100 000 binary searches
per execution (and each search was timed individually, with <code>rdtscp</code>).
Finally,
<a href="http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access">NUMA</a> effects
were eliminated by forcing the processes to run and acquire memory
from the same un-loaded node.  1 000 000 points are hard to handle, so
the plot is only based on a .1% random sample (i.e. 1 000 points for
each size).</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/bsearch-first-cycles.png"></p>

<p>Two things bother me here.  The performance degrades much more quickly
when \(\lg(n)\geq 17\) (i.e. the size exceeds 512 KB), but, worse,
it shows a lot of variation on larger vectors.  The variation seems
more critical to me. Examining the data set reveals that the variation
is purely inter-execution: each set of 100 000 repetitions shows
nearly constant timings.</p>

<p>The degradation after \(\lg(n)=17\) is worrying because drops in
performance when the input is 512 KB or larger looks a lot like a
cache issue.  However, the workload is perfectly cachable: even on
vectors of 1G values, binary search is always passed the same key and
vector, and thus always reads from the same 30 locations.  This should
have no trouble fitting in L1 cache, for the whole range of sizes
considered here.</p>

<p><span class='pullquote-right' data-pullquote='If the definition of insanity is doing the same thing over and over again and expecting different results, contemporary computers are quite ill. '>
The variation is even stranger.  It happens between executions of
identical binaries for a deterministic program, on the same machine
and OS, but never in a single execution.  The variation makes no
sense.  If the definition of insanity is doing the same thing over and over again and expecting different results, contemporary computers are quite ill.
Worse, the variation would never have been uncovered by
repetitions in a single process.  I was lucky to look at the
results from multiple executions, otherwise I could easily have
concluded that binary search for the same element in a vector of 1G
32-bit elements almost always takes around 700 cycles, or almost
always 1400ish cycles.
</span></p>

<p>What&#8217;s behind the speed bump and the extreme variations?</p>

<p>The next graph reports the number of cache misses at the first,
second, third and translation lookaside buffer levels.  The average
count per search over 100 000 repetitions were recorded in 10
independent executions for each size.  The points correspond to
medians of ten average counts, and the vertical lines span the minimum
and maximum counts.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/bsearch-first-cache-misses.png"></p>

<p>The miss counts between executions are pretty much constant for L3 and
TLB, and reasonably consistent for L1.  L1 and TLB misses seem to
cause the slowdown, even though the working sets are tiny.  L2 misses
on \(\lg(n)\geq 24\) are all over the place, and seem to be behind
the variability.</p>

<h3>Yet another meaning for aliasing</h3>

<p>Something&#8217;s wrong with the caches.  We&#8217;ve found an issue with classic
binary search &#8211; that splits the range at each iteration in the middle
&#8211; on vectors of (nearly-)power-of-two size, in bytes.</p>

<p><span class='pullquote-right' data-pullquote='nicely-aligned data will often map to the same cache lines. '>
Nearly all (cacheful) processors to this day map memory addresses to
cache locations by taking the modulo with a power of two.  For
example, with 1K cache lines of 64 bytes, <code>0xabcd = 43981</code> is mapped
to \((43981\div 64)\mod 1024\).  This is as simple to understand as
it is to implement: just select a short range of bits from the
address.  It also has the side effect that nicely-aligned data will often map to the same cache lines.
Two vectors of 64KB each, allocated at 0x10000 and 0x20000, have the
unfortunate property that, for each index, the data at that index in
both vectors will map to the same cache lines in a direct-mapped cache
with 1024 lines of 64 bytes each. When only one cache line is allowed
per location, iterating through both vectors in the same loop will
ping-pong the data from both vectors between cache and memory, without
ever benefiting from caching.  The addresses alias each other, and
that&#8217;s bad.
</span></p>

<p>There are workarounds, on both the hardware and software sides.</p>

<p>Current X86s tend to avoid fast, direct-mapped, caches even at the
first level:
<a href="http://duartes.org/gustavo/blog/post/intel-cpu-caches">set-associative caches</a>
are designed to handle a small number of collisions (like fixed-size
buckets in hash tables), from 2-4 in the fastest levels to 8, 16 or
more for last level caches.  This ensures that a program can access
two similarly-aligned vectors in lockstep, without having each access
to one vector evicting cached data from the other.  It&#8217;s far from
perfect: a 2-way associative cache enables a loop over two vectors to
work fine, but, as soon as the loop iterates over three or more
vectors, it suddenly becomes much slower. Arguably more robust
solutions involving prime moduli, hash functions, or other clever
address-to-cache-line mappings have been proposed since the 80&#8217;s
(<a href="http://www.kharbutli.com/Papers/Kharbutli_HPCA_04.pdf">this [pdf]</a> is a
recent one, with a nice set of references to older work), but I don&#8217;t
think any has ever gained traction.</p>

<p>On the software end, smart memory allocators try to alleviate the
problem by offsetting large allocations away from the beginning of
pages.  Inserting a variable number of cache lines as padding
(<em>before</em> allocated data) minimises the odds that vector traversals
suffer from aliasing.  Sadly, that&#8217;s not yet the case for SBCL. In
fact, as well as pretty much guaranteeing that large vectors will
alias each other, SBCL&#8217;s allocator also makes cache-aware traversals
hard to write: the allocator aligns the vector&#8217;s header (two words for
a length and type descriptor) rather than its data.  That&#8217;s the kind
of consideration that is too easily disregarded when designing a
runtime system; getting it right after the fact probably isn&#8217;t hard,
but may call for some unsavory hacks.</p>

<h3>Aliasing in binary search</h3>

<p>Binary search doesn&#8217;t involve lockstep traversals over multiple
vectors. However, when the vector to search into is of power-of-two
length (or almost), the first few accesses are to similarly-aligned
addresses (e.g. in a 64KB vector of 32-bit values, it could access
indices 0x8000, 0x4000, 0x2000, 0x1000, etc.). Similar issues crop up
when working with power-of-two-sized matrices, or when bit-reversing a
vector.  We can easily convince ourselves that the access patterns
will be similarly structured when binary searching vectors of size
slightly (on the order of one or two cache lines) shorter or longer
than powers of two.</p>

<p>On my X5660, the L1D cache has 512 lines of 64 bytes each, L2 4096
lines, and L3 196608 lines (12 MB).  Crucially, the L1 and L2 caches are
8-way set-associative, and the L3 16-way.  The
<a href="http://en.wikipedia.org/wiki/Translation_lookaside_buffer">TLBs</a> are
4-way associative, and the first level has 64 entries and the second
512.</p>

<p><span class='pullquote-right' data-pullquote='so many reads are to locations that map to the same cache lines that a tiny working set exceeds the L1D&#8217;s wayness. '>
The number of L1 cache misses explodes on vectors of 512K elements and
more.  The L1D is designed so that only contiguous ranges of
\(64\times 512\div 8 = 4 \textrm{KB}\) (one normal page) can be cached without
collisions.  For \(\lg(n)=19\), binary search needs 10 reads that all
map to the same set of cache lines before reducing the candidate range
below 4 KB.   Once the first level cache&#8217;s 8-way associativity has
been exhausted, binary search on a vector of 512K 4-byte
elements still has to read two more locations before hitting a different set
(bucket) of cache lines.  This accounts for the L1 misses:
so many reads are to locations that map to the same cache lines that a tiny working set exceeds the L1D&#8217;s wayness.
To make things worse,
<a href="http://en.wikipedia.org/wiki/Cache_algorithms#Least_Recently_Used">LRU</a>-style
replacement strategy are useless.  We have a similar issue on TLB
misses (plus, filling the TLB on misses means additional data cache
misses to walk the page table).
</span></p>

<p>In contrast, the L3 is large enough and has a high-enough
associativity to accomodate the binary search.</p>

<p><span class='pullquote-right' data-pullquote='Sometimes the OS hands us a range of physical pages that works well for our workloads, other times it doesn&#8217;t. '>
That leaves the wildly-varying miss counts for the second level cache.
The L1D cache is able to fullfill some reads, which explains why L2
misses only appear on larger searches.  The variation is surprising.
What happens is that, on the X5660, caching depends on physical memory
addresses, and those are fully under the OS&#8217;s control and hidden from
userspace.  The exception is the L1D cache, in which the <em>sets</em> only
cover 4 KB (one normal page).  It&#8217;s just small enough that the bits used
to determine the set are always the same for both virtual and physical
addresses, and the CPU can thus perform part of the lookup in parallel with
address translation. This hidden variable explains the large
discrepancies in L2 miss counts that is then reflected in
runtimes. Sometimes the OS hands us a range of physical pages that works well for our workloads, other times it doesn&#8217;t.
Not only is there nothing we can do about it, but we also can&#8217;t even
know about it, except indirectly, or with unportable hacks.  Finally,
the OS can&#8217;t change the mapping without copying data, so that happens
very rarely, and memory allocators tend to minimise syscalls by keeping
previously-allocated address space around.  In effect, an application
that&#8217;s unlucky in its allotment of physical addresses can do little
about it and can&#8217;t really tell except by logging bad performance.
</span></p>

<h2>Searching for a solution</h2>

<p>Classic binary search on power-of-two sized vectors is a pathological
case for small (i.e. fast) caches and TLBs, and its performance varies
a lot based on physical address allocation, something over which we
have no control.  The previous section only covered a trivial case,
and real workloads will be even worse.  We could try to improve
performance and consistency by introducing padding, either after each
element or between chunks of elements, but I&#8217;d prefer a different
solution that preserves the optimal density and simplicity of sorted
vectors.</p>

<p><span class='pullquote-right' data-pullquote='The only remaining knob to improve the performance and consistency of binary search is the division factor. '>
There are many occurrences of two as a magic number in our test cases.
On the hardware side, cache lines are power-of-two sizes, as are L1
and L2 set (buckets) counts.  On the software side, each unsigned is
4-bytes long, the vectors are of power-of-two lengths, and binary
search divides the size of the candidate range by two.  There&#8217;s little
we can do about hardware, and changing our interface to avoid round
element sizes or counts isn&#8217;t always possible.  The only remaining knob to improve the performance and consistency of binary search is the division factor.
</span></p>

<p>The successor of two is three.  How can we divide a power-of-two size
vector in three subranges? One way is to vary the size of the
subdivided range, and keep track of it at runtime.  There&#8217;s a simpler
workaround: it&#8217;s all right, if suboptimal, to search over a wider
range than strictly necessary, as long as no out-of-bounds access is
performed.  For example, a vector of length 8 can be split at indices
\(\lceil 8/3\rceil = 3\) and \(8-3 = 5\); the next range will be either
[0, 3), [3, 5) or [5, 8). The middle range is of length 2 rather than 3,
but the search is still correct if it&#8217;s executed over [3, 6) instead of [3, 5).</p>

<p>Ternary search functions were generated for each vector size; when the last
range was of length 2, a binary search step was executed instead.  The example
below is for a vector of length 32K. It implements a lower bound search,
assuming that the key is greater than or equal to the first element in the
vector (otherwise, the first element is spuriously considered as the lower bound).</p>

<figure class='code'><figcaption><span>ternary search over 32768 elements  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="kt">size_t</span> <span class="nf">t_15</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">key</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">vector</span><span class="p">)</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>     <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">base</span> <span class="o">=</span> <span class="n">vector</span><span class="p">;</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">10923u</span><span class="p">,</span> <span class="mi">21845u</span><span class="p">);</span>       <span class="cm">/*      32768 -&gt; 10923      */</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">3641u</span><span class="p">,</span> <span class="mi">7282u</span><span class="p">);</span>         <span class="cm">/*      10923 -&gt; 3641       */</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">1214u</span><span class="p">,</span> <span class="mi">2427u</span><span class="p">);</span>         <span class="cm">/*       3641 -&gt; 1214       */</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">405u</span><span class="p">,</span> <span class="mi">809u</span><span class="p">);</span>           <span class="cm">/*       1214 -&gt; 405        */</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">135u</span><span class="p">,</span> <span class="mi">270u</span><span class="p">);</span>           <span class="cm">/*        405 -&gt; 135        */</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">45u</span><span class="p">,</span> <span class="mi">90u</span><span class="p">);</span>             <span class="cm">/*        135 -&gt; 45         */</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">15u</span><span class="p">,</span> <span class="mi">30u</span><span class="p">);</span>             <span class="cm">/*         45 -&gt; 15         */</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">5u</span><span class="p">,</span> <span class="mi">10u</span><span class="p">);</span>              <span class="cm">/*         15 -&gt; 5          */</span>
</span><span class='line'>     <span class="n">TERNARY</span><span class="p">(</span><span class="mi">2u</span><span class="p">,</span> <span class="mi">3u</span><span class="p">);</span>               <span class="cm">/*          5 -&gt; 2          */</span>
</span><span class='line'>     <span class="n">BINARY</span><span class="p">(</span><span class="mi">1u</span><span class="p">);</span>                    <span class="cm">/*          2 -&gt; 1          */</span>
</span><span class='line'>     <span class="k">return</span> <span class="n">base</span><span class="o">-</span><span class="n">vector</span><span class="p">;</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>Similar code was generated for binary searches.  We still have to define the expansion
for each <code>TERNARY</code> or <code>BINARY</code> step.</p>

<p><code>BINARY</code> is easy: we simply reuse the logic from the previous post on binary search and
branch mispredictions.</p>

<figure class='code'><figcaption><span>binary splitting step  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="cp">#define BARRIER do { __asm__ volatile(&quot;&quot; ::: &quot;memory&quot;); } while (0)</span>
</span><span class='line'>
</span><span class='line'><span class="cp">#define BINARY(I) do {                                          \</span>
</span><span class='line'><span class="cp">                base = ((base)[I] &lt;= key)?base+I:base;          \</span>
</span><span class='line'><span class="cp">                BARRIER;                                        \</span>
</span><span class='line'><span class="cp">        } while (0)</span>
</span></code></pre></td></tr></table></div></figure>


<p>The compiler barrier was the most straightforward way I found to
prevent gcc from noticing potentially-redundant accesses and trying to
optimise them away by inserting branches.</p>

<p><code>TERNARY</code> present a choice: do we try to expose
<a href="http://en.wikipedia.org/wiki/Memory-level_parallelism">memory-level parallelism (MLP)</a>
and read from both locations in parallel before comparing them to the
key, or do we instead avoid useless memory traffic and make the second
load depend on the result of the first comparison?</p>

<p>The first implementation is simple.</p>

<figure class='code'><figcaption><span>ternary splitting step with concurrent loads  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="cp">#define TERNARY(I, J) do {                                      \</span>
</span><span class='line'><span class="cp">                unsigned * b1 = base+I, *b2 = base+J;           \</span>
</span><span class='line'><span class="cp">                unsigned mid1 = *b1, mid2 = *b2;                \</span>
</span><span class='line'><span class="cp">                BARRIER;                                        \</span>
</span><span class='line'><span class="cp">                base = (mid1 &lt;= key)?b1:base;                   \</span>
</span><span class='line'><span class="cp">                base = (mid2 &lt;= key)?b2:base;                   \</span>
</span><span class='line'><span class="cp">        } while(0)</span>
</span></code></pre></td></tr></table></div></figure>


<p>The second is only a bit more convoluted:</p>

<figure class='code'><figcaption><span>ternary splitting step without useless loads  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="cp">#define TERNARY(I, J) do {                                      \</span>
</span><span class='line'><span class="cp">                unsigned * b1 = base+I, *b2 = base+J;           \</span>
</span><span class='line'><span class="cp">                b1 = (*b2 &lt;= key)?b2:b1;                        \</span>
</span><span class='line'><span class="cp">                BARRIER;                                        \</span>
</span><span class='line'><span class="cp">                base = (*b1 &lt;= key)?b1:base;                    \</span>
</span><span class='line'><span class="cp">                BARRIER;                                        \</span>
</span><span class='line'><span class="cp">        } while(0)</span>
</span></code></pre></td></tr></table></div></figure>


<p>If the first comparison succeeds, the second will repeat the exact
same operation, and store the pointer in <code>base</code>.</p>

<p>Finally, a quaternary search was also implemented: it enables
memory-level parallelism by reading from three locations at each
iteration to divide the range&#8217;s size by four.  As the code snippet
below shows, the high-level access patterns are otherwise exactly the
same as binary search, and a two-way division is invoked when the
range is down to two elements.</p>

<figure class='code'><figcaption><span>quaternary search over 32768 elements  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="cp">#define QUATERNARY(I, J, K) do {                                        \</span>
</span><span class='line'><span class="cp">                unsigned * b1 = base+I, *b2 = base+J, *b3 = base+K;     \</span>
</span><span class='line'><span class="cp">                unsigned mid1 = *b1, mid2 = *b2, mid3 = *b3;            \</span>
</span><span class='line'><span class="cp">                BARRIER;                                                \</span>
</span><span class='line'><span class="cp">                base = (mid1 &lt;= key)?b1:base;                           \</span>
</span><span class='line'><span class="cp">                base = (mid2 &lt;= key)?b2:base;                           \</span>
</span><span class='line'><span class="cp">                base = (mid3 &lt;= key)?b3:base;                           \</span>
</span><span class='line'><span class="cp">        } while(0)</span>
</span><span class='line'>
</span><span class='line'><span class="kt">size_t</span> <span class="nf">q_15</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">key</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">vector</span><span class="p">)</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>     <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">base</span> <span class="o">=</span> <span class="n">vector</span><span class="p">;</span>
</span><span class='line'>     <span class="n">QUATERNARY</span><span class="p">(</span><span class="mi">8192u</span><span class="p">,</span> <span class="mi">16384u</span><span class="p">,</span> <span class="mi">24576u</span><span class="p">);</span> <span class="cm">/*      32768 -&gt; 8192       */</span>
</span><span class='line'>     <span class="n">QUATERNARY</span><span class="p">(</span><span class="mi">2048u</span><span class="p">,</span> <span class="mi">4096u</span><span class="p">,</span> <span class="mi">6144u</span><span class="p">);</span> <span class="cm">/*       8192 -&gt; 2048       */</span>
</span><span class='line'>     <span class="n">QUATERNARY</span><span class="p">(</span><span class="mi">512u</span><span class="p">,</span> <span class="mi">1024u</span><span class="p">,</span> <span class="mi">1536u</span><span class="p">);</span> <span class="cm">/*       2048 -&gt; 512        */</span>
</span><span class='line'>     <span class="n">QUATERNARY</span><span class="p">(</span><span class="mi">128u</span><span class="p">,</span> <span class="mi">256u</span><span class="p">,</span> <span class="mi">384u</span><span class="p">);</span>  <span class="cm">/*        512 -&gt; 128        */</span>
</span><span class='line'>     <span class="n">QUATERNARY</span><span class="p">(</span><span class="mi">32u</span><span class="p">,</span> <span class="mi">64u</span><span class="p">,</span> <span class="mi">96u</span><span class="p">);</span>     <span class="cm">/*        128 -&gt; 32         */</span>
</span><span class='line'>     <span class="n">QUATERNARY</span><span class="p">(</span><span class="mi">8u</span><span class="p">,</span> <span class="mi">16u</span><span class="p">,</span> <span class="mi">24u</span><span class="p">);</span>      <span class="cm">/*         32 -&gt; 8          */</span>
</span><span class='line'>     <span class="n">QUATERNARY</span><span class="p">(</span><span class="mi">2u</span><span class="p">,</span> <span class="mi">4u</span><span class="p">,</span> <span class="mi">6u</span><span class="p">);</span>        <span class="cm">/*          8 -&gt; 2          */</span>
</span><span class='line'>     <span class="n">BINARY</span><span class="p">(</span><span class="mi">1u</span><span class="p">);</span>                    <span class="cm">/*          2 -&gt; 1          */</span>
</span><span class='line'>     <span class="k">return</span> <span class="n">base</span><span class="o">-</span><span class="n">vector</span><span class="p">;</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>Comparing all four implementations (binary search, ternary search with
parallel loads, ternary search with sequential loads, and quaternary
search with parallel loads) helps us understand the respective impact
of minimising aliasing issues and exploiting memory-level parallelism.</p>

<p>Theoretically, the serial ternary search is the second best
implementation after binary search, which is optimal in terms of reads
per reduction in candidate range size: it executes (expected count,
for uniformly-distributed searches) 1.667 reads to divide the
range by three, so \(1.6\overline{6}/\lg(3) \approx 1.05\) times as
many comparisons as binary search, versus approximately 1.26 and 1.5
times as many for parallel ternary and quadratic searches.  On the
other hand, if we can expect independent reads to be executed in
parallel, parallel ternary search will only perform \(1/\lg(3)
\approx 0.63\) times as many batches of reads as binary search, and
quaternary search half as many.</p>

<p>The next graph reports the performance of the methods when searching
for the first element in sorted vectors, just to make sure we actually
handle that case right.  One would hope that the ternary searches
(&#8216;st&#8217; for the serial loads, and &#8216;ter&#8217; for the concurrent loads)
improve on the binary (&#8216;bin&#8217;) or quaternary (&#8216;quat&#8217;) searches (that
both suffer from aliasing issues); this is the reason we&#8217;re interested
in them.  We do observe a marked improvement in runtimes, and the second
graph lets us believe it is indeed due to more successful caching
(beware, the colours correspond to cache levels there, not search
algorithms; there&#8217;s one facet for each search algorithm instead).
Interestingly, \(\lg(n)=30\) is a bit faster than \(\lg(n)=29\)
for the ternary searches; I can&#8217;t even hazard a guess as to what&#8217;s
going on there.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/prelim-first-cycles.png"></p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/prelim-first-cache-misses.png"></p>

<p><span class='pullquote-right' data-pullquote='Additional MLP when the accesses cause many more cache misses doesn&#8217;t help. '>
Additional MLP when the accesses cause many more cache misses doesn&#8217;t help. If
anything, the performance of quaternary search is worse and
less consistent than that of binary search.  However, it does accelerate
ternary search, for which caches work fine.  The serial ternary search
(&#8216;st&#8217;) only improves on the binary or quaternary search for
\(\lg(n)\geq 20\), where the classic search begin to really suffer
from aliasing issues, but the ternary search with parallel loads
(&#8216;ter&#8217;) is always on par with the classic searches, or quicker.
</span></p>

<p>The next set of graph compares the performance of the methods when
searching for random (present) keys; the same seed was used in each
execution and for each method to minimise differences.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/prelim-random-cycles.png"></p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/prelim-random-cache-misses.png"></p>

<p>On this less easily cached workload, we find the same general effect
on performance.  Ternary search works better on larger vectors, and
ternary search with parallel loads is faster even on small vectors,
although the quaternary search seems to fare better here.
Reassuringly, the ternary searches are also much more consistent, both
in runtimes (narrower spreads in the runtime jittered plot), and in
average cache misses, even with different physical memory allocation.
The cache miss graph shows that, not surprisingly, the versions with
parallel loads (<em>quat</em>ernary and <em>ter</em>nary searches) incur more misses
than the comparable methods (<em>bin</em>ary and <em>s</em>erial <em>t</em>ernary), but
that doesn&#8217;t prevent them from executing faster (see the runtime
graph).</p>

<p>I draw two conclusions from these preliminary experiments: avoiding
aliasing issues with slightly suboptimal subdivision (ternary
searches) is a win, and so is exposing memory level parallelism, even
when it results in more cache misses.</p>

<h2>Enough with the preliminaries</h2>

<p>The initial results let us exclude straight binary and quaternary
search from consideration, as well as serial ternary search.  We&#8217;re
also know we&#8217;re looking for variants that avoid aliasing and expose
memory-level parallelism.</p>

<p>Binary search is theoretically optimal (one bit of information to
divide the range in two), so we could also try a very close
approximation that avoids aliasing issues caused by exact division in
halves. I implemented a binary search, with an offset-ed
midpoint. Rather than guiding the recursion by the exact middle, the
offseted binary search reads from the \(31/63\)rd index (rounded
up), and iterates on a subrange \(32/63\)rd as long.  This is only
slightly suboptimal: each lookup results in a division by \(63/32\),
so that the off-center search goes through approximately 2.3% more
lookups than a classic binary search.  Comparing other methods with
the offseted binary search should be equivalent to comparing them with
an hypothetical classic binary search free of aliasing issues.</p>

<p>Aliasing mostly happens at the beginning of the search, and ternary
search is (theoretically) suboptimal. It may be faster to execute
ternary search only at the beginning of the recursion (when aliasing
stands in the way of caching frequently-used locations), and otherwise
go for a classic binary search.  I generated ternary/binary searches
by using ternary search steps until the candidate range size was less
than the square root of the original range size, and then switching to
regular binary search steps.</p>

<p>Finally, quaternary search is interesting because it goes through
roughly the same access patterns as binary search, but exposes more
memory-level parallelism.  Again, a very close approximation suffices
to avoid aliasing issues; I went with reads at \(16/63\),
\(32/63\) and \(1-16/63 = 47/63\), and iteration on a subrange
\(16/63\)rd as long.  Depending on the actual capacity for memory level
parallelism during execution, it will perform between 51% and 152% as
much work as binary search.</p>

<p>I also compared the performance of g++&#8217;s (4.7)
<a href="http://www.sgi.com/tech/stl/set.html">STL set</a>, and google&#8217;s
<a href="http://sparsehash.googlecode.com/svn/trunk/doc/dense_hash_set.html">dense_hash_set</a>,
which don&#8217;t search sorted vectors, but do work with sets, ordered or
not.</p>

<p><span class='pullquote-right' data-pullquote='For our data (32-bit unsigned), that&#8217;s about 700% additional space for each value. '>
The <code>set</code> in g++&#8217;s STL is a
<a href="http://en.wikipedia.org/wiki/Red%E2%80%93black_tree">red-black tree</a>
with a fair amount of space overhead: each node has three pointers
(parent, and left and right children) and a colour enum. For our data (32-bit unsigned), that&#8217;s about 700% additional space for each value.
Even before considering malloc overhead, the set uses eight times as
much memory as a sorted vector.  Obviously, this affects the size of
datasets that can fit in faster storage (data caches, TLB, or main
memory)… on multi-socket machines, it also affects the size of
datasets that can be processed before having to use memory assigned to
a far-away socket.  This is why the comparison for set was only
performed until 128M 32-bit elements; after that, there wasn&#8217;t enough
memory (a bit less than 12 GB) on the socket.  In theory, the
red-black tree is a bit less efficient than binary search, and
probably in the same range as the ternary or quaternary searches:
red-black trees trades potentially imperfect balancing (with some
leaves at depth up to twice the optimum) for much quicker insertions.
An interesting advantage of the red-black tree is that it naturally
avoids aliasing issues: nodes will not tend to be laid out in memory
following a depth-first, in-order traversal.
</span></p>

<p>Google&#8217;s <code>dense_hash_set</code> is an open-addressed hash table with
quadratic probing.  It seems to target about 25% to 50% utilisation,
so ends up using between four times or twice as much memory as the
sorted vector.  I&#8217;ve seen horrible implementations of <code>tr1::hash</code> for
unsigned values in (it&#8217;s the identity function on my mac), so I wrote
a quick <a href="http://arxiv.org/abs/1011.5200">tabular hash function</a>.  It&#8217;s
very similar to the obvious implementation, but the matrix is
transposed: when some octets are identical (e.g. all-zero high-order
bits), the table lookups will hit close-by addresses.  On the
workloads considered here (small consecutive integers), the identity
hash function would actually perform admirably well, but it would fail
very hard on more realistic inputs.  The tabular hash function is
closer to the ideal, and makes the performance of the hash table
nearly independent of the values constituting the set.  It&#8217;s also
pretty fast: we can expect the lookup table to fit in L2, and most
probably in L1D, for the microbenchmarks used here.</p>

<figure class='code'><figcaption><span>tabular hash functor for 32-bit unsigned  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="k">struct</span> <span class="n">hash_u32</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>        <span class="kt">size_t</span> <span class="n">tables</span><span class="p">[</span><span class="mi">256</span><span class="p">][</span><span class="mi">8</span><span class="p">];</span>
</span><span class='line'>
</span><span class='line'>        <span class="n">hash_u32</span><span class="p">()</span>
</span><span class='line'>        <span class="p">{</span>
</span><span class='line'>                <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">256</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>                        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>                                <span class="kt">size_t</span> <span class="n">x</span> <span class="o">=</span> <span class="n">random</span><span class="p">(),</span> <span class="n">y</span> <span class="o">=</span> <span class="n">random</span><span class="p">();</span>
</span><span class='line'>                                <span class="n">tables</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="o">^</span><span class="p">(</span><span class="n">y</span><span class="o">&lt;&lt;</span><span class="mi">32</span><span class="p">);</span>
</span><span class='line'>                        <span class="p">}</span>
</span><span class='line'>                <span class="p">}</span>
</span><span class='line'>        <span class="p">}</span>
</span><span class='line'>
</span><span class='line'>        <span class="kt">size_t</span> <span class="n">operator</span><span class="p">()</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span> <span class="k">const</span>
</span><span class='line'>        <span class="p">{</span>
</span><span class='line'>                <span class="kt">unsigned</span> <span class="n">x0</span> <span class="o">=</span> <span class="n">x</span><span class="o">&amp;</span><span class="mh">0xff</span><span class="p">,</span>
</span><span class='line'>                        <span class="n">x1</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">&gt;&gt;</span><span class="mi">8</span><span class="p">)</span><span class="o">&amp;</span><span class="mh">0xff</span><span class="p">,</span>
</span><span class='line'>                        <span class="n">x2</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">&gt;&gt;</span><span class="mi">16</span><span class="p">)</span><span class="o">&amp;</span><span class="mh">0xff</span><span class="p">,</span>
</span><span class='line'>                        <span class="n">x3</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">&gt;&gt;</span><span class="mi">24</span><span class="p">)</span><span class="o">&amp;</span><span class="mh">0xff</span><span class="p">;</span>
</span><span class='line'>                <span class="k">return</span> <span class="n">tables</span><span class="p">[</span><span class="n">x0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="o">^</span> <span class="n">tables</span><span class="p">[</span><span class="n">x1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span>
</span><span class='line'>                        <span class="o">^</span> <span class="n">tables</span><span class="p">[</span><span class="n">x2</span><span class="p">][</span><span class="mi">2</span><span class="p">]</span> <span class="o">^</span> <span class="n">tables</span><span class="p">[</span><span class="n">x3</span><span class="p">][</span><span class="mi">3</span><span class="p">];</span>
</span><span class='line'>        <span class="p">}</span>
</span><span class='line'><span class="p">};</span>
</span></code></pre></td></tr></table></div></figure>


<p><span class='pullquote-right' data-pullquote='The dense hash table is a valid data point: it helps informs choices that happen in practice. '>
Comparing ordered sets with an unordered (hash) set may seem unfair:
unordered sets expose fewer operations, and hash tables exploit that
additional freedom to offer (usually) faster insertions and lookups.
I don&#8217;t care. When I use a container, I don&#8217;t necessarily want an ordered or an unordered set.
I wish to implement an algorithm, and that algorithm could well be
adapted to exploit the strengths of either ordered or unordered sets.
The dense hash table is a valid data point: it helps informs choices that happen in practice.
</span></p>

<p>First, let&#8217;s compare these implementations &#8211; the hash table (&#8216;goog&#8217;),
the offseted binary search (&#8216;ob&#8217;), the offseted quaternary search
(&#8216;oq&#8217;), the red-black tree (&#8216;stl&#8217;), the ternary/binary search (&#8216;tb&#8217;),
and the ternary search (&#8216;ter&#8217;) &#8211; on the two extreme cases we&#8217;ve been
considering so far: constant searches for the least element, and
searches for random elements.  None of the algorithms suffer from
aliasing issues, so the spread, if any, is smoothly covered.  This
allows me to avoid the rather busy jittered scatter plot and instead
only interpolate between median cycle counts, with shaded regions
corresponding to the values between the 10th and 90th percentile.
The lines for &#8216;goog&#8217; and &#8216;stl&#8217; end early: these data structures could
not represent the largest sets in the 12GB of local memory available
in each processor.  Also, now that we&#8217;ve dealt with the aliasing
anomalies, I don&#8217;t feel it&#8217;s useful to graph the cache and TLB misses
per size and implementation anymore.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/real-first-cycles.png"></p>

<p>When we always search for the least element in the set, the hash table
obviously works very well: it only reads the same one or two cache
lines, and they&#8217;re always in cache.  Surprisingly, the red-black tree
is also a bit faster than the searches in sorted vectors… until it
exhausts all available memory.  Looking at the behaviour on tiny
sizes, it seems likely that the allocation sequence yields a nice
layout for the first few steps of the search.  The offseted binary
search is both faster and more consistent than the classic binary
search (shown in earlier graphs), but never faster than the offseted
quaternary search. The ternary/binary search is similarly never faster
than the straight ternary search.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/real-random-cycles.png"></p>

<p><span class='pullquote-right' data-pullquote='Again, we find that the offseted quaternary search dominates the offseted binary search, and the ternary search the hybrid ternary/binary search. '>
On fully random searches, the hash table is obviously slower than when
it keeps reading the same one or two words, but still much faster than
any of the sorted sets.  The red-black tree is now significantly
slower than all the sorted vector searches (1000 cycles versus 200 to
400 cycles on vectors of 128K elements, for example).  Again, we find that the offseted quaternary search dominates the offseted binary search, and the ternary search the hybrid ternary/binary search.
</span></p>

<p>For the rest of the post, I&#8217;ll only consider google&#8217;s dense hash
table, g++&#8217;s red-black tree, the offseted quaternary search and the
ternary search.  Note that both the offseted quaternary and ternary
searches exhibit consistently better performance than the offseted
binary search, for which aliasing isn&#8217;t an issue.  These alternative
ways to search sorted vectors can thus be expected to improve on
binary search even on vectors of lengths that are far from powers of
two.</p>

<h3>More interesting workloads</h3>

<p>So far, I&#8217;ve only reported runtimes for two workloads: a trivial one,
in which we always search for the same element, and a worst-case
scenario, in which all the keys are chosen independently and randomly
from all the values in the set (missing keys might be even worse for
some hash tables).  I see two dimensions along which it&#8217;s useful to
interpolate between these workloads.</p>

<p>The first dimension is spatial locality.  We sometimes work with very
large sorted sets, but almost exclusively search for values in a small
contiguous subrange.  Within that subrange, however, there is no
obvious relationship between search keys; we might as well be choosing
them randomly, with uniform probability.  The constant-element search
is a degenerate case with a subrange of length one, and the fully
random choice another degenerate case, with a subrange spanning the
whole set.  For all vector sizes from 1K to 1G elements, I ran the
program with subranges beginning at the least value in the set and of
size 1K, 16K, 64K, 256K, 1M and 2M (or size equal to the whole set if
it&#8217;s too small).  These values are interesting because they roughly
correspond to working set sizes that bracket the cache and TLB sizes
for all four implementations.</p>

<p>We can expect all methods to benefit from spatial locality, but not to
the same extent.  The hash table works well because it scatters
everything pseudo-randomly in a large vector; when the subrange is
sufficiently small compared to the set, each key in that subrange can
be expected to use its own cache line, and the hash table&#8217;s
performance will be the same as on randomly-chosen keys on relatively
small subranges.  Similarly, the red-black tree uses about eight times
as much space as sorted vectors for the same data, and will also
exceed the same cache level much more quickly.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/real-subrange-small-cycles.png"></p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/real-subrange-large-cycles.png"></p>

<p><span class='pullquote-right' data-pullquote='runtimes for the sorted vector searches grow much more slowly, and are more stable than for the red-black tree. '>
The red-black tree was slightly faster than the sorted vector searches
when the key was always the same.  This isn&#8217;t the case when the keys
are selected from a small subrange.  On all six scenarios and across
all sizes, the 10th percentile for &#8216;stl&#8217; is slower than the 90th for
either &#8216;oq&#8217; or &#8216;ter&#8217;.  As expected, the runtimes for the hash table
(&#8216;goog&#8217;) evolve in plateaus, as the working set outstrips each level
of caching.  When it exceeds even the L3 (256K values in distinct
cache lines is just barely larger than L3) its performance is more or
less comparable to that of the sorted vector searches, as long as the
subrange is below 2M items.  All the sorted set implementations
exhibit the same roughly sigmoidal shape for runtimes as functions of
the set size.  However, except for a few spikes on very large vectors,
runtimes for the sorted vector searches grow much more slowly, and are more stable than for the red-black tree.
In general, the offseted quaternary search seems to execute a bit
faster than the ternary search, but the difference is minimal.  More
importantly, when working with medium subranges (around 64K to 1M
elements) in medium or large sets (1M elements or more), the sorted
vectors searches are competitive with a hash table, in addition to
being more compact.
</span></p>

<p>The second dimension is temporal locality.  I simulated workloads
exhibiting temporal locality with strided accesses. For example with a
stride of 17, the search keys are, in order, at rank 0, 17, 34, 51,
etc. in the sorted set.  The constant search is a special case with
stride zero.  For all vector sizes from 1K to 1G elements, I ran the
programs with various strides: 1, 17, 65, 129, 257, and 1K+1.  The
ranks were taken modulo the set&#8217;s size, so these odd strides ensure
that every key is generated for small vectors (i.e. they&#8217;re mod-1 step
sizes, for <a href="http://sal.discontinuity.info/">corewar</a> fans).  The
ordered sets should exploit this setting well (the red-black tree
probably less so than the vector searches), and the hash table not at
all: a good hash function guarantees that similar keys don&#8217;t hash to
close-by values any more than very different keys, and the hash table
will behave exactly as if keys were chosen randomly.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/real-stride-short-cycles.png"></p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-30-binary-search-is-a-pathological-case-for-caches/real-stride-medium-cycles.png"></p>

<p><span class='pullquote-right' data-pullquote='The sorted vector searches are always faster, and in general slightly more stable. '>
The hash table behaves almost exactly like it did with random keys; in
particular, there is a large increase in runtimes when the set
comprises at least 1M elements: the hash table then exceeds L3.  Even
on these very easy inputs, the red-black tree is only faster than the
hash table on accesses with stride one, the very best case.  Worse, the
red-black tree already exhibits very wide variations (consider the height of
the shaded areas for &#8216;stl&#8217;) on such traversals.  It&#8217;s
even wider on longer strides in large vectors, but the red-black tree
is then also so slow that it&#8217;s completely off-graph.  The sorted vector searches are always faster, and in general slightly more stable.
In fact, they seem comparable with the hash table, if not faster, for
nearly all vector sizes, as long as the access stride is around 129 or
less.
</span></p>

<h2>From data to knowledge</h2>

<p><span class='pullquote-right' data-pullquote='Each execution of the test program showed very consistent timings, but different executions yielded timings that varied by a factor of two. '>
I only worked on binary search here, and, even with this very simple
algorithm, there are fairly obscure pitfalls.  The issue with physical
address allocation affecting cache line aliasing, particularly at the
L2 level, is easy to miss, and can have a huge impact on results.
Each execution of the test program showed very consistent timings, but different executions yielded timings that varied by a factor of two.
We&#8217;ve been <a href="http://www.multicoreinfo.com/research/papers/2009/asplos09-producing-data.pdf">warned [pdf]</a>
about such hidden variables in the past, but that was a pretty
stealthy instance with a serious effect.
</span></p>

<p>Even without that hurdle, the slowdown caused by aliasing between
cache lines when executing binary searches on vectors of
(nearly-)power-of-two sizes is alarming.  The ratio of runtimes
between the classic binary search and the offseted quaternary search
is on the order of two to ten, depending on the test case.  Some of
that is due to MLP, but the offseted binary search offers only
slightly lesser improvements: aliasing really is an issue.  And yet, I
believe I haven&#8217;t seen anyone try to compensate for that when
comparing fancier search methods that often work best on
near-power-of-two sizes (e.g. breadth-first or van Emde Boas layouts)
with simple binary search.</p>

<p>Also worrying is the fact that, even with all the repetitions and
performance counters, some oddities (especially spikes in runtimes on
very large vectors) are still left unexplained.</p>

<p>Nevertheless, these experiments can probably teach us something useful.</p>

<p><span class='pullquote-right' data-pullquote='I don&#8217;t think that basing decisions on ill-understood phenomena is a good practice. '>
I don&#8217;t think that basing decisions on ill-understood phenomena is a good practice.
This is why I try to minimise the number of variables and start with a
logical hypothesis (in this case, &#8220;aliasing hurts and MLP helps&#8221;),
rather than just being a slave to the p-value.  Hopefully, this
approach results in more robust conclusions, particularly in conjunction
with quantile-based comparisons and a lot of visualisation.  In the
current case, the most useful conclusions seem to be:
</span></p>

<ol>
<li>use offseted binary or quaternary search instead of binary search.
Quaternary search is faster if the vectors are expected to be in
cache, even when aliasing isn&#8217;t an issue, and both methods offer more
robust performance;</li>
<li>good sorted sets can be faster than hash tables when reads tend to
be strided, even with respectable strides on the order of 64 or 128
elements;</li>
<li>if the searches exhibit spatial locality over a couple hundred
thousand bytes, searches in sorted vectors can be comparable to hash
sets.</li>
</ol>


<p>These raise two obvious questions: could we go further than quaternary
search, and how do the results map to loopy implementations?  I don&#8217;t
think it&#8217;s useful to try and go farther than quaternary search: caches
can only handle a few concurrent requests, and quaternary search&#8217;s
three requests is probably close to, if not past, the maximum.  In an
implementation that&#8217;s not fully unrolled the offseted searches are
probably most easily adapted by multiplying the range by \(33/64 =
1/2+1/64\) or \(17/64 = 1/4+1/64\), with two shifts and a few adds.
In either case, L2 cache misses tend to outweight branch
mispredictions, which themselves are usually more important than
micro-optimisations at the instruction level.  The speed-ups brought
about by eliminating aliasing and branches, while exposing
memory-level parallelism, ought to be notable, even in a loopy
implementation.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Binary search *eliminates* branch mispredictions]]></title>
    <link href="http://www.pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/"/>
    <updated>2012-07-03T18:30:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions</id>
    <content type="html"><![CDATA[<p>Searching in a short, sorted vector is a fairly common task.  Lots of
time, data sets are just small enough that it&#8217;s quicker (and simpler)
to store them in a sorted vector than to use an alternative with
asymptotically faster insertions.  Even when we do use more complex
data structures, it often makes sense to specialise the leaves and
pack more data in a single node.</p>

<p>How should we implement such a search?  There are two basic options:
linear and binary search.  After running a bunch of tests, I claim
that a decent binary search offers good and robust performance, and is
nearly always the right choice.  An interpolation search could be
appropriate, but it&#8217;s a tad too complicated for short vectors, and can
fail hard if the input is arranged just wrong.</p>

<p>When the data is expected to be in cache, binary search is pretty much
always preferable.  When it isn&#8217;t, binary search is faster for short
(at most one or two cache line) and long vectors, but a good linear
search can be the best option for data that fits in a couple cache
lines.</p>

<p>I&#8217;ll focus on one specific case that is of interest to me: searching a
short or medium-length sorted vector of known size.  I&#8217;m interested in
micro-optimisations, so I&#8217;ll only work on vectors of 32-bit <code>unsigned</code>
values.  I&#8217;m also assuming we&#8217;re looking for a
<a href="http://www.sgi.com/tech/stl/lower_bound.html">lower bound</a>, because
that&#8217;s what I need.
Some
<a href="https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/">other people</a>
have looked into this, but I consider more cases and use
size-specialised code.  I also try to represent the whole distribution
of runtimes, which helps us take long tails into account: I&#8217;m
usually willing to take a small hit in average performance for an
improvement in consistency.</p>

<h2>Branch mispredictions and entropy</h2>

<p>X86 processors have been pipelined for quite a while now, and
(mispredicted) conditional branches are well known for wrecking the
performance of programs: when a branch is predicted wrong, all the
work that&#8217;s been speculatively done so far must be thrown out.  In
effect, the processor&#8217;s massive execution resources have then been
completely wasted from the moment the branch is decoded until it is
retired (out of order execution improves on things a bit, but the end
result is the same).</p>

<p>Of course, things aren&#8217;t as bad as they once were with the old
<a href="http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29">P4</a>.
That one exhibited an awesome combination of a preternaturally long
pipeline, slow conditional moves, slow shifts, and extra latency on
any instruction that used the condition flags.  I don&#8217;t think I can
convey how frustrating that microarchitecture could be… Mispredicted
branches killed IPC, but conditional moves were only slightly better,
and more portable bit-twiddling tricks (generating masks from the sign
bit with <code>sar</code>, or from the carry flag via <code>sbb</code>) just as slow.  I
believe Linus Torvalds once posted quite the rant on that topic on
<a href="http://www.realworldtech.com/">RWT</a> (:</p>

<p><span class='pullquote-right' data-pullquote='&#8220;It&#8217;s better to execute many nearly useless branches.&#8221; (!?) '>
Still, I think we can tell the period has left its mark on many
programmers who will go to great lengths to avoid anything that looks
like a branch.  One
<a href="http://www.drdobbs.com/article/print?articleId=184405848&amp;siteSectionName=">common</a>
<a href="http://stackoverflow.com/questions/5239055/more-bit-twiddling-efficiently-implementing-a-binary-search-over-a-fixed-size-a">symptom</a>
is the notion that it&#8217;s <em>normal</em> for a linear search to execute faster
than a binary search, even on vectors of 32 or 64 values.  The
reasoning is that the branches in a linear search are trivially
predicted: the comparison always fails, except for the very last once.
In contrast, the comparisons in a binary search are hard to predict:
there are only \(\lg n\) of them, so each one extracts one full bit
of information from the vector.  In other words, linear search is
expected to run better because each conditional extracts less
information: the predictors have less entropy to deal with, and are
right more often… &#8220;It&#8217;s better to execute many nearly useless branches.&#8221; (!?)
</span></p>

<h2>Not all conditionals are created equal</h2>

<p><span class='pullquote-right' data-pullquote='Fixed-length binary searches are naturally executed without any branch! '>
Sometimes, it really is faster to perform redundant work to avoid
mispredicted branches.  However, that&#8217;s definitely not the
case for linear versus binary search.  Linear search pretty much has
to be implemented with conditional branches: we really want to abort
early once we&#8217;ve found what we&#8217;re looking for.  It&#8217;s not so clear-cut
with binary search.  I think it&#8217;s a standard exercise to ask why it
doesn&#8217;t help that much to leave binary search early when we have an
exact match.  So, when the size is known ahead of time, the only
remaining conditional is when we update the upper or the
lower bound;  that&#8217;s actually a conditional move.
Fixed-length binary searches are naturally executed without any branch! It doesn&#8217;t get more prediction-friendly than that.
</span></p>

<p>In short, we actually expect linear search to be faster than binary
search only when conditional branches allow the former to leave the
search loop early.  Linear search can hope to overtake binary search
<em>because</em> of its conditional branches!</p>

<h2>Microbenchmarking for mispredictions</h2>

<p>I tested this hypothesis on my 2.8 GHz X5660, in GCC.  First the
linear (lower bound) search:</p>

<figure class='code'><figcaption><span>inverse linear search  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="n">ALWAYS_INLINE</span> <span class="kt">size_t</span>
</span><span class='line'><span class="nf">linear_search</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">key</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">vector</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">size</span><span class="p">;</span> <span class="n">i</span> <span class="o">--&gt;</span> <span class="mi">0</span><span class="p">;)</span> <span class="p">{</span>
</span><span class='line'>                <span class="kt">unsigned</span> <span class="n">v</span> <span class="o">=</span> <span class="n">vector</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span><span class='line'>                <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&lt;=</span> <span class="n">key</span><span class="p">)</span> <span class="k">return</span> <span class="n">i</span><span class="p">;</span>
</span><span class='line'>        <span class="p">}</span>
</span><span class='line'>
</span><span class='line'>        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>Size-specialised versions were generated by calling it with a constant
<code>size</code> argument.  GCC unrolled the loop for size 16 and lower, which
seems more than reasonable to me.  For larger sizes, the loop was not
unrolled at all.  Again, that seems reasonable: current X86 have
special logic to accelerate such tiny (compare, conditional jump,
decrement) loop bodies.</p>

<figure class='code'><figcaption><span>compiled unrolled linear search (n=4) </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>lsearch_4:
</span><span class='line'>.LFB41:
</span><span class='line'>        .cfi_startproc
</span><span class='line'>        cmpl    12(%rsi), %edi
</span><span class='line'>        jae     .L12
</span><span class='line'>        cmpl    8(%rsi), %edi
</span><span class='line'>        jae     .L13
</span><span class='line'>        cmpl    4(%rsi), %edi
</span><span class='line'>        jae     .L14
</span><span class='line'>        cmpl    (%rsi), %edi
</span><span class='line'>        sbbq    %rax, %rax
</span><span class='line'>        ret
</span><span class='line'>        .p2align 4,,10
</span><span class='line'>        .p2align 3
</span><span class='line'>.L12:
</span><span class='line'>        movl    $3, %eax
</span><span class='line'>        ret
</span><span class='line'>        .p2align 4,,10
</span><span class='line'>        .p2align 3
</span><span class='line'>.L13:
</span><span class='line'>        movl    $2, %eax
</span><span class='line'>        ret
</span><span class='line'>        .p2align 4,,10
</span><span class='line'>        .p2align 3
</span><span class='line'>.L14:
</span><span class='line'>        movl    $1, %eax
</span><span class='line'>        ret</span></code></pre></td></tr></table></div></figure>


<p>The linear search walks the data backwards, from high to low
addresses, which is unusual.  I also tested a slightly more
complicated version that searches the data forward:</p>

<figure class='code'><figcaption><span>forward linear search  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="n">ALWAYS_INLINE</span> <span class="kt">size_t</span>
</span><span class='line'><span class="nf">fwd_search</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">key</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">vector</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>        <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">size</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>                <span class="kt">unsigned</span> <span class="n">v</span> <span class="o">=</span> <span class="n">vector</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span><span class='line'>                <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&gt;=</span> <span class="n">key</span><span class="p">)</span>
</span><span class='line'>                        <span class="k">return</span> <span class="p">(</span><span class="n">v</span> <span class="o">&gt;</span> <span class="n">key</span><span class="p">)</span><span class="o">?</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="o">:</span><span class="n">i</span><span class="p">;</span>
</span><span class='line'>        <span class="p">}</span>
</span><span class='line'>
</span><span class='line'>        <span class="k">return</span> <span class="n">size</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>The same size-specialisation trick was used for the binary search:</p>

<figure class='code'><figcaption><span>binary search  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="cm">/* log_2 ceiling */</span>
</span><span class='line'><span class="n">ALWAYS_INLINE</span> <span class="kt">unsigned</span> <span class="nf">lb</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>        <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;=</span> <span class="mi">1</span><span class="p">)</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span><span class='line'>        <span class="k">return</span> <span class="p">(</span><span class="mi">8</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">))</span><span class="o">-</span><span class="n">__builtin_clzl</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
</span><span class='line'><span class="p">}</span>
</span><span class='line'>
</span><span class='line'><span class="n">ALWAYS_INLINE</span> <span class="kt">size_t</span>
</span><span class='line'><span class="nf">binary_search</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">key</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">vector</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>        <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">low</span> <span class="o">=</span> <span class="n">vector</span><span class="p">;</span>
</span><span class='line'>
</span><span class='line'>        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">lb</span><span class="p">(</span><span class="n">size</span><span class="p">);</span> <span class="n">i</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
</span><span class='line'>                <span class="n">size</span> <span class="o">/=</span> <span class="mi">2</span><span class="p">;</span>
</span><span class='line'>                <span class="kt">unsigned</span> <span class="n">mid</span> <span class="o">=</span> <span class="n">low</span><span class="p">[</span><span class="n">size</span><span class="p">];</span>
</span><span class='line'>                <span class="k">if</span> <span class="p">(</span><span class="n">mid</span> <span class="o">&lt;=</span> <span class="n">key</span><span class="p">)</span>
</span><span class='line'>                        <span class="n">low</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
</span><span class='line'>        <span class="p">}</span>
</span><span class='line'>
</span><span class='line'>        <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">low</span> <span class="o">&gt;</span> <span class="n">key</span><span class="p">)</span><span class="o">?</span> <span class="o">-</span><span class="mi">1</span><span class="o">:</span> <span class="n">low</span> <span class="o">-</span> <span class="n">vector</span><span class="p">;</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>I assume that the size is a power of two, so I only have to track the
lower bracket of the search: the upper one is always at a known
offset.  The return value checks for a key that&#8217;s smaller than all the
elements in the vector, rather than assuming the presence of a
sentinel.  That case doesn&#8217;t happen in the tests, and the conditional
branch is always predicted right.  I tried taking it out, and
the difference was pretty much noise, on the order of one cycle.</p>

<p>Note that other sizes could be handled by special-casing the first
iteration and setting the midpoint to <code>size - (1ul&lt;&lt;(lb(size)-1))</code>.
If the midpoint becomes the lower bound, the rest of the search is on
a power-of-two&#8211;sized range; otherwise, the sequel is performed on a
wider than necessary range, but still correct.  In both cases, the
number of iterations is the same as for a regular binary search.
There could be a slight loss of performance due to worse locality, but
the access patterns aren&#8217;t <em>that</em> different.</p>

<p>All sizes (up to 64, so six iterations) were fully unrolled.  This is
expected, as the body is a simple <code>cmp</code>/<code>lea</code>/<code>cmov</code>.</p>

<figure class='code'><figcaption><span>compiled unrolled binary search (n=16) </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>bsearch_16:
</span><span class='line'>bsearch_16:
</span><span class='line'>.LFB54:
</span><span class='line'>        .cfi_startproc
</span><span class='line'>        cmpl    32(%rsi), %edi
</span><span class='line'>        leaq    32(%rsi), %rax
</span><span class='line'>        cmovb   %rsi, %rax
</span><span class='line'>        cmpl    16(%rax), %edi
</span><span class='line'>        leaq    16(%rax), %rdx
</span><span class='line'>        cmovae  %rdx, %rax
</span><span class='line'>        cmpl    8(%rax), %edi
</span><span class='line'>        leaq    8(%rax), %rdx
</span><span class='line'>        cmovae  %rdx, %rax
</span><span class='line'>        cmpl    4(%rax), %edi
</span><span class='line'>        leaq    4(%rax), %rdx
</span><span class='line'>        cmovb   %rax, %rdx
</span><span class='line'>        movq    $-1, %rax
</span><span class='line'>        cmpl    (%rdx), %edi
</span><span class='line'>        jae     .L130
</span><span class='line'>        rep
</span><span class='line'>        ret
</span><span class='line'>        .p2align 4,,10
</span><span class='line'>        .p2align 3
</span><span class='line'>.L130:
</span><span class='line'>        movq    %rdx, %rax
</span><span class='line'>        subq    %rsi, %rax
</span><span class='line'>        sarq    $2, %rax
</span><span class='line'>        ret</span></code></pre></td></tr></table></div></figure>


<p>Finally, I also tested a vectorised linear search, using SSE
instructions.  The implementations exploits vectorised comparisons to
only perform at most one test per cache line (16 values).  In fact for
vectors of length 16 or less, there is no (mispredicted) branch at
all.  The implementation is actually broken, as it performs signed
rather than unsigned comparisons, but that&#8217;s not an issue in the tests
performed here.</p>

<figure class='code'><figcaption><span>vectorised search  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="k">typedef</span> <span class="kt">int</span> <span class="n">v4si</span> <span class="n">__attribute__</span> <span class="p">((</span><span class="n">vector_size</span> <span class="p">(</span><span class="mi">16</span><span class="p">)));</span>
</span><span class='line'><span class="k">typedef</span> <span class="kt">float</span> <span class="n">v4sf</span> <span class="n">__attribute__</span> <span class="p">((</span><span class="n">vector_size</span> <span class="p">(</span><span class="mi">16</span><span class="p">)));</span>
</span><span class='line'>
</span><span class='line'><span class="cp">#define PCMP(KEYS, VALS) (__builtin_ia32_movmskps                       \</span>
</span><span class='line'><span class="cp">                          ((v4sf)__builtin_ia32_pcmpgtd128((KEYS),      \</span>
</span><span class='line'><span class="cp">                                                           (VALS))))</span>
</span><span class='line'>
</span><span class='line'><span class="kt">size_t</span>
</span><span class='line'><span class="nf">vsearch_4</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">key</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">vector</span><span class="p">)</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>        <span class="k">if</span> <span class="p">(</span><span class="n">vector</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">key</span><span class="p">)</span> <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span><span class='line'>        <span class="n">v4si</span> <span class="n">keys</span> <span class="o">=</span> <span class="p">{</span><span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">};</span>
</span><span class='line'>        <span class="n">v4si</span> <span class="n">vals</span> <span class="o">=</span> <span class="o">*</span><span class="p">(</span><span class="n">v4si</span><span class="o">*</span><span class="p">)</span><span class="n">vector</span><span class="p">;</span>
</span><span class='line'>        <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">PCMP</span><span class="p">(</span><span class="n">keys</span><span class="p">,</span> <span class="n">vals</span><span class="p">)</span><span class="o">^</span><span class="mh">0xf</span><span class="p">)</span><span class="o">|</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">3</span><span class="p">);</span>
</span><span class='line'>        <span class="k">return</span> <span class="n">__builtin_ctz</span><span class="p">(</span><span class="n">mask</span><span class="p">);</span>
</span><span class='line'><span class="p">}</span>
</span><span class='line'>
</span><span class='line'><span class="cp">#define TEST2(I, V1, V2, BIT) do {                              \</span>
</span><span class='line'><span class="cp">                unsigned mask                                   \</span>
</span><span class='line'><span class="cp">                        = PCMP(keys, V1)                        \</span>
</span><span class='line'><span class="cp">                        | PCMP(keys,V2)&lt;&lt;4;                     \</span>
</span><span class='line'><span class="cp">                mask = (mask^0xff) | (BIT)&lt;&lt;7;                  \</span>
</span><span class='line'><span class="cp">                if (mask)                                       \</span>
</span><span class='line'><span class="cp">                        return 4*(I)+__builtin_ctz(mask);       \</span>
</span><span class='line'><span class="cp">        } while (0)</span>
</span><span class='line'>
</span><span class='line'><span class="kt">size_t</span>
</span><span class='line'><span class="nf">vsearch_8</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">key</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="n">vector</span><span class="p">)</span>
</span><span class='line'><span class="p">{</span>
</span><span class='line'>        <span class="k">if</span> <span class="p">(</span><span class="n">vector</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">key</span><span class="p">)</span> <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span><span class='line'>        <span class="n">v4si</span> <span class="n">keys</span> <span class="o">=</span> <span class="p">{</span><span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">};</span>
</span><span class='line'>        <span class="n">v4si</span> <span class="o">*</span> <span class="n">vals</span> <span class="o">=</span> <span class="p">(</span><span class="n">v4si</span><span class="o">*</span><span class="p">)</span><span class="n">vector</span><span class="p">;</span>
</span><span class='line'>        <span class="n">v4si</span> <span class="n">v0</span> <span class="o">=</span> <span class="n">vals</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">v1</span> <span class="o">=</span> <span class="n">vals</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
</span><span class='line'>        <span class="cm">/* BIT = 1, the branch is compiled away */</span>
</span><span class='line'>        <span class="n">TEST2</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">v0</span><span class="p">,</span> <span class="n">v1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
</span><span class='line'><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure>


<p>The preliminary test is always correctly predicted, and only ends up
prefetching data that&#8217;ll be used a couple instructions later.  A
similar trick as <code>TEST2</code> is used to generate a bitmask from 16
comparisons.  Finally, even when there are loops, they&#8217;re fully
unrolled.  The disassembly below shows how the conditional branch
around <code>return</code> is compiled away in the last iteration.  For all sizes
(up to 64), the loops were fully unrolled.</p>

<figure class='code'><figcaption><span>compiled vectorised linear search (n=8) </span></figcaption>
<div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>vsearch_8:
</span><span class='line'>.LFB62:
</span><span class='line'>        .cfi_startproc
</span><span class='line'>        cmpl    %edi, (%rsi)
</span><span class='line'>        movq    $-1, %rax
</span><span class='line'>        jbe     .L210
</span><span class='line'>        rep
</span><span class='line'>        ret
</span><span class='line'>        .p2align 4,,10
</span><span class='line'>        .p2align 3
</span><span class='line'>.L210:
</span><span class='line'>        movl    %edi, -12(%rsp)
</span><span class='line'>        movd    -12(%rsp), %xmm1
</span><span class='line'>        pshufd  $0, %xmm1, %xmm0
</span><span class='line'>        movdqa  %xmm0, %xmm1
</span><span class='line'>        pcmpgtd 16(%rsi), %xmm0
</span><span class='line'>        pcmpgtd (%rsi), %xmm1
</span><span class='line'>        movmskps        %xmm0, %eax
</span><span class='line'>        movmskps        %xmm1, %edx
</span><span class='line'>        sall    $4, %eax
</span><span class='line'>        orl     %edx, %eax
</span><span class='line'>        xorb    $-1, %al
</span><span class='line'>        orb     $-128, %al
</span><span class='line'>        bsfl    %eax, %eax
</span><span class='line'>        cltq
</span><span class='line'>        ret</span></code></pre></td></tr></table></div></figure>


<h2>Graphics time!</h2>

<p>The first scenario was a search for random keys on an in-cache vector.
As for all other tests, 10 000 searches were performed, and the
overhead of the benchmarking loop was estimated with the minimal
iteration time when calling an empty function (44 cycles), and
subtracted from the measured values.  Finally, extreme outliers
(values larger than the 99th percentile for each scenario, size and
algorithm) were removed.  There&#8217;s a bit of overhead (<code>rdtscp</code> isn&#8217;t
exactly lightweight), so the shorter searches don&#8217;t reveal much,
except how complex current CPUs can be.  I also tried measuring a
couple searches at a time, but that lost interesting information, and
let out of order execution mess the results up.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-03-binary-search-star-reduces-star-branch-mispredictions/box-random-cached.png"></p>

<p>The graphics overlay a jittered scatter plot and a
<a href="http://en.wikipedia.org/wiki/Box_plot">box plot</a>.  The density of
points reveals how often each execution time (in clock cycle) was
observed.  The box plots give us the median and the first and third
quartiles; the median is a more robust single-number summary than the
mean on strange distributions.  Still, I also included the arithmetic
mean (without outliers) as a thin grey line.  The search algorithms
are, from left to right, forward linear search (&#8220;lin&#8221;), backward
linear search (&#8220;inv&#8221;), vectorised forward linear search (&#8220;vec&#8221;), and
binary search (&#8220;bin&#8221;).  The size of the vector is reported at the
top.</p>

<p>In general, interesting transitions will happen around length 16
(especially for forward and backward linear searches), and 32 for
binary search.  Scalar linear searches longer than 16 elements aren&#8217;t
unrolled, the vectorised search works in chunks of up to 16 elements,
and 16 <code>unsigned</code>s fit exactly in one cache line. At length 32, binary
search may access one or two cache lines, depending on the data, and
vectorised search starts executing conditional branches.</p>

<p><span class='pullquote-right' data-pullquote='binary search is always faster than linear search: the single mispredicted branch in the latter hurts. '>
The box plots are pretty clear.  Except for vectors of length two or four
(for which the benchmarking infrastructure probably dominates
everything), binary search is always faster than linear search: the single mispredicted branch in the latter hurts.  The
vectorised search is always a bit quicker than the scalar versions
and performs nearly identically to the binary search up
to size 16.  After that, it has to execute a few unpredictable
branches.  The distribution of runtimes for the vectorised search
exhibits very distinct peaks, so the quartiles might not be as robust
as they usually are.
</span></p>

<p>The scatter plots are also interesting.  First, they raise an
interesting question as to what exactly is going on with the bimodal
distribution on vectors of length two or four.  I have no
clue. Current CPUs are awfully complex stateful systems with strange
interactions; I would guess a combination of back-to-back call and
return, and the benchmarking code interfering with itself.</p>

<p>Over all vector sizes, the distribution of latencies for linear
searches is distinctly multimodal: searches that end quickly execute
much faster.  The effect of batching by the cache line in the
vectorised search is also visible.  Unpredictable branches are only
executed for length 32 or 64; the median times (and spread, obviously)
then increase quite a lot.  There&#8217;s also some variation between the
forward and backward linear search; I frankly don&#8217;t know why the
access pattern has that impact, or if it might not just be noise.</p>

<p>What happens if more regular queries help eliminate branch
mispredictions?  Obviously, binary search will not be affected, but we
can expect the linear searches to execute more quickly.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-03-binary-search-star-reduces-star-branch-mispredictions/box-fixed-cached.png"></p>

<p>The leftmost four plots above show what happens when the linear searches
always find the key after eight iterations (one for the vectorised
search).  The remaining two (&#8220;32*&#8221; and &#8220;64*&#8221;) instead set the key
as the very last position that is searched.</p>

<p><span class='pullquote-right' data-pullquote='Conditional branches speed linear search up. '>
In all cases, binary search remains the same.  When the key is found
quickly (first four plots), scalar linear searches are approximately
on par with binary search, and the vectorised search faster.  On the
other hand, when the whole vector must be traversed (two rightmost
plots), the linear searches take more than thrice as much time as the
binary search! Even the vectorised search is at best comparable to the
binary search.  Comparing the top of the distribution in the first
plot to this one shows that mispredicted branches can slow linear
searches by up to 50ish cycles.  However, it is still quicker overall
to take the hit and break out of the search early: compare the
medians.  Conditional branches speed linear search up.
</span></p>

<p>The effect isn&#8217;t as marked on the vectorised search, but it&#8217;s still
slower than a binary search on the &#8220;32*&#8221; and &#8220;64*&#8221; plots.  The
branches are always predicted correctly, and we end up with a good
approximation of a fully vectorised search.  Rather than eliminating
all the branches from the vectorised linear search (by always
processing the whole vector), it&#8217;s simpler and faster to go for a
binary search.</p>

<p>Mispredictions slow linear searches down, and traversing the data
completely even more. The next plot summarises the timings for
searches that succeed after the first or second iterations of (scalar)
linear search.  That&#8217;s almost the best environment I can think of for
linear search, short of always searching for one value that&#8217;s found in
the very first iterations.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-03-binary-search-star-reduces-star-branch-mispredictions/box-split-cached.png"></p>

<p>Even when they always succeed after at most two iterations, the linear
searches are only slightly faster than a binary search, with a few
very slow runs.  The vectorised search is shown in its best light: it
always succeeds on the first iteration and is, overall, comparable
with the linear searches (with a tighter spread).  Much like the
binary search, vectorising minimises branches, but does so at the cost
of additional work.</p>

<p><span class='pullquote-right' data-pullquote='Even with very regular queries, linear searches are, at best, slightly faster than binary search, and otherwise significantly slower. '>
The &#8220;64*&#8221; plot exercises the vectorised search better: the element is
found on the third or last iteration (48th or 49th for the linear
searches).  The binary search isn&#8217;t affected at all, and the linear
searches become appreciably (by a factor of two or four) slower.  Even with very regular queries, linear searches are, at best, slightly faster than binary search, and otherwise significantly slower.
</span></p>

<p>When the data is expected to be cached, I believe it&#8217;s fair to say
that binary search should always be used instead of linear searches,
even unrolled vectorised (fully or not) versions.  In the worst
case, scalar linear searches can be more than 50 or 100 cycles slower
than a decent binary search.  In the best case, a partially vectorised
linear search can save about 10 cycles.  That sounds like a horrible
bargain.</p>

<h2>Let&#8217;s throw in some cache misses</h2>

<p><span class='pullquote-right' data-pullquote='linear searches will eke speed-ups out by skipping some accesses, thanks to their conditional branches. '>
When the data is cached, linear search only wins if we keep hammering
the same query, and it&#8217;s found in the first iterations. What if
the data isn&#8217;t cached?  I ran another set of tests, this time hopping
around 1 GB&#8217;s worth of short vectors.  The vectors were aligned to
their size, so the vectorised search worked naturally, and we can also
easily think about cache lines.  The vectors were chosen by picking
vectors in different 4 KB pages, at every potential offsets from the
start of the page, and shuffling them randomly to foil prefetching.
This is pretty much a worst-case scenario: there&#8217;s both a TLB and an
L3 miss (but the latency of a TLB miss varies a lot depending on which
parts of the page table are still cached).  I could also have tried to
use huge pages (2MB) to eliminate TLB misses.  I don&#8217;t think that&#8217;s
realistic: huge pages aren&#8217;t widely used, and TLBs misses tend
to be an issue even with working sets that are still big
enough for L2.  We can expect two things here: the timings will be
orders of magnitude slower,
and linear searches will eke speed-ups out by skipping some accesses, thanks to their conditional branches.
</span></p>

<p>The first set of plots corresponds to random searches in out-of-cache
vectors.  Searches in even the shortest vectors are slow enough to be
somewhat meaningful.  We can observe that the latency of two accesses
to the same cache line varies a lot, from 200 to 700 cycles.  The
distributions are strongly multimodal, and it&#8217;s not clear to me that
comparing the quartiles is always useful: they sometimes fall in
nearly-empty zones between two peaks, so I would expect tiny changes
to affect them a lot.  The plots for length up to eight are all very
similar, so I&#8217;ll just drop sizes two and four in the sequel.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-03-binary-search-star-reduces-star-branch-mispredictions/box-random-uncached.png"></p>

<p><span class='pullquote-right' data-pullquote='In the real world, chips don&#8217;t load whole cache lines atomically. '>
The forward linear searches (scalar and vectorised) are, over all,
a bit faster than the backward one.  That&#8217;s a bit counter-intuitive:
vectors of size up to 16 fit in one cache line.  That the latency from
cache misses is always the same as soon as we touch a cache line is a
useful lie: it&#8217;s usually good enough of an approximation.  In the real world, chips don&#8217;t load whole cache lines atomically.
Current sockets have hundreds of pins, but they can&#8217;t all be dedicated
to communicating with DIMMs.  On the X5660 where the tests were
executed, single-core bandwidth is 8 GB/s.  That&#8217;s exactly the
bandwidth from one of its DDR3-1066 modules.  DDR3 sends data in
chunks of 64 bits, so that&#8217;s how one cache line is loaded: 64 bits at a
time.  The CPU&#8217;s memory controller first loads the chunk corresponding
to the instruction that triggered a cache miss, and then the rest of
the cache line.  The controller simply seems to have been tuned for
programs that load data from low to high addresses.
</span></p>

<p>The binary search seems a tiny bit quicker than the other searches on
sizes up to 32, and fares badly on 64-element vectors.  16 <code>unsigned</code>s is
exactly one cache line, so it&#8217;s not surprising that 16 or fewer
(correctly aligned) elements can be binary searched quickly.  The
size-32 case benefits from an implicit data dependency in binary
search.  When the key falls in the latter half of the vector, only
that half will be read from: the initial midpoint is the first element
of that latter half.  32 <code>unsigned</code>s fit exactly in two cache lines, and
binary search thus behaves very similarly to linear searches: it reads
from only one cache line half the time, and otherwise from two.  I
can&#8217;t explain the solid peak of occurrences around 600 cycles though.
Finally, the data dependency doesn&#8217;t help so much on vectors of length
64 (half of four cache lines is still two), and binary search is then
significantly slower.</p>

<p>Some of the variation in execution times must still be caused by
branch mispredictions.  The following plots show what happens if the
key is always found after eight iterations of the scalar linear
searches (one iteration of the vectorised search).  The binary search
is also slightly advantaged on sizes 16 and up, as the key is found in
the latter half of the vector.  The rightmost two plots fix the key so
that it is found at the very last iteration of the searches, and
disadvantages binary search by searching for the minimum value in the
data.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-03-binary-search-star-reduces-star-branch-mispredictions/box-fixed-uncached.png"></p>

<p>When the key is found early, the forward linear search seems a tiny
bit faster than the rest, even the vectorised search (which trades
more memory accesses for fewer branches).  Interestingly, when the key
is always found at the end, the linear searches aren&#8217;t affected that
much; speculative execution seems to be doing its job at hiding
latencies.</p>

<p>The last plots correspond to the situation when the key is found after
one or two iterations (equiprobably) for the linear searches (one for
the vectorised version, and in the last cache line for binary search).
The &#8220;64*&#8221; column shows the latencies for a worse scenario: the key is
found after 48 or 49 iterations of the linear searches (three or four
for the vectorised search), and in the first quarter of data for
binary search.</p>

<p><img class="center" src="http://www.pvk.ca/images/2012-07-03-binary-search-star-reduces-star-branch-mispredictions/box-split-uncached.png"></p>

<p>On the good cases, all searches seem to be doing similarly, except for
vectors of length 64 (on which the binary search really suffers).  In
fact, comparing the &#8220;64&#8221; and &#8220;64*&#8221; plots reveals that binary search
is affected by variations in the query just as much as linear
searches: bad memory access patterns can hurt a lot more than
mispredicted branches.</p>

<p>In the end it looks like all searches behave pretty much the same when
searching uncached vectors spanning at most two cache lines.  After
that, linear searches have better access patterns (and after that,
asymptotics catch up).</p>

<h2>Wrapping up</h2>

<p><span class='pullquote-right' data-pullquote='Linear search should only be used when it&#8217;s expected to bail out quickly. '>
When performance matters, linear search shouldn&#8217;t be preferred to
binary search to minimise branch mispredictions.  Linear search should only be used when it&#8217;s expected to bail out quickly.
The situations in which that property is most useful involve
uncached, medium-size vectors.  In general, I&#8217;d just stick to binary
search: it&#8217;s easy to get good and consistent performance from a
simple, portable implementation.
</span></p>

<p>In some cases, it might be worth the trouble to go for a
heroically optimised linear search, but maintenance may be a bitch.
If I expected to be working with larger, potentially uncached,
datasets, I&#8217;d instead try to exploit caches more smartly… But that&#8217;s a
topic for another post!</p>

<p>A friend of mine pointed out one case when a fully vectorised linear
search, without early exit, may be the only (high-performance) option:
cryptographic code.  When the data isn&#8217;t cached, even binary search
can leak some information, via the number of distinct cache lines that
are accessed.</p>

<p>I also completely ignored the question of shared resources (memory
channels, particularly) on multicore chips: what happens to throughput
when multiple cores are executing that kind of search workload?  I&#8217;ll
probably leave that to someone else ;)</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Fitting polynomials by generating linear constraints]]></title>
    <link href="http://www.pvk.ca/Blog/2012/05/24/fitting-polynomials-by-generating-linear-constraints/"/>
    <updated>2012-05-24T17:35:00+02:00</updated>
    <id>http://www.pvk.ca/Blog/2012/05/24/fitting-polynomials-by-generating-linear-constraints</id>
    <content type="html"><![CDATA[<p><small>(There were some slight edits since the original publication.
The only major change is that the time to solve each linear program
probably grows cubically, not linearly, with the number of
points.)</small></p>

<p>This should be the first part in a series of post exploring the
generation of efficient and precise polynomial approximations.  Yes,
there is Lisp code hidden in the math (: See
<a href="https://github.com/pkhuong/rational-simplex">rational-simplex</a> and
<a href="https://github.com/pkhuong/rational-simplex/tree/master/demo/vanilla-fit">vanilla-fit</a>.</p>

<p>Here&#8217;s the context: we want to approximate some non-trivial function,
like \(\log\), \(\exp\) or \(\sin\) on a computer, in floating
point arithmetic.  The textbook way to do this is to use a truncated
Taylor series: evaluating polynomials is pretty quick on computers.
<a href="http://lol.zoy.org/blog/2011/12/21/better-function-approximations">Sam Hocevar</a>&#8217;s
post on the topic does a very good job of illustrating how
wrong-headed the approach is in most cases.  We should instead try to
find an approximation that minimises the maximal error over a range
(e.g. \([-\pi,\pi]\) for \(\sin\)).  He even has
<a href="http://lol.zoy.org/wiki/oss/lolremez">LolRemez</a> to automatically
compute such approximations for given functions.  (While we&#8217;re kind of
on the topic, the very neat
<a href="http://www2.maths.ox.ac.uk/chebfun/">Chebfun</a> applies closely-related
ideas to try and be to continuous functions what floating point
numbers are to reals.)</p>

<p>The LolRemez docs point out two clever tricks to get more efficient
approximations:
<a href="http://lol.zoy.org/wiki/doc/maths/remez/tutorial-changing-variables">forcing odd or even polynomials</a>,
and
<a href="http://lol.zoy.org/wiki/doc/maths/remez/tutorial-fixing-parameters">fixing parameters to simple values</a>.
Could we automate this process?</p>

<p>The goal of this series is to show that we can, by applying Operations
Research (<a href="http://www.scienceofbetter.org/">&#8220;The Science of Better&#8221;</a>)
to this micro-optimisation challenge.  OR is what I do during the day,
as a PhD student, and I believe that there&#8217;s a lot of missed
opportunities in nifty program optimisation or analysis.  (I also
think that language design has a lot to offer to OR people who cope
with modeling languages that have only changed incrementally since the
seventies. :)</p>

<h2>Problem statement</h2>

<p>For now, let&#8217;s try and solve the original problem: find, in a family
of functions \(F\) (e.g. polynomials of degree 4), an approximation
\(\tilde{f}\) of a function \(f\) over some bounded range
\(B\subset R\) such that \(\tilde{f}\) minimises the maximal error
over that range:</p>

<div markdown="0">
$$\tilde{f} = \mathop{\arg\min}_{g \in F}\max_{x\in B} |g(x)-f(x)|.$$
</div>


<p>\(B\) is a dense set, with infinitely, if not uncountably, many
elements, so it&#8217;s far from obvious that we can solve this, even up to
negligible errors.</p>

<h2>The Remez algorithm</h2>

<p>The
<a href="http://en.wikipedia.org/wiki/Remez_algorithm">Remez exchange algorithm</a>
shows one way to solve the minimax approximation problem, from the realm
of mathematical analysis.</p>

<p>Let&#8217;s pretend we do not know \(\tilde{f}\), but do know the points
where its error from \(f\) is maximised.  In fact, if we are looking
for polynomials of degree \(n\), we only need \(n+2\) such points.
Let&#8217;s further impose that the errors alternate in sign: i.e., if the
points are ordered (\(x_0 &lt; \ldots &lt; x_i &lt; \ldots &lt; x_{n+1}\)), then
\(\tilde{f}(x_0) &lt; f(x_0)\), \(\tilde{f}(x_1) > f(x_1)\), etc., or
vice versa.  Nifty analysis results tell us that these points exist,
and that they exhibit the same absolute error \(|\epsilon|\).  We
can now find the coefficients \(a_0,a_1,\ldots,a_n\) of
\(\tilde{f}\) by solving a small linear system:</p>

<div markdown="0">
$$\sum_{j=0}^n a_j x_0^j = f(x_0) + \epsilon$$
$$\sum_{j=0}^n a_j x_1^j = f(x_1) - \epsilon$$
<p><center>etc. for each &#92;(x_i&#92;).</center></p>
</div>


<p>There are \(n+2\) variables (the \(n+1\) coefficients \(a_j\)
and \(\epsilon\)), and \(n+2\) constraints (one for each
extremum), so it&#8217;s a straight linear solve.</p>

<p>Of course, we don&#8217;t actually know these extremal error points
\(x_i\).  The Remez algorithm works because we can initialise them
with an educated guess, find the corresponding coefficients, construct
a new approximation from these coefficients, and update (some) of the
extremal points from the new approximation.  In theory this process
converges (infinitely) toward the optimal coefficients.  In practice,
numerical issues abound: solving a linear system is usually reasonably
stable, but finding exact function extrema is quite another issue.
The upside is that smart people like
<a href="http://lol.zoy.org/wiki/oss/lolremez">Sam</a> and the
<a href="http://www.boost.org/doc/libs/1_36_0/libs/math/doc/sf_and_dist/html/math_toolkit/toolkit/internals2/minimax.html">boost::math team</a>
have already done the hard work for us.</p>

<h2>A linear minimax formulation</h2>

<p>The minimax objective can be reformulated as an optimisation problem
over reals (or rationals, really) with linear constraints and
objective, a
<a href="http://en.wikipedia.org/wiki/Linear_programming">linear program</a>:</p>

<div markdown="0"> $$\min_{e\in\mathbb{R}, a\in\mathbb{R}^{n+1}} e$$
subject to $$\sum_{i=0}^n a_ix^i \leq f(x) + e\qquad\forall x\in B$$
$$\sum_{i=0}^n a_ix^i \geq f(x) - e\qquad\forall x\in B$$ </div>


<p>In words, we&#8217;re trying to find an error value \(e\) and \(n+1\)
coefficients, such that the error \(|\tilde{f}(x)-f(x)|\quad\forall x\in
B\) is at most \(e\): \(-e \leq \tilde{f}(x)-f(x)\leq e\).  The
goal is to minimise that error value \(e\).</p>

<p>We&#8217;re still left with the issue that \(B\) is very large: infinite,
or even uncountable.  It doesn&#8217;t make sense to have two constraints
for each \(x\in B\)!  Our goal is to find approximations in floating
point arithmetic, which helps a bit: there&#8217;s actually a finite number
of floating point (single, double or otherwise) values in any range.
That number, albeit finite, still tends to be impractically large.
We&#8217;ll have to be clever and find a good-enough subset of constraints.
Note that, unlike the Remez algorithm, we can have as many constraints
as we want; we never have to remove constraints, so the quality of the
approximation can only improve.</p>

<p>This method is an interesting alternative to the Remez algorithm
because it always converges, finitely, even with (or rather, due to)
floating-point arithmetic.  It will also makes our future work on a
branch-and-cut method marginally easier.</p>

<h2>Rational simplex for exact solutions to linear programs</h2>

<p>We&#8217;ve reformulated our problem as a finite linear program, assuming
that we know for which point \(x\in B\) we should generate
constraints.  How can we find solutions?</p>

<p>LPs are usually solved with variants of the
<a href="http://en.wikipedia.org/wiki/Simplex_algorithm">simplex algorithm</a>, a
<em>combinatorial</em> algorithm.  This is the key part: while most of the
algorithm is concerned with numerical computations (solving linear
systems, etc.), these are fed to predicates that compare values, and
the results of the comparisons drive each iteration of the algorithm.
The key insight in the simplex algorithm is that there exists at least
one optimal solution in which all but a small set of variables (the
basis) are set to zero, and the value of the basic variables can then
be deduced from the constraints.  Better: every suboptimal (but
feasible) basis can be improved by only substituting one variable in
the basis.  Thus, each iteration of the algorithm moves from one basis
to a similar, but better, one, and numerical operations only determine
whether a basis is feasible and is an improvement.  Very little
information has to persist across iterations: only the set of basic
variables.  This means that there is no inherent accumulation of
round-off errors, and that we can borrow tricks from the computational
geometry crowd.</p>

<p><a href="http://www.dii.uchile.cl/~daespino">Daniel Espinoza</a> implemented
that, and much more, for his PhD thesis on exact linear and integer
program solvers.
<a href="http://www.dii.uchile.cl/~daespino/ESolver_doc/main.html">QSopt-Exact</a>
is a GPL fork of a fairly sophisticated simplex and branch and cut
program.  It exploits the fact that the simplex algorithm is
combinatorial to switch precision on the fly, using hardware floats
(single, double and long), software quad floats, multiprecision
floats, and rational arithmetic.  Unfortunately, the full solver
segfaults when reporting solution values on my linux/x86-64 machine.
Still, the individual simplex solvers (with double floats, quad
floats, multiprecision floats and rationals) work fine.  What I
decided to do was to sidestep the issue and call the solvers in
increasing order of precision, while saving the basis between calls.
The double float implementation is executed first, and when it has
converged (within epsilon), a more precise solver is called, starting
from the basis on which the previous solver converged, etc.  The final
rational solver is exact, but very slow.  Hopefully, previous solvers,
while inexact, will have converged to very nearly-optimal bases,
leaving only a couple iterations for the exact solver.</p>

<p><a href="https://github.com/pkhuong/rational-simplex">Rational-simplex</a> is a
one-file CL system that wraps QSopt-Exact and offers a trivial
modeling language.  We&#8217;ll use it a lot in the sequel.</p>

<h2>Solving the minimax LP with rational-simplex</h2>

<p><a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/vanilla-fit/linf-fit.lisp">linf-fit.lisp</a>
implements a minimax linear fit in 60 LOC (plus a couple kLOC in QSopt
:).</p>

<p>Each
<a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/vanilla-fit/linf-fit.lisp#L20">point</a>
is defined by <code>loc</code>, the original value for <code>x</code>, a sequence of
parameters (in this case, powers of <code>x</code> as we&#8217;re building
polynomials), and the <code>value</code> to approximate, \(f(x)\).</p>

<p><a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/vanilla-fit/linf-fit.lisp#L41">solve-fit</a>
takes a sequence of such points, and solves the corresponding LP with
QSopt-Exact:</p>

<pre><code>(defun solve-fit (points)
  (lp:with-model (:name "linf fit" :sense :minimize)
    (let ((diff  (lp:var :name "diff" :obj 1))
          (coefs (loop for i below *dimension* collect
                       (lp:var :name (format nil "coef ~A" i)
                               :lower nil))))
      (map nil (lambda (point)
                 (let ((lhs (linexpr-dot coefs
                                         (point-parameters point)))
                       (rhs (point-value point)))
                   (lp:post&lt;= lhs (lp:+ rhs diff))
                   (lp:post&gt;= lhs (lp:- rhs diff))))
           points)
      (multiple-value-bind (status obj values)
          (lp:solve)
        (assert (eql status :optimal))
        (values obj (map 'simple-vector
                         (lambda (var)
                           (gethash var values 0))
                         coefs))))))
</code></pre>

<p>With a fresh model in scope in which the goal is to minimise the
objective function, a variable, <code>diff</code>, is created to represent the
error term \(e\), and one variable for each coefficient in the
polynomial (<code>coefs</code>) list; each coefficient is unbounded both from
below and above.</p>

<p>Then, for each point, the two corresponding constraints are added to
the current model.</p>

<p>Finally, the model is solved, and a vector of optimal coefficients is
extracted from the <code>values</code> hash table.</p>

<p>We can easily use <code>solve-fit</code> to <em>exactly</em> solve minimax instances
with a couple thousands of points in seconds:</p>

<pre><code>CL-USER&gt; (let ((*dimension* 5) ; degree = 4
               (*loc-value* (lambda (x)
                              (rational (exp (float x 1d0))))))
           ;; function to fit: EXP. Converting the argument to
           ;; a double-float avoid the implicit conversion to
           ;; single, and we translate the result into a rational
           ;; to avoid round-off
           (solve-fit (loop for x from -1 upto 1 by 1/2048
                       ;; create a point for x in 
                       ;; -1 ... -1/2048,0,1/2048, ... 1
                            collect (make-point x))))
dbl_solver: 0.510 sec 0.0 (optimal)      ; solve with doubles
ldbl_solver: 0.260 sec 0.0 (optimal)     ;   with long doubles
float128_solver: 0.160 sec 0.0 (optimal) ;   with 128 bit quads
mpf_solver: 1.110 sec 0.0 (optimal)      ;   with multi-precision floats
mpq_solver: 0.570 sec 0.0 (optimal)      ;   with rationals
3192106304786396545997180544707814405/5839213465929014357942937289929760178176
#(151103246476511404268507157826110201499879/151089648430913246511773502376932544610304
  28253082057367857587607065964265169406677/28329309080796233720957531695674852114432
  13800443618507633486386515045130833755/27665340899215071993122589546557472768
  9582651302802548702442907342912775/54033868943779437486567557708120064
  2329973989305264632365349185439/52767450140409606920476130574336)
</code></pre>

<p>Or, as more readable float values:</p>

<pre><code>CL-USER&gt; (values (float * 1d0) (map 'simple-vector (lambda (x)
                                                     (float x 1d0))
                                     (second /)))
5.466671707434376d-4 ; (estimated) error bound
#(1.0000899998493569d0 0.997309252293765d0 0.49883511895923116d0
  0.177345274179293d0 0.04415551600665573d0) ; coefficients
</code></pre>

<p>So, we have an approximation that&#8217;s very close to what
<a href="http://lol.zoy.org/wiki/doc/maths/remez/tutorial-exponential">LolRemez</a>
finds for \(\exp\) and degree 4.  However, we&#8217;re reporting a maximal
error of 5.466672e-4, while LolRemez finds 5.466676e-4.  Who&#8217;s wrong?</p>

<h2>Tightening the model</h2>

<p>As is usually the case, we&#8217;re (slightly) wrong! The complete LP
formulation has constraints for each possible argument \(x\in B\).
We only considered 4k equidistant such values in the example above.
The model we solved is missing constraints: it is a
<a href="http://en.wikipedia.org/wiki/Relaxation_%28approximation%29">relaxation</a>.
We&#8217;re solving that relaxation exactly, so the objective value it
reports is a lower bound on the real error, and on the real optimal
error.  At least, we don&#8217;t have any design constraint, and the
coefficients are usable directly.</p>

<p>We shall not settle for an approximate solution.  We must simply add
constraints to the model to make it closer to the real problem.</p>

<p>Obviously, adding constraints that are already satisfied by the
optimal solution to the current relaxation won&#8217;t change anything.
Adding constraints can only remove feasible solutions, and the current
optimal solution remains feasible; since it was optimal before
removing some candidates, it still is after.</p>

<p>So, we&#8217;re looking for constraints that are violated to add them to our
current model.  In the optimisation lingo, we&#8217;re doing constraint (or
row, or cut) generation.  In our problem, this means that we&#8217;re
looking for points such that the error from the current approximation
is greater than the estimate reported as our solution value.  We&#8217;ll
take a cue from the Remez algorithm and look for extremal violation
points.</p>

<p>Extrema of differentiable functions are found where the first
derivative is zero.  Our approximation function is a polynomial and is
trivially differentiated.  We&#8217;ll assume that the user provides the
derivative of the function to approximate.  The (signed) error term is
a difference of differentiable functions \(\tilde{f}-f\), and its
first derivative is simply the difference of the derivatives.</p>

<p>Assuming that the derivative is continuous, we can use the
<a href="http://en.wikipedia.org/wiki/Intermediate_value_theorem">intermediate value theorem</a>
to find its roots: for each root, there is a (non-empty, open)
neighbourhood such that the derivative is negative to the left of the
root and positive to the right, or inversely.</p>

<p>This is what
<a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/vanilla-fit/find-extrema.lisp">find-extrema.lisp</a>
implements.  In optimisation-speak, that&#8217;s our separation algorithm:
we use it to find (maximally-) violated constraints that we ought to
add to our relaxation.</p>

<p><a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/vanilla-fit/find-extrema.lisp#L71">find-root</a>
takes a function, a lower bound, and an upper bound on the root, and
first performs a
<a href="http://en.wikipedia.org/wiki/Bisection_method">bisection search</a>
until the range is down to 256 distinct double values.  Then, the 256
values in the range are scanned linearly to find the minimal absolute
value.  I first tried to use
<a href="http://en.wikipedia.org/wiki/Newton's_method">Newton&#8217;s method</a>, but
it doesn&#8217;t play very well with rationals: although convergence is
quick, denominators grow even more quickly.  There are also potential
issues with even slightly inexact derivatives.  This is why the final
step is a linear search.  The method will work as long as we have
correctly identified a tiny interval around the extremum; the interval
can be explored exhaustively without involving the derivative.</p>

<p><a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/vanilla-fit/find-extrema.lisp#L90">%approximation-error-extrema</a>
finds candidate ranges.  The sign of the derivative at each pair of
consecutive points currently considered in the model is examined; when
they differ, the range is passed to <code>find-root</code>.  If we find a new
extremum, it is pushed on a list.  Once all the pairs have been
examined, the maximal error is returned, along with the list of new
extrema, and the maximal distance between new extrema and the
corresponding bounds.  As the method converges, extrema should be
enclosed more and more tightly by points already considered in our
constraints.</p>

<p>There are frequent calls to
<a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/vanilla-fit/utility.lisp#L18">round-to-double</a>:
this function is used to take an arbitrary real, convert it to the
closest double value, and convert that value back in a rational.  The
reason we do that is that we don&#8217;t want to generate constraints that
correspond to values that cannot be represented as double floats: not
only are they superfluous, but they also tend to have large
denominators, and those really slow down the exact solver.</p>

<p>Finally, the two pieces are put together in
<a href="https://github.com/pkhuong/rational-simplex/blob/master/demo/vanilla-fit/driver.lisp">driver.lisp</a>.
The constraints are initialised from 4096 equidistant points in the
range over which we optimise.  Then, for each iteration, the
relaxation is solved, new extrema are found and added to the
relaxation, until convergence.  Convergence is easy to determine: when
the error reported by the LP (the relaxation) over the constraint
subset is the same as the actual error, we are done.</p>

<pre><code>CL-USER&gt; (time
          (let ((*trace-output* (make-broadcast-stream)))
            ;; muffle the simplex solver log
            (find-approximation 4 -1 1
                                (lambda (x)
                                  (rational (exp (float x 1d0))))
                                (lambda (x)
                                  (rational (exp (float x 1d0)))))))
          ;  predicted error  actual       difference           log distance of new points
Iteration    1: 5.4666716e-4 5.466678e-4 [6.5808034e-10] (4 new extrema, delta 40.42 bit)
Iteration    2: 5.4666746e-4 5.4666775e-4 [3.25096e-10] (4 new extrema, delta 41.15 bit)
Iteration    3: 5.466676e-4 5.466677e-4 [9.041601e-11] (4 new extrema, delta 40.89 bit)
Iteration    4: 5.4666764e-4 5.4666764e-4 [2.4131829e-11] (4 new extrema, delta 38.30 bit)
Iteration    5: 5.4666764e-4 5.4666764e-4 [5.3669834e-12] (4 new extrema, delta 39.04 bit)
Iteration    6: 5.4666764e-4 5.4666764e-4 [2.1303722e-12] (4 new extrema, delta 36.46 bit)
Iteration    7: 5.4666764e-4 5.4666764e-4 [8.5113746e-13] (4 new extrema, delta 37.19 bit)
Iteration    8: 5.4666764e-4 5.4666764e-4 [5.92861e-14] (4 new extrema, delta 36.93 bit)
Iteration    9: 5.4666764e-4 5.4666764e-4 [8.7291163e-14] (4 new extrema, delta 34.34 bit)
Iteration   10: 5.4666764e-4 5.4666764e-4 [2.7814624e-14] (4 new extrema, delta 35.08 bit)
Iteration   11: 5.4666764e-4 5.4666764e-4 [1.22935355e-14] (4 new extrema, delta 33.82 bit)
Iteration   12: 5.4666764e-4 5.4666764e-4 [6.363092e-15] (4 new extrema, delta 34.56 bit)
Iteration   13: 5.4666764e-4 5.4666764e-4 [2.6976767e-15] (4 new extrema, delta 31.97 bit)
Iteration   14: 5.4666764e-4 5.4666764e-4 [6.2273127e-16] (4 new extrema, delta 31.71 bit)
Iteration   15: 5.4666764e-4 5.4666764e-4 [3.1716093e-16] (4 new extrema, delta 32.44 bit)
Iteration   16: 5.4666764e-4 5.4666764e-4 [8.9307225e-17] (4 new extrema, delta 29.86 bit)
Iteration   17: 5.4666764e-4 5.4666764e-4 [2.986885e-17] (4 new extrema, delta 29.60 bit)
Iteration   18: 5.4666764e-4 5.4666764e-4 [8.047477e-17] (4 new extrema, delta 30.33 bit)
Iteration   19: 5.4666764e-4 5.4666764e-4 [0.0e+0] (4 new extrema, delta 27.75 bit)
Evaluation took:
  153.919 seconds of real time    ; includes time spent in the LP solvers
  43.669636 seconds of total run time (39.078488 user, 4.591148 system)
  [ Run times consist of 2.248 seconds GC time, and 41.422 seconds non-GC time. ]
  28.37% CPU
  245,610,372,128 processor cycles
  4,561,238,704 bytes consed

4046879766238553594956027226378928284211150354828640757733860080712233114910799/7402816194767429570294393430906636383377554724586974324818463099198772886301048832
#(925435306122623901577988826638313570454904960173764226165130358296546098660735739/925352024345928696286799178863329547922194340573371790602307887399846610787631104
  6557329860867414742326673804552068829044003331586895788335382835917/6575021589199030076761681434499916661609079970484769650357093605918
  1639925832159462899796569909691517124754835951733786891469617405863/3287510794599515038380840717249958330804539985242384825178546802959
  583024503858777554811473359621837746265838166280012537010086877260/3287510794599515038380840717249958330804539985242384825178546802959
  145161740826348143502214435074903924097159484388538508594721873208/3287510794599515038380840717249958330804539985242384825178546802959)
</code></pre>

<p>Note that we found new extrema in the last iteration.  However, the
error for those extrema wasn&#8217;t actually larger than for the points we
were already considering, so the solution was still feasible and
optimal.</p>

<p>The fractions will be more readable in floats:</p>

<pre><code>CL-USER&gt; (values (float * 1d0) (map 'simple-vector
                                    (lambda (x) (float x 1d0))
                                    (second /)))
5.466676005138464d-4
#(1.0000900001021278d0 0.9973092516744465d0 0.4988351170902357d0
  0.177345274368841d0 0.044155517622880315d0)
</code></pre>

<p>Now, we have nearly exactly (up to a couple bits) the same values as
<a href="http://lol.zoy.org/wiki/doc/maths/remez/tutorial-exponential">LolRemez&#8217;s example</a>.
Each iteration takes less than 10 seconds to execute; we could easily
initialise the relaxation with even more than 4096 points.  We should
expect the impact on the solution time to be cubic: most of the
simplex iterations take place in the floating point solvers, and the
number of iterations is usually linear in the number of variables and
constraints, but the constraint matrix is dense so the complexity of
each iteration should grow almost quadratically.  What really slows
down the rational solver isn&#8217;t the number of variables or constraints,
but the size of the fractions.  On the other hand, we see that some of
the points considered by our constraints differ by \(\approx 30\)
bit in their double representation: we would need a very fine grid
(and very many points) to hit that.</p>

<p>In addition to being more easily generalised than the Remez algorithm,
the LP minimax has no convergence issue: we will always observe
monotone progress in the quality of the approximation and, except for
potential issues with the separation algorithm, we could use the
approach on any function.  There is however (in theory, at least) a
gaping hole in the program: however improbable, all the extrema could
be sandwiched between two points for which the derivative has the same
sign.  In that case, the program will finish (there are no constraint
to adjoin), but report the issue.  A finer initial grid would be one
workaround.  A better solution would be to use a more robust
root-finding algorithm.</p>

<h2>What&#8217;s next?</h2>

<p>The minimax polynomial fitting code is on github, as a
<a href="https://github.com/pkhuong/rational-simplex/tree/master/demo/vanilla-fit">demo</a>
for the
<a href="https://github.com/pkhuong/rational-simplex">rational-simplex</a>
system.  The code can easily be adapted to any other basis (e.g. the
Fourier basis), as long as the coefficients are used linearly.</p>

<p>In the next instalment, I plan to use the cut (constraint) generation
algorithm we developed here in a
<a href="http://en.wikipedia.org/wiki/Branch_and_cut">branch-and-cut</a>
algorithm.  The new algorithm will let us handle side-constraints on
the coefficients: e.g., fix as many as possible to zero, 1, -1 or
other easy multipliers.</p>

<p>If you found the diversion in Linear programming and cut generation
interesting, that makes be very happy!  One of my great hopes is to
see programmers, particularly compiler writers, better exploit the
work we do in mathematical optimisation.  Sure, &#8220;good enough&#8221; is nice,
but wouldn&#8217;t it be better to have solutions that are provably optimal
or within a few percent of optimal?</p>

<p>If you want more of that stuff, I find that Chvatal&#8217;s book
<a href="http://books.google.com/books?id=DN20_tW_BV0C">Linear Programming</a> is
a very nice presentation of linear programming from both the usage and
implementation points of view; in particular, it avoids the trap of
simplex tableaux, and directly exposes the meaning of the operations
(its presentation also happens to be closer to the way real solvers
work).  Works like Wolsey&#8217;s
<a href="http://books.google.com/books?id=x7RvQgAACAAJ">Integer Programming</a>
build on the foundation of straight linear programs to tackle more
complex problems, and introduce techniques we will use (or have used)
here, like branch-and-bound or cut generation.  The classic
undergraduate-level text on operations research seems to be Hillier
and Lieberman&#8217;s
<a href="http://books.google.com/books?id=SrfgAAAAMAAJ">Introduction to operations research</a>
(I used it, and so did my father nearly 40 years ago)&#8230; I&#8217;m not sure
that I&#8217;m a fan though.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Finally! Napa-FFT3 is ready for users]]></title>
    <link href="http://www.pvk.ca/Blog/2012/02/22/finally-napa-fft3-is-ready-for-users/"/>
    <updated>2012-02-22T20:32:00+01:00</updated>
    <id>http://www.pvk.ca/Blog/2012/02/22/finally-napa-fft3-is-ready-for-users</id>
    <content type="html"><![CDATA[<p><a href="https://github.com/pkhuong/Napa-FFT3">Napa-FFT3</a> is in the latest
<a href="http://www.quicklisp.org/beta/">Quicklisp</a> release.  Unlike previous
attempts that were really proofs of concept, this one feels solid
enough for actual use.</p>

<p>This third version is extremely different from the first two: rather
than trying to compute in-order FFTs without blowing caches, it
generates code for bit-reversed FFTs.  The idea came from
<a href="http://www.math.purdue.edu/~lucier/">Brad Lucier</a>, who sent me a
couple emails and showed how nicely his FFT scaled (it&#8217;s used in
<a href="http://dynamo.iro.umontreal.ca/~gambit/wiki/index.php/Main_Page">gambit</a>&#8217;s
bignum code).  Bit-reversed FFTs don&#8217;t have to go through any special
contortion to enjoy nice access patterns: everything is naturally
sequential.  The downside is that the output is in the wrong order (in
<a href="http://en.wikipedia.org/wiki/Bit-reversal_permutation">bit-reversed</a>
order).  However, it might still be an overall win over directly
computing DFTs in order: we only need to execute one bit-reversal
pass, and we can also provide FFT routines that work directly on
bit-reversed inputs.</p>

<p>My hope when I started writing Napa-FFT3 was that I could get away
with a single generator that&#8217;d work well at all sizes, and that
bit-reversing would either not be too much of an issue, or usually not
needed (e.g., for people who only want to perform convolutions or
filtering).</p>

<h2>Overview of the code</h2>

<p>The
<a href="https://github.com/pkhuong/Napa-FFT3/blob/master/forward.lisp">forward</a>
and
<a href="https://github.com/pkhuong/Napa-FFT3/blob/master/inverse.lisp">inverse</a>
transform generators are pretty simple implementations of the
<a href="http://en.wikipedia.org/wiki/Split-radix_FFT_algorithm">split-radix FFT</a>.</p>

<p>Generator for &#8220;flat&#8221; base cases output code for a specialised compiler
geared toward large basic blocks.  The specialised compiler takes
potentially very long traces of simple operations on array elements,
and performs two optimisations: array elements are cached in variables
(registers), and variables are explicitly spilled back into arrays,
following <a href="http://en.wikipedia.org/wiki/Cache_algorithms#B.C3.A9l.C3.A1dy.27s_Algorithm">Belady&#8217;s algorithm</a>.  That allows us to easily exploit the
register file, without taking its size directly into account in the
domain-specific generators, and even when we have to cope with a
relatively naïve machine code generator like SBCL&#8217;s.</p>

<p>Larger input sizes instead use a generator that outputs almost-normal
recursive code; there&#8217;s one routine for each input size, which helps
move as much address computation as possible to compile-time.</p>

<p>Even with code to handle scaling and convolution/filtering, I feel
that the generators are easily understood and modified.  They
currently only support in-order input for the forward transform, and
in-order output for the inverse, but the generators are simple enough
that adding code for all four combinations (in-order input or output,
forward or inverse transform) would be reasonable!  I believe that&#8217;s a
win.</p>

<p>Better: it seems my hope that we can execute bit reverses quickly was
more than justified.  I&#8217;m not quite sure how to describe it, but the
<a href="https://github.com/pkhuong/Napa-FFT3/blob/master/bit-reversal.lisp">code</a>
is based on recursing on the indices from the middle bits toward the
least and most significant bits.  The result is that the there&#8217;s
exactly one swap at each leaf of the recursion, and that, when cache
associativity is high enough (as is the case for the x86 chips I use),
all the cache misses are mandatory.  Better, the recursiveness ensures
that the access patterns are also TLB optimal, when the TLB
associativity is high enough (or infinite, as for my x86oids).</p>

<p>There&#8217;s one issue with that recursive scheme: it&#8217;s really heavy in
integer arithmetic to compute indices.  Again, I generate large basic
blocks to work around that issue.  The last couple levels (three, by
default) of the recursion are unrolled and compiled into a long
sequence of swaps.  The rest of the recursion is executed by looping
over a vector that contains indices that were computed at
compile-time.</p>

<h2>Correctness</h2>

<p>I have a hard time convincing myself that <em>code generators</em> are
correct, especially without a nice static type system.  Instead, I
heavily tested the final generated code.  I&#8217;m using Common Lisp, so
array accesses were all checked automatically, which was very useful
early in the development processes.  Once I was convinced certain that
all accesses were correct, I turned bound and type checking off.  The
<a href="https://github.com/pkhuong/Napa-FFT3/blob/master/ergun-test.lisp">first test file</a>
implements a set of randomised tests proposed by
<a href="http://www.cs.sfu.ca/~funda/publications.html">Funda Ergün</a>.  That
was enough for me to assume that the FFTs themselves were correct.  I
then turned to a
<a href="https://github.com/pkhuong/Napa-FFT3/blob/master/tests.lisp">second set of tests</a>
to try and catch issues in the rest of the code that builds on
straight FFTs.</p>

<p>The process did catch a couple bugs, and makes me feel confident
enough to let other people use Napa-FFT3 in their programs.</p>

<h2>Performance</h2>

<p>Napa-FFT and Napa-FFT2 managed to come reasonably close to FFTW&#8217;s
performance.  When I started working on Napa-FFT3, I hoped that it
could come as close, with much less complexity.  In fact, it performs
even better than expected: Napa-FFT3 is faster than Napa-FFT(2) at
nearly all sizes, and outperforms FFTW&#8217;s default planner for
out-of-cache transforms (even with the bit-reversal pass).</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Fixed Points and Strike Mandates]]></title>
    <link href="http://www.pvk.ca/Blog/2012/02/19/fixed-points-and-strike-mandates/"/>
    <updated>2012-02-19T12:10:00+01:00</updated>
    <id>http://www.pvk.ca/Blog/2012/02/19/fixed-points-and-strike-mandates</id>
    <content type="html"><![CDATA[<p>Many tasks in compilation and program analysis (in symbolic
computation in general, I suppose) amount to finding solutions to
systems of the form \(x = f(x)\).  However, when asked to define
algorithms to find such fixed points, we rarely stop and ask &#8220;which
fixed point are we looking for?&#8221;</p>

<p>In practice, we tend to be interested in fixed points of monotone
functions: given a partial order \((\prec)\), we have \(a \prec b
\Rightarrow f(a)\prec f(b)\).  Now, in addition to being a fairly
reasonable hypothesis, this condition usually lets us exploit
<a href="http://en.wikipedia.org/wiki/Knaster%E2%80%93Tarski_theorem">Tarski&#8217;s fixed point theorem</a>.
If the domain of \(f\) (with \(\prec\)) forms a
<a href="http://mathworld.wolfram.com/CompleteLattice.html">complete lattice</a>,
so does the set of fixpoints of \(f\) !  As a corollary, there then
exists exactly one least and one greatest fixed point under \(\prec\).</p>

<p>This is extremely useful, because we can usually define useful meet
and join operations, and enjoy a complete lattice.  For example, for a
domain that&#8217;s the power set of a given set, we can use \(\subset\)
as the order relation, \(\cup\) as join, and \(\cap\) as meet.
However, what I find interesting to note is that, when we don&#8217;t pay
attention to which fixpoint we wish to find, humans seem to
consistently develop algorithms that converge to the least or greatest
one, depending on the problem.  It&#8217;s as though we all have a <em>common</em>
blind spot covering one of the extreme fixed points.</p>

<p>A simple example is dead value (useless variable) elimination.  When I
ask people how they&#8217;d identify such variables in a program, the naïve
solutions tend to be very similar.  They exploit the observation that
a value is useless if it&#8217;s only used to compute values that are
themselves useless.  The routines start out with every value live
(used), and prune away useless values, until there&#8217;s nothing left to
remove.</p>

<p>These algorithms converge to solutions that are correct, but
suboptimal (except for cycle-free code).  We wish to identify as many
useless values as possible, to eliminate as many computations as
possible.  Yet, if we start by assuming that all values are live, our
algorithm will fail to identify some obviously-useless values, like
<code>x</code> in:</p>

<pre><code>for (...)
  x = x
</code></pre>

<p>We could keep adding more special cases.  However, the correct
(simplest) solution is to try and identify live values, rather than
dead ones.  A value is live if it&#8217;s used to compute a live value.
Moreover, return values and writes to memory are always live.  Our
routine now starts out by assuming that only the latter values are
live, and adjoins live values as it finds them, until there&#8217;s nothing
left to add.</p>

<p>In this case, the intuitive solution converges to the greatest fixed
point, but we&#8217;re looking for the least fixed point.  Setting the right
initial value ensures convergence to the right fixed point.</p>

<p>Other common instances of this pattern are
<a href="http://en.wikipedia.org/wiki/Reference_counting">reference counting</a>
instead of
<a href="http://www.memorymanagement.org/glossary/m.html#marking">marking</a>, or
performing type propagation by initially assigning the top type to all
values (like SBCL).</p>

<p>
<a href="#strike-algorithm" name="strike-algorithm">#</a>
I recently found a use for fixed point computations outside of math
and computer science.
</p>


<p>Most university or <a href="http://en.wikipedia.org/wiki/CEGEP">CEGEP</a> student
unions in Québec will vote (or already have voted) on strike mandates
to help organize protests against rising university tuition fees this
winter and spring.  There are hundreds of such unions across the
province representing, in total, around four hundred thousand
students.  The vast majority of these unions comprise a couple hundred
(or fewer) students, and many feel it would be counter-productive for
only a tiny number of students to be on strike.  Thus, strike mandates
commonly include conditions regarding the minimal number of other
students who also hold strike mandates, along with additional lower
bounds on the number of unions and universities or colleges involved.
As far as I know, all the mandates adopted so far are monotone: if
they are satisfied by a set striking unions, they are also satisfied
by all of its supersets.</p>

<p>Tarski&#8217;s theorem applies (again, with \((\subset, \cup, \cap)\) on the
power set of the set of student unions).  Which fixed point are we
looking for?</p>

<p>It&#8217;s clear to me that we&#8217;re looking for the fixed point with the
largest set of striking unions.  In some situations, the least fixed
point could trivially be the empty set (or all unions that did not
adopt any lower bound).  Moreover, the mandates are usually presented
with an explanation to the effect that, if unions representing at
least \(n_0\) students adopt the same mandate, then all unions that
have adopted the mandate will go on strike simultaneously.</p>

<p>I asked fellow graduate students in computer science to sketch an
algorithm to determine which unions should go on strike given their
mandates; they started with the set of student unions currently on
strike, and adjoined unions for which all the conditions were met.
Such algorithms will converge toward the least fixed point.  For
example, there could be two unions, each comprising 5 000 students,
with the same strike floor of 10 000 students, and these algorithms
would have both unions deadlocked, waiting for the other to go on
strike.</p>

<p>Instead, we should start by assuming that all the unions (with a
strike mandate) are on strike, and iteratively remove unions whose
conditions are not all met, until we hit the greatest fixed point.
I&#8217;m fairly sure this will end up being a purely theoretical concern,
but it&#8217;s a pretty neat case of abstract mathematics helping us
interpret a real-world situation.</p>

<p>This pattern of intuitively converging toward a suboptimal solution
seems to come up a lot when computing fixed points.  It&#8217;s not
necessarily a bad choice: conservative initial values tend to lead to
faster convergence, and often have the property that intermediate
solutions are always correct (feasible).  When we need quick results,
it may make sense to settle for suboptimal solutions.  However, it
ought to be a deliberate choice, rather than a consequence of failing
to consider other possibilities.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Migration and Synopsis]]></title>
    <link href="http://www.pvk.ca/Blog/2012/01/18/migration-and-synopsis/"/>
    <updated>2012-01-18T18:56:00+01:00</updated>
    <id>http://www.pvk.ca/Blog/2012/01/18/migration-and-synopsis</id>
    <content type="html"><![CDATA[<p>This blog has been going for five years.  Back then, it seemed like
the only widely-used static blog generators were
<a href="http://www.blosxom.com/">Blosxom</a> or
<a href="http://pyblosxom.bluesock.org/">pyBlosxom</a>.  They weren&#8217;t that hard
to set up, but getting everything <em>right</em> rather than good enough is a
lot of work.  Latex and MathML support was also very weak, so I wound
up using a (insane) one-off hack with
<a href="http://tug.org/tex4ht/">tex4ht</a>.  I feel like
<a href="http://octopress.org/">Octopress</a> and
<a href="http://www.mathjax.org/">MathJax</a> now do everything I need out of the
box, better than anything I could design by myself.</p>

<p>The permalinks from the old blog are still around, but not the rss
feeds or the date-based links.</p>

<p>I figure this is a good opportunity to make sure the (marginally
useful) permalinks are available somewhere else than via google.</p>

<h2>Lisp-related posts</h2>

<p><a href="http://pvk.ca/Blog/Lisp/accumulating_data_in_vectors.html">Another way to accumulate data in vectors</a>
describes a copying-free extendable vector.  The advantage over the
usual geometric growth with copy is that the performance with respect
to the number of elements added is much smoother.  Runtimes are then
more easily predictable, and sometimes improved (e.g. right when a
copy would be needed).  It&#8217;s also more amenable to a lock-free
adaptation, while preserving O(1) operation complexity (assuming that
<code>integer-length</code> on machine integers is constant time), as shown in
<a href="http://www2.research.att.com/~bs/lock-free-vector.pdf">Dechev et al&#8217;s &#8220;Lock-free Dynamically Resizable Arrays&#8221;</a>.</p>

<p><a href="http://pvk.ca/Blog/Lisp/CommonCold/">Common Cold</a> is a really old
hack to get serialisable closures in SBCL, with serialisable
continuations built on top of that.  Nowadays, I&#8217;d do the closure part
differently, without any macro or change to the source.</p>

<p><a href="http://pvk.ca/Blog/Lisp/concurrency_with_mvars.html">Concurrency with MVars</a>
has short and simple(istic) code for
<a href="http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/Control-Concurrent-MVar.html">mvars</a>,
and uses it to implement same-fringe with threads.</p>

<p><a href="http://pvk.ca/Blog/Lisp/constraint-sets.html">Constraint sets in SBCL: preliminary exploration</a>
summarises some statistics on how constraint sets (internal SBCL data
structures) are used by SBCL&#8217;s compiler.</p>

<p><a href="http://pvk.ca/Blog/Lisp/flow_sensitive_analysis_in_sbcl.html">SBCL&#8217;s flow sensitive analysis pass</a>
explores what operations on constraint sets actually mean.  This,
along with the stats from the previous post, guided a rewrite, not of
constraint sets, but of the analysis pass that uses them.  The
frequency of slow operations or bad usage patterns is reduced enough
to take care of most (all?) performance regression associated with
the original switch to bit-vector-based constraint sets, without
penalising the common case.</p>

<p><a href="http://pvk.ca/Blog/Lisp/finalizing_foreign_pointers_just_late_enough.html">Finalizing foreign pointers just late enough</a>
is a short reminder that attaching finalizers to system area pointers
isn&#8217;t a good idea: SAPs are randomly unboxed and consed back, like
numbers.</p>

<p><a href="http://pvk.ca/Blog/Lisp/hacking_SSE_intrinsics-part_1.html">Hacking SSE Intrinsics in SBCL (part 1)</a>
walks through an SBCL branch that adds support for SSE operations.
Alexander Gavrilov has kept a fork on life support
<a href="https://github.com/angavrilov/sbcl-old">on github</a>.  There&#8217;s still no
part 2, in which the branch is polished enough to merge it in the
mainline.</p>

<p>In the meantime,
<a href="http://pvk.ca/Blog/Lisp/SSE_complexes.html">Complex float improvements for sbcl 1.0.30/x86-64</a>
built upon the original work on SSE intrinsics to implement operations
on <code>(complex single-float)</code> and <code>(complex double-float)</code> with SIMD
code on x86-64.  That sped up most complex arithmetic operations by
100%.  That work also came with support for references to unboxed
constants on x86oids; this significantly improved floating point
performance as well, for both real and complex values.</p>

<p><a href="http://pvk.ca/Blog/Lisp/modular-struct-initialisation.html">Initialising structure objects modularly</a>
is a solution to a problem that I hit, trying to implement non-trivial
initialisation for structures, while allowing inheritance.  Tobias
Rittweiler points out that the protocol is very similar to a common
CLOS pattern where, instead of functions that allocate objects, class
designators are passed.  It also looks a bit like the way Factor
libraries seem to do struct initialisation, but with actual
initialisation instead of assignment (which matters for read-only
slots).</p>

<p><a href="http://pvk.ca/Blog/Lisp/persistent_dictionary.html">An Impure Persistent Dictionary</a>
is an example of a technique I find really useful to implement
persistent versions of side-effectful data structures.  Henry Baker
has a <a href="http://home.pipeline.com/~hbaker1/ShallowArrays.html">paper</a>
that shows how shallow binding can be used to implement persistent
arrays on top of functional arrays, with constant-time overhead for
operations on the latest version.  It&#8217;s a really nice generalisation
of trailing in backtracking searches.  Here, I use it to get
persistent hash tables in only a couple dozen lines of code.</p>

<p><a href="http://pvk.ca/Blog/Lisp/Pipes/">Pipes</a> is an early attempt to develop
a DSL for stream processing, like an 80%
<a href="http://series.sourceforge.net/">SERIES</a>.  I&#8217;ve refocused my efforts
on <a href="http://pvk.ca/Blog/Lisp/Xecto/">Xecto</a>, which only handles
vectors, rather than potentially unbounded streams.  The advantage is
that Xecto looks like it has the potential to be simpler while
achieving near-peak performance to me; the main downside is that
vectors don&#8217;t allow us to represent control flow as data via lazy
evaluation&#8230; and I&#8217;m not sure that&#8217;s such a bad thing.</p>

<p>The post on
<a href="http://pvk.ca/Blog/Lisp/string_case_bis.html">string-case</a> is an
overview of how I structured a CL macro to dispatch that compares with
<code>string=</code> instead of <code>eql</code>.  If I were to do this again, I&#8217;d probably
try and improve <code>string=</code>; I later tested an SSE comparison routine,
and it ended up being, in a lot of cases, faster and simpler (with a
linear search) than the search tree generated by <code>string-case</code>.</p>

<p><a href="http://pvk.ca/Blog/Lisp/type_lower_bound.html">The type-lower-bound branch</a>
describes early work on a branch that provides a way to shut the
compiler up about certain failed type-directed optimisations.  A lot
of the output from SBCL&#8217;s compiler amounts to reports of optimisations
that couldn&#8217;t be performed (e.g. converting multiplication by a
constant power of two to a shift), and why (e.g. the variant argument
isn&#8217;t known to be small enough).  Sometimes, there&#8217;s nothing we can do
about it: we can&#8217;t show the compiler that the argument is small enough
because we know that it will sometimes be too large!  Yet, CL&#8217;s type
system (like most) does not let us express that information.
Programmers are expected to provide upper bounds on the best static
type of values (e.g. we can specify that a value is always a <code>fixnum</code>,
although it may really only be integers between 0 and 1023).  We would
like a way to specify lower bounds as well: &#8220;I know that this will
take arbitrary <code>fixnum</code> values.&#8221;  Once we have that, the compiler can
skip reporting optimisations that we know can&#8217;t be performed (as
opposed to those we don&#8217;t know whether they can be performed).</p>

<p>Finally,
<a href="http://pvk.ca/Blog/Lisp/yet_another_way_to_fake_continuations.html">Yet another way to fake continuations</a>
sketches a simple but somewhat inefficient way to implement
continuations for pure programs.  It may be useful for IO-heavy
applications (web programming), or in certain cases similar to
backtracking search, but in which most of the work is performed
outside of backtracking (e.g. during constraint propagation).</p>

<h2>General low-level programming issues</h2>

<p><a href="http://www.pvk.ca/Blog/LowLevel/SWAR-some-zerop.html">SWAR implementation of (some #&#8217;zerop &#8230;)</a>
sketches how we can use SIMD-within-a-register techniques to have fast
search for patterns of sub-word size.  A degenerate case is when we
look for 0 or 1 in bit vectors; in these case, it&#8217;s clear how we can
test whole words at a time.  The idea can be extended to testing
vectors of 2, 4, 8 (or any size) -bit elements.  I haven&#8217;t found time
to move this in SBCL&#8217;s runtime library (yet), but it would probably
be a neat and feasible first project.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/VM_tricks_safepoints.html">Revisiting VM tricks for safepoints</a>
explores the performance impact of switching from instrumented
pseudo-atomic code sequences to safepoints.  The bottom line is that
it&#8217;s noise.  However, some members of the russian Lisp mafia have used
it as inspiration, and have managed to implement seemingly solid
<a href="https://github.com/akovalenko/sbcl-win32-threads/wiki">threaded SBCL on Windows</a>!
It&#8217;s still a third-party fork for now, but some committers are working
on merging it with the mainline.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/fast-integer-division.html">Fast Constant Integer Division</a>
has some stuff on integer division by constants.  It&#8217;s mostly
superseded by Lutz Euler&#8217;s work to implement the same algorithm as
GCC.  There are some interesting identities that can be used to
improve on that algorithm a tiny bit and, more interestingly, to
implement truncated multiplication by arbitrary fractions.  I only
stumbled upon those a long time after I wrote the post; I&#8217;ll try and
come back to this topic in the coming months.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/more_to_locality_than_cache.html">There&#8217;s more to locality than caches</a>
tracks my attempts to understand why a data structure designed to be
cache-efficient did not perform as well as expected.  It turns out
that cache lines aren&#8217;t exactly read atomically (so reading two
adjacent addresses may be significantly slower than only one), and
that sometimes L2 matters less than TLB.  The latter point was an
important lesson for me.  TLBs are used to accelerate the translation
of virtual addresses to physical; <em>every</em> memory access must be
translated.  TLBs are usually fully associative (behave like
content-addressed memory or hash tables, basically), but with a small
fixed size, on the order of 512 pages for the slower level.  With
normal (on x86oids) 4KB pages, that&#8217;s only enough for 2 MB of data!
Even worse: a cache miss results in a single access to main memory,
which is equivalent to ~60-100 cycles at most; a TLB miss, however,
results in a lookup in a 4 level page table on x86-64, which often
takes on the order of 2-300 cycles.  Luckily, there are workarounds,
like using 2 MB pages.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/napa-fft2-implementation-notes.html">Napa-FFT(2) implementation notes</a>
is where I try to make the code I wrote for a Fast Fourier transform
understandable, especially <em>why</em> it does what it does.  Napa-FFT and
Napa-FFT2 are vastly faster than Bordeaux-FFT (and than all other CL
FFT codes I know, on SBCL), but it&#8217;s still around 20-50% slower than
the usual benchmark, FFTW.  Napa-FFT3 is coming, and it&#8217;s a completely
different approach which manages to be within a couple percent points
of FFTW, and is faster on some operations.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/software-reciprocal.html">0x7FDE623822FC16E6 : a magic constant for double float reciprocal</a>
is a surprisingly popular post.  I was trying to approximate
reciprocals as fast as possible for a mathematical optimization
method.  The usual way to do that is to use a hardware-provided
approximation and then improve it with a couple iterations of Newton&#8217;s
method.  The post shows how we can instead use the way floats are laid
out in memory to provide a surprisingly accurate guess with an integer
subtraction.  I actually think the interesting part was that it made
for a practical use case for the golden section search&#8230;</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/some-notes-on-warren.html">Some notes on Warren</a>
has a couple notes about stuff in Warren&#8217;s book
<a href="http://www.hackersdelight.org/">Hacker&#8217;s Delight</a>.  The sign
extension bit probably deserves more attention; it seems like someone
on #lisp asks how they can sign-extend unsigned integers at least once
a month.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/two-neat-tricks.html">Two variations on old themes</a>
has some stuff on Linux&#8217;s ticket spinaphores, and is the beginning of
my looking into Robin Hood hashing with linear probing for
cache-friendly hash tables.</p>

<p><a href="http://www.pvk.ca/Blog/numerical_experiments_in_hashing.html">Interlude: Numerical experiments in hashing</a>
covers a first stab at designing a hash table that exploits cache
memory.  2-left hashing looks interesting, but its performance was
worse than expected, for various reasons, mostly related to the fact
that caches can be surprisingly complicated.  Two years later,
<a href="http://www.pvk.ca/Blog/more_numerical_experiments_in_hashing.html">More numerical experiments in hashing: a conclusion</a>
revisits the question, and settles on Robin Hood hashing with linear
probing.  It&#8217;s a tiny tweak to normal open addressing (insertions can
bump previously-inserted items farther from their insertion point),
but it suffices to greatly improve the worst and average probing
length, while preserving the nice practical characteristics of linear
probing.  I&#8217;ve also started some work on implementing SBCL&#8217;s hash
table this way, but there are practical issues with weak hash
functions, GC and mutations.</p>

<h2>Miscellaneous stuff</h2>

<p>In
<a href="http://www.pvk.ca/Blog/Coding/deadline-vs-timeout.html">Specify absolute deadlines, not relative timeouts</a>
and
<a href="http://www.pvk.ca/Blog/Coding/deadline-vs-timeout-part-2.html">the sequel</a>,
I argue that we should have interfaces that allow users to specify an
absolute timeout, with respect to a monotonic clock.  Timeouts are
convenient, but don&#8217;t compose well: how do we implement a timeout
version of an operation that sequences two calls to functions that
only offer timeouts as well?  Any solution will be full of race
conditions.  PHK disagrees; I&#8217;m not sure if all of his complaints can
be addressed by using a monotonic clock.</p>

<p>Finally,
<a href="http://www.pvk.ca/Blog/Implementation/SSA_in_practices.html">Space-complexity of SSA in practices</a>
has some early thoughts on how Static single assignment scales for
typical functional programs.  It&#8217;s fairly clear that many compilers
for functional languages have inefficient (wrt to compiler
performance) internal representations; however, it&#8217;s not as clear that
the industry standard, SSA, would fare much better.</p>
]]></content>
  </entry>
  
</feed>
