<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[Paul Khuong mostly on Lisp]]></title>
  <link href="http://www.pvk.ca/atom.xml" rel="self"/>
  <link href="http://www.pvk.ca/"/>
  <updated>2012-02-19T12:35:24-05:00</updated>
  <id>http://www.pvk.ca/</id>
  <author>
    <name><![CDATA[Paul Khuong]]></name>
    <email><![CDATA[pvk@pvk.ca]]></email>
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[Fixed Points and Strike Mandates]]></title>
    <link href="http://www.pvk.ca/Blog/2012/02/19/fixed-points-and-strike-mandates/"/>
    <updated>2012-02-19T12:10:00-05:00</updated>
    <id>http://www.pvk.ca/Blog/2012/02/19/fixed-points-and-strike-mandates</id>
    <content type="html"><![CDATA[<p>Many tasks in compilation and program analysis (in symbolic
computation in general, I suppose) amount to finding solutions to
systems of the form \(x = f(x)\).  However, when asked to define
algorithms to find such fixed points, we rarely stop and ask &#8220;which
fixed point are we looking for?&#8221;</p>

<p>In practice, we tend to be interested in fixed points of monotone
functions: given a partial order \((\prec)\), we have \(a \prec b
\Rightarrow f(a)\prec f(b)\).  Now, in addition to being a fairly
reasonable hypothesis, this condition usually lets us exploit
<a href="http://en.wikipedia.org/wiki/Knaster%E2%80%93Tarski_theorem">Tarski&#8217;s fixed point theorem</a>.
If the domain of \(f\) (with \(\prec\)) forms a
<a href="http://mathworld.wolfram.com/CompleteLattice.html">complete lattice</a>,
so does the set of fixpoints of \(f\) !  As a corollary, there then
exists exactly one least and one greatest fixed point under \(\prec\).</p>

<p>This is extremely useful, because we can usually define useful meet
and join operations, and enjoy a complete lattice.  For example, for a
domain that&#8217;s the power set of a given set, we can use \(\subset\)
as the order relation, \(\cup\) as join, and \(\cap\) as meet.
However, what I find interesting to note is that, when we don&#8217;t pay
attention to which fixpoint we wish to find, humans seem to
consistently develop algorithms that converge to the least or greatest
one, depending on the problem.  It&#8217;s as though we all have a <em>common</em>
blind spot covering one of the extreme fixed points.</p>

<p>A simple example is dead value (useless variable) elimination.  When I
ask people how they&#8217;d identify such variables in a program, the naïve
solutions tend to be very similar.  They exploit the observation that
a value is useless if it&#8217;s only used to compute values that are
themselves useless.  The routines start out with every value live
(used), and prune away useless values, until there&#8217;s nothing left to
remove.</p>

<p>These algorithms converge to solutions that are correct, but
suboptimal (except for cycle-free code).  We wish to identify as many
useless values as possible, to eliminate as many computations as
possible.  Yet, if we start by assuming that all values are live, our
algorithm will fail to identify some obviously-useless values, like
<code>x</code> in:</p>

<pre><code>for (...)
  x = x
</code></pre>

<p>We could keep adding more special cases.  However, the correct
(simplest) solution is to try and identify live values, rather than
dead ones.  A value is live if it&#8217;s used to compute a live value.
Moreover, return values and writes to memory are always live.  Our
routine now starts out by assuming that only the latter values are
live, and adjoins live values as it finds them, until there&#8217;s nothing
left to add.</p>

<p>In this case, the intuitive solution converges to the greatest fixed
point, but we&#8217;re looking for the least fixed point.  Setting the right
initial value ensures convergence to the right fixed point.</p>

<p>Other common instances of this pattern are
<a href="http://en.wikipedia.org/wiki/Reference_counting">reference counting</a>
instead of
<a href="http://www.memorymanagement.org/glossary/m.html#marking">marking</a>, or
performing type propagation by initially assigning the top type to all
values (like SBCL).</p>

<p>
<a href="#strike-algorithm" name="strike-algorithm">#</a>
I recently found a use for fixed point computations outside of math
and computer science.
</p>


<p>Most university or <a href="http://en.wikipedia.org/wiki/CEGEP">CEGEP</a> student
unions in Québec will vote (or already have voted) on strike mandates
to help organize protests against rising university tuition fees this
winter and spring.  There are hundreds of such unions across the
province representing, in total, around four hundred thousand
students.  The vast majority of these unions comprise a couple hundred
(or fewer) students, and many feel it would be counter-productive for
only a tiny number of students to be on strike.  Thus, strike mandates
commonly include conditions regarding the minimal number of other
students who also hold strike mandates, along with additional lower
bounds on the number of unions and universities or colleges involved.
As far as I know, all the mandates adopted so far are monotone: if
they are satisfied by a set striking unions, they are also satisfied
by all of its supersets.</p>

<p>Tarski&#8217;s theorem applies (again, with \((\subset, \cup, \cap)\) on the
power set of the set of student unions).  Which fixed point are we
looking for?</p>

<p>It&#8217;s clear to me that we&#8217;re looking for the fixed point with the
largest set of striking unions.  In some situations, the least fixed
point could trivially be the empty set (or all unions that did not
adopt any lower bound).  Moreover, the mandates are usually presented
with an explanation to the effect that, if unions representing at
least \(n_0\) students adopt the same mandate, then all unions that
have adopted the mandate will go on strike simultaneously.</p>

<p>I asked fellow graduate students in computer science to sketch an
algorithm to determine which unions should go on strike given their
mandates; they started with the set of student unions currently on
strike, and adjoined unions for which all the conditions were met.
Such algorithms will converge toward the least fixed point.  For
example, there could be two unions, each comprising 5 000 students,
with the same strike floor of 10 000 students, and these algorithms
would have both unions deadlocked, waiting for the other to go on
strike.</p>

<p>Instead, we should start by assuming that all the unions (with a
strike mandate) are on strike, and iteratively remove unions whose
conditions are not all met, until we hit the greatest fixed point.
I&#8217;m fairly sure this will end up being a purely theoretical concern,
but it&#8217;s a pretty neat case of abstract mathematics helping us
interpret a real-world situation.</p>

<p>This pattern of intuitively converging toward a suboptimal solution
seems to come up a lot when computing fixed points.  It&#8217;s not
necessarily a bad choice: conservative initial values tend to lead to
faster convergence, and often have the property that intermediate
solutions are always correct (feasible).  When we need quick results,
it may make sense to settle for suboptimal solutions.  However, it
ought to be a deliberate choice, rather than a consequence of failing
to consider other possibilities.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Migration and Synopsis]]></title>
    <link href="http://www.pvk.ca/Blog/2012/01/18/migration-and-synopsis/"/>
    <updated>2012-01-18T18:56:00-05:00</updated>
    <id>http://www.pvk.ca/Blog/2012/01/18/migration-and-synopsis</id>
    <content type="html"><![CDATA[<p>This blog has been going for five years.  Back then, it seemed like
the only widely-used static blog generators were
<a href="http://www.blosxom.com/">Blosxom</a> or
<a href="http://pyblosxom.bluesock.org/">pyBlosxom</a>.  They weren&#8217;t that hard
to set up, but getting everything <em>right</em> rather than good enough is a
lot of work.  Latex and MathML support was also very weak, so I wound
up using a (insane) one-off hack with
<a href="http://tug.org/tex4ht/">tex4ht</a>.  I feel like
<a href="http://octopress.org/">Octopress</a> and
<a href="http://www.mathjax.org/">MathJax</a> now do everything I need out of the
box, better than anything I could design by myself.</p>

<p>The permalinks from the old blog are still around, but not the rss
feeds or the date-based links.</p>

<p>I figure this is a good opportunity to make sure the (marginally
useful) permalinks are available somewhere else than via google.</p>

<h2>Lisp-related posts</h2>

<p><a href="http://pvk.ca/Blog/Lisp/accumulating_data_in_vectors.html">Another way to accumulate data in vectors</a>
describes a copying-free extendable vector.  The advantage over the
usual geometric growth with copy is that the performance with respect
to the number of elements added is much smoother.  Runtimes are then
more easily predictable, and sometimes improved (e.g. right when a
copy would be needed).  It&#8217;s also more amenable to a lock-free
adaptation, while preserving O(1) operation complexity (assuming that
<code>integer-length</code> on machine integers is constant time), as shown in
<a href="http://www2.research.att.com/~bs/lock-free-vector.pdf">Dechev et al&#8217;s &#8220;Lock-free Dynamically Resizable Arrays&#8221;</a>.</p>

<p><a href="http://pvk.ca/Blog/Lisp/CommonCold/">Common Cold</a> is a really old
hack to get serialisable closures in SBCL, with serialisable
continuations built on top of that.  Nowadays, I&#8217;d do the closure part
differently, without any macro or change to the source.</p>

<p><a href="http://pvk.ca/Blog/Lisp/concurrency_with_mvars.html">Concurrency with MVars</a>
has short and simple(istic) code for
<a href="http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/Control-Concurrent-MVar.html">mvars</a>,
and uses it to implement same-fringe with threads.</p>

<p><a href="http://pvk.ca/Blog/Lisp/constraint-sets.html">Constraint sets in SBCL: preliminary exploration</a>
summarises some statistics on how constraint sets (internal SBCL data
structures) are used by SBCL&#8217;s compiler.</p>

<p><a href="http://pvk.ca/Blog/Lisp/flow_sensitive_analysis_in_sbcl.html">SBCL&#8217;s flow sensitive analysis pass</a>
explores what operations on constraint sets actually mean.  This,
along with the stats from the previous post, guided a rewrite, not of
constraint sets, but of the analysis pass that uses them.  The
frequency of slow operations or bad usage patterns is reduced enough
to take care of most (all?) performance regression associated with
the original switch to bit-vector-based constraint sets, without
penalising the common case.</p>

<p><a href="http://pvk.ca/Blog/Lisp/finalizing_foreign_pointers_just_late_enough.html">Finalizing foreign pointers just late enough</a>
is a short reminder that attaching finalizers to system area pointers
isn&#8217;t a good idea: SAPs are randomly unboxed and consed back, like
numbers.</p>

<p><a href="http://pvk.ca/Blog/Lisp/hacking_SSE_intrinsics-part_1.html">Hacking SSE Intrinsics in SBCL (part 1)</a>
walks through an SBCL branch that adds support for SSE operations.
Alexander Gavrilov has kept a fork on life support
<a href="https://github.com/angavrilov/sbcl-old">on github</a>.  There&#8217;s still no
part 2, in which the branch is polished enough to merge it in the
mainline.</p>

<p>In the meantime,
<a href="http://pvk.ca/Blog/Lisp/SSE_complexes.html">Complex float improvements for sbcl 1.0.30/x86-64</a>
built upon the original work on SSE intrinsics to implement operations
on <code>(complex single-float)</code> and <code>(complex double-float)</code> with SIMD
code on x86-64.  That sped up most complex arithmetic operations by
100%.  That work also came with support for references to unboxed
constants on x86oids; this significantly improved floating point
performance as well, for both real and complex values.</p>

<p><a href="http://pvk.ca/Blog/Lisp/modular-struct-initialisation.html">Initialising structure objects modularly</a>
is a solution to a problem that I hit, trying to implement non-trivial
initialisation for structures, while allowing inheritance.  Tobias
Rittweiler points out that the protocol is very similar to a common
CLOS pattern where, instead of functions that allocate objects, class
designators are passed.  It also looks a bit like the way Factor
libraries seem to do struct initialisation, but with actual
initialisation instead of assignment (which matters for read-only
slots).</p>

<p><a href="http://pvk.ca/Blog/Lisp/persistent_dictionary.html">An Impure Persistent Dictionary</a>
is an example of a technique I find really useful to implement
persistent versions of side-effectful data structures.  Henry Baker
has a <a href="http://home.pipeline.com/~hbaker1/ShallowArrays.html">paper</a>
that shows how shallow binding can be used to implement persistent
arrays on top of functional arrays, with constant-time overhead for
operations on the latest version.  It&#8217;s a really nice generalisation
of trailing in backtracking searches.  Here, I use it to get
persistent hash tables in only a couple dozen lines of code.</p>

<p><a href="http://pvk.ca/Blog/Lisp/Pipes/">Pipes</a> is an early attempt to develop
a DSL for stream processing, like an 80%
<a href="http://series.sourceforge.net/">SERIES</a>.  I&#8217;ve refocused my efforts
on <a href="http://pvk.ca/Blog/Lisp/Xecto/">Xecto</a>, which only handles
vectors, rather than potentially unbounded streams.  The advantage is
that Xecto looks like it has the potential to be simpler while
achieving near-peak performance to me; the main downside is that
vectors don&#8217;t allow us to represent control flow as data via lazy
evaluation&#8230; and I&#8217;m not sure that&#8217;s such a bad thing.</p>

<p>The post on
<a href="http://pvk.ca/Blog/Lisp/string_case_bis.html">string-case</a> is an
overview of how I structured a CL macro to dispatch that compares with
<code>string=</code> instead of <code>eql</code>.  If I were to do this again, I&#8217;d probably
try and improve <code>string=</code>; I later tested an SSE comparison routine,
and it ended up being, in a lot of cases, faster and simpler (with a
linear search) than the search tree generated by <code>string-case</code>.</p>

<p><a href="http://pvk.ca/Blog/Lisp/type_lower_bound.html">The type-lower-bound branch</a>
describes early work on a branch that provides a way to shut the
compiler up about certain failed type-directed optimisations.  A lot
of the output from SBCL&#8217;s compiler amounts to reports of optimisations
that couldn&#8217;t be performed (e.g. converting multiplication by a
constant power of two to a shift), and why (e.g. the variant argument
isn&#8217;t known to be small enough).  Sometimes, there&#8217;s nothing we can do
about it: we can&#8217;t show the compiler that the argument is small enough
because we know that it will sometimes be too large!  Yet, CL&#8217;s type
system (like most) does not let us express that information.
Programmers are expected to provide upper bounds on the best static
type of values (e.g. we can specify that a value is always a <code>fixnum</code>,
although it may really only be integers between 0 and 1023).  We would
like a way to specify lower bounds as well: &#8220;I know that this will
take arbitrary <code>fixnum</code> values.&#8221;  Once we have that, the compiler can
skip reporting optimisations that we know can&#8217;t be performed (as
opposed to those we don&#8217;t know whether they can be performed).</p>

<p>Finally,
<a href="http://pvk.ca/Blog/Lisp/yet_another_way_to_fake_continuations.html">Yet another way to fake continuations</a>
sketches a simple but somewhat inefficient way to implement
continuations for pure programs.  It may be useful for IO-heavy
applications (web programming), or in certain cases similar to
backtracking search, but in which most of the work is performed
outside of backtracking (e.g. during constraint propagation).</p>

<h2>General low-level programming issues</h2>

<p><a href="http://www.pvk.ca/Blog/LowLevel/SWAR-some-zerop.html">SWAR implementation of (some #&#8217;zerop &#8230;)</a>
sketches how we can use SIMD-within-a-register techniques to have fast
search for patterns of sub-word size.  A degenerate case is when we
look for 0 or 1 in bit vectors; in these case, it&#8217;s clear how we can
test whole words at a time.  The idea can be extended to testing
vectors of 2, 4, 8 (or any size) -bit elements.  I haven&#8217;t found time
to move this in SBCL&#8217;s runtime library (yet), but it would probably
be a neat and feasible first project.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/VM_tricks_safepoints.html">Revisiting VM tricks for safepoints</a>
explores the performance impact of switching from instrumented
pseudo-atomic code sequences to safepoints.  The bottom line is that
it&#8217;s noise.  However, some members of the russian Lisp mafia have used
it as inspiration, and have managed to implement seemingly solid
<a href="https://github.com/akovalenko/sbcl-win32-threads/wiki">threaded SBCL on Windows</a>!
It&#8217;s still a third-party fork for now, but some committers are working
on merging it with the mainline.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/fast-integer-division.html">Fast Constant Integer Division</a>
has some stuff on integer division by constants.  It&#8217;s mostly
superseded by Lutz Euler&#8217;s work to implement the same algorithm as
GCC.  There are some interesting identities that can be used to
improve on that algorithm a tiny bit and, more interestingly, to
implement truncated multiplication by arbitrary fractions.  I only
stumbled upon those a long time after I wrote the post; I&#8217;ll try and
come back to this topic in the coming months.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/more_to_locality_than_cache.html">There&#8217;s more to locality than caches</a>
tracks my attempts to understand why a data structure designed to be
cache-efficient did not perform as well as expected.  It turns out
that cache lines aren&#8217;t exactly read atomically (so reading two
adjacent addresses may be significantly slower than only one), and
that sometimes L2 matters less than TLB.  The latter point was an
important lesson for me.  TLBs are used to accelerate the translation
of virtual addresses to physical; <em>every</em> memory access must be
translated.  TLBs are usually fully associative (behave like
content-addressed memory or hash tables, basically), but with a small
fixed size, on the order of 512 pages for the slower level.  With
normal (on x86oids) 4KB pages, that&#8217;s only enough for 2 MB of data!
Even worse: a cache miss results in a single access to main memory,
which is equivalent to ~60-100 cycles at most; a TLB miss, however,
results in a lookup in a 4 level page table on x86-64, which often
takes on the order of 2-300 cycles.  Luckily, there are workarounds,
like using 2 MB pages.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/napa-fft2-implementation-notes.html">Napa-FFT(2) implementation notes</a>
is where I try to make the code I wrote for a Fast Fourier transform
understandable, especially <em>why</em> it does what it does.  Napa-FFT and
Napa-FFT2 are vastly faster than Bordeaux-FFT (and than all other CL
FFT codes I know, on SBCL), but it&#8217;s still around 20-50% slower than
the usual benchmark, FFTW.  Napa-FFT3 is coming, and it&#8217;s a completely
different approach which manages to be within a couple percent points
of FFTW, and is faster on some operations.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/software-reciprocal.html">0x7FDE623822FC16E6 : a magic constant for double float reciprocal</a>
is a surprisingly popular post.  I was trying to approximate
reciprocals as fast as possible for a mathematical optimization
method.  The usual way to do that is to use a hardware-provided
approximation and then improve it with a couple iterations of Newton&#8217;s
method.  The post shows how we can instead use the way floats are laid
out in memory to provide a surprisingly accurate guess with an integer
subtraction.  I actually think the interesting part was that it made
for a practical use case for the golden section search&#8230;</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/some-notes-on-warren.html">Some notes on Warren</a>
has a couple notes about stuff in Warren&#8217;s book
<a href="http://www.hackersdelight.org/">Hacker&#8217;s Delight</a>.  The sign
extension bit probably deserves more attention; it seems like someone
on #lisp asks how they can sign-extend unsigned integers at least once
a month.</p>

<p><a href="http://www.pvk.ca/Blog/LowLevel/two-neat-tricks.html">Two variations on old themes</a>
has some stuff on Linux&#8217;s ticket spinaphores, and is the beginning of
my looking into Robin Hood hashing with linear probing for
cache-friendly hash tables.</p>

<p><a href="http://www.pvk.ca/Blog/numerical_experiments_in_hashing.html">Interlude: Numerical experiments in hashing</a>
covers a first stab at designing a hash table that exploits cache
memory.  2-left hashing looks interesting, but its performance was
worse than expected, for various reasons, mostly related to the fact
that caches can be surprisingly complicated.  Two years later,
<a href="http://www.pvk.ca/Blog/more_numerical_experiments_in_hashing.html">More numerical experiments in hashing: a conclusion</a>
revisits the question, and settles on Robin Hood hashing with linear
probing.  It&#8217;s a tiny tweak to normal open addressing (insertions can
bump previously-inserted items farther from their insertion point),
but it suffices to greatly improve the worst and average probing
length, while preserving the nice practical characteristics of linear
probing.  I&#8217;ve also started some work on implementing SBCL&#8217;s hash
table this way, but there are practical issues with weak hash
functions, GC and mutations.</p>

<h2>Miscellaneous stuff</h2>

<p>In
<a href="http://www.pvk.ca/Blog/Coding/deadline-vs-timeout.html">Specify absolute deadlines, not relative timeouts</a>
and
<a href="http://www.pvk.ca/Blog/Coding/deadline-vs-timeout-part-2.html">the sequel</a>,
I argue that we should have interfaces that allow users to specify an
absolute timeout, with respect to a monotonic clock.  Timeouts are
convenient, but don&#8217;t compose well: how do we implement a timeout
version of an operation that sequences two calls to functions that
only offer timeouts as well?  Any solution will be full of race
conditions.  PHK disagrees; I&#8217;m not sure if all of his complaints can
be addressed by using a monotonic clock.</p>

<p>Finally,
<a href="http://www.pvk.ca/Blog/Implementation/SSA_in_practices.html">Space-complexity of SSA in practices</a>
has some early thoughts on how Static single assignment scales for
typical functional programs.  It&#8217;s fairly clear that many compilers
for functional languages have inefficient (wrt to compiler
performance) internal representations; however, it&#8217;s not as clear that
the industry standard, SSA, would fare much better.</p>
]]></content>
  </entry>
  
</feed>

