I find most developers are vastly more comfortable dealing with
pointerbased data structures than with implicit ones. I blame our
courses, which focus on the former and completely fail to show us how
to reason about the latter. For example, the typical presentation of
the binary heap introduces an indexing scheme to map from parent to
children – the children of heap[i]
are heap[2 * i]
and
heap[2 * i + 1]
, with onebased indices – that is hard to
generalise to ternary or kary heaps (Knuth’s presentation, with
parents at \(\lfloor i/2 \rfloor\), is no better). The reason it’s
so hard to generalise is that the indexing scheme hides a simple
bijection between paths in kway trees and natural numbers.
I find it ironic that I first encountered the idea of describing data structures or algorithms in terms of number systems, through Okasaki’s and Hinze’s work on purely functional data structures: that vantage point seems perfectly suited to the archetypal mutable data structure, the array! I’ll show how number systems help us understand implicit data structures with two examples: a simple indexing scheme for kary heaps and compositions of specially structured permutations to implement inplace bitreversal and (some) matrix transpositions.
The
classical
way
to present implicit binary heaps is to work with onebased array
indices and to root an implicit binary tree at heap[1]
. For any
node heap[i]
, the children live at heap[2 * i]
and
heap[2 * i + 1]
. My issue with this presentation is that it’s
unclear how to extend the scheme to ternary or kary trees: if the
children of heap[1]
are heap[3 * 1 ... 3 * 1 + 2]
, i.e.,
heap[3, 4, 5]
,
what do we do with heap[2]
? We end up with k  1
parallel kary
trees stashed in the same array. Knuth’s choice of mapping children
to parents with a floored integer division suffers the same fate.
Onebased arrays hides the beauty of the binary indexing scheme. With
zerobased arrays, the children of heap[i]
are heap[2 * i + 1]
and
heap[2 * i + 2]
. This is clearly isomorphic to the onebased scheme
for binary heaps. The difference is that the extension to kway
trees is obvious: the children of heap[i]
are heap[k * i + 1 ... k * i + k]
.
A couple examples like the one below fail to show any problem. However, even a large number of tests is no proof. Thinking in terms of number systems leads to a nice demonstration that the scheme creates a bijection between infinite kary trees and the naturals.
There is a unique finite path from the root to any vertex in an infinite kary tree. This path can be described as a finite sequence \((c\sb{1}, c\sb{2}, \ldots, c\sb{n})\) of integers between 1 and k (inclusively). If \(c\sb{1} = 1\), we first went down the first child of the root node; if \(c\sb{1} = 2\), we instead went down the second child; etc. If \(c\sb{2} = 3\), we then went down the third child, and so on. We can recursively encode such finite paths as naturals (in pidgin ML):
path_to_nat [] = 0
path_to_nat [c : cs] = k * path_to_nat cs + c
Clearly, this is an injection from finite paths in kary trees to the
naturals. There’s only a tiny difference with the normal positional
encoding of naturals in base k
: there is no 0, and digits instead
include k
. This prevents us from padding a path with zeros, which
would map multiple paths to the same natural.
We only have to show that path_to_nat
is a bijection between finite
paths and naturals. I’ll do that by constructing an inverse that is
total on the naturals.
nat_to_path 0 = []
nat_to_path n = let c = pop n
in c : nat_to_path (n  c) / k
where pop
is a version of the mod k
that returns k
instead of 0:
pop n = let c = n `mod` k
in
if (c != 0)
then c
else k
The base case is nat_to_path 0 = []
.
In the induction step, we can assume that path_to_nat cs = n
and
that nat_to_path n = cs
. We only have to show that, for any
\(1 \leq c \leq k\), nat_to_path path_to_nat c:cs = c:cs
. Let
n' = path_to_nat c:cs = k * n + c
.
\[n\sp{\prime} = kn + c \equiv c\quad \mod k,\]
so pop n'
will correctly return c
(and k
rather than 0). It’s
then a tiny bit of algebra to show that (n'  c) / k = n
, and we
fall back to the induction hypothesis.
This scheme is so simple that I wound up coding a version of
heapsort(3)
that lets callers choose the heap’s arity at runtime.
Higher arity heaps perform more comparisons but fewer swaps than
binary heaps; the tradeoff is profitable when sorting large items.
It seems to me that, for decades, we’ve been presenting implicit heaps
and heapsort in a way that marginally simplifies the binary case at
the expense of obscuring the elegant general case.
Bit reversing an array of length \(2\sp{l}\) sends the element x[i]
to
x[j]
, where the binary representation of i
(including padding up
to l
bits) is the reverse of j
. For example, in an array of
length 16, 3 = 0011
becomes 1100 = 12
.
Reversing a fixedwidth integer’s binary representation is its selfinverse, so bit reversing an array is a sequence of swaps. This means that the permutation can be performed inplace, as a series of independent swaps. Bit reversal used to be slow on cached machines: contiguous elements (with indices that only vary in their low order bits) swap with faroff elements (indices that only vary in their high order bits). Worse, the stride between the latter elements is a large power of two, which causes all sorts of aliasing issues. Workarounds (see Zhang 99 (PDF)) mostly end up implementing a software cache with explicit buffers. Nowadays, even L1 caches have such a high associativity that aliasing is a much less important issue.
NapaFFT3 implements bit reversals by calling a few specialised functions that only swap the lower and higher bits; the main routine iterates over an array of precomputed middle bit reversals (similar to various publications of Elster’s, but recursing on the middle bits first). In this implementation, the number of L1 cache misses incurred by bit reversing an array is only slightly greater than the compulsory misses. Bit reversal isn’t free, but it’s also not clear that autosorting FFTs are quicker than outoforder FFTs followed by a bit reversal pass.
Bit reversal is the only array permutation I’ve seen described in terms of its effect on indices. I think it’s a fruitful avenue for other inplace permutations.
For example, the viewpoint makes it clear how to transpose a matrix of dimensions \(2\sp{m} \times 2\sp{n}\) with a sequence of inplace bit reversals (each \(i\sb{k}\) and \(j\sb{l}\) is a bit in the index’s binary representation).
For a rowmajor layout, the sketch above corresponds to:
Bit reversal, like all other permutations, is a reversible linear operation. We can thus change the order of operation if we want to. For example, it’s not necessarily preferable to bitreverse contiguous rows first. We could also flip the highorder bits of the indices: rather than swapping scalars, we would swap rows. Separately bit reversing contiguous rows works best when each row fits in cache. Bit reversing columns instead amortises the bad access pattern inherent to bit reversal by spending more time on each swap: swapping rows is slower than swapping scalars, but also very efficients with regards to (streaming!) I/O.
This is interesting because inplace transposition of rectangular matrices is hard, and transposition is already a bad fit for caches. Transposing matrices with a sequence of bit reversals might just be practical. In fact, that’s what I intend to do in NapaFFT3 for multidimensional DFTs: we can fuse all but the middle wholevector bit reversal with mixedradix FFTs (and the latter might similarly benefit from operating on [sub]rows rather than scalars).
One obvious question now appears: can we generalise the trick to general dimensions? It’s pretty clear that we can do it for any other base \(b\) and matrices of dimension \(b\sp{m} \times b\sp{n}\) (it’s interesting how highly composite numbers dimensions are easy to transpose, and, IIRC, so are coprime ones). What if there’s no such factorisation? The best I can do is “more or less.”
For arbitrary matrix dimensions \(m \times n\), I think it’s best to decompose indices in a mixed radix (but still positional) number system. For example, a \(63 \times 21\) matrix might have indices in radix \(3,7\ \ 3,7,3\). Given this number system, matrix transposition is
It’s a small generalisation to let the radices be \(a,b\ \ a,b,a\), for a matrix of dimension \(ab \times a\sp{2}b\). We can then perform most of a matrix transposition by swapping positions of identical weight: first a full mixedradix digit reversal (the weights are palindromic), followed by another mixedradix reversal on the first three positions.
This leaves the last chunk \(b\sb{2},a\sb{3}\), which should instead be \(a\sb{3},b\sb{2}\). That’s another rectangular matrix tranpose, but smaller than the original one. It might be practical to execute that last step with a straightforward outofplace transpose: a smaller transpose needs less scratch space and may fit in cache. We can also apply the same trick as for bit reversals and apply the transpose before everything else, by permuting rows rather than scalars. The simplest way to do that is to transpose a matrix of pointers before replicating the permutation on the actual data (glibc’s mergesort references Knuth vol. 3, exercise 5.210).
Finally, this also works for \(a\sp{2}b \times ab\): matrix transposition is its own inverse, so we only have execute the inverse of each step, in reverse order.
Definitely mixed results, but at least we have some intuition on why general rectangular transpositions are hard to perform in place: they’re hard to express as sequences of swaps.
This post is the more theoretical prologue to
a lowlevel look
at qsort(3)
: I really wanted to make sure the nice implicit tree
layout in the first section had the space it deserves.
I tried to make inorder implicit trees fit in this number system approach. I can’t see how. The problem is that inorder trees associate ranges (rather than indices) with nodes; for example, at what depth is index 1000? It depends on the size of the search tree. It might be the root (in a tree of 2000 vertices) or a leaf (in a tree of 1001 vertices).
]]>I don’t really follow the compression scene, and only pay minimal attention to machine learning. Nevertheless, the “Compression is Learning” slogan feels more and more right to me. In this post, I’ll explore another relationship that I find surprising: one between compression and compilation.
Five years ago, I took Marc Feeley’s compiler class, and he let us choose our term project. I settled on generating traces from recursive algorithms (typical of cacheoblivious algorithms) and reordering them to get iterative code that was better suited to the first level of cache. I came up with a gnarly CLaccented researchquality prototype, but the result was surprisingly effective. Sadly, I never really managed to recover loops or function calls from the fully inlined trace, so even mediumsize kernels would exhaust the instruction cache.
I believe I now see a practical and scalable solution, thanks to Artur Jez’s work on “A really simple approximation of smallest grammar.” His work might lead to a functionoriented analogue of trace compilation. Trace compilers are notoriously weak on recursion (recursive function calls don’t map well to loops), so it would be nifty to have an alternative that identifies functions rather than loops.
This makes me quip that “(Trace) Compilation is Compression.” We can see trace compilers as lazily unpacking traces from the source program, rewriting them a bit, and recompressing traces in an executable format. Admittedly, the analogy is a bit of a stretch for classical compilers: they are less aggressive in decompressing source code and directly translate from one compressed representation (a small source file may describe billions of execution paths) to another.
Anyway… this is extremely hypothetical and Jez’s work is fun independently of my weekend speculations.
Once we cast a program trace as a sequence of opcodes (perhaps without operands, to expose more redundancy), it’s obvious that reducing code size is “just” compression, and LZtype algorithms quickly come to mind: they compress strings by referring to earlier substrings.
The heart of LZ77 is a loop that streams through the input sequence and finds the longest common subsequence earlier in the input. In practice, the search usually considers a fixedsize window; when I write LZ77 here, I instead refer to the theoretical approach with an unbounded search window. Repeated LCS searches on an unbounded window sounds slow, but, in sophisticated implementations, the bottleneck is sorting the input string to generate a suffix array.
I don’t feel like being sophisticated and will implement a quadratictime LCS search.
1 2 3 4 5 6 7 8 9 10 11 12 

Some backreferences clearly look like function calls.
1 2 3 4 5 6 

It’s a bit surprising at first, but some also correspond to repeat
loops.
1 2 3 

This last patch is strange because it’s selfreferential. However, if
we decompress character by character, its meaning becomes obvious: we
replicate the substring in [0, 1]
to generate exactly 5 characters.
This is basically a loop; we can handle partial copies with, e.g.,
rotation and entering the middle of the loop (like Duff’s device).
Lempel and Ziv’s result tells us that we can look for references greedily in an unbounded window and obtain asymptotically optimal (proportional to the string’s entropy) compression. Again, there are algorithms to do that in linear time – via radix sort – but I’ll just bruteforce my way in cubic time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

Here’s the problem with converting an LZ77 factorisation into calls and loops: there is no guarantee that patches nest sanely.
1 2 3 

This factorisation encodes “ababacbac” by repeating “ab” two and a half times, inserting “c”, and then repeating “bac.” The issue is that we wish to convert “ababa” into a loop, but the last patch would then need to enter that loop in its last iteration. We could also have the opposite problem, with a patch that only covers a prefix of an earlier patch (a few iterations of a loop). The sketch below shows how we might even have to deal with both issues in the same factor.
Jez’s paper analyses a method to recover a CFL with a single member, the string we wish to compress. The CFL is described by a fully deterministic (it produces exactly one string) CFG in Chomsky normal form, i.e., a straightline program.
This “really simple” approach takes a step back from LZ compression and instead starts with the easiest, greedy, way to generate a straightline program from a string: just pair characters as you encounter them, lefttoright.
For example, on “abab,” this would pair “(ab)(ab)” and then ”((ab)(ab)),” and generate the following grammar:
G > XX
X > ab
We got lucky: the repetition synced up with the greedy pairing. That’s not always the case; for example, “abcab” becomes “(ab)(ca)b” – which exposes no redundancy – rather than “(ab)c(ab).”
Jez addresses this issue by using LZ77 factorisation as an oracle to synchronise pairings.
First, we simplify the algorithm by assuming that there are no factors that begin exactly one character before their output. Such factors correspond to repetitions of that one character (e.g., “aaaa” = “a” x 4) and we can easily turn them into repetitions of a pair (e.g., “aa” x 2).
It’s then simply a matter of not pairing when it would prevent the next character from synchronising with its backreference. For example, in “abcab,” we’d skip pairing “ca” so that “ab” could pair like the first occurrence did. We also don’t force a pair when the backreference only includes the first letter, but not the second. Finally, we opportunistically match consecutive unpaired letters.
Each such pass creates a new, shorter, string of literals and production rules. We iteratively apply the guided pairing loop until we’re left with a single production rule.
That’s it. Jez’s paper is mostly concerned with showing that we only have to run LZ compression once and with proving suboptimality bounds. The key is that we can convert factors for the input into factors for the output, and that each pass shrinks the string by a multiplicative factor. I’ll simplify things further by recomputing a factorisation (in cubic time!) from scratch in each pass.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 

My implementation is exceedingly naïve and runs in cubic time in the worst case. With a lot more work, the bottleneck (LZ77 factorisation) runs at the speed of sort… but constructing a suffix array would only get in the way of this post. Let’s just see what pairing the code generates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

In short, “abcab” becomes “(ab)c(ab)”, “abababa” “(ab)(ab)(ab)a”, and “ababacbac” “(ab)(ab)(ac)()b(ac)”.
So far, we’ve only made pairing decisions. The next step is to translate them into production rules (i.e., functions), and to merge equivalent rules to actually save space.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 

On the examples above, we find:
1 2 3 4 5 6 7 8 9 10 11 

We only have to pair production rules further until the sequence consists of a single rule (or literal).
1 2 3 4 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 

The “call graph” below shows there’s a lot of sharing for such a short string.
I started by writing about compiling program traces, but so far I’ve only been working with strings of characters. Let’s pretend the following PostScript file was written by someone who’s never heard of functions or loops.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Phew, programming is hard! That’s a lot of copypaste to produce a simple pattern.
Let’s see what our (simplified) implementation of Jez’s algorithm can do.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

35 functions that each contain two literals or function calls. Not bad (: It’s an actual SMOP to convert this grammar into a PostScript program.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

This second PostScript program comprises 211 tokens, rather than the original’s 156, but 105 of those are “{ } def” noise to define functions. Arguably, the number of meaningful tokens was reduced from 156 to slightly more than 100. Crucially, the output remains the same.
Stack programs seem particularly well suited to this rewriting approach: explicit operands (e.g., registers) introduce trivial discrepancies. In a VM, I would consider “compressing” the opcode stream separately from the operands. Compiling only the former to native code would already reduce interpretative overhead, and extracting function calls from the instruction stream should avoid catastrophic size explosions.
There are several obvious deficiencies in the direct LZ77guided
pairing. Just breaking Chomsky normalisation to inline singleuse
function would already help. It might also make sense to specialcase
repetitions as repeat n
loops, instead of hiding that in
\(\log n\) levels of recursion. In a real program, it would also be
useful to inline the bottommost rules so as not to drown in control
flow. The thing is, destroying structure in the name of performance
is well understood compiler tech; recovering it efficiently was the
hard part.
It’s already Google Summer of Code season; like spring, it’s slightly earlier this year. For multiple reasons – including the accelerated schedule and my transition to having a Real Job – I did not manage to guide any of the students who sent in applications. After an initial ranking, we’re left with a fair number of decent proposals for a Summer of SBCL. They’re of comparable quality, and all leave me a bit uneasy. In this post, I’ll try to explain what I look for in GSoC proposals… or, really, in action plans in general.
I care about the quality of GSoC plans for two reasons. First, because it reassures me that the applicant understands the difficulties they’ll have to overcome to complete their project. Second, because, like TDD, a good plan makes it easier to feel when we’re making concrete progress. To me, these two factors are necessary conditions for a good success rate.
We’ll probably allow additional time to update the proposals we have already received, so hopefully this post can be immediately useful. I have a single vote, and other mentors no doubt have different criteria. I only hope that this post can give ideas to applicants looking to review their proposal.
This is a common objection to any form of planning for software projects. I don’t understand where it comes from.
Software development should be easier to plan than most endeavours: we have to deal with relatively few people, and, when we do, it’s rarely in an adversarial setting… and I’m willing to bet most of us can better forecast the behaviour of programs than that of people. There’s currently an electoral campaign in Québec, and I’m certain that, before the date was even announced, every single party had a precise plan for each day, at the riding, regional, and provincial levels. Politicians can do it; why can’t software developers?
Of course, there will be constant adjustments to the initial plan. The reason I think it’s important to spend time and thought on that plan isn’t to stick to it, but primarily because it forces us to systematically consider all sorts of potential issues, challenges, failure modes, etc. This exercise becomes useful when problems crop up: rather than making decisions in the heat of the moment, we can rely on our past (well rested and composed) self’s research, notes, and ideas.
I also find plans tend to become much more handwavy as we get farther in the future. That’s a normal tendency. I think we’re usually pretty good at shortterm planning; most people can cook a meal without sitting down to figure out how they’ll cause awesome chemical reactions to occur without burning their house down. It’s a different matter when it comes to monthlong camping trips or feeding a thousand people.
The difference is one of scale, and it’s exactly when we’re out of our comfort zone that mindful planning becomes useful. I expect an action plan to be as thorough in its last week as in its first. Most of it will be fiction, but, as I wrote above, the thinking involved in writing that fiction is what I’m after.
There are tons of ways to describe a plan. I like this one, but the main thing is what I have in mind when I read each section. For instance, last year’s proposal for register allocation via graph colouring (PDF) had:
It’s a different structure than what I suggest below, but both mostly present the same information. My wish is to feel confident that the student has done the work to have that information.
First, a description of the goal. SBCL offers prefab proposals, and it’s tempting to just copypaste. However, I feel much more confident in the intrinsic motivation of a student who’s made the project their own. Many of our suggestions are skeletons for projects; it should be easy to come up with ways to extend and better specify them according to the student’s interests. Others are already fleshed out; those are a bit harder, but even small variations on the initial theme are encouraging. In all cases, if I don’t feel like the student owns the project, I probably won’t rank the proposal highly, and certainly won’t be inclined to mentor them. The problem statement is a simple way to demonstrate one’s understanding of – and interest for – the project.
Second, a review of the current environment. What currently exists, in SBCL or elsewhere? Can we exploit interesting research or code? Has there been previous efforts, and how/why did they not pan out? Perhaps there’s a relevant postmortem from a similar project that will help understand the problem domain. What potential difficulties do we know of? What don’t we know (known unknowns)? How bad do we estimate unknown unknowns to be, and why? What persons or resources could be helpful? To a certain extent, this is simply demonstrating that one’s done their due diligence. However, it also tells other people (e.g., potential mentors) how they can best work with the applicant. This section should be a good resource to reread when bad things happen.
Third, the suggested course of action at a high level. Not only what would happen in a perfect execution, but, ideally, ways to address (ahead of time or when/if they occur) some of the difficulties and unknowns listed in the previous section.
Fourth, success indicators, a translation of the initial goal description into a few checkboxes. Ideally, the criteria are falsifiable… much like an experiment, with the null hypothesis being project failure. I find that checking even a few of these boxes gives me a good sense of closure on a project.
Finally the calendar, to show how one might execute the third section and satisfy the success indicators in the fourth section. The first few weeks are usually easy to envision, but proposals tend to become fuzzier with time. I’m more confident in proposals that only use one or two week periods, with actions and a few testable for each period. Shorter periods make fuzziness more obvious, but, more importantly, they combine with the milestones to let us tell that we’re (or aren’t) getting traction.
In the unfortunate case that we’re somehow not making the expected progress, frequent milestones means we can tell more quickly that something is off. We only have to make up for a one week setback rather than for a month, i.e., one quarter of the GSoC program. The odds of outright project failure are lower, and we don’t risk feeling (as) down for wasting a month of work.
One thing that I find particularly scary in project calendars are “Magic Happens” steps. This seems more common with longer planning periods. Many proposals are quite detailed with respect to preliminary work and research, and to final integration and documentation. The issue is that most of the work is stashed in a 35 week interval during which the actual project becomes completed. It’s not clear what the student will do to get there (or even who will do the work ;). Planning each week individually makes such steps obvious.
I know that it’s hard to determine how we’ll achieve a goal when we’ve never done something similar before. But we can pretend. I want to feel that an applicant has a clear idea how to execute the task they propose, if only in an artificial perfect world. Otherwise, why should I believe they can do so in reality?
That’s it. I’m told that writers should show rather than tell. I feel the same about plans. I like to be shown how goals could be achieved in ideal and lessthanideal worlds rather than told when steps will be completed. I think the most important point is that I don’t appreciate detailed plans because we should (or can) stick to them. Rather, I believe that the exercise of coming up with such a plan is an efficient way to ensure we can fruitfully respond to what’ll actually happen when we execute it.
The regalloc proposal was one of the longest ones last year, but I felt confident that the student had a good idea of what challenges made the task nontrivial. In the end, we followed only a small part of the plan: it’s hard to come up with a suite of bad regalloc examples, visualisation turned out not to scale to real code, and there was too little time to work on live range splitting. However, every time there was a change, Alexandra had no problem determining how to adjust the schedule and figuring out the next step. On my end, I could easily follow the modified schedule and see that we were making progress.
This planning exercise is not easy, and it’s certainly not quick. (FWIW, I find it useful to get someone else, even without relevant experience, to look at the plan and ask questions.) Action plans have helped me countless times: when I needed to react to problems; when I felt like I was running in circles but could tell for sure that I was making progress; when a project forcibly came to an end but I still had a sense of closure… Hopefully, they can do the same for SBCL GSoC students.
]]>