Paul Khuong mostly on Lisp

rss feed

Mon, 12 Apr 2010

 

The type-lower-bound branch

Nikodemus was intrigued by the type-lower-bound branch I pushed on repo.or.cz a few weeks ago. It’s a convenience hack that wound up being slightly more intricate and interesting than planned: a user was complaining that there was no nice way to muffle some optimisation notes. Take

(lambda (x)  

  (declare (type integer x)  

           (optimize speed))  

  (1+ x))  

 

; in: LAMBDA (X)  

;     (1+ X)  

; ==>  

;   (+ X 1)  

;  

; note: forced to do GENERIC-+ (cost 10)  

;       unable to do inline fixnum arithmetic (cost 1) because:  

;       The first argument is a INTEGER, not a FIXNUM.  

;       The result is a (VALUES INTEGER &OPTIONAL), not a (VALUES FIXNUM &REST T).  

;       unable to do inline fixnum arithmetic (cost 2) because:  

;       The first argument is a INTEGER, not a FIXNUM.  

;       The result is a (VALUES INTEGER &OPTIONAL), not a (VALUES FIXNUM &REST T).  

;       etc.

If the user already knows that the best they can do is declare that x is an integer, there is no way to muffle only the notes that amount to wishing that the type of x was more precisely known.

My first (failed) attempt tagged variables as hopeless (any note mentioning these variables’ types would then be muffled). I forgot why I couldn’t make it work, but I believe it’s because invisible copies are very common, so that transforms manipulated lvars that were mere untagged copies of the tagged variable.

Luckily, we have a well-established mechanism to make properties flow across copies: the type (propagation) system. Again, my first attempt was to create a type for hopelessly vague types. However, types must implement all three set operations (negation, intersection and union), and making that work with such an ad hoc type wasn’t an attractive prospect.

This is where type lower bounds come in. In CL, types, as exposed through declarations, are always upper bounds: they are treated as conservative approximations of the set of values the annotated form, variable, etc. can take. The conservativeness is necessary because the exact static type is generally undecidable. In other words, the meaning of (the integer x) is that x can evaluate to any integer; it could actually only ever evaluate to a fixnum (or any other subset of integers), which is what the note in the original example asks for.

If we had a way to also denote lower bounds (e.g. that x can take (not fixnum) values) on the exact static type of forms, compiler notes could be tested against these lower bounds to determine when the programmer knows that the declared or derived type cannot be improved enough for a transform to fire. In the original example, this would amount to declaring that x’s exact static type is between fixnum (exclusively) and integer (inclusively), or, equivalently, that x will always be an integer, and will sometimes not be a fixnum. The note is then obviously moot, and can be muffled by the compiler.

Unlike ad hoc “hopeless” types, intersection and union of lower bound types (really, range of types, since lower bound types are always accompanied by an upper bound) are straightforward and theoretically sound. The only issue is with type negation. Since this is all for convenience, I decided to just punt and drop the lower bound before negating.

The branch, as pushed on repo.or.cz, seems to be working. In order to keep changes to a minimum, regular types are treated as having an implicit lower bound of nil, and range types (with a non-trivial lower bound) are aggressively converted to regular types. This gives the muffling effect for some interesting simple cases, and reverts to the old behaviour very quickly. There are probably hidden bugs (both in the code and in the design), but since they could only be triggered by using the extension, I’m not too worried.

N.B. This isn’t actually useful for compilation speed

I originally thought type lower bounds could be useful to improve compilation speed: by keeping around both lower and upper bounds, we are able to overapproximate types even in the presence of type negation. Once I implemented a quick prototype, ir1-widening, I realised we don’t need lower bounds at all: we only have to make sure we only ever approximate types once we’re sure they’ll never be negated. Actually, what would be even more useful is a way to compute approximate unions and intersections quickly, instead of widening types after the fact.

posted at: 20:44 | /Lisp | permalink

Mon, 28 Dec 2009

 

Constraint sets in SBCL: preliminary exploration

Part of what makes SBCL’s static analyses powerful is their flow-sensitivity: the system can express the fact that something is true only in one of the branches of a conditional. At the heart of that capability are sets of constraints that represent that knowledge; e.g. (eql x y) to assert that x and y hold the same value in a given branch.

Until September 2008, we implemented these sets as specialised hash tables. In commit 1.0.20.4, Richard M. Kreuter replaced the hash tables with bit vectors. On most workloads, the change was beneficial, if only because the compiler now spends less time in GC. However, in some cases (I would expect, with a large number of variables and conditional branches), users have reported severe performance regressions (e.g. https://bugs.launchpad.net/sbcl/+bug/394206), regarding not only compilation times, but also memory usage.

Exploring alternative representations for constraint sets (consets) looks like good way to get started with SBCL: it’s interesting as a data structure challenge, involves “real world” engineering trade-offs, and will have an immediate observable impact on the performance of the compiler. I’ve said as much before. Maybe this post will spur some interest.

As a first step, I instrumented the conset implementation to output some profiling information on each operation: the size of the constraint universe (approximately, the number of unique constraints in the compilation unit), and, for each conset argument, the size of the bit vector and the size of the set. I’m only using a random sample of 1% of the output from a full SBCL build.

Before graphing out histograms for the population and size of the consets, I counted the operations that were actually performed to see what was important:



operation frequency (%)


member 53
do 19
adjoin 14
copy 4.4
do-intersect 3.1
equal 2.7
union 1.4
empty 0.98
intersection 0.88
difference 0.38


All the operations should be obvious, except maybe do, which iterates over the constraints in the set, do-intersect, which iterates over the intersection of two sets, and empty, which tests for emptiness.

Of the 53% calls to member, 38% come from do-intersect, and another 14% from adjoin. There are actually very few calls to member per se. A large number of the calls to member returned T: 29%. That’s unfortunate, since failure is much easier to fast path. do is also slightly over-represented, since do-intersect also count as calls to do.

Graphs galore

PIC

The spikes are powers of 2, and the last non-zero value is 4096. Kreuter reported that the largest universe he’d seen was 8k elements in an email. We don’t really have to worry about enormous universes except to avoid disastrous blow-ups.

PIC

PIC

PIC

We can also see that the representation as bitvector isn’t very dense. However, keep in mind that a bit is much smaller than a word. It takes very sparse sets for alternatives such as hash tables to save space.

PIC

The very long tails here means that while there are some fairly large consets, the vast majority of the operations only touch sets that have less than a half dozen elements.

PIC

PIC

PIC

That’s also true for calls to member, but even more for adjoin and empty.

PIC

The profile for binary operations (union, intersection, difference and equal) is very similar to that of adjoin. In detail, we have:

PIC

PIC

PIC

PIC

equal only shows a heavier tail. intersection and difference, however, both peak around 20 elements, instead of 1 to 3. Finally, we can see that union is pretty much a non-issue.

It’s also interesting to see the difference between the size of the largest and smallest argument for binary operations.

PIC

PIC

The heavier tails on binary operations probably come from the large arguments, while the smaller arguments seem to follow the overall distribution.

Finally, iteration through the set’s elements (do or do-intersect) is interesting because the ratio between the set’s population and the bit vector’s size is a good indicator of that operation’s performance. For do-intersect only the set that’s iterated over is counted (not the one on which member tests are then executed).

PIC

PIC

PIC

Making sense of the data

I’m not sure what information we can extract from this data, and especially not how to extrapolate to the cases reported by users who experience serious regressions.

One obvious way to speed do up is to work with vectors of fixnum or machine words, and do the rest ourselves. That would help with long spans of 0s, and would simplify the inner loop. We could do something similar for empty for implementations that don’t special-case FIND on bit-vectors (e.g. SBCL).

This does nothing to reduce memory usage. We probably want to switch to another representation for really sparse sets, a hash table or a (sorted) vector. But, if we do that, we run the risk of switching to a sparse implementation only because we have constraints with very high and very low indices, but nothing in the middle. A fat and shallow (two-level?) tree would help avoid that situation, and let us choose locally between a few representations.

posted at: 22:16 | /Lisp | permalink

Mon, 29 Jun 2009

 

Complex float improvements for sbcl 1.0.30/x86-64

Complex single- and double- floats used to be represented as pairs of SSE registers on the x86-64 port. Doing so simplified the porting work, since more code could be shared with other backends. Representing complexes packed in SSE registers as would be obviously preferable was left for the future. Luckily, most of my work on SSE intrinsics is directly applicable to that task: all the SSE{1,2} instructions were finally defined, and modifying a pre-existing primitive object type is much simpler than adding a new one.

The initial work was mostly straightforward, since, by default, the only operations on complexes are moves (to/from registers, the stack/arguments, vectors, or the heap [as boxed objects]), creation from scalar components, and extraction of components. The modifications mostly entailed modifying the way register allocation manipulates complexes (they only need a single SSE register, like scalar values), and making the boxed representation more SSE-friendly (alignment, packing single floats for MOVQ). There's only one problem I can remember in that part of the code: the stack isn't guaranteed to be aligned on 16 byte boundaries! That means that I'm forced to use MOVUPD when moving complex double floats between the stack and an SSE register.

Already, using half the registers and half the memory accesses is interesting. The real pay-off with packed complexes, however, is the ability to use vector instructions in arithmetic operations. The obvious cases are addition and subtraction (both complex-complex and complex-real), which exhibit natural horizontal parallelism: each pair of components (real-real or imaginary-imaginary) can be processed separately. For complex-real multiplication or division, things are only slightly more complex: the real operand has to be duplicated before operating in parallel. I encountered two important bugs in that code.

First, scalar values were still passed around with narrow register-register moves (movsd or movss). When we only ever manipulated scalars, that wasn't an issue. Complex-real addition and subtraction, however, really want to exploit the fact that the high part of a scalar value is filled with 0. That's not the case when a narrow move is used to move a single float value into a register that used to hold a double... Using full register moves instead (for both complex and scalar values) not only made the code correct, but also improved performance, due to their being simpler for the SSE pipeline to handle.

Second, packed single float division executes four divisions, not two. Thus, it's not enough to duplicate the real operand in a complex-real division before executing a packed division: depending on the exception mask, the 0.0/0.0 in the higher half of the result may trigger an arithmetic exception. The simplest solution I found was to perform the same operation in both halves of the operands. Obviously, an exception will be signalled iff operating only on the lower half would have signalled too. The cost is one additional shuffle step for the complex value: the real value always has to be duplicated, and the high half of the result must be reset to 0 anyway.

That leaves complex-complex multiplication, and complex-complex/real-complex division. Complex multiplication is a fairly common operation; common enough for Intel to pretty much dedicate an instruction to that operation (ADDSUBP{S,D}) in SSE3. Even when restricting ourselves to SSE2, the code is fairly simple. The most unnatural part is computing the conjugate, given the absence of ADDSUB (more on that later). Complex division, however, is a much more complicated beast. I decided to stick to an implementation in CL, but rewritten to use block (complex-at-a-time) instead of component-wise arithmetic. In the end, it still wanted an additional operation: swapping the real and imaginary components of a complex.

Finally, that leaves non-arithmetic operations, like equality tests (both bitwise and as numbers), conjugate and negation. Once there's a guarantee that the unused components in a SSE register are 0-filled (a guarantee that is supported by most load-from-memory instructions), testing for equality is very straightforward. Computing the conjugate of a complex or negating one, on the other hand, isn't so nice. One look at the way floats are laid out in register shows that they may be implemented as simple bit-wise operations. That leaves the problem of loading the relevant constants in. That's where a later patch comes in.

Negating and taking the absolute value of a real requires constants for bit-wise operations, like conjugating or negating a complex. Until now, SBCL generated code that loaded the constant in a GPR, and then moved it from the GPR into an SSE register. Generating the constant usually required some shifting (often slow), and moving data between the GPR and SSE register files takes a relatively long time on current microarchitectures. Loading such constants from memory directly into SSE registers would be preferable. However, by the time assembly code is generated, it's too late to add constants, and some of these constants would have to be boxed (be it as fixnums or bignums). In other words, both not feasible and not always a good idea.

A second patch adds support for unboxed constants stored inline, after the executable code. The system must already support code relocation, so the moving GC isn't an issue more than it already is (RIP-relative addressing on x86-64, and post-GC fixups on x86). I haven't bothered to look at the situation for other platforms yet, but, with some luck, things will be similarly simple. Storing constants in the code segments was already possible, with a small bit of hacking. What the patch mostly offers is the ability to pack the constants together. The advantage is that they can be stored far enough away from code (one cache line) to not confuse caches, and that a more global outlook allows the code generator to merge identical constants together and to reorder them to guarantee sufficient alignment without wasting too much space.

With that infrastructure in place, it was easy to generate much simpler and faster code for the previous bitwise operations on floats. It also became interesting to store some unboxed constants inline. On x86-64, I implemented that for constant floats (both real and complex) and fixnums that fit in registers, but can't be assembled as immediates (so +/- 2^29 and wider). On x86, fixnums can always be assembled as immediates, so only floats are stored as unboxed inline constants. Float constants are interesting to store inline because they're frequently found in speed-oriented code. Wider-than-fixnum constants, on the other hand, not so much. There's also a certain trade-off to consider: unboxed constants cause additional consing (and code bloat) each time they must be boxed.

Unboxed constants open the door for another set of improvements for x86oids: many instructions can load one operand directly from memory. Using that feature may save a register, and always makes the code smaller. The latter can often have a heavier impact on performance than the former. Using fewer registers mostly matters when the compiler has to spill values; that does not happen so often in inner loops on the x86-64. Smaller code, on the other hand, helps with memory (fits better in cache, needs less bandwidth) and reduces the pressure on the instruction decoder. Finally, when a constant shows up as an operand, it pretty much always pays off to store is inline, so even wider-than-fixnum integer constants are stored inline in that case.

The union of these changes transparently speeds up most complex float operations by up to ~100% on my Core 2. In the current version of Bordeaux-FFT, that translates into a 110% speed-up for 1024-point FFTs. Scalar operations also show significant improvements. Nikodemus Siivola's implementation of the (broken) mandelbrot routine at http://random-state.net/log/3452921796.html shows a 55% speed-up for hand-rolled complex arithmetic (within 20% of the execution time of the fastest g++ version). The complex-ful version shows a lesser speed-up, 38%, probably because a good portion of the runtime is spent operating element-wise to compute the 2-norm of complexes. If you have performance-sensitive code that performs a lot of floating point computations, you might be pleasantly surprised by this month's release.

P.S. Considering the size of these patches, I would probably be even more pleasantly surprised if they were bug-free. Bug reports (to sbcl-bugs, or, more realistically, directly to sbcl-devel) are always welcome, especially during the code freeze period starting today!

posted at: 02:00 | /Lisp | permalink

Sun, 22 Mar 2009

 

Hacking SSE Intrinsics in SBCL (part 1)

I think I managed to add sane support for SSE values (octwords) and operations to SBCL this weekend (all that because of a discussion on PRNGs which lead to remembering that SFMT wasn't too hard to implement, which lead to SSE intrinsics). It wasn't too hard, but clearly the path could have been better documented. The interesting part is the creation of a new primitive data type in the compiler; once that is done, we only have to define new VOPs. Note that since I'm only targetting x86-64, alignment isn't an issue: objects are aligned to 128 bit by default.

The compiler has to be informed of the new data type at two levels:

  • The front/middle -end, to create a new type (as in CL:TYPEP), and to define how to map from that type to the primitive type (which is more concerned about representation than semantics) used in the back-end;
  • The back-end, which has to be informed about the existence of a new primitive type, and must also be modified to correctly map that primitive type to machine registers or stack locations, and to know how to move from one to the other or how to box such values in a GC-friendly object on the heap.

Obviously the runtime was also be modified to be able to GC the new type of objects.

I believe it makes more sense to do this by starting low-level, in the back-end, and then building things up to the front-end, so I'll try and explain it that way.

Hacking the machine definition

SBCL's backend has a sort of type system at the VOP / TN level (virtual operations/registers). Two elements of type information are associated with each TN: the primitive type and the storage class. As its name implies, the primitive type is a low-level, C-style type (e.g. SIMPLE-ARRAY-UNSIGNED-BYTE-64 or DOUBLE-FLOAT), which is almost entirely concerned with representation. Apart from bad puns, there is no subtyping; COMPLEX and COMPLEX-DOUBLE-FLOAT are disjoint types. However, that still leaves some leeway to the back-end. A DOUBLE-FLOAT may be stored in a FP register, on the stack, or as a boxed value. Thus, TN are also assigned a storage class before generating code.

The first step was to define new storage classes for SSE values: sse-reg and sse-stack for SSE values in XMM registers and on the stack, respectively. That's done in src/compiler/x86-64/vm.lisp, !define-storage-classes:

  (sse-stack stack :element-size 2 :alignment 2)

[...]

  (sse-reg float-registers
           :locations #.(loop for i from 0 below 15 collect i)
           :constant-scs ()
           :save-p t
           :alternate-scs (sse-stack))

The first form defines a new storage class (SC) that's stored on the stack (the storage base, SB), where each element takes two locations (in this case, 64 bit words) and requires an alignment of two locations. The second form defines a new storage class that uses the float-registers storage base, and may take any of the first 15 locations in that SB (xmm15 is reserved). Values in that SC must be saved to the stack when needed. The sse-stack SC is to be used when there aren't enough registers or when saving registers (e.g. for a call). Some more modifications were needed in the assembler to make it aware of the new SC, but that's a mostly orthogonal concern.

Adding a new primitive type

That's enough to define a new primitive type in src/compiler/generic/primtype.lisp:

(!def-primitive-type sse-value (sse-reg descriptor-reg))

sse-value can be stored in sse-reg or descriptor-reg (as boxed values), or in any of their alternate SC.

Defining a new kind of primitive object

I've defined a new primitive type that can be stored as a boxed value. However, I haven't yet defined how that boxed value should be represented. I allocated one of the unused widetags to sse-values in src/compiler/generic/early-objdef.lisp:

  #!-x86-64
  unused02
  #!+x86-64
  sse-value                         ; 01100010

That's not necessary, but I would then have to define my own typechecking VOPs, adapt the GC somehow, etc. It will define a new constant, SB!VM::SSE-VALUE-WIDETAG. I use it in the primitive object definition (src/compiler/generic/objdef.lisp):

(define-primitive-object (sse-value
                          :lowtag other-pointer-lowtag
                          :widetag sse-value-widetag)
  (filler) ; preserve the natural 128 bit alignment
  (lo-value :c-type "long" :type (unsigned-byte 64))
  (hi-value :c-type "long" :type (unsigned-byte 64)))

The macro will also define useful constants, e.g. SSE-VALUE-SIZE, as well as slot offsets for the low and high values. That information will be needed by the GC. Genesis only exports CL constants to C when the symbols are external to the package. Thus, I had to add some symbols to the export list for SB!VM in ./package-data-list.lisp-expr:

               #!+x86-64 "SSE-VALUE"
               #!+x86-64 "SSE-VALUE-P" ; will be defined later on
               #!+x86-64 "SSE-VALUE-HI-VALUE-SLOT"
               #!+x86-64 "SSE-VALUE-LO-VALUE-SLOT"
               #!+x86-64 "SSE-VALUE-SIZE"
               #!+x86-64 "SSE-VALUE-WIDETAG"

and similarly for SB!KERNEL:

               #!+x86-64
               "OBJECT-NOT-SSE-VALUE-ERROR" ; so will this

Adapting the GC

I chose a very simple representation for sse-values: the header word will contain the right widetag, obviously, and the rest will encode the size of the object (in words). That's a common scheme in SBCL, and well supported by generic code everywhere. The garbage collector (a copying Cheney GC) has three important tables that are used to dispatch to the correct function given an object's widetag: scavtab to scavenge objects for pointers, transother to copy objects to the new space and sizetab to compute the size of an object. They're all initialised in src/runtime/gc-common.c, gc_init_tables. The representation is standard, so I only had to add pointers to predefined functions, scav_unboxed, trans_unboxed and size_unboxed:

#ifdef SSE_VALUE_WIDETAG
    scavtab[SSE_VALUE_WIDETAG] = scav_unboxed;
#endif

[...]

#ifdef SSE_VALUE_WIDETAG
    transother[SSE_VALUE_WIDETAG] = trans_unboxed;
#endif

[...]

#ifdef SSE_VALUE_WIDETAG
    sizetab[SSE_VALUE_WIDETAG] = size_unboxed;
#endif

Adding utility VOPs to the compiler

That's enough code to be able to use the sse-value primitive type, use our new storage classes in VOP definitions, and pass boxed sse-values around without crashing the GC. The first VOPs to define are probably those that let the compiler move sse-values around, from an sse-reg, sse-stack or descriptor-reg to another. Not all VOPs in the cartesian product must be defined; the compiler can figure out how to piece together move VOPs to a certain extent.

(define-move-fun (load-sse-value 2) (vop x y)
  ((sse-stack) (sse-reg))
  (inst movdqa y (ea-for-sse-stack x)))

(define-move-fun (store-sse-value 2) (vop x y)
  ((sse-reg) (sse-stack))
  (inst movdqa (ea-for-sse-stack y) x))

are enough to define how to move between the stack and xmm registers (ea-for-sse-stack is a helper function that generates an effective address from an sse-value 's location in the compile-time stack frame).

(define-vop (sse-move)
  (:args (x :scs (sse-reg)
            :target y
            :load-if (not (location= x y))))
  (:results (y :scs (sse-reg)
               :load-if (not (location= x y))))
  (:note "sse move")
  (:generator 0
    (unless (location= y x)
      (inst movdqa y x))))
(define-move-vop sse-move :move (sse-reg) (sse-reg))

provides and registers code to move from one xmm register to another.

(define-vop (move-from-sse)
  (:args (x :scs (sse-reg)))
  (:results (y :scs (descriptor-reg)))
  (:node-var node)
  (:note "sse to pointer coercion")
  (:generator 13
     (with-fixed-allocation (y
                             sse-value-widetag
                             sse-value-size
                             node)
       (inst movdqa (make-ea :qword
                             :base y
                             :disp 1)
             x))))
(define-move-vop move-from-sse :move
  (sse-reg) (descriptor-reg))

(define-vop (move-to-sse)
  (:args (x :scs (descriptor-reg)))
  (:results (y :scs (sse-reg)))
  (:note "pointer to sse coercion")
  (:generator 2
    (inst movdqa y (make-ea :qword :base x :disp 1))))
(define-move-vop move-to-sse :move (descriptor-reg) (sse-reg))

will be used to move from an xmm register to a boxed representation and vice-versa.

Finally,

(define-vop (move-sse-arg)
  (:args (x :scs (sse-reg) :target y)
         (fp :scs (any-reg)
             :load-if (not (sc-is y sse-reg))))
  (:results (y))
  (:note "SSE argument move")
  (:generator 4
     (sc-case y
       (sse-reg
        (unless (location= x y)
          (inst movdqa y x)))
       (sse-stack
        (inst movdqa (ea-for-sse-stack y fp) x)))))
(define-move-vop move-sse-arg :move-arg
  (sse-reg descriptor-reg) (sse-reg))

defines how arguments are loaded before calling a function.

Creating a new CL type and associated functions

Now that pretty much all the groundwork has been done in the backend, it's time to inform the frontend about the new data type. The system's built-in classes are defined in src/code/class.lisp, right after (defvar *built-in-classes*). I only had to insert another sublist in the list of built-in classes.

     (sb!vm:sse-value
      :codes (#.sb!vm:sse-value-widetag))

That takes care of defining a new class for the middle-end, and of mapping the class's name, in the front-end, to the class object in the middle-end.

SBCL tries pretty hard to always provide a safe language by default, so I have to make sure the type sse-value can be checked. First, a new error kind is defined in src/code/interr.lisp:

(deferr object-not-sse-value-error (object)
  (error 'type-error
         :datum object
         :expected-type 'sb!vm:sse-value))

The mapping from internal error number to a meaningful message is specified in src/compiler/generic/interr.lisp, define-internal-errors: (object-not-sse-value "Object is not of type SSE-VALUE."). Since I use a normal representation, I can use preexisting machinery to define the type checking VOPs in src/compiler/generic/late-type-vops.lisp:

(!define-type-vops sse-value-p check-sse-value sse-value
    object-not-sse-value-error
  (sse-value-widetag))

That only creates a VOP for sse-value-p; we'd sometimes like to have a real function. That's created in src/code/pred.lisp, with (def-type-predicate-wrapper sb!vm:sse-value-p). Moreover, when a VOP is defined as a translation for a function, that function must be defknown ed to the compiler. I do that in src/compiler/generic/vm-fndb.lisp, with (defknown sb!vm:sse-value-p (t) boolean (foldable flushable)).

TYPEP must also be informed to use that function. That's set up in src/compiler/generic/vm-typetran.lisp, with (define-type-predicate sb!vm:sse-value-p sb!vm:sse-value).

Making the middle and the back meet

The modifications above added a new primitive type to the back-end, along with some machinery to represent and manipulate values of that type. They also added a new built-in class to the front and middle -end and some type-checking code for that new class. The only thing left is to make sure the new class is mapped to the correct primitive type. The default is to map everything to the primitive type T, a boxed value. That's done at the bottom of src/compiler/generic/primtype.lisp, in primitive-type-aux. I only had to modify the case of translating built-in class(oids) to primitive types: sse-value are treated like complex, function, system-area-pointer or weak-pointer. The class sse-value is mapped to the primitive type sse-value. That way, a function that is defknown to, e.g., take an argument of type sse-value (the class) can be translated by a VOP that takes an argument of primitive type sse-value, which can be stored in an sse-reg (an xmm register).

Now what?

We have the data type definitions. To make them useful we still have to define a lot of functions and VOPs (and SSE instructions). However, that's much closer to regular development and doesn't require as much digging around in the source. I'll leave that for another post.

posted at: 18:25 | /Lisp | permalink

Mon, 02 Jun 2008

 

Yet another way to fake continuations

The developers of the COMET constraint programming library faced an interesting challenge when they decided to provide means to distribute computations. Their search model is, at the bottom level, based on continuations (implemented by copying the C stack in the sequential or threaded cases). To distribute work, they had to come up with a way to send continuations across the wire. Obviously, a copy of the stack isn't likely to work here, with pointers to the heap, randomised address space, multiple architectures, etc. Instead, they send a log of the non-deterministic choices taken so far, in order to make the same choices on the receiving machine (and then, hopefully, different ones).

"Continuations represent the rest of the computation" is a common way to try to explain continuations. One approach to represent the rest of the computation is simply to store all the actions performed on the way to getting there. It is easy to see that only actions that result from non referentially-transparent code must be stored. In COMET's case, that means the values returned at choice points (the program is assumed to be otherwise referentially transparent).

The approach is much more practical than it sounds: in a good constraint solver, most of the work is done outside the search itself, to propagate information through the constraint graph. Thus, to avoid the vast majority of recomputations, it suffices to send the constraint graph along with the 'continuation'. In general, it would also be possible to treat computationally expensive operations as non RT and save their results... they do take an observable amount of resources to evaluate after all.

In some ways, stateful network interactions are similar. While they do have to somehow save and restore their computational state, most computationally expensive code is completely executed between receiving a request and sending a response back. The results of these computations are also easily serialised. On the other hand, code that must use continuations (because it waits for responses) is often nearly pure IO and flow control, directing the data to and from code that does the heavy lifting. In other words, if, as Cox wants us to, we instead used a state machine, we'd have a simple state machine with a couple fat states.

One problem with this approach is that restoring a continuation takes time proportional to the number of actions previously executed. This can be alleviated by logging the result of forms that execute many actions; once the forms have been fully evaluated, there is no need to save or replay the intermediate actions. Still, that's not very good when suspending a potentially infinite loop. I see the approach more as a way to implement the part of the aforementioned state machine that is painful to do by hand (with many sequential or nested states). For the outer, driving loop, I would stick to a real state machine.

A toy implementation of such continuations would look like:

(defconstant +in-eval+ '+in-eval+
  "Used as a flag for calls that have been captured
before completion")
(defparameter *action-log* ()
  "Stack of executed actions.")
(defparameter *replay* ()
  "Stack of actions to replay.")
(defun log-action (action)
  (push action *action-log*)
  (values-list (first action)))

(defmacro /log (&body body)
  `(if (and *replay*
            (if (eq (car *replay*) +in-eval+)
                (prog1 nil
                  (pop *replay*))
                t))
       (values-list (pop *replay*))
       (log-action (multiple-value-list
                    (let ((*action-log* (cons +in-eval+
                                              *action-log*)))
                      ;; rebind, since we don't need to replay
                      ;; intermediate actions if we can simulate
                      ;; the complete evaluation of body
                      ,@body)))))

(defun call-with-log (fn &rest args)
  (let ((*replay*     ())
        (*action-log* ()))
    (apply fn args)))

(defun capture-log ()
  (reverse *action-log*))

(defun replay-log (log fn &rest args)
  (let ((*replay*     log)
        (*action-log* ()))
    (apply fn args)))

With these, we can implement a simple backtracking search.

(defparameter *search-states* ()
  "Stack of states to explore.")
(defparameter *next-choices* ()
  "List of choices to take in the first uncommitted choice point")
(defun fail ()
  (throw 'fail nil))

(defun choose (&rest choices)
  (/log                  ; capture itself is non RT!
    (when *next-choices* ; override the choices if we must
      (shiftf choices *next-choices* nil))
    (cond ((null choices)
           (fail))
          ((null (rest choices))
           (first choices))
          (t
           (push (cons (rest choices) (capture-log))
                 *search-states*)
           ;; this call to choose still hasn't returned in the
           ;; captured log, so will be entered when replayed.
           (first choices)))))

(defun execute-search (fn &rest args)
  (let ((*search-states* ())
        (count           0))
    (flet ((body ()
             (incf count)
             (catch 'fail
               (multiple-value-list
                (apply fn args)))))
      (call-with-log #'body)
      (values (loop while *search-states*
                    for (*next-choice* . log) = (pop *search-states*)
                    for values = (replay-log log #'body)
                    when values collect values)
              count))))

Finally, this can be used to stupidly search for pythagorean triples:

(defun iota ()
  (let ((n (/log
             (format t "How many integers? ")
             (parse-integer (read-line)))))
    (loop for i below n collect i)))

(defun pythagorean-triples ()
  (let* ((nats (/log (iota)))
         (x    (1+ (apply 'choose nats)))
         (y    (1+ (apply 'choose nats)))
         (z    (1+ (apply 'choose nats))))
    (if (and (< x y)
             (= (* z z) (+ (* x x) (* y y))))
        (values x y z)
        (fail))))

(execute-search #'pythagorean-triples)

This will prompt, once, for an integer, n, and then find all the x, y, z in [1, n] such that z^2 = x^2 + y^2. Note how the two non-referentially transparent operations, the IO and choose, are wrapped in /log. What happens if we replace the /log in iota with a progn? The program prompts for a number at every replay (continuation invocation). What about removing the /log around the call to iota? With /log still around the IO, iota is referentially transparent (but most probably not pure), so we can't easily see any difference. However, /log still ensures it is only called once, instead of at every replay, sensibly affecting how much the program conses.

The logging code manipulates actions, not just return values. In fact, it already manipulates two types of actions: entering an expression, and returning from it. In general, it is possible to track arbitrary effects of the logged expressions (e.g., assignment to variables), as long as the effects can be replayed.

In the context of suspending and serialising a computation's state between communications, this approach seems especially interesting. The disadvantages (slow replay of long journals, potential replay of expensive operations) can be worked around, by using continuations where they have the most to offer and by logging expensive operations when possible. More importantly, the model is simple to understand, as are the performance consequences and the way to address them. It also does not itself depend on serialising closures. Finally, it is hard to misuse! Only specific operations (those that affect global state or otherwise communicate with the outside world) have to be specially annotated for correctness; all the rest is an optimisation. The control flow isn't restructured, so third-party functions can be safely used, as long as they're referentially transparent... even higher-order functions!

posted at: 23:50 | /Lisp | permalink

Thu, 15 May 2008

 

An Impure Persistent Dictionary

A persistent data structure is one in which it is possible to revert to previous versions of the structure. A simple way to achieve that is simply to always work with copies when modifying values. More commonly and realistically, purely functional variants are used. There is then no need for bulk copying; only local rewriting. Old data is never overwritten, but rather implicitly not referenced by updated versions.

For tree-like structures with small nodes, there often is a straightforward purely functional variant. However, for large flat structures such as arrays, things aren't as obvious. With some cleverness, one can find alternative ways that are not pure, but still persistent, without having to fully copy old versions either (which could be considered pure, modulo linearity). A classic (and unfortunately little known) approach is described in Baker's "Shallow Binding Makes Functional Arrays Fast".

Shallow binding is a technique that was originally developed to implement dynamic scoping. It is in some ways the dual of the natural technique of deep binding, in which bindings are stored in an alist (or, rather, an associative stack, since bindings are only pushed and popped). In deep binding, creating a new a binding is simple (constant time), but lookups must traverse the alist to find a match. In shallow binding, the "associative stack" is used to store not the new binding, but the previous value of that binding. Active (current, non-shadowed) bindings are stored in a flat array, in which each symbol is uniquely allocated a known index. Thus shadowing is still in O(1), but lookups are now also done in constant time! (A disadvantage is that switching contexts, e.g. between green threads, isn't in O(1) as in deep binding.)

As Baker points out, the technique of using a side-effectful data structure and logging enough information to reverse the side-effects can also be used for persistent arrays or, here, dictionaries.

The simple cases are accesses to the latest version. Reading from the latest version simply queries the associated hash table directly. Writing to the latest version creates a new version, in which the overwritten key-value association is saved, and updates the hash table.

The interesting case is when older versions are accessed. The approach suggested by Baker "reroots" the tree of versions (one vertex per version): the tree is rewritten as though the version we want to access was the latest. In order to do that, each version points to the next version (not the previous, parent, version). It makes sense to point to (at most) a single child version because we reroot before writing: each time we create a new child version from an old parent version, rerooting to the parent ensures that the latter has no child (until one is created). Note that since versions point to newer ones, old unreferenced versions are not artificially kept alive by references from newer versions.

Rerooting can be executed in two phases. First, reverse the single-linked list that begins at the version to which we wish to reroot and ends at the very latest version. To make these changes globally visible, the in-place reversal algorithm should be used. Second, walk the reversed list to undo the changes; these undoings themselves should be logged in place of the previous undo information.

Thus, with persistent tables represented by structures such as

(defstruct ptable
  (key        +unbound+
              :read-only t)
  (prev-value +unbound+)
  (next       (make-hash-table))) ; either the *child* ptable
                                  ; or the backing hash table

the reversal can be done with the classic algorithm

(defun reverse-ptable-list (ptable-list)
  (declare (type ptable ptable-list)
           (optimize speed))
  (labels ((inner (list next)
             (if (ptable-p list)
                 (inner (shiftf (ptable-next list) next)
                        list)
                 (values next list))))
    (multiple-value-bind (reversed-list terminal)
        (inner ptable-list ptable-list)
      (setf (ptable-next ptable-list) terminal)
      (values reversed-list terminal))))

and the undoing pass with a simple traversal (table-value is pretty much like gethash, but associates values of +unbound+ to absent entries)

(defun undo-changes (change-list table)
  (declare (type ptable change-list)
           (type hash-table table)
           (optimize speed))
  (do ((change change-list next)
       (next   (ptable-next change-list)
               (ptable-next next)))
      ((not (ptable-p next))
       table)
    (rotatef (ptable-prev-value change)
             (table-value table
                          (ptable-key change)))))

Finally reroot becomes a very short

(defun reroot (ptable)
  (declare (type ptable ptable))
  (when (ptable-p (ptable-next ptable)) ; fast-path the
    (multiple-value-call #'undo-changes ; pre-rooted case
      (reverse-ptable-list ptable)))
  ptable)

reroot lets us reduce any case to the trivial case. A read can simply reroot the persistent table that is read, and then access the associated hash table (ptable-next). A write may reroot the ptable to update, record the previous value in a new ptable node, update the associated hash table, and set the next slot of the updated ptable to the new ptable and that of the new ptable to the hash table.

This logging approach has the significant advantage of being mostly generic (the only specialised part is the representation of undo information) and of passing most of the coding and algorithmic complexity to a normal side-effectful data structure. Such data structures have been extensively studied for decades, unlike purely functional ones, which currently represent a sensibly smaller niche. It is likely that the regular (side-effectful) structure uses simpler and more efficient algorithms than the closest purely functional alternative. In an impure language like Common Lisp, the former is also more likely to have already been implemented, either in a library or by the implementation.

For example, on SBCL 1.0 (FreeBSD/1.8 GHz O265), inserting all the integers from 0 to 2^20 in an eql hash table takes 1.128 seconds (0.204 sec of GC) and conses 165 MB. Wrapping an identical hash table in a shallow binding scheme to make it persistent makes that go to 4.362 seconds (2.89 sec of GC) while consing 215 MB. Reads take almost twice as much time in the persistent version (but still very little). While hash table performance has greatly improved in newer versions of SBCL, that should only make the relative difference for writes even greater. It is easy to see that most of it is due to additional GC pressure.

Unfortunately, there are limits to the amount of allocation/GC overhead that can be shaved from the persistent version (in the general case): enough information to recover any arbitrary version must be preserved! Ideally, linearity (single reference count) would be exploited to reuse storage implicitly instead of allocating a fresh ptable while rendering another one dead. Barring that, however, it seems likely that a good part of the overhead on writes for the shallow binding version is simply due to the nigh unavoidable need to preserve more information (and thus allocate more memory) for persistence. That much (all, when only the latest version is referenced) of that information is short-lived is a good thing: it means that the space overhead is within a constant factor of the original data structure. Moreover, that is exactly the sort of workloads generational GCs are designed to exploit.

It may be possible to design a purely functional data structure with the same complexity on reads and writes as hash tables or the persistent hash tables described above. However, rarely is anything free. In some ways, it seems to me that one must often pay for the implicit O(1) version switch of purely functional data structures with additional complexity in other operations.

There are some issues with the naive technique described here. The current data structure isn't thread safe, nor is it obvious how to make it so. I believe it is be possible to achieve thread-safety by associating a lock with the hash table. Whenever one wishes to operate (reroot or update) on a backing hash table or associated ptable nodes, the lock must be taken. That leaves the problem of "ping-ponging" between versions. If a program alternatively operates between two versions of the hash table that are far apart, the constant rerooting will tend to dominate times. A common way to address that is to introduce a probability of making a copy of the hash table (instead of rerooting back and forth) at each reroot. To keep (amortised) constant time writes and reads, the probability of making a copy should be inversely proportional to the number of entries in the hash table.

The barely tested source code used to perform the measurements may be found at http://www.discontinuity.info/~pkhuong/persistent-table.lisp. As usual, it is distributed under a modified BSD license.

posted at: 22:34 | /Lisp | permalink

Mon, 07 Apr 2008

 

Malmesure, quand tu nous tiens

I would say that the most important thing I learned during my short stint in health science is that quality experiment plans are essential. It's true in the most empirical of (pseudo-)sciences, and also in that other pseudo-science, benchmarking. We have less problems with sample size and getting accurate numbers. The fundamental question still remains: are we really measuring what we believe we are?

Leslie Polzer found the following results, comparing his seqrnd and my random-shuffle:

Here are the results, comparing Paul Khuong's
RANDOM-SHUFFLE with SEQRND.

[...]

(time (benchmark #'random-shuffle))
(time (benchmark #'seqrnd))
CLISP 2.41 (interpreted):
Real time: 25.72314 sec.
Run time: 10.665971 sec.
Space: 86264000 Bytes
GC: 164, GC time: 0.669959 sec.

Real time: 72.53122 sec.
Run time: 29.028109 sec.
Space: 79356432 Bytes
GC: 152, GC time: 0.5233 sec.

Bottom line: RANDOM-SHUFFLE 3x faster,
SEQRND slightly less space.
CLISP 2.41 (compiled)

Real time: 19.819485 sec.
Run time: 8.232797 sec.
Space: 86016072 Bytes
GC: 164, GC time: 0.663291 sec.

Real time: 28.287846 sec.
Run time: 10.196001 sec.
Space: 79108984 Bytes
GC: 151, GC time: 0.519969 sec.

Bottom line: RANDOM-SHUFFLE 1.35x faster,
SEQRND slightly less space.
SBCL 1.0.15:

Evaluation took:
13.861 seconds of real time
5.013006 seconds of user run time
0.126658 seconds of system run time
[Run times include 0.074 seconds GC run time.]
0 calls to %EVAL
0 page faults and
52,039,552 bytes consed.

Evaluation took:
4.639 seconds of real time
2.263186 seconds of user run time
0.026665 seconds of system run time
[Run times include 0.009 seconds GC run time.]
0 calls to %EVAL
0 page faults and
30,302,776 bytes consed.

Bottom line: SEQRND 3x faster, SEQRND 3/5 space.

That's interesting: completely different results depending on the implementation. What's also interesting is the presence of a few obvious flaws in the experiment. Let's keep in mind what we want to decide: which of using a EQ hash-table or storing values in cons-cells in-place is more space or time efficient.

First, the benchmark harness generates a list of random numbers for each run. That's pure noise. We could instead pass a copy of the same sequence for each iteration. It might introduce bias, but, looking at the sources, that seems unlikely. So, from

(defun benchmark (fn)
 (dotimes (x 1000)
   (let ((seq (loop for i from 1 to 1000 collect (random i))))
     (funcall fn seq))))
we go to
(defparameter *random-sequence*
  (loop repeat 1000
        collect (random 1.0)))

(defun test-shuffle (fn list n)
  (dotimes (i n)
    (funcall fn (copy-seq list))))

We can also see an immediate difference between seqrnd and random-shuffle: seqrnd generates single floats, random-shuffle double floats. Considering we'll not be shuffling millions of items, single floats should be good enough. Obviously, if we're to measure time efficiency, we want to turn optimisations on. Thus, we get these functions:

(defvar *random-id-ht* nil)

(defun initialize-memo ()
  (setf *random-id-ht* (make-hash-table :test #'eq)))

(declaim (sb-ext:maybe-inline consistent-random-id))
(defun consistent-random-id (obj)
  (declare (optimize speed))
  (multiple-value-bind (val found-p) (gethash obj *random-id-ht*)
    (if found-p val
      (setf (gethash obj *random-id-ht*)
            (random 1.0)))))

(defun seqrnd (seq)
  "Randomize the elements of a sequence. Destructive on SEQ."
  (declare (optimize speed)
           (inline consistent-random-id))
  (initialize-memo) ; need to clear between runs
  (sort seq #'> :key (lambda (x) (consistent-random-id x))))
(defun random-shuffle (sequence)
  (declare (optimize speed))
  (map-into sequence #'car
            (sort (map 'vector (lambda (x)
                                 (cons x (random 1.0)))
                       sequence)
                  #'< :key #'cdr)))

For 1000 shuffles of the same 1000-element list, we find 1.82s for seqrnd and 4.11s for random-shuffle. From that, we could conclude that it is faster to use a hash table than to take the car of a cons. The question remains, are we really measuring the relative efficiency of the two methods? In the last test case, random-shuffle does most of its work on vectors, seqrnd on lists. As an experiment, let's shuffle a 1000-element vector. The figures are now reversed! seqrnd takes 3.41s, and random-shuffle 1.47s. It seems rather unlikely that we are really measuring the effects of the two ways to associate random numbers to keys.

Let's time sorts of lists and vectors, passing (lambda (x) (sort x #'< :key #'identity)) to test-shuffle. Sorting lists takes .436s, compared to 1.37s for vectors. At least part of the discrepancy between our two methods can be attributed to SORT. Why is there a difference? SBCL's SORT calls different routines depending on the sequence's type: lists are passed to a mergesort, vectors to an in-place heapsort. Their performance envelopes are wildly different. The heapsort is clearly ahead on large datasets, but it varies on smaller inputs.

We have a hint that working with slightly different data structures can lead to large performance differences in third-party code. What effect can it have on MAP-INTO? Let's test with (lambda (x) (map-into x #'identity x)). Mapping into a list (1000 elements, 1000 times) takes 7.78s, 0.114s to a vector. The difference is enormous. The reason is simple. Quite clearly, MAP-INTO for lists sucks. How bad can it suck? ELT and LENGTH -using bad, surprisingly. SBCL's MAP-INTO is in need of some TLC if someone has the time.

Since we seem to be primarily interested in shuffling medium-sized lists, we could add a special case for such inputs, to keep working on lists and avoid MAP-INTO, something like:

(defun random-shuffle (sequence)
  (declare (optimize speed))
  (etypecase sequence
    (list (mapcar #'car
                  (sort (mapcar (lambda (x)
                                  (cons x (random 1.0)))
                                sequence)
                        #'< :key #'cdr)))
    (t    (map-into sequence #'car
                    (sort (map 'vector
                               (lambda (x)
                                 (cons x (random 1.0)))
                               sequence)
                          #'< :key #'cdr)))))

With this specialised code, we find random-shuffle takes 0.839s for 1000 shuffles of 1000-elements lists (1.82s for seqrnd). It also conses less, 64MB instead 81MB.

What happens if we make it in-place? First, let's write a non-sucky equivalent to MAP-INTO (something like that could be patched in SBCL):

(defun list-map-into (output function input)
  (declare (optimize speed))
  (let ((function (coerce function 'function))) ; FIXME: that's slightly wrong...
    (loop for outs on output
          for x in input
          do (setf (car outs)
                   (funcall function x))
          finally (return output))))

We can then use it in random-shuffle, being careful that the conses passed to SORT can be reused arbitrarily.

(defun random-shuffle (sequence)
  (declare (optimize speed))
  (etypecase sequence
    (list
     (let ((result
            (sort (list-map-into sequence
                                 (lambda (x)
                                   (cons x (random 1.0)))
                                 sequence)
                  #'< :key #'cdr)))
       (list-map-into result #'car result)))
    (t    (map-into sequence #'car
                    (sort (map 'vector
                               (lambda (x)
                                 (cons x (random 1.0)))
                               sequence)
                          #'< :key #'cdr)))))

Like memoisation, it could be argued that a good MAP-INTO should be part of every lisper's toolbox (... as part of the implementation). With this new MAP-INTO look-alike, 1000 random-shuffle of 1000-element lists takes 0.829s and conses 32MB (1.82s and 81MB for seqrnd).

After some optimisations, guided by a better knowledge of our implementation's library, we see that the correct approach (that doesn't fail on sequences with repeated elements) is also faster and more space-efficient (1.47s, 64MB for random-shuffle of vectors and 3.41s, 74MB for seqrnd). We can also observe the importance of understanding the performance of third-party, magic-pixie-dust-style code when trying to understand that of our own code, lest we be lead to misconclude.

posted at: 15:49 | /Lisp | permalink

Sun, 06 Apr 2008

 

On Magic Pixie Dust

Someone is *wrong* on the internet

(Thank you xkcd)

Common Lisp isn't one of them scripting languages in which lists have O(1) random dereferencing or arrays O(1) insertion at both ends, but it's still possible to forget about invisible costs.

Still on the topic of random shuffling, Leslie Polzer lists three alternatives, and suggests another, one that feels like a "Good Solution". Unfortunately, the only one not to work is the latter.

It seems to me the best solution is the one that looks the worst to some. Fisher-Yates is a nice, simple shuffling algorithm. It could have been cuter, but I don't see what's wrong with imperative code.

(defun fisher-yates (sequence)
  (let ((vector (coerce sequence 'vector)))
    (loop for i downfrom (1- (length vector)) to 0
          do (rotatef (aref vector i)
                      (aref vector (random (1+ i)))))
    (replace sequence vector)))

The other sort-based solutions also work, except the final cleverer one that uses memoisation. It looks good. Unfortunately, it's not thread-safe and, contrary to the advertisement, it conses (probably more so than the other implementations). Worse, it doesn't work.

It's obviously not thread-safe, since each call to seqrnd reinitialises the global memoisation table. A better solution would use dynamic scoping instead of assignment. If the code isn't meant to be used in a threaded environment, then CLRHASH would probably be a better idea.

Generic memoisation is often based on a hash table. Such a table must save the key-value association somewhere. It needs at least two words for the key and the value. Usually, the overhead is greater than that, for various implementation-level reasons. Compare that to the solutions that sort a sequence of conses: with a vector, the space overhead's usually a bit more than 3 words/value (4 for a list). Specialised data structures can often bring the cost down, by being lossy or using some attribute of the set of associations. Generic memoisation is a nice hack, but is often replaced by specialised code as development goes on.

The shuffling code also does not work. A memoisation table will associate random numbers to values, not (initial) positions. If the sequence to shuffle contains the same element several times, all the references/copies will be bundled sequentially! For example, the list (a a b b) can only be shuffled two ways: (a a b b) or (b b a a). There are also issues with numbers and characters, for which EQ is free to return NIL regardless of the arguments, but that's not exactly common.

(Mis)Using existing code is a very good idea, especially for quick hacks or boilerplate that's only in the way of the essence of a project. However, it's still important to be aware of what's going on underneath it all. Excessive use of opaque magic leads to beautiful broken code.

Extra! Extra! We can abuse CLISP's arbitrary precision floats to generate a lot of digits of Pi for us:

[1]> (setf (long-float-digits) 3500)
3500
[2]> (* 4 (atan 1L0))
3.14159265358979323846264338327950288419...

posted at: 17:50 | /Lisp | permalink

Sat, 05 Apr 2008

 

Trivial uniform shuffling

Leslie Polzer suggested this short and simple code to randomly shuffle a sequence:

(defun seqrnd (seq)
  "Randomize the elements of a sequence. Destructive on SEQ."
  (sort seq #'> :key (lambda (x) (random 1.0))))

Associating a random key with the values to shuffle, and sorting on these keys works well, as long as there aren't too many items (otherwise, there are birthday paradox issues with the finite mantissa). Unfortunately, quoting the CLHS on SORT: "There is no guarantee on the number of times the key will be called." In particular, the key may be called once/element/comparison; it should probably be (nearly) referentially transparent.

In this case, calling the key for each comparison reduces the above to sorting with a random comparison predicate, which is known to not yield an uniform distribution. For example, in a typical quicksort, the probability of displacing one of the first elements of an n element sequence to its end is in 1/2^n, not 1/n. This is not a purely hypothetical issue: SBCL's SORT calls key for each comparison, as do some other implementations, I'm sure. Considering the common case for key functions (reader functions), the decision makes sense.

So, to actually associate a single random key to each element, something like this could be used:

(defun random-shuffle (sequence)
  (map-into sequence #'car
            (sort (map 'vector (lambda (x)
                                 (cons x (random 1d0)))
                       sequence)
                  #'< :key #'cdr)))

posted at: 19:00 | /Lisp | permalink

Fri, 25 Jan 2008

 

Fast and simple sparse vectors

For my serialisation package (more on that later; the goals are to maintain sharing/circularity, minimise the number of copies for unboxed data and to support closures), I needed a way to quickly map from objects to some bookkeeping information. At first, that was a hash table. Unfortunately, there is considerable space overhead (SBCL uses chained buckets, so there's the array, the chain, and the key), which can easily destroy runtime on large enough data sets.

Finally, I found a rather hacky way to pin the serialised objects in place, which makes it possible to use raw addresses. For numbers, I stuck to the hash table. For everything else, however, I was looking for a relatively fast and reasonably space-efficient way to map from integers to objects. Cuckoo hash tables spend too much time rehashing on grows. Similarly, sparse vectors as a combination of a bitmask and a dense vector take too much time copying data into larger containers.

To avoid the problem of expensive grows, using some sort of tree structure seemed like a nice solution: each node contains much less data than the complete tree. Unfortunately, trees often have a low branching factor, so a query ends up reading from many locations. As soon as the tree becomes larger than cache, things get pretty ugly. I wanted something with fat nodes, like B-trees. However, finding the right child in a B-tree node isn't particularly cache-friendly either: a binary search is horrible for locality. A correct solution would be to use a cache-aware (ideally cache-oblivious, since I don't like fiddling with magic constants) layout, like van Emde Boas trees, or Bender, Demaine and Farach-Colton's cache-oblivious B-tree.

A lazier solution is to take advantage of the fact that we're dealing with bits, and not just some abstract set of totally-ordered values, and treat our keys (addresses) like strings of bits of constant length (modulo leading zeros). Tries are then natural... But they too tend to have low branching factors. With an arbitrary length of 40 bits (that's 1 TB's worth of addresses, or, rather, since I shift the 4 least significant bits out first, 16 TB), something like using the 20 MSB in the first level, then 10 and 10 in the second and third level seemed natural: an array of 1 million elements isn't that large, as long as there aren't thousands of them, and 1024 elements really isn't that bad either. Moreover, restricting the trie to 3 levels reduces the number of cache misses.

Obviously, it wouldn't be feasible to have a dense array of 1024 or 1M elements for each trie node, when most node will have only a few children, if any. We need another sparse array, but this time with different goals: space efficiency and fast growth aren't such issues anymore, but read and write speeds are.

I borrowed the concept of 'critical bit' from PATRICIA tries, and pluralized it to get a class of rather cheap hash functions that should, on typical workloads, do fine. The idea is to identify bit positions that are identical across all keys, and form an index from the other positions by concatenating them together. In the best case, it takes lg n bits to index from n keys (multiples of a power of 2), and, in the worst case, n bits for 2 keys (no common bit). Fortunately, we'd expect addresses to be closer to the former than the latter.

A lookup then consists of making sure the searched key corresponds to the identical bits (mask and compare), and then concatenating the variable bits together. By looping over each byte and using look-up tables for the popcount and partial concatenating, that operation takes on the order of 30-100 cycle on my 2.16 GHz Core 2.

Insertion, is, as always, a bit hairier. The simple case is when the common bits fit (again, mask and compare). If so, all that is needed is to concatenate the new key to find its index and write in the dense array. Otherwise, a new mask must be computed, and elements from the old dense vector reinserted in the new (at least twice larger) one with the correct indices. Computing a new mask is surprisingly simple: (logandc1 (logxor new-value common-value) common-mask).

I haven't managed to find a simpler way to do reinsertion than to recurse over the non-common bit positions to generate all the values that fit with the common bit positions (and their values). At least, that way it's possible to compute the old concatenated index on the fly (I haven't found how to do that for the new concatenated index yet, unfortunately). It obviously pays off to add a special case when there is only one key, and store it directly, instead of using a vector of length 1. What was surprisingly also important to performance was to explicitly reuse the vectors that are freed on reinsertions.

By glueing these two pieces together, we get an implementation of sparse vectors that should be able to take advantage of locality in keys (empty or nearly empty spans take very little space), and of (some) patterns in contiguous keys to save space. It's also reasonably performant, since it's expected to incur very few (a half dozen) misses for random access. Interestingly, it also makes predecessor/successor queries efficient, since the concatenating operation preserves ordering (bits are concatenated in order). In comparison with SBCL's hash tables (which aren't the best in the west, but not completely horrible either):

  • Inserting all the integers in [0, 2^20)
    • takes 0.12s (VS 0.83s),
    • conses 8.5 MB (VS 167 MB);
  • Inserting 2^20 random integers in [0, 2^20)
    • takes 0.18s (VS 0.81)
    • conses 8.5 MB (vs 83 MB)
  • Inserting all the multiples of 5 in [0, 5*2^20)
    • takes 0.40s (VS 0.9s)
    • conses 42 MB (VS 167 MB);
  • Inserting all the multiples of 16 in [0, 16*2^20)
    • takes 0.14s (VS 0.82 s)
    • conses 9.6 MB (VS 167 MB);
  • Inserting 2^20 random multiples of 256 in [0, 256*2^20)
    • takes 0.72s (VS 1.1s)
    • conses 25 MB (VS 83 MB);
  • Inserting 2^20 random multiples of 256 in [0, 256*2^22)
    • takes 0.86s (VS 0.85s)
    • conses 49 MB (VS 83 MB);
  • Inserting 2^20 random integers in [0, 2^30)
    • takes 11s, 8.7 of which in GC, (VS 0.69s)
    • conses 364 MB (VS 83 MB).

For truly random keys, a hash table is better. For my needs, however, the sparse trie is preferable: addresses usually won't be scattered all over the heap, so I can expect less space usage, less consing and faster execution times. Moreover, objects close together will often be traversed together, and the sparse trie can exploit that locality, unlike hash tables. I only compared insertions because it seems to be much slower (often by a factor of 2 or more) than look-ups.

posted at: 20:30 | /Lisp | permalink

Made with PyBlosxom Contact me by email: pvk@pvk.ca.