Paul Khuong mostly on Lisp

Tue, 20 Dec 2011

Xecto, a new project - the plumbing

Xecto is another (SBCL-only) project of mine, available in its very preliminary state on github. It started out as a library for regular parallel operations on arrays (reduce, scan, map, etc), with the goal, not of hitting as close to peak performance as possible through whatever trick necessary, but rather to have a simple execution model, and thus a simple performance model as well. These days, I’d rather have a tool that can’t hit the theoretical peak performance, but has a simple enough performance model that I can program close to the tool’s peak, than a tool that usually gives me excellent performance through ill-explained voodoo (and sometimes suffers from performance glass jaw issues).

I like to think of it as preferring to map operations to the BLAS instead of writing loops in C and hoping the compiler autovectorise everything correctly. When performance is really critical, smart people can study the problem and come up with a judicious mixture of high-level optimisations, specialised code generators and hand-rolled assembly (ATLAS, GotoBLAS, FFTW, MKL, etc). When performance isn’t that important, going for a tool that’s easy to understand at the expense of some runtime performance (e.g. by punting to more generic but heavily optimised libraries from the previous list) is probably a good engineering trade-off. It seems to me it’s only in a strange no man’s land of applications that must really run quickly, but don’t justify the development of a bespoke solution, that complex automagic optimisation shines.

Xecto represents the nth time I’ve written most of the components: work queues, task-stealing deques, futures, optimising simple array-processing loops, etc. I’m very aggressively fighting the second system effect, and going for a simple, often simplistic, design. My hope is that this will yield a robust parallel-processing library upon which others can build as well.

I’ll try to document the state of the project as it goes. There’s currently a basic working prototype: parallel processing infrastructure, arrays as vectors and shape metadata, a minimal loop nest optimiser, and parallel (task, data and SIMD) execution of vector operations. I’m now trying to go from prototype to useful code.

Note that Xecto is SBCL-only: I use atomic primitives a lot, and don’t care that much about portability (yet).

1 Parallel processing plumbing

The base of the parallelisation infrastructure is a thread pool; spawning one thread for each task is a silly waste of resources and often leads to CPUs constantly switching context between threads.

The thread pool (thread-pool.lisp) spawns a fixed number of worker threads ahead of time, and jobs are submitted via a queue of work units. A job can simply be a function or a symbol to call, but that’s not very useful: we usually want to know when the job is finished and what the result was.

A task structure (work-units.lisp) can be used for that. The function slot of the structure will be called, with the task as its single argument. With inheritance, we can add a slot to hold the result, and another to wait for status changes (status.lisp). status.lisp implements a simple way to wait on a status, optimised for the common case when tasks are completed before the program waits on them: a status slot usually only stores status symbols, but waiters upgrade it to a reference to a structure with a mutex/waitqueue pair for more efficient waiting.

We often have a large number of small independent work units that represent a larger, logical, work units. A bulk task structure (work-units.lisp) implements that case. A vector of subtasks is provided, and a function is called for each subtask; when all the subtasks have finished executing, a cleanup function is called. In the case of vector processing (and in many others), there is a certain locality between adjacent work units. The work queue exploits that by allowing multiple threads to work on the same bulk task, but ensuring that each thread tends to only execute its own range of subtasks.

Work units also often recursively generate additional work units. They could simply go through the regular work unit queue. In practice, however, it’s much more interesting to skip that and execute them on the same worker, in stack order (work-stack.lisp). We avoid some synchronisation overhead, and the LIFO evaluation order tends to improve temporal locality. If there were only private evaluation stacks, we could easily find ourselves with a lot of available tasks, but all assigned to a few workers. That’s why tasks can be stolen by idle workers when the queue is empty.

Finally, tasks also have dependencies: we want to only start executing a given task after all its dependencies have completed their own execution.

The thread pool supports recursive waiting: work units can wait for other work units to complete. The worker thread will then continue executing tasks (on its stack first, then from the shared queue or by stealing) until the criterion (all dependencies fully executed) is met. This can waste a lot of stack space compared to implementations like Cilk that can steal and release waiting stack frames when the dependencies have all executed. However, the implementation is simple, and the space overhead is reasonable: a quadratic increase, in the worst case, I believe. If the serial program’s stack usage is decent (e.g. polylogarithmic in the input), it shouldn’t be an issue.

There’s also some machinery to move the dependency logic in the thread pool and eliminate the disadvantages of recursive waiting. Dependencies are registered between futures (futures.lisp, parallel-futures.lisp), and, when a future’s last dependency has just finished executing, it’s marked for execution (on the current worker’s stack). Because of my needs for vector processing, futures are bulk tasks with a function that’s called before executing the subtasks, and anoter one after.

It’s nothing special, and a lot of things is clearly suboptimal, but it’s mostly decent and can be improved again later.

2 General-purpose parallel processing primitives

The code that I mention in the previous section is already sufficient for a few common parallel processing primitives (parallel-primitives.lisp). Xecto doesn’t use them, but they’re still pretty useful.

2.1 Promises, `parallel:let`

These are simple tasks that are always pushed on the worker’s stack if possible, or on the current parallel execution context otherwise.

parallel:promise takes a function and a list of arguments, and pushes/enqueues a task that call the function with these arguments, and saves the results.

parallel:promise-value will wait for the promise’s value, while parallel:promise-value* will recursively wait on chains of promises.

parallel:let uses promises to implement something like Qlisp’s qlet. The syntax is the same as let (except for bindings without value forms), and the bound values are computed in parallel. A binding clause for :parallel defines a predicate value: if the value is true, the clauses are evaluated in parallel (the default), otherwise it’s normal serial execution.

This, along with the fact that waiting in workers doesn’t stop parallel execution, means that we can easily parallelise a recursive procedure like quicksort.

All of the following code is a normal quicksort.

(deftype index () 
  ‘(mod ,most-positive-fixnum)) 
 
(declaim (inline selection-sort partition find-pivot)) 
(defun partition (vec begin end pivot) 
  (declare (type (simple-array fixnum 1) vec) 
           (type index begin end) 
           (type fixnum pivot) 
           (optimize speed)) 
  (loop while (> end begin) 
        do (if (<= (aref vec begin) pivot) 
               (incf begin) 
               (rotatef (aref vec begin) 
                        (aref vec (decf end)))) 
        finally (return begin))) 
 
(defun selection-sort (vec begin end) 
  (declare (type (simple-array fixnum 1) vec) 
           (type index begin end) 
           (optimize speed)) 
  (loop for dst from begin below end 
        do 
           (let ((min   (aref vec dst)) 
                 (min-i dst)) 
             (declare (type fixnum min) 
                      (type index min-i)) 
             (loop for i from (1+ dst) below end 
                   do (let ((x (aref vec i))) 
                        (when (< x min) 
                          (setf min   x 
                                min-i i)))) 
             (rotatef (aref vec dst) (aref vec min-i))))) 
 
(defun find-pivot (vec begin end) 
  (declare (type (simple-array fixnum 1) vec) 
           (type index begin end) 
           (optimize speed)) 
  (let ((first  (aref vec begin)) 
        (last   (aref vec (1- end))) 
        (middle (aref vec (truncate (+ begin end) 2)))) 
    (declare (type fixnum first last middle))                                                                                                                                     
    (when (> first last) 
      (rotatef first last)) 
    (cond ((< middle first) 
           first 
           (setf middle first)) 
          ((> middle last) 
           last) 
          (t 
           middle))))

Here, the only difference is that the recursive calls happen via parallel:let.

(defun pqsort (vec) 
  (declare (type (simple-array fixnum 1) vec) 
           (optimize speed)) 
  (labels ((rec (begin end) 
             (declare (type index begin end)) 
             (when (<= (- end begin) 8) 
               (return-from rec (selection-sort vec begin end))) 
             (let* ((pivot (find-pivot vec begin end)) 
                    (split (partition vec begin end pivot))) 
               (declare (type fixnum pivot) 
                        (type index  split)) 
               (cond ((= split begin) 
                      (let ((next (position pivot vec 
                                            :start    begin 
                                            :end      end 
                                            :test-not #’eql))) 
                        (assert (> next begin)) 
                        (rec next end))) 
                     ((= split end) 
                      (let ((last (position pivot vec 
                                            :start    begin 
                                            :end      end 
                                            :from-end t 
                                            :test-not #’eql))) 
                        (assert last) 
                        (rec begin last))) 
                     (t 
                      (parallel:let ((left  (rec begin split)) 
                                     (right (rec split end)) 
                                     (:parallel (>= (- end begin) 512))) 
                        (declare (ignore left right)))))))) 
    (rec 0 (length vec)) 
    vec))

We will observe that, mostly thanks to the coarse grain of parallel recursion (only for inputs of size 512 or more), the overhead compared to the serial version is tiny.

We can test the performance (and scaling) on random vectors of fixnums. I also compared with SBCL’s heapsort to make sure the constant factors were decent, but the only reasonable conclusion seems to be that our heapsort is atrocious on largish vectors.

(defun shuffle (vector) 
  (declare (type vector vector)) 
  (let ((end (length vector))) 
    (loop for i from (- end 1) downto 0 
          do (rotatef (aref vector i) 
                      (aref vector (random (+ i 1))))) 
    vector)) 
 
(defun test-pqsort (nproc size) 
  (let ((vec (shuffle (let ((i 0)) 
                        (map-into (make-array size 
                                              :element-type ’fixnum) 
                                  (lambda () 
                                    (incf i))))))) 
    (parallel-future:with-context (nproc) ; create an independent thread 
      (time (pqsort vec)))                ; pool 
    (loop for i below (1- (length vec)) 
          do (assert (<= (aref vec i) (aref vec (1+ i)))))))

Without parallelism (:parallel is nil), we find

* (test-pqsort 1 (ash 1 25)) 
 
Evaluation took: 
  6.245 seconds of real time 
  6.236389 seconds of total run time (6.236389 user, 0.000000 system) 
  99.86% CPU 
  17,440,707,947 processor cycles 
  0 bytes consed

With the parallel clause above, we instead have

* (test-pqsort 1 (ash 1 25)) 
 
Evaluation took: 
  6.420 seconds of real time 
  6.416401 seconds of total run time (6.416401 user, 0.000000 system) 
  99.94% CPU 
  17,930,818,675 processor cycles 
  45,655,456 bytes consed

All that parallel processing bookkeeping and additional indirection conses up 45 MB, but the net effect on computation time is negligible. Better: on my workstation, I observe nearly linear scaling until 4-5 threads, and there is still some acceleration by going to 8 or even 11 threads.

2.2 Futures

These are very similar to the futures I described above: they are created (parallel:future) with a vector of dependencies, a callback, and, optionally, a vector of subtasks (each subtask is a function to call) and a cleanup function.

parallel:future-value[*] will wait until the future has finished executing, and the * variant recursively forces chains of futures.

parallel:bind binds, similarly to let, variables to the value of future values (if not a future, the value is used directly), but waits by going through the work queue. When the first form in the body is :wait, the macro waits for the body to finish executing; otherwise, the future is returned directly.

These aren’t very useful directly, but are needed for parallel dotimes.

2.3 `parallel:dotimes`

Again, the syntax for parallel:dotimes is very similar to that of dotimes. The only difference is the lack of implicit block and tagbody: the implicit block doesn’t make sense in a parallel setting, and I was too lazy for tagbody (but it could easily be added).

The body will be executed for each integer from 0 below the count value. There’s no need to adapt the iteration count to the number of threads: the macro generates code to make sure the number of subtasks is at most the square of the worker count, and the thread pool ensures that adjacent subtasks tend to be executed by the same worker. When all the iterations have been executed, the result value is computed, again by a worker thread (which allows, e.g., pushing work units recursively).

Again, when the first form in the body is :wait, the macro inserts code to wait for the future’s completion; otherwise, the future is returned directly.

In other words, this macro implements a parallel for:

(parallel:dotimes (i (length vector) vector) 
  :wait 
  (setf (aref vector i) (f i)))

2.4 map, reduce, etc.

Given parallel:dotimes, it’s easy to implement parallel:map, parallel:reduce and parallel:map-group-reduce (it’s something like map-reduce).

These are higher-order functions, and can waste a lot of performance executing generic code; inline them (or ask for high speed/low-space optimisation) to avoid that.

2.4.1 `parallel:map`

(parallel:map type function sequence &key (wait t)) coerces the argument sequence to a simple vector, and uses parallel:dotimes to call the function on each value. If type is nil, it’s only executed for effect, otherwise the result is stored in a temporary simple vector and coerced to the right type. If wait, the value is returned, otherwise a future is returned.

2.4.2 `parallel:reduce`

(parallel:reduce function sequence seed &key (wait t) key) agains coerces to a simple vector and then does it via parallel:dotimes. It also exploits the work-queue:worker-id to perform most of the reduction in thread-local accumulators, only merging them at the very end. function should be associative and commutatiev, and seed a neutral element for the function. key, if provided, implements a fused map/reduce step. Again, wait determines whether the reduced value or a future is returned.

2.4.3 `parallel:map-group-reduce`

parallel:map-group-reduce implements a hash-based map/group-by/reduce, or closer to google’s map/reduce, a mapcan/group-by/reduce. Again, it is implemented with thread-local accumulators and parallel:dotimes.

(map-group-reduce sequence map reduce &key group-test group-by) maps over the sequence to compute a mapped value and a group-by value. The latter is the map function’s second return value if group-by is nil, and group-by’s value for the current input otherwise. All the values given the same (according to group-test) group-by keys are reduced with the reduce function, and, finally, a vector of conses is returned: the car are the group-by values, and the cdr the associated reduced values.

When &key master-table is true, a hash table with the same associations is returned as a second value, and when it’s instead :quick, a simpler to compute version of that table is returned (the value associated with each key is a cons, as in the primary return value).

When &key fancy is true, we have something more like the original map/reduce. map is passed a value to process and a binary function: the first argument is the group-by key, and the second the value.

The function can be used, for instance, to process records in a vector.

(parallel-future:with-context (4) 
  (parallel:map-group-reduce rows #’row-age #’+ 
                             :group-by #’row-first-name 
                             :group-test #’equal)) 
=> #(("Bob" . 40) ("Alice" . 35) ...)

This uses up to four threads to sum the age of rows for each first name, comparing them with equal.

A more classic example, counting the occurrences of words in a set of documents, would use the :fancy interface and look like:

(defun count-words (documents) 
  (parallel:map-group-reduce documents 
                             (lambda (document accumulator) 
                               (map nil (lambda (word) 
                                          (funcall accumulator word 1)) 
                                    document)) 
                             #’+ 
                             :group-test #’equal 
                             :fancy t))

2.5 Multiple-producer single-consumer queue

mpsc-queue.lisp is a very simple (56 LOC) lock-free queue for multiple producers (enqueuers), but only a single consumer (dequeuer). It’s a really classic construction that takes a lock-free stack and reverses it from time to time to pop items in queue order. This (building a queue from stacks) must be one of the very few uses for those CS trick questions hiring committees seem to adore.

It’s not related at all to the rest of Xecto, but it’s cute.

3 Next

I think the next step will be generating the vector-processing inner loops in C, and selecting the most appropriate one to use. Once that’s working, the arrays could be extended to more types than just double-float, and allocated on the foreign heap.

In the meantime, I hope the parallel primitives can already be useful. Please have some fun and enjoy the fruits of Nikodemus’s work on threads in SBCL.

P.S. The organisation of the code probably looks a bit weird. I’m trying something out: a fairly large number of tiny internal packages that aren’t meant to be :used.

posted at: 02:10 | /Lisp/Xecto | permalink

Thu, 15 Dec 2011

Initialising structure objects modularly

I use defstruct a lot, even when execution speed or space usage isn’t an issue: they’re better suited to static analysis than standard-objects and that makes code easier to reason about, both for me and SBCL. In particular, I try to exploit read-only and typed slots as much as possible.

Structures also allow single-inheritance – and Christophe has a branch to allow multiple subclassing – but don’t come with a default protocol like CLOS’s make-instance/initialize-instance. Instead, we can only provide default value forms for each slot (we can also define custom constructors, but they don’t carry over to subtypes).

What should we do when we want to allow inheritance but need complex initialisation which would usually be hidden in a hand-written constructor function?

I’ll use the following as a completely artificial example. The key parts are that I have typed and read-only slots, that the initialisation values depend on arguments, and that we also have some post-processing that needs a reference to the newly-constructed structure (finalization is a classic).

(defstruct (foo 
            (:constructor %make-foo)) 
  (x (error "Missing arg") :type cons 
                           :read-only t) 
  (y (error "Missing arg") :read-only t)) 
 
(defun make-foo (x y) 
  ;; non-trivial initial values 
  (multiple-value-bind (x y) (%frobnicate x y) 
    (let ((foo (%make-foo :x x :y y))) 
      ;; post-processing 
      (%quuxify foo) 
      foo)))

The hand-written constructor is a good, well-known, way to hide the complexity, as long as we don’t want to allow derived structure types. But what if we do want inheritance?

One way to work around the issue is to instead have an additional slot for arbitrary extension data. I’m not a fan.

Another way is to move the complexity from make-foo into initialize-foo, which mutates an already-allocated instance of foo (or a subtype). I’m even less satisfied by this approach than by the previous one. It means that I lose read-only slots, and, when I don’t have sane default values, typed slots as well. I also have to track whether or not each object is fully initialised, adding yet more state to take into account.

For now, I’ve settled on an approach that parameterises allocation. Instead of calling %make-foo directly, an allocation function is received as an argument. The hand-written constructor becomes:

(defun make-foo (x y 
                 &optional (allocator ’%make-foo) 
                 &rest     arguments) 
  ;; the &rest list avoids having to build (a chain of) 
  ;; closures in the common case 
  (multiple-value-bind (x y) (%frobnicate x y) 
    ;; allocation is parameterised 
    (let ((foo (apply allocator :x x :y y arguments))) 
      (%quuxify foo) 
      foo)))

This way, I can define a subtype and still easily initialize it:

(defstruct (bar 
            (:constructor %make-bar) 
            (:include foo)) 
  z) 
 
(defun make-bar (x y z) 
  (make-foo x y ’%make-bar :z z))

The pattern is nicely applied recursively as well:

(defun make-bar (x y z 
                 &optional (allocator ’%make-bar) 
                 &rest     arguments) 
  (apply ’make-foo x y allocator :z z arguments))

Consing-averse people will note that the &rest arguments are only used for apply. SBCL (and many other implementations, most likely) handles this case specially and doesn’t even allocate a list: the arguments are used directly on the call stack.

I’m sure others have encountered this issue. What other solutions are there? How can the pattern of parameterising allocation be improved or generalised?

posted at: 18:29 | /Lisp | permalink

Sun, 13 Nov 2011

Finalizing foreign pointers just late enough

SBCL exposes a low-level type, system-area-pointers (SAPs), which are roughtly equivalent to void * pointers in C. Since it’s so low level, we allow ourselves a lot of tricks to ensure performance. In particular, SAPs may be represented as raw addresses in machine registers. In order to simplify the implementation of this fairly useful optimization, they are given the same leeway as numbers with respect to EQ: SAPs that represent the same address, even from multiple evaluations of the same bindings or value, are not guaranteed to be EQ. When SAPs are compiled to machine registers, this lets us simply re-create a type-generic heap value as needed.

CFFI chose to directly expose SAPs in its user-facing interface. Finalizing SAPs is obviously a no-no: multiple references to the same (semantically) SAP can randomly be transformed into references to an arbitrary number of (physically) different objects.

If you want to finalize potentially-strange system-provided types, it’s probably better to wrap them in a read-only structure, and finalize that structure; for example:

(defstruct (wrapper 
            (:constructor make-wrapper (pointer))) 
  (pointer nil :read-only t)) 
 
(defun gced-foreign-alloc (type &rest rest) 
  (let* ((ptr (apply #’foreign-alloc type rest)) 
         (wrapper (make-wrapper ptr))) 
    (tg:finalize wrapper 
                 (lambda () 
                   (foreign-free ptr)))))

posted at: 00:48 | /Lisp | permalink

Mon, 15 Aug 2011

An issue nobody cares about

CL-USER> (values (truncate (ash 1 60) pi)) 
366986312910250176 
CL-USER> (values (truncate (+ 4 (ash 1 60)) pi)) 
366986312910250176

Obviously, it’s just another floating-point issue.

It’s also an extremely minor optimisation obstable: when only the first return value is needed, it can be faster to implement the division exactly with machine integers than to convert to FP, divide, and convert back to an integer. Unfortunately, doing so would be too exact: we would have different results for equivalent programs depending on the level of optimisation (or the constant-ness of the divisor).

Plus, who does that, or cares about that operation?

posted at: 15:55 | /Lisp | permalink

Fri, 29 Jul 2011

An implementation of (one of) Nesterov's accelerated gradient method

I’m a PhD student at Université de Montréal. Contrary to what I suppose most people would expect from the contents of this blog, I don’t (officially) work on compilers. Instead, I chose to focus on mathematical optimisation; specifically, large-scale combinatorial optimisation. Although I sometimes wonder if I made the right choice, the basic rationale is that I’m hoping to contribute to both the compilers and optimisation worlds by bringing them closer.

Sketching algorithms makes up a large fraction of my work, and, lately, Matlisp has been very useful. If you work with algorithms that depend on numerical tools like the BLAS, LAPACK, FFTs, or even just special function like erf or gamma, you too might like it. In addition to foreign function wrappers and a couple f2cled routines, Matlisp includes a nice reader macro for matrices, and a few convenience functions to manipulate them. Plus, it’s reasonably efficient: it doesn’t try to roll its own routines, can use platform-specific libraries for BLAS and LAPACK, and passes Lisp vectors without copying. It doesn’t look like it was written with threads in mind, unfortunately, but it’s always been good enough for me so far.

The latest project in which I’ve used Matlisp is the prototype implementation of an accelerated gradient method for composite functions [PDF]). The sequel will more or less closely follow the code in my implementation.

I think the program makes a useful, small and self-contained example of how Matlisp can be used; if you need to minimise something that fits in the method’s framework, it might even work well.

1 An accelerated gradient method for composite functions

1.1 The theory

In his paper on gradient methods for composite functions, Nesterov describes a family of efficient algorithms to minimise convex functions of the form

where f is a “nice” black box, and Ψ not as nice, but better-known.

Here, “efficient” means that, in addition to being a gradient method (so each iteration is fairly inexpensive computationally), the method converges at a rate of ( 1k2 ), where k is the number of iterations.

f is nice because it must be convex, continuous and differentiable, with a Lipschitz-continuous gradient (continuous and also not too variable). Fortunately, the Lispchitz constant itself may not be known a priori. On the other hand, we only assume that we have access to oracles to compute its value and gradients at arbitrary points.

Ψ simply has to be convex. However, in addition to value and (sub)gradient oracles, we also assume the existence of an oracle to solve problems of the form (with s,a > 0)

Ψ can, for instance, be used to express domain constraints or barriers. In the former case, the last oracle amounts to solving a projection problem.

1.2 The programming interface

My implementation exposes four generic functions to implement one’s own f and Ψ.

value and grad both take a function description as the first argument, and a point (a Matlisp column vector) as the second argument. They return, respectively, the value and the gradient of the function at that point.

project takes a Ψ function description, a distance function description and the multiplicative factor factor a as arguments; the distance function s
2 x′x + b′x is represented by a distance struct with three slots: n (the size of the vectors), s and b. It returns the minimiser of s
2 x′x + b′x + aΨ(x).

An additional generic function map-gradient is also exposed to solve problems of the type arg min_xg′(x-y) + L
2- ∥x-y∥² + Ψ(x), but the default method re-formulates this expression in terms of project.

Computing the minimiser of a distance function (without any Ψ term) is a common operation, so it’s exposed as distance-min.

The rest of the code isn’t really complicated, but the best explanation is probably the original paper (n.b. although the definitions always come before their use, the distance separating them may be surprisingly large).

Hooks are available to instrument the execution of the method and, e.g., count the number of function or gradient evaluations, or the number of projections. It should be obvious how to use them.

1.3 Example implementations of Ψ

1.3.1 `:unbounded` function

The simplest Ψ is probably Ψ(x) = 0. The accelerated gradient method is then a gradient method for regular differentiable convex optimisation.

value and grad are trivial.

project isn’t much harder: a necessary and sufficient condition for minimising s
2 x′x + b′x is sx^* + b = 0, so x^* = -1
s b.

1.3.2 `:non-negative` function

To constrain the feasible space to non-negative vectors, we only have to let Ψ(x) = 0 if x ≥ 0 (for every component), and Ψ(x) = ∞ otherwise.

Again, value and grad are trivial: project will ensure the method only works with non-negative x, and any subgradient suffices.

As for project, a little algebra shows that it’s equivalent to finding the feasible x that minimises the distance with -1
s b (that’s actually true for any Ψ that represents contraints). For the non-negative orthant, we only have to clamp each negative component of --1
s b to 0:

(defun clamp (matrix) 
  (map-matrix! (lambda (x) 
                 (declare (type double-float x)) 
                 (max x 0d0)) 
               matrix)) 
 
(defmethod project ((psi (eql :non-negative)) distance Ak) 
  (declare (type distance distance) 
           (ignore Ak)) 
  (clamp (distance-min distance)))

1.4 A sample “nice” function f: linear least squares

The linear least squares problem consists of finding a vector x that minimizes (the multiplication by half is just to get a nicer derivative)

This kind of problem can come up when fitting a line or a hyperplane to a set of observations.

value is straightforward:

(defmethod value ((lls linear-least-square) x) 
  ;; A set of convenient matrix operation is available 
  (let ((diff (m- (m* (lls-A lls) x) 
                  (lls-b lls)))) 
    (* .5d0 (ddot diff diff)))) ; ddot ~= a call to MATLISP:DOT

A little algebra shows that the gradient can be computed with

The constructor make-lls precomputes A′A and A′, so we get

(defmethod grad ((lls linear-least-square) x) 
  ;; we can also call BLAS routines directly 
  (axpy! -1 (lls-Atb lls) 
         (m* (lls-AtA lls) x)))

LAPACK has built-in functions to solve that problem, for example gelsy (that’s on my puny macbook air):

;; Create an LLS instance with 2000 rows (observations) 
;; and 400 columns (variables). 
AGM> (defparameter *lls* (make-random-lls 2000 400)) 
*LLS* 
AGM> (time (gelsy (lls-a *lls*) (lls-b *lls*) (* 1000 (expt 2d0 -52)))) 
Evaluation took: 
  0.829 seconds of real time 
  0.827092 seconds of total run time (0.806912 user, 0.020180 system) 
  [ Run times consist of 0.036 seconds GC time, and 0.792 seconds non-GC time. ] 
  99.76% CPU 
  1,323,883,144 processor cycles 
  6,425,552 bytes consed 
 
#<REAL-MATRIX  400 x 1 
   -2.20134E-2 
    1.92827E-2 
   -7.19030E-3 
   -4.01685E-2 
     : 
    7.19374E-3 > 
400 
AGM> (value *lls* *) 
66.5195278164384d0

Although we have very solid specialised methods to solve least squares problems, it makes for a nice non-trivial but easily-explained example.

;; 400 variables, with an LLS instance for f and 
;; a default Psi of :unbounded 
AGM> (time (solve (make-agm-instance 400 *lls*))) 
Evaluation took: 
  32.426 seconds of real time 
  31.266698 seconds of total run time (31.003772 user, 0.262926 system) 
  [ Run times consist of 0.799 seconds GC time, and 30.468 seconds non-GC time. ] 
  96.43% CPU 
  51,779,907,728 processor cycles 
  795,491,552 bytes consed 
 
66.51952785790199d0 
#<REAL-MATRIX  400 x 1 
   -2.20108E-2 
    1.92838E-2 
   -7.19136E-3 
   -4.01690E-2 
     : 
    7.19797E-3 > 
[...]

Pretty horrible, even more so when we add the ≈ 0.9 seconds it takes to precompute A′ and A′A.

Here’s an interesting bit:

;; Use the optimal value as the starting point 
AGM> (time (solve (make-agm-instance 400 *lls* :x0 (second /)))) 
Evaluation took: 
  0.043 seconds of real time 
  0.043683 seconds of total run time (0.043445 user, 0.000238 system) 
  102.33% CPU 
  69,793,776 processor cycles 
  1,525,136 bytes consed 
 
66.51952785758927d0 
#<REAL-MATRIX  400 x 1 
   -2.20108E-2 
    1.92838E-2 
   -7.19136E-3 
   -4.01690E-2 
     : 
    7.19795E-3 > 
[...]

This method naturally exploits warm starts close to the optimal solution…useful when you have to solve a lot of similar instances.

Another interesting property is shown here:

AGM> (time (solve (make-agm-instance 400 *lls*) 
                  :min-gradient-norm 1d0)) 
Evaluation took: 
  1.961 seconds of real time 
[...] 
66.52381619339305d0 ; delta: 1/2.3e2 
[...] 
AGM> (time (solve (make-agm-instance 400 *lls*) 
                  :min-gradient-norm 1d-1)) 
Evaluation took: 
  7.328 seconds of real time 
[...] 
66.51957600391108d0 ; delta: 1/2.1e4 
[...] 
AGM> (time (solve (make-agm-instance 400 *lls*) 
                  :min-gradient-norm 1d-2)) 
Evaluation took: 
  19.275 seconds of real time 
[...] 
66.51952840584295d0 ; delta: 1/1.7e6 
[...] 
AGM> (time (solve (make-agm-instance 400 *lls*) 
                  :min-gradient-norm 1d-4)) 
Evaluation took: 
  31.146 seconds of real time 
[...] 
66.51952785790199d0 ; delta: 1/2.4e7

The run time (and the number of iteration or function/gradient evaluation) scales nicely with the precision.

1.5 Efficiency

The implementation wasn’t written with efficiency in mind, but rather to rapidly get a working prototype. However, nearly all the cost is incurred by calls to the BLAS. If performance were an issue, I’d mostly focus on minimising consing and copies by reusing storage and calling (destructive versions of) BLAS functions to fuse operations.

posted at: 00:19 | /Lisp/Math | permalink

Tue, 21 Jun 2011

SBCL's flow sensitive analysis pass

As I noted a year and a half ago, the constraint propagation pass in SBCL can slow compilation down a lot (lp#394206, lp#792363). To a large extent, I feel like that issue has been fixed in HEAD, but the lack of documentation on that pass made progress slower than it could have been. So, this post is a half-organised version of my notes, before I forget it all.

1 What constraint propagation does

The constraint propagation pass is executed as part of the IR1 optimisation loop, which operates on Lisp-level basic blocks. It’s only concerned with propagating information between “real” variables (LAMBDA-VARiable), variables that have a name and can be set and referred to multiple times, as opposed to temporaries (LVARs).

The distinction is useful because there are (usually) much fewer full-blown variables than temporaries, and only the former directly benefit from exploiting previously executed type checks and predicates to deduct information (flow sensitivity).

Instead of handling all the operation found in basic blocks, constraint propagation only consider a few key operations, and otherwise exploits the type information previously derived in a bottom-up manner (i.e. propagating the types of arguments to result types) by the other IR1 passes. The latter, inversely, only have access to the results of constraint propagation through the type derived for each REFerence to a lambda-var.

So, constraint propagation goes through each basic block, updating its state (set of known true constraints) when it encounters BIND nodes (LET bindings, more or less), references to a lambda-var, CAST nodes ([usually] checked type assertions), CSET nodes (assignment), or branches on a few key predicates (EQL, <, > or TYPEP, mostly).

As would be expected, the state at the end of each basic block is propagated to its successors, join points take the intersection of their precedessors’ states, and entry points are initialised with empty states... and this is all executed iteratively until convergence.

2 What constraint propagation isn’t

It feels a lot like an abstract interpretation algorithm. However, a major difference is that constraint propagation doesn’t converge on a least fixed point. Take the following trivial example:

CL-USER> (lambda () 
           (let ((x 0.0)) 
             (dotimes (i 10 x) 
               (setf x (/ x 2.0))))) 
#<FUNCTION (LAMBDA ()) {10048E20C9}> 
CL-USER> (sb-kernel:%simple-fun-type *) 
(FUNCTION NIL (VALUES NUMBER &OPTIONAL))

It’s obvious to us that X will always be a single float. Unfortunately, while SBCL’s type analyses converge to a fixed point, they’re not always seeded with an underapproximation of a least fixed point.

When there are assignments to a variable, the type of that variable is initialised with its declared type (T if none), and, after that initial type has been used to derive the type of the values that could be assigned to it, the union of these types is taken (and of the initially bound value). In the example above, the return type of (/ [T] [single-float]) is NUMBER. Once we have an over-approximation of the least fixed point, we shouldn’t expect to tighten it back much.

In a proper abstract interpretation pass, X would be initialised with the bottom type or the singleton type (EQL 0.0) (or some approximation thereof), and iteratively widened with the derived type of the division and of its initialisation value, up to convergence. That would easily deduce that X is always 0.0, or, at least, a single-float.

It’s also not like any SSA variant. Yes, Continuation-passing Style is equivalent to Static single assignment form. One key part of that equivalence is that both styles rename variables at join points (via arguments to the continuation or phi functions). IR1 does mention continuations, but these continuations don’t perform the essential renaming of variables. That is the very reason why we have to use constraint sets to represent what we know about a variable’s contents at a given program point.

3 How constraint propagation does it

3.1 Constraints

At the heart of the pass are constraint sets (CONSETs) that are updated as the analysis walks through a basic block. Each constraint in a conset represent a small fact that is known to be true about a given lambda-var: that its contents are (or are not) of a certain type, are greater or less than values of a given type, EQL to a constant, or EQL to another variable.

As a pun, EQL constraint can also link a lambda-var and an lvar: as combinations (function calls) work with lvars, these special constraints are used to translate information about a combination’s arguments to information about lambda-vars (which is what constraint propagation handles).

EQL constraints are special. When two variables are EQL, it’s essential for good performance to extend that EQLness to all the variables that are EQL to either of them (i.e. compute the transitive closure of the EQL relation). This is useful because, when something is true of a variable (e.g. it’s TYPEP FIXNUM), it’s also true of all the variables EQL to it.

However, we only do that for one of the operands in a non-EQL constraint between variables, and not at all for EQL constraints involving constants (overly eager constant propagation can hurt since our codegen doesn’t handle repeated references to constraints that well) or lvars (these aren’t really constraints, but rather pure bookkeeping information punned as constraints).

3.2 Propagation

To begin propagation, entry points are initialised with empty constraint sets (we know nothing about the inputs), and the consets are updated by walking down the basic blocks in execution order, in CONSTRAINT-PROPAGATE-IN-BLOCK:

BIND nodes represent things like LET bindings. If we have a useful (more precise than T) type for the bound value, the new variable is (currently) of that type. Also, if the bound value is a lambda-var’s current value, then the newly introduced variable is EQL to it.
REF nodes are references to a lambda-var. The LVAR that receives that value is EQL to the lambda-var. Additionally, if we’re on the last iteration, it’s useful to store all that we know about the referenced variable at that program point for other IR1 passes to exploit.
CAST nodes are type checks. If the check fails, execution will not continue, so we can simply assume that the check is successful, and propagate the type to the lambda-var at the end of the chain of casts (if any).
CSET nodes represent assignments. The most important thing to do is forget all that is known about the assigned variable at that program point: after the assignment, it’s bound to a brand new value. Still, before that, if we want a lot of propagation, it’s useful to propagate some information that’s always true of that variable to EQL nodes. For now, it only makes sense to propagate the fact that, if the assigned variable is known to always be of a given type, all the variables that are currently EQL to it are also of that type. Then, as with bind nodes, it’s useful sense to add type constraints from the new value, and to add EQL constraints if applicable.

Finally, if the block ends with a conditional branch, it’s useful to look at the predicate guiding the branch to propagate different constraints to the consequent and alternative successors.

If the predicate is simply a reference to a variable, then it’s known to be NIL in the alternative, and something else in the consequent.

If it’s a function call, only a few functions are interpreted:

TYPEP-like functions result in constraints to the effect that a given lambda-var is or is not of a given type.
EQ and EQL result in additional EQL constraints (between variables, or to a constant) in the consequent, and the reverse constraints (not EQL) in the alternative
< and > are used to derive tighter bounds on numeric types. However, we do not track relationships between variables (except for EQLity and non-EQLity), and only note that a given variable is less or greater than a value of some type.
A few other type predicates also result in TYPEP constraints, when specially marked for the compiler.

Once a basic block has been constraint propagated through, the information is used by its successors. Basic blocks are processed in such an order that at least one of its predecessors have been propagated through before itself. The analysis is run with an initial conset made of the intersection of the consets (that have already been computed) at the end of its predecessors, taking care to use the right one if some predecessor ends in a conditional branch (COMPUTE-BLOCK-IN). And, that’s repeated on blocks for which the initial conset might have changed until we hit a fixed point.

Once that’s done, we’re only interested in storing all the flow-sensitive information we have about each variable in the relevant REFerences to it. So, constraint propagation is executed one last time (otherwise we only have the consets at the beginning and end of each basic block), and, when a ref node is encountered, the set of constraints related to the referenced variables is extracted and converted into a type, in CONSTRAIN-REF-TYPE.

4 What’s been and what could be done

We currently represent consets as bitsets. Without too much surprise, the only operation that is horrible with bitsets is iterating through the members of a given set. Fortunately, we only used this in a few functions, and it seems like bitsets are still the best option, when supplemented with a few indices.

The heaviest timesink was FIND-CONSTRAINT: constraints are hash-consed so that equivalent constraints are always EQ. It used to be we’d perform a linear search over the set of constraints associated with a given variable to find constraints. We now use hash tables in each variable to speed the search up.

The other major slowdowns were related to functions that need to iterate through the intersection of the current set of constraints and certain subsets of the constraints associated with a given variable (e.g. all the EQL constraints for a given variable in the current constraint set). Instead of iterating through the intersection of two bitvectors, we represent each subset of the constraints associated with each variable as vectors of constraints (the hash-consing step already ensures there won’t be any duplicate). Now, we only have to iterate through these vectors and check for membership in a bitset.

These simple changes are enough to bring compile times for a couple test cases down from hundreds of seconds to a few seconds or less.

There are definitely more low-hanging fruits in this area, be it for more precise analyses (converging to a proper least fixed point), stronger interaction with the rest of the IR1 passes or quicker compilation. Hopefully, this post can serve as a high level guide to would-be src/compiler/constraint.lisp hackers.

posted at: 15:45 | /Lisp | permalink

Fri, 01 Apr 2011

Introducing Pipes, a lightweight stream fusion EDSL

Fusion is cool. It lets us write bulk data processing programs modularly, with each pass or computation separate from the rest, while retaining the efficiency of code that executes all the passes simultaneously rather than building large intermediate values. In the last couple years, most of the attention has been coming from the Haskell crowd (e.g. Data.List.Stream), but Lisper aren’t immune to that siren’s call. In the late 80’s, Richard Waters worked on SERIES, a Common Lisp package to transform “obviously synchronizable series expressions” into buffer-less loops. That package is still available, and even lightly maintained.

I believe that, while the goal is noble, the approach taken in SERIES is too magical; I’d argue that it’s even more true of stream fusion in GHC.

SERIES goes through a lot of trouble to makes it transformation work on code expressed as regular CL, and even across function definitions. The problem is, SERIES can only handle a subset of CL, so users can be surprised when a seemingly minor refactoring suddenly breaks their code.

GHC’s fusion stuff has the virtue of using the language itself to handle the dirty work of looking just like regular Haskell. Unfortunately, it also depends on inlining and on rewrite rules (well, one), which don’t warn when they can’t be applied.

So, instead, I believe that we should be using an EDSL. It’s embedded, so it’s easy to use with the host language (both around and inside the DSL). It’s also domain specific, so it should be clear what the language does and doesn’t handle, and, in the worst case, the implementation can warn the user instead of silently reverting to slowness.

The design space is huge, but I think Pipes is a decent point. I haven’t been able to spend even one minute on it since a hackathon during the holidays; I hope someone else will be able to work on Pipes a bit.

The idea behing Pipes

The simplest fusion schemes attempt to handle cases like

(mapcar f (mapcar g list)) -> (mapcar (compose f g) list)

These cases are simple because each operation has at most one consumer (output) and one producer (input), and produces exactly one output value for each input value. The obvious ways to generalize allow arbitrarily many output values per input values, multiple consumers, or multiple producers.

Most common are transformations that handle an arbitrary number of output values per input values (or even arbitrary types!). foldr fusion is probably the prime example. I like to think of these rules as based on a specialised CPS conversion: instead of consing a list up, producers call their continuations on each item that would have been in that list.

I’m not sure I can find examples of schemes that handle multiple consumers in the literature. We don’t see them as often in code as multiple producers, and they’re hard to handle with rewrite rules. Still, it’s not hard to treat those cases with a dedicated code generator: just compile a “push” dataflow engine. Such a compiler could easily be extended to allow many outputs (or none) per input.

Multiple producers are much more common, if only because zipWith is a standard list operation. Stream fusion achieves that, and a lot more: concatenated and nested streams, mostly. It can also handle functions like filter easily, by inserting “skip” values in the stream. In fact, it can allow functions to yield an arbitrary number of output values per input. That’s compatible with what a “pull” dataflow engine can achieve.

SERIES is somewhere else in the solution space. It allows multiple producers and consumers, but not, as far as I can tell, more than one output per input (it too uses “skip” values to implement functions like filter). It achieves that by compiling to a loop that advances the state of each node in the dataflow graph exactly once per iteration. Thus, the amount of implicit buffering is constant (one output value for each node), and the loop’s body is generated by a simple traversal of the graph.

Pipes is mostly a “push” engine that handles multiple consumers, but uses a type system to recognize cases similar to what SERIES handles, and then allows multiple producers as well. It’s a different design choice than stream fusion, but I believe that it’s essential to allow multiple consumers instead of forcing users to build temporary values and then traverse them multiple times. Like SERIES, it compiles to a loop whose body corresponds to a traversal of the graph. However, instead of advancing each node exactly once per iteration, some subgraphs can advance multiple times per iteration, or even compile to nested loops.

I’ve been trying to find a good tradeoff for almost 5 years now, and it I feels like there’s a rule here, something like “arbitrary outputs per input, multiple consumers, multiple producers: choose two.” Stream fusion chooses multiple producers and arbitrary output, SERIES multiple producers and consumers, and Pipes multiple consumers and arbitrary output, plus a dash of multiple producers in certain cases. The fact that it almost manages to get that third feature is what I originally found exciting about that design.

That’s it for now

Frankly, I’m pretty much rediscovering my own code, but I remember thinking that it was almost ready for people to start playing with it. Again, it can be downloaded at https://github.com/pkhuong/Pipes.

Here’s a simple example:

CL-USER> 
(lambda (x y) 
  (declare (type (simple-array double-float 1) x y) 
           (optimize speed (sb-c::insert-array-bounds-checks 0))) 
  (pipes (let* ((x (- (from-vector x) (from-vector y))) 
                ; x is bound to the difference at each iteration 
                (_ sum-x (scan + x 0d0 double-float)) 
                ; sum-x is bound to the final sum at the end of 
                ; the loop 
                (_ sum-x^2 (scan + (* x x) 0d0 double-float)) 
                (_ min   (scan min x 
                               double-float-positive-infinity 
                               double-float)) 
                (_ max   (scan max x 
                               double-float-negative-infinity 
                               double-float)))) 
    (values (length x) sum-x sum-x^2 min max))) 
#<FUNCTION (LAMBDA (X Y)) {1004B7BAB9}> 
CL-USER> 
(funcall * 
         (map-into (make-array 10 :element-type ’double-float) 
                   (let ((x 0d0)) 
                     (lambda () 
                       (incf x)))) 
         (map-into (make-array 10 :element-type ’double-float) 
                   (let ((x 0d0)) 
                     (lambda () 
                       (decf x))))) 
10 
110.0d0 
1540.0d0 
2.0d0 
20.0d0 
CL-USER> (disassemble **) 
; disassembly for (LAMBDA (X Y)) 
; 04125107:       488B4AF9         MOV RCX, [RDX-7]           ; no-arg-parsing entry point 
;      10B:       488B5FF9         MOV RBX, [RDI-7] 
;      10F:       4C8BC3           MOV R8, RBX 
;      112:       488BF1           MOV RSI, RCX 
;      115:       4839D9           CMP RCX, RBX 
;      118:       488BCE           MOV RCX, RSI                                                                                                                                     
;      11B:       490F4FC8         CMOVNLE RCX, R8 ; find the min length 
;      11F:       488BD9           MOV RBX, RCX 
;      122:       31C9             XOR ECX, ECX 
;      124:       660F57E4         XORPD XMM4, XMM4 
;      128:       660F57ED         XORPD XMM5, XMM5 
;      12C:       F20F101554020000 MOVSD XMM2, [RIP+596] 
;      134:       F20F101D54020000 MOVSD XMM3, [RIP+596] 
;      13C:       EB46             JMP L3 
;      13E:       90               NOP 
;      13F:       90               NOP 
;      140: L0:   F20F104C0A01     MOVSD XMM1, [RDX+RCX+1] 
;      146:       F20F10740F01     MOVSD XMM6, [RDI+RCX+1] 
;      14C:       F20F5CCE         SUBSD XMM1, XMM6 
;      150:       F20F58E1         ADDSD XMM4, XMM1 
;      154:       660F28F1         MOVAPD XMM6, XMM1 
;      158:       F20F59F1         MULSD XMM6, XMM1 
;      15C:       F20F58EE         ADDSD XMM5, XMM6 
;      160:       660F2FD1         COMISD XMM2, XMM1 
;      164:       0F8A52010000     JP L14 
;      16A:       0F834C010000     JNB L14 ; rest of cmp/movapd 
;      170: L1:   660F2FD9         COMISD XMM3, XMM1 
;      174:       0F8A2D010000     JP L12 
;      17A:       0F8627010000     JBE L12 ; same 
;      180: L2:   4883C108         ADD RCX, 8 
;      184: L3:   4839D9           CMP RCX, RBX 
;      187:       7CB7             JL L0 
[...]

This subtracts two vectors element-wise, and returns the sum of the differences, the sum of the squared differences, and the minimum and maximum differences. It’s definitely not the best example, as it doesn’t explicitly exploit the freedom to compute many outputs per input, but it’s something that I’ve had to write by hand before.

If you want to play with Pipes and have questions, ping me on #lisp.

posted at: 21:20 | /Lisp/Pipes | permalink

Tue, 28 Dec 2010

Another way to accumulate data in vectors

From time to time, some #lisp-ers debate whether it’s faster to accumulate a sequence of values using cons/nreverse, or cons/(setf cdr). To me, if that’s an issue, you simply shouldn’t use a linked list; as Slava once wrote, “single linked lists are simply the wrong data structure in 99% of cases.” A vector won’t be much or any slower to build, and traversing a long vector is often far quicker than traversing an equally long linked list, if only because the vector uses half as much space (and thus memory bandwidth). The obvious question is then: how should we accumulate data in vectors? (clearly, building a linked list and coercing it to a vector, as SBCL sometimes does, isn’t ideal)

1 The simplest way

Obviously, the ideal way is to know the size of the final vector (or a reasonable overapproximation) ahead of time. The whole vector can then be preallocated, and, if needed, trimmed a posteriori. In terms of memory accesses, this pretty much achieves the lower bound of exactly one access per vector element. It’s also rarely applicable.

2 The usual way

The usual way to approach the problem of an unknown vector size is to begin with an underapproximation of the final size, and grow the vector geometrically (and copy from the old one) when needed. That’s the approach usually shown in textbooks, and almost necessarily used by std::vector.

Unless the initial guess is very accurate (or too high), this method has a noticeable overhead in terms of memory access per vector element. Asymptotically, the best case is when the last grow is exactly the size of the final vector. For instance, when growing temporary vectors by powers of two, this happens when the final vector size is also a power of two. Then, in addition to the one write per element, half are copied once, a quarter once again, etc., for a total of ≈ 2 writes per element. The worst case is when the final vector has exactly one more element; for powers of two, all but one vector element are then copied once, half of that once again, etc., resulting in ≈ 3 writes per element.

C programs can often do much better by using realloc (or even mremap) to avoid copies. High level languages can, in principle, do as well by also hooking in the memory allocator. C++’s std::vector, however, can’t exploit either: it must go through constructors and operator= (or std::swap, or std::move). A variant of realloc that fails instead of moving the data could be useful in those situations; I don’t know of allocators that provide such a function.

3 A lazier way

Another way to build a vector incrementally is to postpone copying to a single contiguous vector until the very end. Instead, a sequence of geometrically larger temporary vectors can be accumulated, and only copied once to a final vector of exactly the right size. Since the temporary vectors grow geometrically, there aren’t too many of them (proportional to the logarithm of the final vector size), and the sequence can be represented simply, as a linked list, a flat preallocated vector, or a vector that’s copied when resized.

This yields, in all cases, exactly one write and one copy per element, and (lg n) bookkeeping operations for the sequence, for a total of ≈ 2 writes per element.

4 Testing the hypothesis

I used my trusty X5660 to test all three implementations, in C, when building a vector of around 2²⁴unsigned elements (enough to overflow the caches). The table below summarizes the results, in cycles per vector element (median value of 256). The row for 2²⁴- 1 and 2²⁴ elements represent the best case for the usual growable vector, 2²⁴ + 1 the worst case, and 2²⁴⋅ 3
2 an intermediate value (usual case). The relative values were similar with implementations in SBCL, both on a X5660 and on a Core 2.


size (n)	preallocate (c/n)	grow (c/n)	lazy (c/n)

2²⁴- 1	7.52	13.39	15.01
2²⁴	7.55	13.39	15.29
2²⁴ + 1	7.55	19.63	13.91
2²⁴⋅	7.54	15.51	15.35

The numbers for preallocated vectors are reassuringly stable, at 7.5 cycles per vector element (2 cycles per byte). The usual way to grow a vector ends up using a bit less than twice as many cycles in the best cases, almost exactly twice in the usual case, and nearly three times as many cycles in the worst case. In contrast, lazy copying consistently uses twice as many cycles as the preallocated vector. These figures fit nicely with the theoretical cost in memory accesses. The main discrepancy is that growable vectors seem to do slightly better than expected for the best cases; this is probably due to the data caches, which significantly accelerate copying the first few temporary vectors.

5 Conclusion

The usual way to implement growable vectors has issues with performance on vectors slightly longer than powers of two; depending on the computation:bandwidth cost ratio, a task could become 50% slower when the number of outputs crosses a power of two. Lazy copying, on the other hand, has very similar performance in most cases, without the glass jaw around powers of two. The effect is even more important when copying is more expensive, be it due to copy constructors, or to slower implementations of memcpy (e.g. SBCL).

I’ve been working on a library for SERIES-style fusion of functional sequence operations, lately (in fact, that’s what lead me to benchmark lazy copying), and decided to spend the time needed to implement accumulation of values with lazily-copied vectors. The cost is mostly incurred by myself, while there isn’t any difference in the interface, and the user notices, at worst, a slight slowdown compared to the usual implementation. For instance,

(pipes (let ((x (from-vector x))) 
         (to-vector (remove-if-not (plusp x) x))))

will accumulate only the strictly positive values in x, without any unexpected major slowdown around powers of two.

Growable vectors are often provided by (standard) libraries, and used by programmers without giving explicit thought to their implementation. Glass jaws should probably be avoided as much as possible in this situation, even at the cost of additional code off the hot path and a little pessimisation in the best cases.

posted at: 00:49 | /Lisp | permalink

Sat, 04 Dec 2010

Concurrency with MVars

[Updated twice to improve readability on planet lisp, and again to remove some useless package qualifiers.]

I originally intended to remind users of condition variables (waitqueues) in SBCL that condition-wait can return spuriously. In other words, you’re likely misusing condition variables if your program isn’t correct with (basically) the following definition of condition-wait:

(defun condition-wait (queue mutex) 
  (declare (ignore queue)) 
  (sb-thread:release-mutex mutex) 
  ;; sleep and/or assume a fair mutex 
  (sb-thread:get-mutex mutex))

One way to almost always use condition variables correctly is to insert calls to condition-wait in a busy-polling loop. I was going to code up a rough example of how to do that for coroutines, but it seemed better to exploit MVars instead.

1 MVars

(The code for this section is available at http://discontinuity.info/˜pkhuong/mvar.lisp.)

The GHC docs describe MVars as “(pronounced ”em-var”) is a synchronising variable, used for communication between concurrent threads. It can be thought of as a a [sic] box, which may be empty or full.” Another way to see them is as bounded (one-element) message queues.

To use MVars, we need to be able to create one (make), consume a value (take) and put one in the queue (put). It’s also useful to expose a type (mvar) and a type test (mvar-p).

(defpackage "MVAR" 
    (:use "CL" "SB-THREAD") 
  (:export "MVAR" "MVAR-P" "MAKE" "VALUE" "TAKE" "PUT")) 
 
(in-package "MVAR")

To implement an MVar, we obviously need some way to denote emptyness, a mutable box, and a mutex to protect against concurrent accesses.

(defconstant +empty+ ’+empty+) 
 
(defstruct (mvar 
             (:constructor %make-mvar)) 
  (mutex      (make-mutex)     :type mutex 
   :read-only t) 
  ;; signaled when reads are possible 
  (read-cvar  (make-waitqueue) :type waitqueue 
   :read-only t) 
  ;; signaled when writes are possible 
  (write-cvar (make-waitqueue) :type waitqueue 
   :read-only t) 
  (value  +empty+))

(defun make (&optional (value +empty+)) 
  (%make-mvar :value value))

The two condition variables (waitqueues) are used to reduce busy-looping. It would also be possible to only have a single condition variable for both reads and writes, but that would result in even more spurious wake-ups. Instead, the code can use condition-notify to only wake a single waiter at a time.

(defun take (mvar) 
  (declare (type mvar mvar)) 
  (let ((mutex (mvar-mutex mvar)) 
        (cvar  (mvar-read-cvar mvar))) 
    (with-mutex (mutex) 
      (loop for value = (mvar-value mvar) 
            do (cond ((eq value +empty+) 
                      (condition-wait cvar mutex)) 
                     (t 
                      (setf (mvar-value mvar) +empty+) 
                      (condition-notify (mvar-write-cvar mvar)) 
                      (return value)))))))

(defun put (mvar new-value) 
  (declare (type mvar mvar)) 
  (assert (not (eq new-value +empty+))) 
  (let ((mutex (mvar-mutex mvar)) 
        (cvar  (mvar-write-cvar mvar))) 
    (with-mutex (mutex) 
      (loop for value = (mvar-value mvar) 
            do (cond ((eq value +empty+) 
                      (setf (mvar-value mvar) new-value) 
                      (condition-notify (mvar-read-cvar mvar)) 
                      (return new-value)) 
                     (t 
                      (condition-wait cvar mutex)))))))

Finally, tiny setf wrappers never hurt:

(declaim (inline value (setf value))) 
(defun value (mvar) 
  (take mvar)) 
 
(defun (setf value) (value mvar) 
  (put mvar value))

2 Implementing coroutines with MVars

(The code for this section is available at http://discontinuity.info/˜pkhuong/coroutine.lisp.)

Coroutines are like functions, except that they allow multiple returns and (re-) entries. Users should be able to create coroutines (coroutine), yield values from coroutines, and grab the next values from a coroutine.

(defpackage "COROUTINE" 
    (:use "CL" "SB-THREAD" "SB-EXT") 
  (:export "COROUTINE" "YIELD" "NEXT" "+DEAD+")) 
 
(in-package "COROUTINE")

To implement that, coroutines only need a thread and two mvar, one for arguments and another for return values:

(defstruct (coroutine 
             (:constructor %make-coroutine (thread in out))) 
  (thread nil :type thread    :read-only t) 
  (in     nil :type mvar:mvar :read-only t) 
  (out    nil :type mvar:mvar :read-only t))

next simply has to put fresh argument values, and take return values:

(defun next (coroutine &rest values) 
  (mvar:put (coroutine-in coroutine) values) 
  (values-list (mvar:take (coroutine-out coroutine))))

yield shouldn’t be used outside coroutines, so it’s defined as a stub and a compiler macro:

(defun yield (&rest values) 
  (declare (ignore values)) 
  (error "~S used outside ~S" ’yield ’coroutine)) 
 
(define-compiler-macro yield (&whole whole &rest values) 
  (declare (ignore values)) 
  (warn "~S used outside ~S" ’yield ’coroutine) 
  whole)

Finally, coroutines are just threads with a local yield function.

(defconstant +dead+ ’+dead+) 
 
(defmacro coroutine (&body body) 
  (let ((_in    (gensym "IN")) 
        (_out   (gensym "OUT")) 
        (_block (gensym "BLOCK"))) 
    ‘(make-coroutine 
      (lambda (,_in ,_out) 
        "IN is the input MVAR and OUT the output MVAR." 
        (lambda () 
          (block ,_block 
            (flet ((yield (&rest values) 
                     (mvar:put ,_out values) 
                     (let ((in (mvar:take ,_in))) 
                       (when (eq in +dead+) 
                         (return-from ,_block)) 
                       (values-list in)))) 
              ;; signal that initialization is complete 
              (yield) 
              (locally 
                  ,@body))) 
          (mvar:put ,_out (list +dead+))))))) 
 
(defun make-coroutine (builder) 
  (let* ((in     (mvar:make)) 
         (out    (mvar:make)) 
         (thread (make-thread (funcall builder in out))) 
         (coroutine (%make-coroutine thread in out))) 
    ;; the coroutine thread and the finalizer don’t hold references 
    ;; to the coroutine struct, so finalize isn’t useless. 
    (finalize coroutine 
              (lambda () 
                (mvar:put in +dead+) 
                (join-thread thread))) 
    ;; return the coroutine and the first yielded values 
    (multiple-value-call #’values 
      coroutine 
      (values-list (mvar:take out)))))

3 Same-fringe with coroutines

A classic toy application of coroutines (or, actually generators, since information only flows out of coroutines) is the same fringe problem. We can implement that by first enumerating the leaves of a cons tree:

(defun leaves (tree) 
  (coroutine 
    (labels ((walk (tree) 
               (cond ((consp tree) 
                      (walk (car tree)) 
                      (walk (cdr tree))) 
                     (t 
                      (yield tree))))) 
      (walk tree))))

Then, we only have to read the leaves from the input trees:

(defun same-fringe (tree1 tree2) 
  (loop with leaves1 = (leaves tree1) 
        with leaves2 = (leaves tree2) 
        for leaf1 = (next leaves1) 
        for leaf2 = (next leaves2) 
        do 
     (cond ((and (eq leaf1 +dead+) 
                 (eq leaf2 +dead+)) 
            (return t)) 
           ((not (eql leaf1 leaf2)) 
            (return nil)))))

posted at: 17:22 | /Lisp | permalink

Mon, 12 Apr 2010

The type-lower-bound branch

Nikodemus was intrigued by the type-lower-bound branch I pushed on repo.or.cz a few weeks ago. It’s a convenience hack that wound up being slightly more intricate and interesting than planned: a user was complaining that there was no nice way to muffle some optimisation notes. Take

(lambda (x) 
  (declare (type integer x) 
           (optimize speed)) 
  (1+ x)) 
 
; in: LAMBDA (X) 
;     (1+ X) 
; ==> 
;   (+ X 1) 
; 
; note: forced to do GENERIC-+ (cost 10) 
;       unable to do inline fixnum arithmetic (cost 1) because: 
;       The first argument is a INTEGER, not a FIXNUM. 
;       The result is a (VALUES INTEGER &OPTIONAL), not a (VALUES FIXNUM &REST T). 
;       unable to do inline fixnum arithmetic (cost 2) because: 
;       The first argument is a INTEGER, not a FIXNUM. 
;       The result is a (VALUES INTEGER &OPTIONAL), not a (VALUES FIXNUM &REST T). 
;       etc.

If the user already knows that the best they can do is declare that x is an integer, there is no way to muffle only the notes that amount to wishing that the type of x was more precisely known.

My first (failed) attempt tagged variables as hopeless (any note mentioning these variables’ types would then be muffled). I forgot why I couldn’t make it work, but I believe it’s because invisible copies are very common, so that transforms manipulated lvars that were mere untagged copies of the tagged variable.

Luckily, we have a well-established mechanism to make properties flow across copies: the type (propagation) system. Again, my first attempt was to create a type for hopelessly vague types. However, types must implement all three set operations (negation, intersection and union), and making that work with such an ad hoc type wasn’t an attractive prospect.

This is where type lower bounds come in. In CL, types, as exposed through declarations, are always upper bounds: they are treated as conservative approximations of the set of values the annotated form, variable, etc. can take. The conservativeness is necessary because the exact static type is generally undecidable. In other words, the meaning of (the integer x) is that x can evaluate to any integer; it could actually only ever evaluate to a fixnum (or any other subset of integers), which is what the note in the original example asks for.

If we had a way to also denote lower bounds (e.g. that x can take (not fixnum) values) on the exact static type of forms, compiler notes could be tested against these lower bounds to determine when the programmer knows that the declared or derived type cannot be improved enough for a transform to fire. In the original example, this would amount to declaring that x’s exact static type is between fixnum (exclusively) and integer (inclusively), or, equivalently, that x will always be an integer, and will sometimes not be a fixnum. The note is then obviously moot, and can be muffled by the compiler.

Unlike ad hoc “hopeless” types, intersection and union of lower bound types (really, range of types, since lower bound types are always accompanied by an upper bound) are straightforward and theoretically sound. The only issue is with type negation. Since this is all for convenience, I decided to just punt and drop the lower bound before negating.

The branch, as pushed on repo.or.cz, seems to be working. In order to keep changes to a minimum, regular types are treated as having an implicit lower bound of nil, and range types (with a non-trivial lower bound) are aggressively converted to regular types. This gives the muffling effect for some interesting simple cases, and reverts to the old behaviour very quickly. There are probably hidden bugs (both in the code and in the design), but since they could only be triggered by using the extension, I’m not too worried.

N.B. This isn’t actually useful for compilation speed

I originally thought type lower bounds could be useful to improve compilation speed: by keeping around both lower and upper bounds, we are able to overapproximate types even in the presence of type negation. Once I implemented a quick prototype, ir1-widening, I realised we don’t need lower bounds at all: we only have to make sure we only ever approximate types once we’re sure they’ll never be negated. Actually, what would be even more useful is a way to compute approximate unions and intersections quickly, instead of widening types after the fact.

posted at: 20:44 | /Lisp | permalink

Contact me by email: pvk@pvk.ca.