Fresh in SBCL 1.1.8: SSE Intrinsics!

One feature I’m particularly excited about is the basic support for SSE intrinsics on x86-64. It’s only the fundamental infrastructure needed for the compiler and runtime to manipulate and compile operations on SIMD packs (the platform-independent name suggested by Kevin Reid). However, now that it’s merged in, we can easily add new functionality that maps (more or less) directly to SSE assembly in a running image, like the sb-rotate-byte contrib does for bitwise rotations.

There is currently no such contrib in SBCL, although Alexander Gavrilov has been maintaining cl-simd, an extension for SBCL and ECL (he also kept an old SSE/SBCL branch of mine on life support since 2008). I’m really not sure what the right interface will look like for Common Lisp, so I decided to only provide the mechanism to implement such an interface; when something emerges that looks sane and has been used in anger, we can merge it in.

In the meantime, we get to define our own interface and see where that experience leads us. In this post, I’ll use a classic example to guide my first stab at an interface: the Mandelbrot set. I’ll only define as much of an interface as I need, and adapt it along the way if I see the need for more sophistication; I prefer to build programs and libraries bottom-up and iteratively, rather than trying to divine my requirements ahead of time. A consequence of this approach is that I have to/am able to maintain a dynamic development style, even when working with (and working on) SSE intrinsics. I know many people use REPLs as desktop calculators; maybe others can be persuaded to also use SBCL as an SIMD calculator, to doodle on SSE code sequences (:

Packed floating-point arithmetic

The last form adds three new functions (f4+, f4* and f4-) to the compiler’s database (fndb). Each function accepts two packs of single floats as arguments, and returns another single float pack (the result of element-wise addition, multiplication or subtraction). Moreover, the functions can be reordered and flushed as dead code by the compiler, and, whenever they appear in code, they should ultimately be translated to assembly code. The last bit is so the runtime system silently overwrites the database if the functions are already there, rather than warning at load-time.

Next, we add a template (VOP) to convert calls to f4+ into assembly code. This task is so deeply tied with the internals that it makes sense to just do it from SB-VM.

The first line defines a VOP with name mandelbrot::f4+ (it’s an arbitrary symbol and the important bit is only to avoid collisions, but good naming is useful when reading compiler traces or optimisation notes).

Then, we specify that it’s to be automatically considered when there is a call to mandelbrot::f4+, and that it can be used both in fast and in safe code.

The template converts a call with two arguments, x and y, that must both be in SSE registers of single floats. Both value must be represented as single floats in SIMD packs: when defining VOPs, types refer to primitive types, and primitive types are concerned with representation (like C types) rather than only sets of values (a given CL value or type can be mapped to many primitive types depending on the context). We also declare a preference for x to be allocated in the same register as the result r, if possible.

The template has a single result, which is also an SIMD pack of single floats allocated in an SSE register of single floats.

Finally, we have code to emit assembly. If r and y were packed (register-allocated) in the same register, we exploit commutativity to add x into y. Otherwise, we move x into r if necessary (the move function takes care of checking for equivalent locations), and emit a packed single float addition (addps) of y into r (which has just been overwritten with the contents of x).

The reason we have to specify that the SSE registers hold single float values (rather than doubles or integers) is for functions like move to emit a move of FP SSE registers rather than integer, for example. However, this additional information doesn’t really affect register allocation: the three storage classes (SC) – single-sse-reg, double-sse-reg and int-sse-reg – all map to the same storage base (SB), and the same SSE register can be used for an simd-pack-single value at one program point, an simd-pack-int at another, and a double-float at yet another.

This VOP is the bare minimum for a useful packed addition of single floats: in the future, it will probably make sense to add support for a constant argument, which could be directly loaded from a RIP-relative address to save a register (SBCL isn’t clever enough to determine when it makes sense to keep a constant in a register across a basic block or a loop).

We do the same for multiplication and subtraction. There’s one last twist for f4-. We can’t exploit commutativity there, so we have to make sure y and r aren’t packed in the same register; this is achieved by extending the live range of r from the end of the (default) live range for x. There are obvious opportunities for macro-isation, but, again, I can refactor when I have a clear need (e.g. when I decide to add a dozen more intrinsics).

Now that we’ve declared the functions and defined one way to convert each to machine code, we can define stubs to interact with them from interpreted code or as normal first-class functions. The compiler already has all the information needed to convert any call to f4+ et al to machine code; however, no such function is actually defined. We can define them with what looks like infinite recursion: the defknown will ensure the compiler inserts type checks for (simd-pack single-float) as needed and infers that the result is also an (simd-pack single-float), so the templates will be applicable.

Now, we can call our new functions at the REPL, pass them to higher-order functions, etc., like any normal function.

The first form makes an simd-pack from four single float values. We must already track at compile-time whether each SIMD pack value is a pack of singles, doubles or integers, so it makes sense to do the same at runtime: that way we preserve type information when constant folding, and that we can print them nicely. Thus, we can call f4+ on that value, and see that 1.0 + 1.0 = 2.0. We can also disassemble the function f4* to confirm that it is a normal function that can be passed around (in this case to disassemble), and that it doesn’t endlessly recurse.

A first stab at the Mandelbrot inner loop

The inner loop body to approximate the Mandelbrot set is simply z' = z*z + c, where z and c are complex values. If we expand the complexes into their real and imaginary parts (zr + i zi, cr + i ci), we find z*z+c = (zr*zr - zi*zi + cr) + i(2*zr*zi + ci). That’s easily vectorised if we have a pack of four zr values, another for the zi, and similarly for cr and ci.

This is a straight translation of the scalar formula above, but it computes four values in parallel (that’s why struct of array layouts lead to easy to vectorisation). Crucially, a disassembly reveals we get the code we expect:

The beauty of Python’s representation selection capabilities is that we can specify multiple representations (unboxed in a register or on stack, or boxed as a generic pointer to a heap-allocated value) for SIMD packs, and the right (usually…) one will be chosen depending on context. In this case, the function receives boxed SIMD packs, unboxes them into SSE registers in the prologue, performs its FP arithmetic only on SSE registers, and converts the results back into boxed SIMD packs.

Even better: it seems to compute the right thing (:

The first return value is a pack of real components, and the second a pack of imaginary components, and the values do correspond to the scalar computation in normal complex arithmetic.

Comparisons and packed integers

The next step is to compute the (squared) norm of complex values, in order to detect escape from the orbit.

Again, a couple smoke tests, and a disassembly

Finally, we have to compare floats (squared norms) against a limit (4.0), and perform some integer arithmetic to count the number of iterations until escape. f4-sign-mask is the first instance of a function that accepts arbitrary SIMD pack types: singles, integers, or even doubles, it’ll extract sign bits from the four 32-bit chunks.

The VOPs are really straightforward and somewhat repetitive, again: we only want the bare minimum to get something working, and fancy special cases can wait until they actually show up.

Again, we add stubs to easily work at the REPL

And some smoke tests

f4<= compares two packs of single floats element by element, and returns a pack of integers with all 32 bits set to 1 for pairs that are <=, and 0 otherwise. i4- subtracts 32-bit signed integers element-wise; subtracting -1 (#xffff...) is equivalent to adding 1. Finally, f4-sign-mask extracts the topmost (sign) bit from each chunk of 32 bits, and returns the 4 bits as a normal integer (i.e. in a general purpose register). The function sb-ext:%simd-pack-ub32s is useful to extract the four 32 bit unsigned values from an SIMD pack – in this case the result of subtracting the comparison masks from zero – and work on them in a scalar context.

A complete Mandelbrot inner loop

The idea is that we’ll want to compare the result of %norm^2 with the limit (4.0), and stop iterating when the maximal iteration count is reached, or when all four norms are greater than 4.0 (all sign bits are zero). Until then, we can subtract from the counter to increment the number of unescaped iterations. When we’re done, we can easily extract individual iteration counts from the packed counters.

This yields

We can run a couple random tests and compare with a straightforward scalar version:

So, there’s some register allocation oddities (mostly because the VOPs don’t handle the case when both arguments are the same, I think), but it’s pretty good. The one issue that bothers me is (zerop (f4-sign-mask still-in-orbit)): it’s a common operation, and there’s a slight miscompile (the result of f4-sign-mask is converted to a fixnum before comparison… an instance of bad representation selection). These two reasons – factoring code out and improving code generation for a medium-level operation – are enough for me to add a specialised function to test if all the sign bits are zero.

A new predicate

We define predicates simply as functions that return boolean values (this one is prefixed with f4- because the instruction officially works on single floats).

However, VOPs for such functions shouldn’t return values; instead, they can define how to determine that their result is true. In this case, if the :z (zero) flag is set. Code generation will take care of using that information in conditional branches or conditional moves.

For example, in this stub, the compiler will emit the template, followed by a conditional move to return T or NIL.

Wrapping up

Now, we can slightly adjust mandelbrot-escape to exploit this new function:

This all looks like a lot of work, but the vast majority of it is reusable code to define SSE operations: 22 LOC of defknown, 83 LOC for assembly templates, 20 LOC in stubs and utility code. There’s only 31 LOC for the SIMD mandelbrot loop (%mandelbrot-iter, %norm^2 and mandelbrot-escape), 6 for the naïve scalar loop, and 12 for the random test harness. I uploaded all the files on github to simplify further experimentation; just fork and play around! The function names could certainly be improved, for one.

Miscellaneous notes

My request for extra-hard bug shaking in SBCL 1.1.8 didn’t function as well as I’d wish: only a few hours after release (and after a week of code freeze), we received a new bug report (with fixes committed very rapidly, at least). We observed the same phenomenon in 1.1.6, and wound up being extremely conservative for 1.1.7. I’m not sure what’s the best way to encourage testing (binary release candidates during freeze?), but I am starting to see the point of the old odd/even policy for Linux. For now, the result is that 1.1.8 offers a lot of exciting improvements, but also more than its share of bugs; for production code, it’s probably best to either build an early 1.1.8.git-commit, or wait for 1.1.9.

Long-time readers might remember that I first blogged about SSE intrinsics in 2008. The problem is that I hit some issues with representing type information (whether an SIMD pack contains integer, single float or double float data), and didn’t manage to find a nice resolution. I also had trouble with type operations (intersection, union, negation) on (simd-pack [type]). For the first problem, I bit the bullet and created many storage classes and primitive types; as shown above, a function can accept arbitrary simd-pack, and so can VOPs: when converting to primitive types, unspecialised simd-pack are treated like integer simd-pack. The second, I solved by moving to a type upgrading system inspired by array types. There are really only three specialised simd-pack types: on integers, single floats and double floats. Any specialised simd-pack type specifier must be specialised on a recognisable subtype of one of these three types. It’s a hack, but all type operations become trivial and easy to think about.

The reason I finally sat down, re-read my code and thought hard for a few days is Google’s Summer of Code. A student proposed to build on that work to vectorise standard sequence operations in CL, and I really didn’t want to be blocking their work. In the end we were only allotted two slots, and had to select a pair of extremely strong proposals from ten interesting submissions… more on that soon!

In the meantime, I hope the recent burst of activity and outside interest can motivate others to contribute to SBCL, even without Google support. We already have one person working on a contrib to exploit GMP for bignum and rational arithmetic, instead of our hand-rolled routines; I expect to see it merged in 1.1.9 or in 1.1.10.

P.S. I recently stumbled on Glassman’s FFT and it’s really nifty: an in-order general-size FFT (not so fast when prime factors are huge) in approximately 50 LOC. The performance isn’t awesome on cached architectures, but shouldn’t be atrocious either… and it handles arbitrary DFT sizes. There are issues with numerical stability when computing the twiddle factors on the fly, and the access patterns could be improved, but it’s a very simple and interesting foundation… especially given that the inner loop is naturally vectorisable.