Paul Khuong mostly on Lisp

rss feed

Tue, 23 Oct 2007

 

Fast Constant Integer Division

Integer division (and its kissing cousins, remainder and modulo) is an operation that regularly pops up in surprising places. The first time I encountered the problem was a few years ago, when I wanted to implement a corewar VM in SBCL (probably my first attempt at a real project in CL). My program was much slower than I would have expected: constant MOD were compiled to actual divisions. More recently, I hit the same problem when trying to implement hash functions and PRNGs efficiently. The main technique to make such operation decently fast approximates division by a multiplication and a shift (in other words, multiplication by a fraction). However, there are other tricks in more restricted situations that can sometimes yield simpler code.

Expressing division as a multiplication is actually simple when you think of it in terms of fractions. Division (and then truncation) by x is like multiplication by 1/x. Since the only type of division we can easily express is division by a power of 2, we want to approximate multiplication by 1/x with a multiplication by y/2^k. Moreover, since the result will be truncated, we want to overapproximate (in absolute value) the fraction. We thus find (for x positive) the folklore formula y = ceiling(2^k/x) (or 1 + floor(2^k/x), which suffers from an off-by-one when x is a power of 2). Note that since the multiplier is rounded up, any error will be one of overapproximation. If the sign of the input is known (or the range otherwise asymetric), it is possible to use a small additive constant to bring the result of the multiplication closer to 0 before dividing and taking the floor (with a shift), and reduce the overapproximation.

In http://www.discontinuity.info/~pkhuong/pseudoinverse.pdf, we find a nice way to figure out the smallest z such that floor(z/x) is different from floor(z*y/2^k). If the input is known to fit in the range (which is often on the order of machine ints), then it is possible to convert integer division to multiplication. Unfortunately, this doesn't even consider the possibility of a (constant) additive fixup between multiplication and division/flooring.

When implementing this on a machine that offers BxB -> B, B multiplication (e.g., 32x32 -> 32, 32 on IA32, or 64x64 -> 64, 64 on x86-64), or even just the most significant half (like Itanium or Alpha), there are two main interesting values for k (as in y/2^k): B and B + floor(lg x). If we take k = B, then we can simply use the most significant word as an implicitly shifted value. If that isn't precise enough, then B + floor(lg x) is the largest denominator such that y (skipping leading zeros) will fit in B bits; the result must however then be shifted. In both cases, but especially when shifting, it is essential to ensure that the result won't overflow the space available in the significant half of the product. When a machine offers many multiplication sizes, it can be useful to try smaller ones first: wider multiplications are sometimes slower, and larger constant always take up more space.

Sometimes, we're only interested in the remainder (MOD is slightly more complex and can be expressed in terms of REM). In the general case, it is impossible to avoid a division, or at least a multiplication by a pseudoinverse, followed by a multiplication and a subtraction. However, when the input range is restricted, some nice tricks are available. It's much simpler to assume that the input is always positive (or always negative). Doing otherwise requires a non-trivial amount of mask generation and subsequent masking, making the tricks less attractive.

Apart from powers of 2, the most obvious trick is when we want (rem z x), where z is known to be smaller than 2x. We can express x as 2^n - e. We have (rem z x) = z if z < x, and (rem z x) = z-x otherwise. It would be simple to compare, generate a mask and do a masked subtraction. However, when z + e might overflow (and n is the width of the register), a simple way to check for that is to add e to z. If the result is greater than 2^n - 1, z >= x (but we still have z < 2^(n+1)). Luckily, the implicit wrap around (masking off everything but the lower n bits) is enough to subtract x (remember that e was previously added, so subtracting 2^n really subtracts x). If z < x (the overflow bit isn't set), then we have to subtract e back. This logic can easily be expressed with a masked subtraction.

A slightly more widely applicable trick is also based on representing the modulo x as 2^k - e. We can split z in 2 (or more) parts: z = z1 2^k + z0. Thus (rem z x) = (rem (z1 e + z0) x). If the product z1 e is small enough, we can then use the previous trick instead of computing the remainder the usual way (it would be possible to recurse). By adding e to z1 e and then adding that to z0 we can guarantee that any overflow will only happen in the last addition, thus making it possible to easily use hardware flags.

Official SBCL should be able to express some of these tricks (at the very least reciprocal multiplication) some time soon. The neat thing about optimising divisions away is that they're so expensive (on the order of 100-200 clock cycles on modern x86) that even branches are pretty much guaranteed to be an improvement, even in the worst case :)

posted at: 00:49 | /LowLevel | permalink

Made with PyBlosxom Contact me by email: pvk@pvk.ca.