### 0x7FDE623822FC16E6 : a magic constant for double float reciprocal

*EDIT: I never expected this post to garner so much interest. Just to be clear, using*
*the constant is a very bad idea in almost every situation I can think of. Even*
*instructions like *`RCPSS `

*should only be used by people who are convinced that the*
*tradeoffs are worth it (or don’t have any other choice). Subtracting from the constant*
*above is even worse, and handles even fewer corner cases.*

I’ve been working on a specialised discrete optimisation algorithm, and `perf `

tells
me that it’s spending around 25% of its time on one (double float) division
instruction: the algorithm has to repeatedly compute *x*^{-1} for a couple hundred
thousand values in a vector. If the denominator was constant, it would be simple to
speed that up. Unfortunately, the only shortcut for reciprocals seems to consist of
computing an approximation and refining it with a couple Newton-Raphson
steps.

Surprisingly, it’s still possible to achieve significant speed-ups when as many as 3 refinement steps are used: a division unit takes a lot of space, so, unlike multiplications and additions, hardware divisions can rarely be executed in parallel. I.e., each division takes a couple cycles (around 7-14, according to Agner Fog’s tables [PDF]), and must wait for the previous division to be fully computed. Since I’m dividing a large number of values one after the other, I’m looking for good throughput more than latency, and it can be useful to perform more operations, as long as they’re executed in parallel or in a pipeline.

The textbook way to compute an approximation of the reciprocal is
probably to use hardware support. On `x86[-64]`

, that would be `RCPSS`

, which
computes a 12 bit approximation for the reciprocal of a single float. For
my use case, I also have to convert to and from doubles (`CVTSD2SD `

and
`CVTSS2SD`

). My Westmere machine can dispatch one of these instructions per
cycle.

But then again, it seems like it would be possible to do something like the classic fast inverse square root. Simply taking the opposite value of the exponent field gives us a decent approximation: double floats on x86 (and on most other machines) are represented as

sign : exponent : significand 1 bit: 11 bits : 52 bits (and implicit leading 1)

The exponent is signed, but represented as unsigned numbers with a positive bias: an exponent of 0 is represented by 1022, and -1 by 1021, 1 by 1023, etc. The maximal biased value for a normal number is 2046 (2047 is used for infinities and NaNs). So, to get the opposite value of the exponent, it suffices to subtract it from 2046. That gives us a single bit of precision which, while sufficient to guarantee convergence of the Newton-Raphson steps, is far from useful.

Instead, it would be interesting to also exploit the significand to get a couple more
bits of precision from a subtraction like to-double(*a *- from-double(*x*)). The
Newton-Raphson steps can diverge when the initial guess is too large, so the constant
*a *should be of the form 2046 ⋅ 2^{52}- *b*, where *b *[0*,*2^{52}- 1]. I graphed the
maximal error on a range of inputs (2^{17} equidistant values between 0 and 3),
and it looks unimodal (decreasing until it hits the global minimum, then
increasing).

I want to minimise a function that seems unimodal, but isn’t (a priori)
differentiable and is fairly expensive to compute. That seemed like a perfect
application for the golden section search (to avoid any problem with floating point
rounding errors, I used an exhaustive search once the range was down to a
couple thousand values). I waited a couple seconds, and my REPL spit out
`0x7FDE623822FC16E6 `

for the best value of *a*: the max relative error after 3 steps
is on the order of 4 ⋅ 10^{-11} (≈ 35 bit), and the average around 1 ⋅ 10^{-11}.
The convergence rate of the Newton-Raphson methods is quadratic, so the
magic constant yields slightly more than 4(!!) bits of precision after a single
subtraction.

In the end, going through `RCPSS `

is still faster, as the initial guess is almost 3
times as precise, but on another machine without hardware support for approximate
reciprocals, or on which the Newton-Raphson steps execute faster, it might be useful
to keep `0x7FDE623822FC16E6 `

in mind. Granted, that doesn’t seem to often have been
the case so far: I can’t find that constant or any value close to it on google.