In the words of a friend and former colleague:
Two years of my life in one repository….
— John Wittrock (@johnwittrock) December 19, 2017
Congrats @pkhuong @arexus and all! https://t.co/jPFnYrc5V4
If you don’t want to read more about what’s in ACF and why I feel
it’s important to open source imperfect repositories,
jump to the section on fast itoa
.
ACF contains the base data structure and runtime library code we use to build production services, in C that targets Linux/x8664. Some of it is correctly packaged, most of it just has the raw files from our internal repository. Ironically, after settling on the project’s name, we decided not to publish the most “frameworky” bits of code: it’s unclear why anyone else would want to use it. The data structures are in C, and tend to be readoptimised, with perhaps some support for nonblocking singlewriter/multireader concurrency. There’s also nonblocking algorithms to support the data structures, and basic HTTP server code that we find useful to run CPUintensive or mixed CPU/networkintensive services.
Publishing this internal code took a long time because we were trying to open a project that didn’t exist yet, despite being composed of code that we use every day. AppNexus doesn’t sell code or binaries. Like many other companies, AppNexus sells services backed by inhouse code. Our code base is full of informal libraries (I would be unable to make sense of the code base if it wasn’t organised that way), but enforcing a clean separation between pseudolibraries can be a lot of extra work for questionable value.
These fuzzy demarcations are made worse by the way we imported some ideas directly from Operating Systems literature, in order to support efficient concurrent operations. That had a snowball effect: everything, even basic data structures, ends up indirectly depending on runtime system/framework code specialised for our use case. The usual initial offenders are the safe memory reclamation module, and the tracking memory allocator (with a bump pointer mode); both go deep in internals that probably don’t make sense outside AppNexus.
Back in 2015, we looked at our support code (i.e., code that doesn’t directly run the business) and decided we should share it. We were–and still are–sure that other people face similar challenges, and exchanging ideas, if not directly trading code, can only be good for us and for programming in general. We tried to untangle the “Common” (great name) support library from the rest of the code base, and to decouple it from the more opinionated parts of our code, while keeping integration around (we need it), but purely optin.
That was hard. Aiming for a separate shared object and a real Debian package made it even harder than it had to be. The strong separation between packaged ACF code and the rest of repo added a lot of friction, and the majority of the support code remained intree.
Maybe we made a mistake when we tried to librarify our internals. We want a library of reusable code; that doesn’t have to mean a literal shared object. I’m reminded of the two definitions of portable code: code sprinkled with platform conditionals, or code that can be made to run on a new machine with minimal effort. Most of the time, I’d rather have the latter. Especially when code mostly runs on a single platform, or is integrated in few programs, I try to reduce overhead for the common case, while making reuse possible and easy enough that others can benefit.
And that’s how we got the ACF effort out of the door: we accepted that the result would not be as polished as our favourite open source libraries, and that most of the code wouldn’t even be packaged or disentangled from internals. That’s far from an ideal state, but it’s closer to our goals than keeping the project private and on the backburner. We got it out by “feature” boxing the amount of work–paring it down to figuring out what would never be useful to others, and tracking down licenses and provenance–before pushing the partial result out to a public repository. Unsurprisingly, once that was done, we completed more tasks on ACF in a few days than we have in the past year.
Now that ACF is out, we still have to figure out the best way to help others coopt our code, to synchronise the public repository with our internal repository, and, in my dreams, to accept patches for the public repo and have them also work for the internal one. In the end, what’s important is that the code is out there with a clear license, and that someone with similar problems can easily borrow our ideas, if not our code.
The source isn’t always pretty, and is definitely not as well packaged and easily reusable as we’d like it to be, but it has proved itself in production (years of use on thousands of cores), and builds to our real needs. The code also tries to expose correct and efficient enough code in ways that make correct usage easy, and, ideally, misuse hard. Since we were addressing specific concrete challenges, we were able to tweak contracts and interfaces a bit, even for standard functionality like memory allocation.
The last two things are what I’m really looking for when exploring other people’s support code: how did usage and development experience drive interface design, and what kind of nonstandard tradeoffs allowed them to find new lowhanging fruits?
If anyone else is in the same situation, please give yourself the permission to open source something that’s not yet fully packaged. As frustrating as that can be, it has to be better than keeping it closed. I’d rather see real, flawed but productiontested, code from which I can take inspiration than nothing at all.
¶ The
integer to string conversion file (an_itoa
)
is one instance of code that relaxes the usual [u]itoa
contract
because it was written for a specific problem (which also gave us
real data to optimise for). The relaxation stems from the fact that
callers should reserve up to 10 chars to convert 32 bit (unsigned)
integers, and 20 chars for 64 bit ones: we let the routines write
garbage (0
/NUL
bytes) after the converted string, as long as it’s
in bounds. This allowance, coupled with a smidge of thinking, let us
combine a few cute ideas to solve the depressingly common problem of
needing to print integers quickly.
Switching to an_itoa
might be a quick win for someone else, so I cleaned it up
and packaged it immediately after making the repository public.
We wrote an_itoa
in July 2014. Back then, we had an application
with a moderate deployment (a couple racks on three continents) that
was approaching capacity. While more machines were in the pipeline, a
quick perf
run showed it was spending a lot of time converting
strings to integers and back. We already had a fastish string to
integer function. Converting machine integers back to string however,
is a bit more work, and took up around 20% of total CPU time.
Of course, the real solution here is to not have this problem. We shouldn’t have been using a humanreadable format like JSON in the first place. We had realised the format would be a problem a long time ago, and were actually in the middle of a transition to protobuf, after a first temporary fix (replacing a piece of theoretically reconfigurable JavaScript that was almost never reconfigured with hardcoded C that performed the same JSON manipulation). But, there we were, in the middle of this slow transition involving terabytes of valuable persistent data, and we needed another speed boost until protobuf was ready to go.
When you’re stuck with C code that was manually converted, line by line, from JavaScript, you don’t want to try and make high level changes to the code. The only reasonable quick win was to make the conversion from integer to string faster.
Human readable formats wasting CPU cycles to print integers is a
common problem, and we quickly found a few promising approaches and
libraries. Our baseline was the radix10 code in
stringencoders.
This post about Lwan
suggested using radix10, but generating strings backward instead of
reversing like the stringencoders
library. Facebook apparently hit
a similar problem in 2013, which lead to
this solution
by Andrei Alexandrescu. The Facebook code combines two key
ideas: radix100 encoding, and finding the length of the string with
galloping search to write the result backward, directly where it
should go.
Radix100 made sense, although I wasn’t a fan of the 200byte lookup
table. I was also dubious of the galloping search; it’s a lot of
branches, and not necessarily easy to predict. The kind of memmove
we need to fixup after conversion is small and easy to specialise on
x86, so we might not need to predict the number of digits at all.
I then looked at the microbenchmarks for Andrei’s code, and they made it look like the code was either tested on integers with a fixed number of digits (e.g., only 4digit integers), or randomly picked with uniform probability over a large range.
If the number of digits is fixed, the branchiness of galloping search
isn’t an issue. When sampling uniformly… it’s also not an issue
because most integers are large! If I pick an integer at random in
[0, 1e6)
, 90% of the integers have 6 digits, 99% 5 or 6, etc.
Sometimes, uniform selection is representative of the real workload (e.g., random uids or sequential object ids). Often, not so much. In general, small numbers are more common; for example, small counts can be expected to roughly follow a Poisson distribution.
I was also worried about data cache footprint with the larger lookup
table for radix100 encoding, but then realised we were converting
integers in tight loops, so the lookup table should usually be hot.
That also meant we could afford a lot of instruction bytes; a multiKB
atoi
function wouldn’t be acceptable, but a couple hundred bytes was
fine.
Given these known solutions, John and I started doodling for a bit. Clearly, the radix100 encoding was a good idea. We now had to know if we could do better.
Our first attempt was to find the number of decimal digits more quickly than with the galloping search. It turns out that approximating \(\log\sb{10}\) is hard, and we gave up ;)
We then realised we didn’t need to know the number of decimal digits. If we generated the string in registers, we could find the length after the fact, slide bytes with bitwise shifts, and directly write to memory.
I was still worried about the lookup table: the random accesses in the
200 byte table for radix100 encoding could hurt when converting short
arrays of small integers. I was more comfortable with some form of
arithmetic that would trade bestcase speed for consistent, if slightly
suboptimal, performance. As it turns out, it’s easy to convert values
between 0 and 100 to
unpacked BCD
with a reciprocal multiplication by \( 1/10 \) and some inregister
bit twiddling. Once we have a string of BCD bytes buffered in a
general purpose register, we can vertically add '0'
to every byte in
the register to convert to ASCII characters. We can even do the whole
conversion on a pair of such values at once, with SIMD within a
register.
The radix100 approach is nice because it chops up the input two digits at a time; the makespan for a given integer is roughly half as long, since modern CPUs have plenty of execution units for the body.
The dependency graph for radix10 encoding of 12345678
looks like
the following, with 7 serial steps.
Going for radix100 halves the number of steps, to 4. The steps are
still serial, except for the conversion of integers in [0, 100)
to
strings.
Could we expose even more ILP than the radix100 loop?
The trick is to divide and conquer: divide by 10000 (1e4
) before
splitting each group of four digits with a radix100 conversion.
Recursive encoding gives us fewer steps, and 2 of the 3 steps can execute in parallel. However, that might not always be worth the trouble for small integers, and we know that small numbers are common. Even if we have a good divideandconquer approach for larger integers, we must also implement a fast path for small integers.
The fast path for small integers (or the most significant limb of
larger integers) converts a 2 or 4 digit integer to unpacked BCD,
bitscans for the number of leading zeros, converts the BCD to ASCII by
adding '0'
(0x30
) to each byte, and shifts out any leading zero;
we assume that trailing noise is acceptable, and it’s all NUL
bytes
anyway.
For 32bit integers
an_itoa
(really an_uitoa
) looks like:
if number < 100:
execute specialised 2digit function
if number < 10000:
execute specialised 4digit function
partition number with first 4 digits, next 4 digits, and remainder.
convert first 2 groups of 4 digits to string.
If the number is < 1e8: # remainder is 0!
shift out leading zeros, print string.
else:
print remainder # at most 100, since 2^32 < 1e10
print strings for the first 2 groups of 4 digits.
The 64 bit version,
an_ltoa
(really an_ultoa
) is more of the same, with differences when the
input number exceeds 1e8
.
I’ve already concluded that cache footprint was mostly not an issue, but we should still made sure we didn’t get anything too big.
an_itoa
: 400 bytes.an_ltoa
: 880 bytesfb_itoa
: 426 bytes + 200 byte LUTfb_constant_itoa
(without the galloping search): 172 bytes + 200 byte LUTlwan_itoa
(radix10, backward generation): 60 bytes.modp_uitoa10
: 91 bytes.The galloping search in Facebook’s converter takes a lot of space
(there’s a ton of conditional branches, and large numbers must be
encoded somewhere). Even if we disregard the lookup table, an_itoa
is smaller than fb_itoa
, and an_ltoa
(which adds code for > 32
bit integers) is only 254 bytes larger than fb_itoa
(+ LUT). Now,
Facebook’s galloping search attempts to make small integers go faster
by checking for them first; if we convert small numbers, we don’t
expect to use all ~250 bytes in the galloping search. However,
an_itoa
and an_ltoa
are similar: the code is setup such that
larger numbers jump forward over specialised subroutines for small
integers. Small integers thus fall through to only execute code at
the beginning of the functions. 400 or 800 bytes are sizable
footprints compared to the 60 or 90 bytes of the radix10 functions,
but acceptable when called in tight loops.
Now that we feel like the code and lookup table sizes are reasonable (something that microbenchmarks rarely highlight), we can look at speed.
I first ran the conversion with random integers in each digit count
class from 1 digit (i.e., numbers in [0, 10)
) to 19 (numbers in
[1e8, 1e9)
). The instruction cache was hot, but the routines were
not warmed on that size class of numbers (more realistic that way).
The results are cycle counts (with the minimum overhead for a noop conversion subtracted from the raw count), on an unloaded 2.4 GHz Xeon E52630L, a machine that’s similar to our older production hardware.
We have data for:
an_itoa
, our 32 bit conversion routine;an_ltoa
, our 64 bit conversion routine;fb_constant_itoa
, Facebook’s code, with the galloping search
stubbed out;fb_itoa
, Facebook’s radix100 code;itoa
, GNU libc conversion (via sprintf);lw_itoa
, Lwan’s backward radix10 converter;modp
, stringencoder’s radix10 / strreverse
converter.I included fb_constant_itoa
to serve as a lower bound on the
radix100 approach: the conversion loop stops as soon as it hits 0
(same as fb_itoa
), but the data is written at a fixed offset, like
lw_itoa
does. In both fb_constant_itoa
’s and lw_itoa
’s cases,
we’d need another copy to slide the part of the output buffer that was
populated with characters over the unused padding (that’s why
fb_itoa
has a galloping search).
When I chose these functions back in 2014, they were all I could find that was reasonable. Since then, I’ve seen one other divide and conquer implementation, although it uses a lookup table instead of arithmetic to convert radix100 limbs to characters, and an SSE2 implementation that only pays off for larger integers (32 bits or more).
Some functions only go up to UINT32_MAX
, in which case we have no
data after 9 digits. The raw data is here; I used
this R script to generate the plot.
The solid line is the average time per conversion (in cycles), over 10K data points, while the shaded region covers the 10th percentile to the 90th percentile.
(GNU) libc’s conversion is just wayy out there. The straightforward
modp
(stringencoders) code overlaps with Facebook’s itoa
; it’s
slightly slower, but so much smaller.
We then have two incomplete string encoders: neither
fb_constant_itoa
nor lw_itoa
generates their output where it should
go. They fill a buffer from the end, and something else (not
benchmarked) is responsible for copying the valid bytes where they
belong. If an incomplete implementation suffices, Lwan’s radix10
approach is already competitive with, arguably faster than, the
Facebook code. The same backward loop, but in radix100, is
definitely faster than Facebook’s full galloping search/radix100
converter.
Finally, we have an_itoa
and an_ltoa
, that are neck and neck with
one another, faster than both modp
and fb_itoa
on small and large
integers, and even comparable with or faster than the incomplete
converters. Their runtime is also more reliable (less variance) than
modp
’s and fb_itoa
’s: modp
pays for the second variable length
loop in strreverse
, and fb_itoa
for the galloping search. There
are more code paths in an_itoa
and an_ltoa
, but no loop, so the
number of (unpredictable) conditional branches is lower.
What have we learned from this experiment?
sprintf
. That makes sense,
since that code is so generic. However, in practice, we only
convert to decimal, some hex, even less octal, and the rest is
noise. Maybe we can afford to special case these bases.modp_uitoa10
hurts. It does make sense
to avoid that by generating backward, ideally in the right spot
from the start.fb_constant_itoa
is
faster than lwan_itoa
).an_itoa
and an_ltoa
are faster for small values).an_ltoa
is flatter for large
integers).With results that made sense for an easily understood microbenchmark, I decided to try a bunch of distributions. Again, the code was hot, the predictors lukewarm, and we gathered 10K cycle counts per distribution/function. The raw data is here, and I used this R script to generate the plot.
The independent variables are all categorical here, so I use one facet per distribution, and, in each facet, a boxplot per conversion function, as well as a jittered scatter plot to show the distribution of cycle counts.
Clearly, we can disregard glibc’s sprintf
(itoa
).
The first facet generated integers by choosing uniformly between
\(100, 1000, 10\sp{4}, \ldots, 10\sp{8}\). That’s a semirealistic
variation on the earlier dataset, which generated a bunch of numbers in
each size class, and serves as an easily understood worstcase for
branch prediction. Both an_itoa
and an_ltoa
are faster than the
other implementations, and branchier implementations (fb_itoa
and
modp
) show their variance. Facebook’s fb_itoa
isn’t even faster
than modp
’s radix10/strreverse
encoder. The galloping search
really hurts: fb_constant_itoa
, without that component, is slightly
faster than the radix10 lw_itoa
.
The second facet is an even harder case for branch predictors: random
values skewed with an exponential (pow(2, 64.0 * random() / RAND_MAX)
),
to simulate realworld counts. Both an_itoa
and
an_ltoa
are faster than the other implementations, although
an_ltoa
less so: an_itoa
only handles 32 bit integers, so it deals
with less entropy. Between the 32bit implementations, an_itoa
is
markedly faster and more consistent than lw_itoa
(which is
incomplete) and modp
. Full 64bit converters generally exhibit more
variance in runtime (their input is more randomised), but an_ltoa
is
still visibly faster than fb_itoa
, and even than the incomplete
fb_constant_itoa
. We also notice that fb_itoa
’s runtimes are more
spread out than fb_constant_itoa
: the galloping search adds overhead
in time, but also a lot of variance. That makes me think that the
Facebook code is more sensitive than others to difference in data
distribution between microbenchmarks and production.
The third facet should be representative of printing internal
sequential object ids: uniform integers in [0, 256K)
. As expected,
every approach is tighter than with the skewed “counts” distribution
(most integers are large). The an_itoa
/an_ltoa
options are faster
than the rest, and it’s far from clear that fb_itoa
is preferable to
even modp
. The range was also chosen because it’s somewhat of a
worst case for an_itoa
: the code does extra work for values between
\(10\sp{4}\) and \(10\sp{8}\) to have more to do before the
conditional branch for x < 1e8
. That never pays off in the range
tested here. However, even with this weakness, an_itoa
still seems
preferable to fb_itoa
, and even to the simpler modp_uitoa10
.
The fourth facet (first of the second row) shows what happens when we
choose random integers in [0, 20)
. That test case is interesting
because it’s small, thus semirepresentative of some of our counts,
and because it needs 1 or 2 digits with equal probability. Everything
does pretty well, and runtime distributions are overall tight; branch
predictors can do a decent job when there are only two options. I’m
not sure why there’s such a difference between an_itoa
and
an_ltoa
’s distribution. Although the code for any value less than
100 is identical at the C level, there are small difference in code
generation… but I can’t pinpoint where the difference might come from.
The fifth facet, for random integers in [100, 200)
is similar, with
a bit more variance.
The sixth facet generates unix timestamps around a date in 2014 with
uniform selection plus or minus one million second. It’s meant to be
representative of printing timestamps. Again, an_itoa
and an_ltoa
are faster than the rest, with an_itoa
being slightly faster and
more consistent. Radix100 (fb_constant_itoa
) is faster and more
consistent than radix10 (lw_itoa
), but it’s not clear if fb_itoa
is preferable to modp
. The variance for modp
is larger than for
the other implementations, even fb_itoa
: that’s the cost of a
radix10 loop and of the additional strreverse
.
This set of results shows that conditional branches are an issue when
converting integers to strings, and that the impact of branches
strongly depends on the distribution. The Facebook approach, with a
galloping search for the number of digits, seems particularly
sensitive to the distribution. Running something like fb_itoa
because it does well in microbenchmark is thus only a good idea if we
know that the microbenchmark is representative of production.
Bigger numbers take more time to convert, but the divide and conquer
approach of an_itoa
and an_ltoa
is consistently faster at the high
end, while their unrolled SIMDwithinaregister fast path does well for
small numbers.
s[n]printf
The correct solution to the “integer printing is too slow” problem is simple: don’t do that. After all, remember the first rule of high performance string processing: “DON’T.” When there’s no special requirement, I find Protobuf does very well as a better JSON.
However, once you find yourself in this bad spot, it’s trivial to do better than generic libc conversion code. This makes it a dangerously fun problem in a way… especially given that the data distribution can matter so much. No benchmark is perfect, but various implementations are affected differently by flaws in microbenchmarks. It’s thus essential not to overfit on the benchmark data, probably even more important than improving performance by another factor of 10% or 20% (doing 45x better than libc code is already a given). That’s why I prefer integer conversion code with more consistent cycle counts: there’s less room for differences due to the distribution of data.
Finally, if, like 2014AppNexus, you find yourself converting a lot of
integers to strings in tight loops (on x8664 machines),
try an_itoa
or an_ltoa
!
The whole repository is Apache 2.0,
and it should be easy to copy and paste all the dependencies to pare
it down to two files. If you do snatch our code, note that the
functions use their destination array (up to 10 bytes for an_itoa
,
and 20
for an_ltoa
) as scratch space, even for small integers.
Thank you for reviewing drafts, John, Ruchir, Shreyas, and Andrew.
]]>Whenever I mention a data or work distribution problem where I ideally want everything related to a given key to hit the same machine, everyone jumps to consistent hashing. I don’t know how this technique achieved the mindshare it has, although I suspect Amazon’s 2007 Dynamo DB paper is to blame (by introducing the problem to many of us, and mentioning exactly one decentralised solution)… or maybe some Google interview prep package.
Karger et al’s paper doesn’t help, since they introduce the generic concept of a consistent hash function and call their specific solution… “consistent hashing.” I’m not sure where I first encountered rendezvous hashing, but I vaguely remember a technical report by Karger, so it’s probably not some MIT vs UMich thing.
Regardless of the reason for consistent hashing’s popularity, I feel the goto technique should instead be rendezvous hashing. Its basic form is simple enough to remember without really trying (one of those desert island algorithms), it is more memory efficient than consistent hashing in practice, and its downside–a simple implementation assigns a location in time linear in the number of hosts–is not a problem for small deployments, or even medium (a couple racks) scale ones if you actually think about failure domains.
Side question: why did rendezvous have to lose its hyphen to cross the Channel?
Basic rendezvous hashing takes a distribution key (e.g., a filename),
and a set of destinations (e.g., hostnames). It then uses a hash function
to pseudorandomly map each (distribution_key, destination)
pair to a
value in [0, 1)
or [0, 2^64  1)
, and picks the destination that
gives the minimal hash value. If it needs k
destinations for
redundancy, it can pick the destinations that yield the least k
hash
values. If there are ties (unlikely with a good hash function), it
breaks them arbitrarily but consistently, e.g., by imposing a total
order on hostnames.
A Python implementation could look like the following.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

We only need to store the list of destinations, and we can convince ourselves that data distribution is pretty good (close to uniform) and that small changes in the set of destinations only affects a small fraction of keys (those going to destinations added/removed), either with pen and paper or with a few simulations. That compares positively with consistent hashing, where a practical implementation has to create a lot (sometimes hundreds) of pseudonodes for each real destination in order to mitigate clumping in the hash ring.
The downside is that we must iterate over all the nodes, while consistent hashing is easily \(\mathcal{O}(\log n)\) time, or even \(\mathcal{O}(\log \log n)\), with respect to the number of (pseudo)nodes. However, that’s only a problem if you have a lot of nodes, and rendezvous hashing, unlike consistent hashing, does not inflate the number of nodes.
Another thing I like about rendezvous hashing is that it naturally handles weights. With consistent hashing, if I want a node to receive ten times as much load as another, I create ten times more pseudonodes. As the greatest common divisor of weights shrinks, the number of pseudonode per node grows, which makes distribution a bit slower, and, more importantly, increases memory usage (linear in the number of pseudonodes). Worse, if you hit the fundamental theorem of arithmetic (as a coworker once snarked out in a commit message), you may have to rescale everything, potentially causing massive data movement.
Rendezvous hashing generates pseudorandom scores by hashing, and ranks
them to find the right node(s). Intuitively, we want to use weights
so that the distribution of pseudorandom scores generated for a node A
with twice the weight as another node B has the same shape as that of
node B, but is linearly stretched so that the average hash value for A is
twice that for B. We also want the distribution to cover [0, infty)
,
otherwise a proportion of hashes will always go to the heavier node,
regardless of what the lighter node hashes to, and that seems wrong.
The trick,
as explained by Jason Resch
at Cleversafe, is to map our hashes from uniform in [0, 1)
to
[0, infty)
not as an exponential, but with weight / log(h)
. If you
simulate just using an exponential, you can quickly observe that it
doesn’t reweigh things correctly: while the mean is correctly scaled,
the mass of the probability density function isn’t shifted quite right.
Resch’s proof of correctness for this tweaked exponential fits on a
single page.
The Python code becomes something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 

There are obvious microoptimisations here (for example, computing the inverse of the score lets us precompute the reciprocal of each destination’s weight), but that’s all details. The salient part to me is that space and time are still linear in the number of nodes, regardless of the weights; consistent hashing instead needs space pseudolinear(!) in the weights, and is thus a bit slower than its \(\mathcal{O}(\log n)\) runtime would have us believe.
The lineartime computation for weighted rendezvous hashing is also CPU friendly. The memory accesses are all linear and easily prefetchable (load all metadata from an array of nodes), and the computational kernel is standard vectorisable floating point arithmetic.
In practice, I’m also not sure I ever really want to distribute between hundreds of machines: what kind of failure/resource allocation domain encompasses that many equivalent nodes? For example, when distributing data, I would likely want a hierarchical consistent distribution scheme, like Ceph’s CRUSH: something that first assigns data to sections of a datacenter, then to racks, and only then to individual machines. I should never blindly distribute data across hundreds of machines; I need to distribute between a handful of sections of the network, then one of a dozen racks, and finally to one of twenty machines. The difference between linear and logarithmic time at each level of this “failure trie” is marginal and is easily compensated by a bit of programming.
The simplicity of basic rendezvous hashing, combined with its minimal space usage and the existence of a weighted extension, makes me believe it’s a better initial/default implementation of consistent hash functions than consistent hashing. Moreover, consistent hashing’s main advantage, sublineartime distribution, isn’t necessarily compelling when you think about the whole datacenter (or even many datacenters) as a resilient system of failureprone domains. Maybe rendezvous hashing deserves a rebranding campaign (:
]]>However, if we had something simple enough to implement natively in the compiler, we could hope for the maintainers to understand what the ILP solver is doing. This seems realistic to me mostly because the generic complexity tends to lie in the continuous optimisation part. Branching, bound propagation, etc. is basic, sometimes domain specific, combinatorial logic; cut generation is probably the most prominent exception, and even that tends to be fairly combinatorial. (Maybe that’s why we seem to be growing comfortable with SAT solvers: no scary analysis.) So, for the past couple years, I’ve been looking for simple enough specialised solvers I could use in branchandbound for large 0/1 ILP.
Some stuff with augmented lagrangians and specialised methods for boxconstrained QP almost panned out, but nested optimisation sucks when the inner solver is approximate: you never know if you should be more precise in the lower level or if you should aim for more outer iterations.
A subroutine in Chubanov’s polynomialtime linear programming algorithm [PDF] (related journal version) seems promising, especially since it doesn’t suffer from the numerical issues inherent to log barriers.
Chubanov’s “Basic Subroutine” accepts a problem of the form \(Ax = 0\), \(x > 0\), and either:
The class of homogeneous problems seems useless (never mind the nondeterministic return value), but we can convert “regular” 0/1 problems to that form with a bit of algebra.
Let’s start with \(Ax = b\), \(0 \leq x \leq 1\), we can reformulate that in the homogeneous form:
\[Ax  by = 0,\] \[x + s  \mathbf{1}y = 0,\] \[x, s, y \geq 0.\]
Any solution to the original problem in \([0, 1]\) may be translated to the homogeneous form (let \(y = 1\) and \(s = 1  x\)). Crucially, any 0/1 (binary) solution to the original problem is still 0/1 in the homogeneous form. In the other direction, any solution with \(y > 0\) may be converted to the boxconstrained problem by dividing everything by \(y\).
If we try to solve the homogenous form with Chubanov’s subroutine, we may get:
As soon as we invoke the third case to recursively solve a smaller problem, we end up solving an interesting illspecified relaxation of the initial 0/1 linear program: it’s still a valid relaxation of the binary problem, but is stricter than the usual box linear relaxation.
That’s more than enough to drive a branchandbound process. In practice, branchandbound is much more about proving the (near) optimality of an existing solution than coming up with strong feasible solutions. That’s why the fact that the subroutine “only” solves feasibility isn’t a blocker. We only need to prove the absence of 0/1 solutions (much) better than the incumbent solution, and that’s a constraint on the objective value. If we get such a proof, we can prune away that whole search subtree; if we don’t, the subroutine might have fixed some variables 0 or 1 (always useful), and we definitely have a fractional solution. That solution to the relaxation could be useful for primal heuristics, and will definitely be used for branching (solving the natural LP relaxation of constraint satisfaction problems ends up performing basic propagation for us, so we get some domain propagation for free by only branching on variables with fractional values).
At the root, if we don’t have any primal solution yet, we should probably run some binary search on the objective value at the root node and feed the resulting fractional solutions to rounding heuristics. However, we can’t use the variables fixed by the subroutine: until we have a feasible binary solution with objective value \(Z\sp{\star}\), we can’t assume that we’re only interested in binary solutions with object value \(Z < Z\sp{\star}\), so the subroutine might fix some variables simply because there is no 0/1 solution that satisfy \(Z < k\) (case 3 is vacuously valid if there is no 0/1 solution to the homogeneous problem).
That suffices to convince me of correctness. I still have to understand Chubanov’s “Basic Subroutine.”
This note by Cornelis/Kees Roos helped me understand what makes the subroutine tick.
The basic procedure updates a dual vector \(y\) (not the same \(y\) as the one I had in the reformulation… sorry) such that \(y \geq 0\) and \(y_1 = 1\), and constantly derives from the dual vector a tentative solution \(z = P\sb{A}y\), where \(P\sb{A}\) projects (orthogonally) in the null space of the homogeneous constraint matrix \(A\) (the tentative solution is \(x\) in Chubanov’s paper).
At any time, if \(z > 0\), we have a solution to the homogenous system.
If \(z = P\sb{A}y = 0\), we can exploit the fact that, for any feasible solution \(x\), \(x = P\sb{A}x\): any feasible solution is alrady in the null space of \(A\). We have
\[x\sp{\top}y = x\sp{\top}P\sb{A}y = x\sp{\top}\mathbf{0} = 0\]
(the projection matrix is symmetric). The solution \(x\) is strictly positive and \(y\) is nonnegative, so this must mean that, for every component of \(y\sb{k} > 0\), \(x\sb{k} = 0\). There is at least one such component since \(y_1 = 1\).
The last condition is how we bound the number of iterations. For any feasible solution \(x\) and any component \(j\),
\[y\sb{j}x\sb{j} \leq y\sp{\top}x = y\sp{\top}P\sb{A}x \leq x P\sb{A}y \leq \sqrt{n} z.\]
Let’s say the max element of \(y\), \(y\sb{j} \geq 2 \sqrt{n}z\). In that case, we have \[x\sb{j} \leq \frac{\sqrt{n}z}{y\sb{j}} \leq \frac{1}{2}.\]
Chubanov uses this criterion, along with a potential argument on \(z\), to bound the number of iterations. However, we can apply the result at any iteration where we find that \(x\sp{\top}z < y\sb{j}\): any such \(x\sb{j} = 0\) in binary solutions. In general, we may upper bound the lefthand side with \(x\sp{\top}z \leq xz \leq \sqrt{n}z\), but we can always exploit the structure of the problem to have a tighter bound (e.g., by encoding clique constraints \(x\sb{1} + x\sb{2} + … = 1\) directly in the homogeneous reformulation).
The rest is mostly applying lines 912 of the basic procedure in Kees’s note. Find the set \(K\) of all indices such that \(\forall k\in K,\ z\sb{k} \leq 0\) (Kees’s criterion is more relaxed, but that’s what he uses in experiments), project the vector \(\frac{1}{K} \sum\sb{k\in K}e\sb{k}\) in the null space of \(A\) to obtain \(p\sb{K}\), and update \(y\) and \(z\).
The potential argument here is that after updating \(z\), \(\frac{1}{z\sp{2}}\) has increased by at least \(K > 1\). We also know that \(\max y \geq \frac{1}{n}\), so we can fix a variable to 0 as soon as \(\sqrt{n} z < \frac{1}{n}\), or, equivalently, \(\frac{1}{z} > n\sp{3/2}\). We need to increment \(\frac{1}{z\sp{2}}\) to at most \(n\sp{3}\), so we will go through at most \(1 + n\sp{3})\) iterations of the basic procedure before it terminates; if the set \(K\) includes more than one coordinate, we should need fewer iterations to reach the same limit.
Chubanov shows how to embed the basic procedure in a basic iterative method to solve binary LPs. The interesting bit is that we reuse the dual vector \(y\) as much as we can in order to bound the total number of iterations in the basic procedure. We fix at least one variable to \(0\) after a call to the basic procedure that does not yield a fractional solution; there are thus at most \(n\) such calls.
In contrast to regular numerical algorithms, the number of iterations and calls so far have all had exact (non asymptotic) bounds. The asymptotics hide in the projection step, where we average elementary unit vectors and project them in the null space of \(A\). We know there will be few (at most \(n\)) calls to the basic procedure, so we can expend a lot of time on matrix factorisation. In fact, Chubanov outright computes the projection matrix in \(\mathcal{O}(n\sp{3})\) time to get his complexity bound of \(\mathcal{O}(n\sp{4})\). In practice, this approach is likely to fill a lot of zeros in, and thus run out of RAM.
I’d start with the sparse projection code in SuiteSparse. The direct sparse solver spends less time on precomputation than fully building the projection matrix (good if we don’t expect to always hit the worst case iteration bound), and should preserve sparsity (good for memory usage). In return, computing projections is slower, which brings the worstcase complexity to something like \(\mathcal{O}(n\sp{5})\), but that can be parallelised, should be more proportional to the number of nonzeros in the constraint matrix (\(\mathcal{O}(n)\) in practice), and may even exploit sparsity in the righthand side. Moreover, we can hope that the \(n\sp{3}\) iteration bound is pessimistic; that certainly seems to be the case for most experiments with random matrices.
The worstcase complexity, between \(\mathcal{O}(n\sp{4})\) and \(\mathcal{O}(n\sp{5})\), doesn’t compare that well to interior point methods (\(\mathcal{O}(\sqrt{n})\) sparse linear solutions). However, that’s all worstcase (even for IPMs). We also have different goals when embedding linear programming solvers in branchandbound methods. Warm starts and the ability to find solution close to their bounds are key to efficient branchandbound; that’s why we still use simplex methods in such methods. Chubanov’s projection routine seems like it might come close to the simplex’s good fit in branchandbound, while improving efficiency and parallelisability on large LPs.
]]>/proc/sched_debug
.
The hard part about locking tends not to be the locking itself, but preemption. For example, if you structure a memory allocator like jemalloc, you want as few arenas as possible; one per CPU would be ideal, while one per thread would affect fragmentation and make some operations scale linearly with the number of threads. However, you don’t want to get stuck when a thread is preempted while it owns an arena. The usual fix is twopronged:
The first tweak isn’t that bad; scaling the number of arenas, stats regions, etc. with the number of CPUs is better than scaling with the number of threads. The second one really hurts performance: each allocation must acquire a lock with an interlocked write. Even if the arena is (mostly) CPUlocal, the atomic wrecks your pipeline.
It would be nice to have locks that a thread can acquire once per scheduling quantum, and benefit from ownership until the thread is scheduled out. We could then have a few arenas per CPU (if only to handle migration), but amortise lock acquisition over the timeslice.
That’s not a new idea. Dice and Garthwaite described this exact application in 2002 (PDF) and refer to older work for uniprocessors. However, I think the best exposition of the idea is Harris and Fraser’s Revocable locks for nonblocking programming, published in 2005 (PDF). Harris and Fraser want revocable locks for nonblocking multiwriter code; our problem is easier, but only marginally so. Although the history of revocable locks is pretty Solariscentric, Linux is catching up. Google, Facebook, and EfficiOS (LTTng) have been pushing for restartable sequences, which is essentially OS support for sections that are revoked on context switches. Facebook even has a pure userspace implementation with Rseq; they report good results for jemalloc.
Facebook’s Rseq implements almost exactly what I described above, for the exact same reason (speeding up a memory allocator or replacing miscellaneous perthread structs with ~perCPU data). However, they’re trying to port a kernel idiom directly to userspace: restartable sequences implement strict perCPU data. With kernel supports, that makes sense. Without such support though, strict perCPU data incurs a lot of extra complexity when a thread migrates to a new CPU: Rseq needs an asymmetric fence to ensure that the evicted thread observes its eviction and publishes any write it performed before being evicted.
I’m not sure that’s the best fit for userspace. We can avoid a lot of complexity by instead dynamically allocating a few arenas (exclusive data) per CPU and assuming only a few threads at a time will be migrated while owning arenas.
Here’s the relaxed revocable locks interface I propose:
Each thread has a thread state struct. That state struct has:
Locks are owned by a pair of thread state struct and generation counter (ideally packed in one word, but two words are doable). Threads acquire locks with normal compareandswap, but may bulk revoke every lock they own by advancing their generation counter.
Threads may execute any number of conditional stores per lock
acquisition. Lock acquisition returns an ownership descriptor
(pair of thread state struct and generation counter), and
rlock_store_64(descriptor, lock, dst, value)
stores value
in
dst
if the descriptor still owns the lock and the ownership has
not been cancelled.
Threads do not have to release lock ownership to let others make
progress: any thread may attempt to cancel another thread’s
ownership of a lock. After rlock_owner_cancel(descriptor, lock)
returns successfully, the victim will not execute a conditional
store under the notion that it still owns lock
with descriptor
.
The only difference from Rseq is that rlock_owner_cancel
may fail.
In practice, it will only fail if a thread on CPU A attempts to cancel
ownership for a thread that’s currently running on another CPU B.
That could happen after migration, but also when an administrative
task iterates through every (pseudo)perCPU struct without changing
its CPU mask. Being able to iterate through all available
pseudoperCPU data without migrating to the CPU is big win for slow
paths; another advantage of not assuming strict perCPU affinity.
Rather than failing on migration, Rseq issues an asymmetric fence to
ensure both its writes and the victim’s writes are visible. At best,
that’s implemented with interprocessor interrupts (IPIs) that scale
linearly with the number of CPUs… for a pointtopoint signal. I
oversubscribed a server with 24x more threads than CPUs, and thread
migrations happened at a constant frequency per CPU. Incurring
O(#CPU)
IPIs for every migration makes the perCPU overhead of
Rseq linear with the number of CPUs (cores) in the system. I’m also
wary of the high rate of code self/cross modification in Rseq:
mprotect
incurs IPIs when downgrading permissions, so Rseq must
leave some code page with writes enabled. These downsides (potential
for IPI storm and lack of W^X) aren’t unique to Rseq. I think
they’re inherent to emulating unpreempted perCPU data in userspace
without explicit OS support.
When rlock_owner_cancel
fails, I expect callers to iterate down the
list of pseudoperCPU structs associated with the CPU and eventually
append a new struct to that list. In theory, we could end up with as
many structs in that list as the peak number of thread on that CPU; in
practice, it should be a small constant since rlock_owner_cancel
only fails after thread migration.
I dumped my code as a gist, but it is definitely hard to follow, so I’ll try to explain it here.
Bitpacked ownership records must include the address of the owner
struct and a sequence counter. Ideally, we’d preallocate some address
space and only need 2030 bits to encode the address. For now, I’m
sticking to 64 byte aligned allocations and rely on x8664’s 48 bits
of address space. With 64 bit owner/sequence records, an rlock
is a 64 bit spinlock.
1 2 3 4 5 6 7 8 9 10 11 

In the easy case, acquiring an rlock
means:
owner
field (with a 64 bit load);rlock_owner_seq_t
.But first, we must make canonicalise our own owner
struct.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Rlock lazily allocates an rlock_owner
per thread and stores it in
TLS; we can’t free that memory without some safe memory reclamation
scheme (and I’d like to use Rlock to implement SMR), but it is
possible to use a typestable freelist.
Regardless of the allocation/reuse strategy, canonicalising an rlock means making sure we observe any cancellation request.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

To acquire a lock we observe the current owner, attempt to cancel its ownership, and (if we did cancel ownership) CAS in our own owner/sequence descriptor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

Most of the trickiness hides in rlock_owner_cancel
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 

The fancy stuff begins around ensure_cancel_sequence(victim, sequence);
.
Our code maintains the invariant that the MPMC sequences
(cancel_sequence
, signal_sequence
) are either the SPMC sequence  1
(normal state), or exactly the SPMC sequence (cancellation
request).
ensure_cancel_sequence
CASes the cancel_sequence
field from its
expected value of owner.sequence  1
to owner.sequence
. If
the actual value is neither of them, the owner has already
advanced to a new sequence value, and we’re done.
Otherwise, we have to hope the victim isn’t running.
Now comes the really tricky stuff. Our CAS is immediately visible globally. The issue is that the victim might already be in the middle of a critical section. When writers executes a critical sections, they:
It’s really hard to guarantee that the write in step 1 is visible (without killing performance in the common case), and if it is, that the victim isn’t about to execute step 3.
We get that guarantee by determining that the victim hasn’t been
continuously executing since the time we attempted to CAS the
cancel_sequence
forward. That’s (hopefully) enough of a barrier to
order the CAS, step 1, and our read of the critical section flag.
That’s not information that Linux exposes directly. However, we can
borrow a trick from Rseq
and read /proc/self/task/[tid]/stat
. The
contents of that file include whether the task is (R)unnable (or
(S)leeping, waiting for (D)isk, etc.), and the CPU on which the task
last executed.
If the task isn’t runnable, it definitely hasn’t been running continuously since the CAS. If the task is runnable but last ran on the CPU the current thread is itself running on (and the current thread wasn’t migrated in the middle of reading the stat file), it’s not running now.
If the task is runnable on another CPU, we can try to look at
/proc/sched_debug
: each CPU has a .curr>pid
line that tells us
the PID of the task that’s currently running (0 for none). That file
has a lot of extra information so reading it is really slow, but we
only need to do that after migrations.
Finally, the victim might really be running. Other proposals would fire an IPI; we instead ask the caller to allocate a few more pseudoperCPU structs.
Assuming we did get a barrier out of the scheduler, we hopefully observe that the victim’s critical section flag is clear. If that happens, we had:
This guarantees that the victim hasn’t been in the same critical section since the CAS in step 1. Either it’s not in a critical section, or if it is, it’s a fresh one that will observe the CAS. It’s safe to assume the victim has been successfully evicted.
The less happy path happens when we observe that the victim’s critical
section flag is set. We must assume that it was scheduled out in
the middle of a critical section. We’ll send a (POSIX) signal to the
victim: the handler will skip over the critical section if the victim
is still in one. Once that signal is sent, we know that the first
thing Linux will do is execute the handler when the victim resumes
execution. If the victim is still not running after tgkill
returned, we’re good to go: if the victim is still in the critical
section, the handler will fire when it resumes execution.
Otherwise, the victim might have been scheduled in between the CAS and the signal; we still have the implicit barrier given by the context switch between CAS and signal, but we can’t rely on signal execution. We can only hope to observe that the victim has noticed the cancellation request and advanced its sequence, or that it cleared its critical section flag.
The rest is straightforward. The rlock_store_64
must observe any
cancellation, ensure that it still holds the lock, and enter the
critical section:
Once it leaves the critical section, rlock_store_64
clears the
critical section flags, looks for any cancellation request, and
returns success/failure. The critical section is in inline assembly
for the signal handler: executing the store in step 4 implicitly
marks the end of the critical section.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 

Finally, the signal handler for rlock cancellation requests iterates
through the rlock_store_list
section until it finds a record that
strictly includes the instruction pointer. If there is such a record,
the thread is in a critical section, and we can skip it by overwriting
RIP
(to the end of the critical section) and setting RAX
to 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 

On my 2.9 GHz Sandy Bridge, a baseline loop to increment a counter a billion times takes 6.9 cycles per increment, which makes sense given that I use inline assembly loads and stores to prevent any compiler cleverness.
The same loop with an interlocked store (xchg
) takes 36 cycles per
increment.
Interestingly, an xchg
based spinlock around normal increments only
takes 31.7 cycles per increment (0.44 IPC). If we wish to back our
spinlocks with futexes, we must unlock with an interlocked write; releasing
the lock with a compareandswap brings us to 53.6 cycles per
increment (0.30 IPC)! Atomics really mess with pipelining: unless
they’re separated by dozens or even hundreds of instructions, their
barrier semantics (that we usually need) practically forces an
inorder, barely pipelined, execution.
FWIW, 50ish cycles per transaction is close to what I see in microbenchmarks for Intel’s RTM/HLE. So, while the overhead of TSX is nonnegligible for very short critical sections, it seems more than reasonable for adaptive locks (and TSX definitely helps when preemption happens, as shown by Dice and Harris in Lock Holder Preemption Avoidance via Transactional Lock Elision).
Finally, the figure that really matters: when incrementing with
rlock_store_64
, we need 13 cycles per increment. That loop hits
2.99 IPC, so I think the bottleneck is just the number of instructions
in rlock_store_64
. The performance even seems independent of the
number of worker threads, as long as they’re all on the same CPU.
In tabular form:
 Method  Cycle / increment  IPC 

 Vanilla  6.961  1.15 
 xchg  36.054  0.22 
 FAS spinlock  31.710  0.44 
 FASCAS lock  53.656  0.30 
 Rlock, 1 thd  13.044  2.99 
 Rlock, 4 thd / 1 CPU  13.099  2.98 
 Rlock, 256 / 1  13.952  2.96 
 Rlock, 2 / 2  13.047  2.99 
Six more cycles per write versus threadprivate storage really isn’t that bad (accessing TLS in a shared library might add as much overhead)… especially compared to 2550 cycles (in addition to indirect slowdowns from the barrier semantics) with locked instructions.
I also have a statisticsgathering mode that lets me vary the fraction of cycles spent in critical sections. On my server, the frequency of context switches between CPUintensive threads scheduled on the same CPU increases in steps until seven or eight threads; at that point, the frequency tops out at one switch per jiffy (250 Hz). Apart from this scheduling detail, evictions act as expected (same logic as for sampled profiles). The number of evictions is almost equal to the number of context switches, which is proportional to the runtime. However, the number of hard evictions (with the victim in a critical section) is always proportional to the number of critical section executed: roughly one in five million critical section is preempted. That’s even less than the one in two million we’d expect from the ~six cycle per critical section: that kind of makes sense with out of order execution, given that the critical section should easily flow through the pipeline and slip past timer interrupts.
The main tradeoff is that rlocks do not attempt to handle thread migrations: when a thread migrates to another CPU, we let it assume (temporary) exclusive ownership of its pseudoperCPU struct instead of issuing IPIs. That’s good for simplicity, and also – arguably – for scaling. The scaling argument is weak, given how efficient IPIs seem to be. However, IPIs feel like one of these operations for which most of the cost is indirect and hard to measure. The overhead isn’t only (or even mostly) incurred by the thread that triggers the IPIs: each CPU must stop what it’s currently doing, flush the pipeline, switch to the kernel to handle the interrupt, and resume execution. A scheme that relies on IPIs to handle events like thread migrations (rare, but happens at a nonnegligible base rate) will scale badly to really large CPU counts, and, more importantly, may make it hard to identify when the IPIs hurt overall system performance.
The other important design decision is that rlocks uses signals
instead of crossmodifying code. I’m not opposed to crossmodifying
code, but I cringe at the idea of leaving writable and executable
pages lying around just for performance. Again, we could mprotect
around crossmodification, but mprotect
triggers IPIs, and that’s
exactly what we’re trying to avoid. Also, if we’re going to
mprotect
in the common case, we might as well just mmap
in
different machine code; that’s likely a bit faster than two mprotect
and definitely safer (I would use this mmap
approach for revocable
multiCPU locks à la Harris and Fraser).
The downside of using signals is that they’re more invasive than crossmodifying code. If user code expects any (async) signal, its handlers must either mask the rlock signal away and not use rlocks, or call the rlock signal handler… not transparent, but not exacting either.
Rlocks really aren’t that much code (560 LOC), and that code is fairly reasonable (no mprotect or selfmodification trick, just signals). After more testing and validation, I would consider merging them in Concurrency Kit for production use.
Next step: either mmap
based strict revocable locks for nonblocking
concurrent code, or a full implementation of pseudoperCPU data based
on relaxed rlocks.
1 2 3 4 5 6 

With hardware popcount, this compiles to something like the following.
1 2 3 4 5 

This should raise a few questions:
Someone with a passing familiarity with x86 would also ask why we use
popcnt
instead of checking the parity flag after xor
.
Unfortunately, the parity flag only considers the least significant
byte of the result (:
When implementing something like the hashing trick or count sketches (PDF), you need two sets of provably strong hash functions: one to pick the destination bucket, and another to decide whether to increment or decrement by the sketched value.
Onebit hash functions are ideal for the latter use case.
The bitwise operations in bit_hash
implement a degenerate form of
tabulation hashing. It considers
the 64 bit input value x
as a vector of 64 bits, and associates a
two intermediate output values with each index. The naïve
implementation would be something like the following.
1 2 3 4 5 6 7 8 9 10 11 

Of course, the representation of random_table
is inefficient, and we
should handroll a bitmap. However, the loop itself is a problem.
The trick is to notice that we can normalise the table so that the
value for random_table[i][0]
is always 0: in order to do so, we have
to fix the initial value for acc
to a random bit. That initial
value is the hash value for 0
, and the values in
random_table[i][1]
now encode whether a nonzero bit i
in x
flips the hash value or leaves it as is.
The table
argument for bit_hash
is simply the 64 bits in
random_table[i][1]
, and bit
is the hash value for 0
. If bit i
in table
is 0, bit i
is irrelevant to the hash. If bit i
in
table
is 1, the hash flips when bit i
in x
is 1. Finally, the
parity counts how many times the hash was flipped.
I don’t think so. Whenever we need a hash bit, we also want a hash bucket; we might as well steal one bit from the latter wider hash. Worse, we usually want a few such bucket/bit pairs, so we could also compute a wider hash and carve out individual bits.
I only thought about this trick because I’ve been reading a few
empirical evaluation of sketching techniques, and a few authors find
it normal that computing a hash bit doubles the CPU time spent on
hashing. It seems to me the right way to do this is to map
columns/features to nottoosmall integers (e.g., universal hashing to
[0, n^2)
if we have n
features), and apply strong hashing to
these integers. Hashing machine integers is fast, and we can always
split strong hashes in multiple values.
In the end, this family of onebit hash functions seems like a good solution to a problem no one should ever have. But it’s still a cute trick!
]]>