However, if we had something simple enough to implement natively in the compiler, we could hope for the maintainers to understand what the ILP solver is doing. This seems realistic to me mostly because the generic complexity tends to lie in the continuous optimisation part. Branching, bound propagation, etc. is basic, sometimes domain specific, combinatorial logic; cut generation is probably the most prominent exception, and even that tends to be fairly combinatorial. (Maybe that’s why we seem to be growing comfortable with SAT solvers: no scary analysis.) So, for the past couple years, I’ve been looking for simple enough specialised solvers I could use in branchandbound for large 0/1 ILP.
Some stuff with augmented lagrangians and specialised methods for boxconstrained QP almost panned out, but nested optimisation sucks when the inner solver is approximate: you never know if you should be more precise in the lower level or if you should aim for more outer iterations.
A subroutine in Chubanov’s polynomialtime linear programming algorithm [PDF] (related journal version) seems promising, especially since it doesn’t suffer from the numerical issues inherent to log barriers.
Chubanov’s “Basic Subroutine” accepts a problem of the form \(Ax = 0\), \(x > 0\), and either:
The class of homogeneous problems seems useless (never mind the nondeterministic return value), but we can convert “regular” 0/1 problems to that form with a bit of algebra.
Let’s start with \(Ax = b\), \(0 \leq x \leq 1\), we can reformulate that in the homogeneous form:
\[Ax  by = 0,\] \[x + s  \mathbf{1}y = 0,\] \[x, s, y \geq 0.\]
Any solution to the original problem in \([0, 1]\) may be translated to the homogeneous form (let \(y = 1\) and \(s = 1  x\)). Crucially, any 0/1 (binary) solution to the original problem is still 0/1 in the homogeneous form. In the other direction, any solution with \(y > 0\) may be converted to the boxconstrained problem by dividing everything by \(y\).
If we try to solve the homogenous form with Chubanov’s subroutine, we may get:
As soon as we invoke the third case to recursively solve a smaller problem, we end up solving an interesting illspecified relaxation of the initial 0/1 linear program: it’s still a valid relaxation of the binary problem, but is stricter than the usual box linear relaxation.
That’s more than enough to drive a branchandbound process. In practice, branchandbound is much more about proving the (near) optimality of an existing solution than coming up with strong feasible solutions. That’s why the fact that the subroutine “only” solves feasibility isn’t a blocker. We only need to prove the absence of 0/1 solutions (much) better than the incumbent solution, and that’s a constraint on the objective value. If we get such a proof, we can prune away that whole search subtree; if we don’t, the subroutine might have fixed some variables 0 or 1 (always useful), and we definitely have a fractional solution. That solution to the relaxation could be useful for primal heuristics, and will definitely be used for branching (solving the natural LP relaxation of constraint satisfaction problems ends up performing basic propagation for us, so we get some domain propagation for free by only branching on variables with fractional values).
At the root, if we don’t have any primal solution yet, we should probably run some binary search on the objective value at the root node and feed the resulting fractional solutions to rounding heuristics. However, we can’t use the variables fixed by the subroutine: until we have a feasible binary solution with objective value \(Z\sp{\star}\), we can’t assume that we’re only interested in binary solutions with object value \(Z < Z\sp{\star}\), so the subroutine might fix some variables simply because there is no 0/1 solution that satisfy \(Z < k\) (case 3 is vacuously valid if there is no 0/1 solution to the homogeneous problem).
That suffices to convince me of correctness. I still have to understand Chubanov’s “Basic Subroutine.”
This note by Cornelis/Kees Roos helped me understand what makes the subroutine tick.
The basic procedure updates a dual vector \(y\) (not the same \(y\) as the one I had in the reformulation… sorry) such that \(y \geq 0\) and \(y_1 = 1\), and constantly derives from the dual vector a tentative solution \(z = P\sb{A}y\), where \(P\sb{A}\) projects (orthogonally) in the null space of the homogeneous constraint matrix \(A\) (the tentative solution is \(x\) in Chubanov’s paper).
At any time, if \(z > 0\), we have a solution to the homogenous system.
If \(z = P\sb{A}y = 0\), we can exploit the fact that, for any feasible solution \(x\), \(x = P\sb{A}x\): any feasible solution is alrady in the null space of \(A\). We have
\[x\sp{\top}y = x\sp{\top}P\sb{A}y = x\sp{\top}\mathbf{0} = 0\]
(the projection matrix is symmetric). The solution \(x\) is strictly positive and \(y\) is nonnegative, so this must mean that, for every component of \(y\sb{k} > 0\), \(x\sb{k} = 0\). There is at least one such component since \(y_1 = 1\).
The last condition is how we bound the number of iterations. For any feasible solution \(x\) and any component \(j\),
\[y\sb{j}x\sb{j} \leq y\sp{\top}x = y\sp{\top}P\sb{A}x \leq x P\sb{A}y \leq \sqrt{n} z.\]
Let’s say the max element of \(y\), \(y\sb{j} \geq 2 \sqrt{n}z\). In that case, we have \[x\sb{j} \leq \frac{\sqrt{n}z}{y\sb{j}} \leq \frac{1}{2}.\]
Chubanov uses this criterion, along with a potential argument on \(z\), to bound the number of iterations. However, we can apply the result at any iteration where we find that \(x\sp{\top}z < y\sb{j}\): any such \(x\sb{j} = 0\) in binary solutions. In general, we may upper bound the lefthand side with \(x\sp{\top}z \leq xz \leq \sqrt{n}z\), but we can always exploit the structure of the problem to have a tighter bound (e.g., by encoding clique constraints \(x\sb{1} + x\sb{2} + … = 1\) directly in the homogeneous reformulation).
The rest is mostly applying lines 912 of the basic procedure in Kees’s note. Find the set \(K\) of all indices such that \(\forall k\in K,\ z\sb{k} \leq 0\) (Kees’s criterion is more relaxed, but that’s what he uses in experiments), project the vector \(\frac{1}{K} \sum\sb{k\in K}e\sb{k}\) in the null space of \(A\) to obtain \(p\sb{K}\), and update \(y\) and \(z\).
The potential argument here is that after updating \(z\), \(\frac{1}{z\sp{2}}\) has increased by at least \(K > 1\). We also know that \(\max y \geq \frac{1}{n}\), so we can fix a variable to 0 as soon as \(\sqrt{n} z < \frac{1}{n}\), or, equivalently, \(\frac{1}{z} > n\sp{3/2}\). We need to increment \(\frac{1}{z\sp{2}}\) to at most \(n\sp{3}\), so we will go through at most \(1 + n\sp{3})\) iterations of the basic procedure before it terminates; if the set \(K\) includes more than one coordinate, we should need fewer iterations to reach the same limit.
Chubanov shows how to embed the basic procedure in a basic iterative method to solve binary LPs. The interesting bit is that we reuse the dual vector \(y\) as much as we can in order to bound the total number of iterations in the basic procedure. We fix at least one variable to \(0\) after a call to the basic procedure that does not yield a fractional solution; there are thus at most \(n\) such calls.
In contrast to regular numerical algorithms, the number of iterations and calls so far have all had exact (non asymptotic) bounds. The asymptotics hide in the projection step, where we average elementary unit vectors and project them in the null space of \(A\). We know there will be few (at most \(n\)) calls to the basic procedure, so we can expend a lot of time on matrix factorisation. In fact, Chubanov outright computes the projection matrix in \(\mathcal{O}(n\sp{3})\) time to get his complexity bound of \(\mathcal{O}(n\sp{4})\). In practice, this approach is likely to fill a lot of zeros in, and thus run out of RAM.
I’d start with the sparse projection code in SuiteSparse. The direct sparse solver spends less time on precomputation than fully building the projection matrix (good if we don’t expect to always hit the worst case iteration bound), and should preserve sparsity (good for memory usage). In return, computing projections is slower, which brings the worstcase complexity to something like \(\mathcal{O}(n\sp{5})\), but that can be parallelised, should be more proportional to the number of nonzeros in the constraint matrix (\(\mathcal{O}(n)\) in practice), and may even exploit sparsity in the righthand side. Moreover, we can hope that the \(n\sp{3}\) iteration bound is pessimistic; that certainly seems to be the case for most experiments with random matrices.
The worstcase complexity, between \(\mathcal{O}(n\sp{4})\) and \(\mathcal{O}(n\sp{5})\), doesn’t compare that well to interior point methods (\(\mathcal{O}(\sqrt{n})\) sparse linear solutions). However, that’s all worstcase (even for IPMs). We also have different goals when embedding linear programming solvers in branchandbound methods. Warm starts and the ability to find solution close to their bounds are key to efficient branchandbound; that’s why we still use simplex methods in such methods. Chubanov’s projection routine seems like it might come close to the simplex’s good fit in branchandbound, while improving efficiency and parallelisability on large LPs.
]]>/proc/sched_debug
.
The hard part about locking tends not to be the locking itself, but preemption. For example, if you structure a memory allocator like jemalloc, you want as few arenas as possible; one per CPU would be ideal, while one per thread would affect fragmentation and make some operations scale linearly with the number of threads. However, you don’t want to get stuck when a thread is preempted while it owns an arena. The usual fix is twopronged:
The first tweak isn’t that bad; scaling the number of arenas, stats regions, etc. with the number of CPUs is better than scaling with the number of threads. The second one really hurts performance: each allocation must acquire a lock with an interlocked write. Even if the arena is (mostly) CPUlocal, the atomic wrecks your pipeline.
It would be nice to have locks that a thread can acquire once per scheduling quantum, and benefit from ownership until the thread is scheduled out. We could then have a few arenas per CPU (if only to handle migration), but amortise lock acquisition over the timeslice.
That’s not a new idea. Dice and Garthwaite described this exact application in 2002 (PDF) and refer to older work for uniprocessors. However, I think the best exposition of the idea is Harris and Fraser’s Revocable locks for nonblocking programming, published in 2005 (PDF). Harris and Fraser want revocable locks for nonblocking multiwriter code; our problem is easier, but only marginally so. Although the history of revocable locks is pretty Solariscentric, Linux is catching up. Google, Facebook, and EfficiOS (LTTng) have been pushing for restartable sequences, which is essentially OS support for sections that are revoked on context switches. Facebook even has a pure userspace implementation with Rseq; they report good results for jemalloc.
Facebook’s Rseq implements almost exactly what I described above, for the exact same reason (speeding up a memory allocator or replacing miscellaneous perthread structs with ~perCPU data). However, they’re trying to port a kernel idiom directly to userspace: restartable sequences implement strict perCPU data. With kernel supports, that makes sense. Without such support though, strict perCPU data incurs a lot of extra complexity when a thread migrates to a new CPU: Rseq needs an asymmetric fence to ensure that the evicted thread observes its eviction and publishes any write it performed before being evicted.
I’m not sure that’s the best fit for userspace. We can avoid a lot of complexity by instead dynamically allocating a few arenas (exclusive data) per CPU and assuming only a few threads at a time will be migrated while owning arenas.
Here’s the relaxed revocable locks interface I propose:
Each thread has a thread state struct. That state struct has:
Locks are owned by a pair of thread state struct and generation counter (ideally packed in one word, but two words are doable). Threads acquire locks with normal compareandswap, but may bulk revoke every lock they own by advancing their generation counter.
Threads may execute any number of conditional stores per lock
acquisition. Lock acquisition returns an ownership descriptor
(pair of thread state struct and generation counter), and
rlock_store_64(descriptor, lock, dst, value)
stores value
in
dst
if the descriptor still owns the lock and the ownership has
not been cancelled.
Threads do not have to release lock ownership to let others make
progress: any thread may attempt to cancel another thread’s
ownership of a lock. After rlock_owner_cancel(descriptor, lock)
returns successfully, the victim will not execute a conditional
store under the notion that it still owns lock
with descriptor
.
The only difference from Rseq is that rlock_owner_cancel
may fail.
In practice, it will only fail if a thread on CPU A attempts to cancel
ownership for a thread that’s currently running on another CPU B.
That could happen after migration, but also when an administrative
task iterates through every (pseudo)perCPU struct without changing
its CPU mask. Being able to iterate through all available
pseudoperCPU data without migrating to the CPU is big win for slow
paths; another advantage of not assuming strict perCPU affinity.
Rather than failing on migration, Rseq issues an asymmetric fence to
ensure both its writes and the victim’s writes are visible. At best,
that’s implemented with interprocessor interrupts (IPIs) that scale
linearly with the number of CPUs… for a pointtopoint signal. I
oversubscribed a server with 24x more threads than CPUs, and thread
migrations happened at a constant frequency per CPU. Incurring
O(#CPU)
IPIs for every migration makes the perCPU overhead of
Rseq linear with the number of CPUs (cores) in the system. I’m also
wary of the high rate of code self/cross modification in Rseq:
mprotect
incurs IPIs when downgrading permissions, so Rseq must
leave some code page with writes enabled. These downsides (potential
for IPI storm and lack of W^X) aren’t unique to Rseq. I think
they’re inherent to emulating unpreempted perCPU data in userspace
without explicit OS support.
When rlock_owner_cancel
fails, I expect callers to iterate down the
list of pseudoperCPU structs associated with the CPU and eventually
append a new struct to that list. In theory, we could end up with as
many structs in that list as the peak number of thread on that CPU; in
practice, it should be a small constant since rlock_owner_cancel
only fails after thread migration.
I dumped my code as a gist, but it is definitely hard to follow, so I’ll try to explain it here.
Bitpacked ownership records must include the address of the owner
struct and a sequence counter. Ideally, we’d preallocate some address
space and only need 2030 bits to encode the address. For now, I’m
sticking to 64 byte aligned allocations and rely on x8664’s 48 bits
of address space. With 64 bit owner/sequence records, an rlock
is a 64 bit spinlock.
1 2 3 4 5 6 7 8 9 10 11 

In the easy case, acquiring an rlock
means:
owner
field (with a 64 bit load);rlock_owner_seq_t
.But first, we must make canonicalise our own owner
struct.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Rlock lazily allocates an rlock_owner
per thread and stores it in
TLS; we can’t free that memory without some safe memory reclamation
scheme (and I’d like to use Rlock to implement SMR), but it is
possible to use a typestable freelist.
Regardless of the allocation/reuse strategy, canonicalising an rlock means making sure we observe any cancellation request.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

To acquire a lock we observe the current owner, attempt to cancel its ownership, and (if we did cancel ownership) CAS in our own owner/sequence descriptor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

Most of the trickiness hides in rlock_owner_cancel
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 

The fancy stuff begins around ensure_cancel_sequence(victim, sequence);
.
Our code maintains the invariant that the MPMC sequences
(cancel_sequence
, signal_sequence
) are either the SPMC sequence  1
(normal state), or exactly the SPMC sequence (cancellation
request).
ensure_cancel_sequence
CASes the cancel_sequence
field from its
expected value of owner.sequence  1
to owner.sequence
. If
the actual value is neither of them, the owner has already
advanced to a new sequence value, and we’re done.
Otherwise, we have to hope the victim isn’t running.
Now comes the really tricky stuff. Our CAS is immediately visible globally. The issue is that the victim might already be in the middle of a critical section. When writers executes a critical sections, they:
It’s really hard to guarantee that the write in step 1 is visible (without killing performance in the common case), and if it is, that the victim isn’t about to execute step 3.
We get that guarantee by determining that the victim hasn’t been
continuously executing since the time we attempted to CAS the
cancel_sequence
forward. That’s (hopefully) enough of a barrier to
order the CAS, step 1, and our read of the critical section flag.
That’s not information that Linux exposes directly. However, we can
borrow a trick from Rseq
and read /proc/self/task/[tid]/stat
. The
contents of that file include whether the task is (R)unnable (or
(S)leeping, waiting for (D)isk, etc.), and the CPU on which the task
last executed.
If the task isn’t runnable, it definitely hasn’t been running continuously since the CAS. If the task is runnable but last ran on the CPU the current thread is itself running on (and the current thread wasn’t migrated in the middle of reading the stat file), it’s not running now.
If the task is runnable on another CPU, we can try to look at
/proc/sched_debug
: each CPU has a .curr>pid
line that tells us
the PID of the task that’s currently running (0 for none). That file
has a lot of extra information so reading it is really slow, but we
only need to do that after migrations.
Finally, the victim might really be running. Other proposals would fire an IPI; we instead ask the caller to allocate a few more pseudoperCPU structs.
Assuming we did get a barrier out of the scheduler, we hopefully observe that the victim’s critical section flag is clear. If that happens, we had:
This guarantees that the victim hasn’t been in the same critical section since the CAS in step 1. Either it’s not in a critical section, or if it is, it’s a fresh one that will observe the CAS. It’s safe to assume the victim has been successfully evicted.
The less happy path happens when we observe that the victim’s critical
section flag is set. We must assume that it was scheduled out in
the middle of a critical section. We’ll send a (POSIX) signal to the
victim: the handler will skip over the critical section if the victim
is still in one. Once that signal is sent, we know that the first
thing Linux will do is execute the handler when the victim resume
execution. If the victim is still not running after tgkill
returned, we’re good to go: if the victim is still in the critical
section, the handler will fire when it resumes execution.
Otherwise, the victim might have been scheduled in between the CAS and the signal; we still have the implicit barrier given by the context switch between CAS and signal, but we can’t rely on signal execution. We can only hope to observe that the victim has noticed the cancellation request and advanced its sequence, or that it cleared its critical section flag.
The rest is straightforward. The rlock_store_64
must observe any
cancellation, ensure that it still holds the lock, and enter the
critical section:
Once it leaves the critical section, rlock_store_64
clears the
critical section flags, looks for any cancellation request, and
returns success/failure. The critical section is in inline assembly
for the signal handler: executing the store in step 4 implicitly
marks the end of the critical section.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 

Finally, the signal handler for rlock cancellation requests iterates
through the rlock_store_list
section until it finds a record that
strictly includes the instruction pointer. If there is such a record,
the thread is in a critical section, and we can skip it by overwriting
RIP
(to the end of the critical section) and setting RAX
to 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 

On my 2.9 GHz Sandy Bridge, a baseline loop to increment a counter a billion times takes 6.9 cycles per increment, which makes sense given that I use inline assembly loads and stores to prevent any compiler cleverness.
The same loop with an interlocked store (xchg
) takes 36 cycles per
increment.
Interestingly, an xchg
based spinlock around normal increments only
takes 31.7 cycles per increment (0.44 IPC). If we wish to back our
spinlocks with futexes, we must unlock with an interlocked write; releasing
the lock with a compareandswap brings us to 53.6 cycles per
increment (0.30 IPC)! Atomics really mess with pipelining: unless
they’re separated by dozens or even hundreds of instructions, their
barrier semantics (that we usually need) practically forces an
inorder, barely pipelined, execution.
FWIW, 50ish cycles per transaction is close to what I see in microbenchmarks for Intel’s RTM/HLE. So, while the overhead of TSX is nonnegligible for very short critical sections, it seems more than reasonable for adaptive locks (and TSX definitely helps when preemption happens, as shown by Dice and Harris in Lock Holder Preemption Avoidance via Transactional Lock Elision).
Finally, the figure that really matters: when incrementing with
rlock_store_64
, we need 13 cycles per increment. That loop hits
2.99 IPC, so I think the bottleneck is just the number of instructions
in rlock_store_64
. The performance even seems independent of the
number of worker threads, as long as they’re all on the same CPU.
In tabular form:
 Method  Cycle / increment  IPC 

 Vanilla  6.961  1.15 
 xchg  36.054  0.22 
 FAS spinlock  31.710  0.44 
 FASCAS lock  53.656  0.30 
 Rlock, 1 thd  13.044  2.99 
 Rlock, 4 thd / 1 CPU  13.099  2.98 
 Rlock, 256 / 1  13.952  2.96 
 Rlock, 2 / 2  13.047  2.99 
Six more cycles per write versus threadprivate storage really isn’t that bad (accessing TLS in a shared library might add as much overhead)… especially compared to 2550 cycles (in addition to indirect slowdowns from the barrier semantics) with locked instructions.
I also have a statisticsgathering mode that lets me vary the fraction of cycles spent in critical sections. On my server, the frequency of context switches between CPUintensive threads scheduled on the same CPU increases in steps until seven or eight threads; at that point, the frequency tops out at one switch per jiffy (250 Hz). Apart from this scheduling detail, evictions act as expected (same logic as for sampled profiles). The number of evictions is almost equal to the number of context switches, which is proportional to the runtime. However, the number of hard evictions (with the victim in a critical section) is always proportional to the number of critical section executed: roughly one in five million critical section is preempted. That’s even less than the one in two million we’d expect from the ~six cycle per critical section: that kind of makes sense with out of order execution, given that the critical section should easily flow through the pipeline and slip past timer interrupts.
The main tradeoff is that rlocks do not attempt to handle thread migrations: when a thread migrates to another CPU, we let it assume (temporary) exclusive ownership of its pseudoperCPU struct instead of issuing IPIs. That’s good for simplicity, and also – arguably – for scaling. The scaling argument is weak, given how efficient IPIs seem to be. However, IPIs feel like one of these operations for which most of the cost is indirect and hard to measure. The overhead isn’t only (or even mostly) incurred by the thread that triggers the IPIs: each CPU must stop what it’s currently doing, flush the pipeline, switch to the kernel to handle the interrupt, and resume execution. A scheme that relies on IPIs to handle events like thread migrations (rare, but happens at a nonnegligible base rate) will scale badly to really large CPU counts, and, more importantly, may make it hard to identify when the IPIs hurt overall system performance.
The other important design decision is that rlocks uses signals
instead of crossmodifying code. I’m not opposed to crossmodifying
code, but I cringe at the idea of leaving writable and executable
pages lying around just for performance. Again, we could mprotect
around crossmodification, but mprotect
triggers IPIs, and that’s
we’re trying to avoid. Also, if we’re going to mprotect
in the
common case, we might as well just mmap
in different machine code;
that’s likely a bit faster than two mprotect
and definitely safer (I
would use this mmap
approach for revocable multiCPU locks à la
Harris and Fraser).
The downside of using signals is that they’re more invasive than crossmodifying code. If user code expects any (async) signal, its handlers must either mask the rlock signal away and not use rlocks, or call the rlock signal handler… not transparent, but not exacting either.
Rlocks really aren’t that much code (560 LOC), and that code is fairly reasonable (no mprotect or selfmodification trick, just signals). After more testing and validation, I would consider merging them in Concurrency Kit for production use.
Next step: either mmap
based strict revocable locks for nonblocking
concurrent code, or a full implementation of pseudoperCPU data based
on relaxed rlocks.
1 2 3 4 5 6 

With hardware popcount, this compiles to something like the following.
1 2 3 4 5 

This should raise a few questions:
Someone with a passing familiarity with x86 would also ask why we use
popcnt
instead of checking the parity flag after xor
.
Unfortunately, the parity flag only considers the least significant
byte of the result (:
When implementing something like the hashing trick or count sketches (PDF), you need two sets of provably strong hash functions: one to pick the destination bucket, and another to decide whether to increment or decrement by the sketched value.
Onebit hash functions are ideal for the latter use case.
The bitwise operations in bit_hash
implement a degenerate form of
tabulation hashing. It considers
the 64 bit input value x
as a vector of 64 bits, and associates a
two intermediate output values with each index. The naïve
implementation would be something like the following.
1 2 3 4 5 6 7 8 9 10 11 

Of course, the representation of random_table
is inefficient, and we
should handroll a bitmap. However, the loop itself is a problem.
The trick is to notice that we can normalise the table so that the
value for random_table[i][0]
is always 0: in order to do so, we have
to fix the initial value for acc
to a random bit. That initial
value is the hash value for 0
, and the values in
random_table[i][1]
now encode whether a nonzero bit i
in x
flips the hash value or leaves it as is.
The table
argument for bit_hash
is simply the 64 bits in
random_table[i][1]
, and bit
is the hash value for 0
. If bit i
in table
is 0, bit i
is irrelevant to the hash. If bit i
in
table
is 1, the hash flips when bit i
in x
is 1. Finally, the
parity counts how many times the hash was flipped.
I don’t think so. Whenever we need a hash bit, we also want a hash bucket; we might as well steal one bit from the latter wider hash. Worse, we usually want a few such bucket/bit pairs, so we could also compute a wider hash and carve out individual bits.
I only thought about this trick because I’ve been reading a few
empirical evaluation of sketching techniques, and a few authors find
it normal that computing a hash bit doubles the CPU time spent on
hashing. It seems to me the right way to do this is to map
columns/features to nottoosmall integers (e.g., universal hashing to
[0, n^2)
if we have n
features), and apply strong hashing to
these integers. Hashing machine integers is fast, and we can always
split strong hashes in multiple values.
In the end, this family of onebit hash functions seems like a good solution to a problem no one should ever have. But it’s still a cute trick!
]]>In July 2012, I started really looking into searching in static sorted sets, and found the literature disturbingly useless. I reviewed a lot of code, and it turned out that most binary searches out there are not only unsafe for overflow, but also happen to be badly microoptimised for small arrays with simple comparators. That lead to Binary Search eliminates Branch Mispredictions, a reaction to popular assertions that binary search has bad constant factors (compared to linear search or a breadth first layout) on modern microarchitecture, mostly due to branch mispredictions. That post has code for really riced up searches on fixed array sizes, so here’s the size generic inner loop I currently use.
1 2 3 4 5 6 7 8 9 10 

The snippet above implements a binary search, instead of dividing by three to avoid aliasing issues. That issue only shows up with array sizes that are (near) powers of two. I know of two situations where that happens a lot:
The fix for the first case is to do proper benchmarking on a wide range of input sizes. Ternary or offset binary search are only really useful in the second case. There’s actually a third case: when I’m about to repeatedly search in the same array, I dispatch to unrolled ternary searches, with one routine for each power of two. I can reduce any size to a power of two with one initial iteration on an offcenter “midpoint.” Ternary search has a high overhead for small arrays, unless we can precompute offsets by unrolling the whole thing.
My work on binary search taught me how to implement binary search not stupidly–unlike real implementations–and that most experiments on searching in array permutations seem broken in their very design (they focus on full binary trees).
I don’t think I ever make that explicit, but the reason I even started looking into binary search is that I wanted to have a fast implementation of searching in a van Emde Boas layout! However, none of the benchmarks (or analyses) I found were convincing, and I kind of lost steam as I improved sorted arrays: sortedness tends to be useful for operations other than predecessor/successor search.
Some time in May this year, I found Pat Morin’s fresh effort on the
exact question I had abandoned over the years: how do popular
permutations work in terms of raw CPU time? The code was open,
and even good by research standards! Pat had written the annoying
part (building the permutations), generated a bunch of tests I could
use to check correctness, and avoided obvious microbenchmarking
pitfalls. He also found a really nice way to find the return value for BFS searches from the location where the search ends with fast
bit operations (j = (i + 1) >> __builtin_ffs(~(i + 1));
, which he
explains in the paper).
I took that opportunity to improve constant factors for all the implementations, and to really try and explain in detail the performance of each layout with respect to each other, as well as how they respond to the size of the array. That sparked a very interesting back and forth with Pat from May until September (!). Pat eventually took the time to turn our informal exchange into a coherent paper. More than 3 years after I started spending time on the question of array layouts for implicit search trees, I found the research I was looking for… all it took was a bit of collaboration (:
Bonus: the results were unexpected! Neither usual suspects (Btree or van Emde Boas) came out on top, even for very large arrays. I was also surprised to see the breadthfirst layout perform much better than straight binary search: none of the usual explanations made sense to me. It turns out that the improved performance (when people weren’t testing on round, power of two, array sizes) was probably an unintended consequence of bad code! Breadthfirst is fast, faster than layouts with better cache efficiency, because it prefetches well enough to hide latency even when it extracts only one bit of information from each cache line; its performance has nothing to do with cachability. Our code prefetches explicitly, but slower branchy implementations in the wild get implicit prefetching, thanks to speculative execution.
Conclusion: if you need to do a lot of comparisonbased searches in > L2sized arrays, use a breadthfirst order and prefetch. If you need sorted arrays, consider sticking some prefetches in a decent binary search loop. If only I’d known that in 2012!
A couple months ago, I found LZ77like Compression with Fast Random Access by Kreft and Navarro. They describe a LempelZiv approach that is similar to LZ77, but better suited to decompressing arbitrary substrings. The hard part about applying LZ77 compression to (byte)code is that parses may reuse any substring that happens to appear earlier in the original text. That’s why I had to use Jez’s algorithm to convert the LZ77 parse into a (one word) grammar.
LZEnd fixes that.
Kreft and Navarro improve random access decompression by restricting the format of “backreferences” in the LZ parse. The parse decomposes the original string into a sequence of phrases; concatenating the phrases yields the string back, and phrases have a compact representation. In LZ77, phrases are compressed because they refer back to substrings in prior phrases. LZEnd adds another constraint: the backreferences cannot end in the middle of phrases.
For example, LZ77 might have a backreference like
[abc][def][ghi]

to represent “cdefg.” LZEnd would be forced to end the new phrase at “f”
[abc][def][ghi]

and only represent “cdef.” The paper shows that this additional restriction has a marginal impact on compression rate, and uses the structure to speed up operations on compressed data. (The formal definition also forbids the cute/annoying selfreferenceasloop idiom of LZ77, without losing too much compression power!)
We can apply the same idea to compress code. Each phrase is now a
subroutine with a return
at the end. A backreference is a series
of calls to subroutines; the first call might begin in the middle, but
matches always end on return
, exactly like normal code does! A
phrase might begin in the middle of a phrase that itself consists of
calls. That’s still implementable: the referrer can see through the
indirection and call in the middle of the callee’s callee (etc.), and
then go back to the callee for a suitably aligned submatch.
That last step looks like it causes a space blowup, and I can’t bound it (yet).
But that’s OK, because I was only looking into compressing traces as a
last resort. I’m much more interested in expression trees, but
couldn’t find a way to canonicalize sets (e.g., arguments to integer
+
) and sequences (e.g., floating point *
) so that similar
collections have similar subtrees… until I read Hammer et al’s work
on Nominal Adapton, which solves a
similar problem in a different context.
They want a tree representation for lists and tries (sets/maps) such that a small change in the list/trie causes a small change in the tree that mostly preserves identical subtrees. They also want the representation to be a deterministic function of the list/trie. That way, they can efficiently reuse computations after incremental changes to inputs.
That’s exactly my sequence/set problem! I want a treebased representation for sequences (lists) and sets (tries) such that similar sequences and sets have mostly identical subtrees for which I can reuse pregenerated code.
Nominal Adapton uses a hashbased construction described by Pugh and Teitelbaum in 1989 (Incremental computation via function caching) to represent lists, and extends the idea for tries. I can “just” use the same trick to canonicalise lists and sets into binary trees, and (probabilistically) get common subexpressions for free, even across expressions trees! It’s not perfect, but it should scale pretty well.
That’s what I’m currently exploring when it comes to using compression to reduce cache footprint while doing aggressive specialisation. Instead of finding redundancy in linearised bytecode after the fact, induce identical subtrees for similar expressions, and directly reuse code fragments for subexpressions.
I thought I’d post a snippet on the effect of alignment and virtual memory tricks on TLBs, but couldn’t find time for that. Perhaps later this week. In the meantime, I have to prepare a short talk on the software transactional memory system we built at AppNexus. Swing by 23rd Street on December 15 if you’re in New York!
]]>What do memory allocation, histograms, and event scheduling have in common? They all benefit from rounding values to predetermined buckets, and the same bucketing strategy combines acceptable precision with reasonable space usage for a wide range of values. I don’t know if it has a real name; I had to come up with the (confusing) term “linearlog bucketing” for this post! I also used it twice last week, in otherwise unrelated contexts, so I figure it deserves more publicity.
I’m sure the idea is old, but I first came across this strategy in jemalloc’s binning scheme for allocation sizes. The general idea is to simplify allocation and reduce external fragmentation by rounding allocations up to one of a few bin sizes. The simplest scheme would round up to the next power of two, but experience shows that’s extremely wasteful: in the worst case, an allocation for \(k\) bytes can be rounded up to \(2k  2\) bytes, for almost 100% space overhead! Jemalloc further divides each poweroftwo range into 4 bins, reducing the worstcase space overhead to 25%.
This subpoweroftwo binning covers medium and large allocations. We still have to deal with small ones: the ABI forces alignment on every allocation, regardless of their size, and we don’t want to have too many small bins (e.g., 1 byte, 2 bytes, 3 bytes, …, 8 bytes). Jemalloc adds another constraint: bins are always multiples of the allocation quantum (usually 16 bytes).
The sequence for bin sizes thus looks like: 16, 32, 48, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, 384, … (0 is special because malloc must either return NULL [bad for error checking] or treat it as a full blown allocation).
I like to think of this sequence as a special initial range with 4 linearly spaced subbins (0 to 63), followed by poweroftwo ranges that are again split in 4 subbins (i.e., almost logarithmic binning). There are thus two parameters: the size of the initial linear range, and the number of subbins per range. We’re working with integers, so we also know that the linear range is at least as large as the number of subbins (it’s hard to subdivide 8 integers in 16 bins).
Assuming both parameters are powers of two, we can find the bucket for any value with only a couple x86 instructions, and no conditional jump or lookup in memory. That’s a lot simpler than jemalloc’s implementation; if you’re into Java, HdrHistogram’s binning code is nearly identical to mine.
As always when working with bits, I first doodled in SLIME/SBCL: CL’s bit manipulation functions are more expressive than C’s, and a REPL helps exploration.
Let linear
be the \(\log\sb{2}\) of the linear range, and subbin
the \(\log\sb{2}\) of the number of subbin per range, with
linear >= subbin
.
The key idea is that we can easily find the power of two range (with a
BSR
), and that we can determine the subbin in that range by shifting
the value right to only keep its subbin
most significant (nonzero)
bits.
I clearly need something like \(\lfloor\log\sb{2} x\rfloor\):
1 2 

I’ll also want to treat values smaller than 2**linear
as
though they were about 2**linear
in size. We’ll do that with
nbits := (lb (logior x (ash 1 linear))) === (max linear (lb x))
We now want to shift away all but the top subbin
bits of x
shift := ( nbits subbin)
subindex := (ash x ( shift))
For a memory allocator, the problem is that the last rightward shift rounds down! Let’s add a small mask to round things up:
mask := (ldb (byte shift 0) 1) ; that's `shift` 1 bits
rounded := (+ x mask)
subindex := (ash rounded ( shift))
We have the top subbin
bits (after rounding) in subindex
. We
only need to find the range index
range := ( nbits linear) ; nbits >= linear
Finally, we combine these two together by shifting index
by
subbin
bits
index := (+ (ash range subbin) subindex)
Extra! Extra! We can also find the maximum value for the bin with
size := (logandc2 rounded mask)
Assembling all this yields
1 2 3 4 5 6 7 8 9 10 

Let’s look at what happens when we want \(2\sp{2} = 4\) subbin per range, and a linear progression over \([0, 2\sp{4} = 16)\).
CLUSER> (bucket 0 4 2)
0 ; 0 gets bucket 0 and rounds up to 0
0
CLUSER> (bucket 1 4 2)
1 ; 1 gets bucket 1 and rounds up to 4
4
CLUSER> (bucket 4 4 2)
1 ; so does 4
4
CLUSER> (bucket 5 4 2)
2 ; 5 gets the next bucket
8
CLUSER> (bucket 9 4 2)
3
12
CLUSER> (bucket 15 4 2)
4
16
CLUSER> (bucket 17 4 2)
5
20
CLUSER> (bucket 34 4 2)
9
40
The sequence is exactly what we want: 0, 4, 8, 12, 16, 20, 24, 28, 32, 40, 48, …!
The function is marginally simpler if we can round down instead of up.
1 2 3 4 5 6 7 8 

CLUSER> (bucketdown 0 4 2)
0 ; 0 still gets the 0th bucket
0 ; and rounds down to 0
CLUSER> (bucketdown 1 4 2)
0 ; but now so does 1
0
CLUSER> (bucketdown 3 4 2)
0 ; and 3
0
CLUSER> (bucketdown 4 4 2)
1 ; 4 gets its bucket
4
CLUSER> (bucketdown 7 4 2)
1 ; and 7 shares it
4
CLUSER> (bucketdown 15 4 2)
3 ; 15 gets the 3rd bucket for [12, 15]
12
CLUSER> (bucketdown 16 4 2)
4
16
CLUSER> (bucketdown 17 4 2)
4
16
CLUSER> (bucketdown 34 4 2)
8
32
That’s the same sequence of bucket sizes, but rounded down in size instead of up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 

I first implementated this code to mimic’s jemalloc binning scheme: in a memory allocator, a linearlogarithmic sequence give us alignment and bounded space overhead (bounded internal fragmentation), while keeping the number of size classes down (controlling external fragmentation).
High dynamic range histograms use the same class of sequences to bound the relative error introduced by binning, even when recording latencies that vary between microseconds and hours.
I’m currently considering this binning strategy to handle a large number of timeout events, when an exact priority queue is overkill. A timer wheel would work, but tuning memory usage is annoying. Instead of going for a hashed or hierarchical timer wheel, I’m thinking of binning events by timeout, with one FIFO per bin: events may be late, but never by more than, e.g., 10% their timeout. I also don’t really care about sub millisecond precision, but wish to treat zero specially; that’s all taken care of by the “round up” linearlog binning code.
In general, if you ever think to yourself that dispatching on the bitwidth of a number would mostly work, except that you need more granularity for large values, and perhaps less for small ones, linearlogarithmic binning sequences may be useful. They let you tune the granularity at both ends, and we know how to round values and map them to bins with simple functions that compile to fast and compact code!
P.S. If a chip out there has fast int>FP conversion and slow bit scans(!?), there’s another approach: convert the integer to FP, scale by, e.g., \(1.0 / 16\), add 1, and shift/mask to extract the bottom of the exponent and the top of the significand. That’s not slow, but unlikely to be faster than a bit scan and a couple shifts/masks.
]]>