Lock-free programming for the masses11 Jun 2016
Efficient concurrent programming libraries are essential for taking advantage of fine-grained parallelism on multicore hardware. In this post, I will introduce reagents, a composable, lock-free concurrency library for expressing fine-grained parallel programs on Multicore OCaml. Reagents offer a high-level DSL for experts to specify efficient concurrency libraries, but also allows the consumers of the libraries to extend them further without knowing the details of the underlying implementation.
Designing and implementing scalable concurrency libraries is an enormous undertaking. Decades of research and industrial effort has led to state-of-the-art concurrency libraries such as java.util.concurrent (JUC) for the JVM and System.Collections.Concurrent (SCC) for the .NET framework. These libraries are often written by experts and have subtle invariants, which makes them hard to maintain and improve. Moreover, it is hard for the library user to safely combine multiple atomic operations. For example, while JUC and SCC provide atomic operations on stacks and queues, such atomic operations cannot be combined into larger atomic operations.
On the other hand software transactional memory (STM) offers composability, but STM based data structures are generally less efficient than their lock-free counterparts, especially when there is moderate to high levels of contention. Aaron Turon introduced reagents, an expressive and composable library which retains the performance and scalability of lock-free programming. Reagents allow isolated atomic updates to shared state, as well as message passing communication over channels. Furthermore, reagents provide a set of combinators for sequential composition à la STM, parallel composition à la Join calculus, and selective communication à la Concurrent ML, while being lock-free. Reagents occupy this sweet-spot between expressivity and performance, and we believe it could serve as a great default1 for writing fine-grained concurrent programs in Multicore OCaml.
The basic reagents combinators are presented below.
A reagent value with type
('a,'b) t represents an atomic transaction that
takes an input of type
'a and returns a value of type
'b. The basic atomic
operations are exchanging message on an endpoint of a channel through
updating a shared reference through
swap operation blocks the
calling thread until a matching swap operation is available on the dual
The atomic reference update operation
upd, takes a function which is applied
to the current value of the reference (of type
'a) and the input value (of
'b), and is expected to return an optional pair of the new value for the
reference and a return value (of type
'c). If the update function returns
None, then the invoking thread blocks until the reference is updated. Reagent
implementation takes care of the blocking and signalling necessary for thread
The most important feature of reagents is that it allows composition of reagent
transactions in sequence
>>> and in parallel
<*>, and also to selectively
choose one of the available operations
<+>. Furthermore, these combinators
being arrows, enable
optimisations that cover the common case and help reagents achieve performance
commensurate to hand-written implementations. Reagents library also exposes
monadic combinators for convenience, at the cost of forgoing optimisation
A lock-free stack
The following is a reagent implementation of the Treiber lock-free stack.
We utilise a shared reference of type
'a list ref to represent the stack and
upd operation to perform atomic operations on the stack. The important
take away from this snippet is that the code is no more complicated than a
sequential stack implementation. The logic for backoff, retry, blocking and
signalling are hidden behind the reagents implementation. In particular, the
pop operation blocks the calling thread until the stack is non-empty. Thus,
the experts can write efficient concurrency libraries using reagents while
preserving readability (and as a consequence maintainability) of code.
Furthermore, since the stack interface is exposed as reagents, the individual
operations can be further composed. For example, given two Treiber stacks
pop s1 >>> push s2 transfers elements atomically between the stacks,
pop s1 <*> pop s2 consumes elements atomically from both of the stacks, and
pop s1 <+> pop s2 consumes an element from either of the stacks. Importantly,
the composition preserves the optimisations and blocking/signalling behaviours,
allowing the users of the library to arbitrarily combine and extend the
functionality without knowing about the underlying implementation.
Feeding the philosophers
The parallel composition combinator provides an elegant way to solve the Dining Philosophers problem. The problem imagines a set of philosophers seated around a circular table, forever alternating between thinking and eating. Forks are placed between adjacent philosophers, and each philosopher can only eat after obtaining both the left and right forks. The goal is to design a solution where no philosopher will starve. The problem highlights the issues of deadlock and fairness in concurrent programming.
One way to solve this problem is to model each fork as a pair of endpoints, one for taking and another for dropping the fork.
Now, the solution for a single round of eating can be implemented as follows:
take l_fork <*> take r_fork to atomically take both of the forks.
Reagents ensure that the protocol does not deadlock. After eating, we release
the forks by spawning lightweight threads. In the next round, the philosophers
race for the available forks. If the thread scheduler is fair, then the
protocol provides fairness among the philosophers. The complete solution is
The key idea behind the implementation is that the reagent transaction executes in two phases. The first phase involves collecting all the compare-and-swap (CAS) operations necessary for the transaction, and the second phase is invoking a k-CAS operation (emulated in software). The failure to gather all the available CASes constitutes a permanent failure, causing the thread to explore other alternatives in the case of a selective communication or block otherwise. The failure in the second phase means that there is active interference from other concurrent threads, in which case the transaction is retried.
Performance of the Reagents depends critically on having fine-grained control over threads and schedulers for implementing backoff loops, blocking and signalling. However, one of the main ideas of multicore OCaml is not to bake in the thread scheduler into the compiler but rather describe them as libraries. To this end, the reagents library is functorized over the following generic scheduler interface:
The interface itself only describes the scheduler’s effects, whose behaviour is
defined by the
perform (Suspend f) applies
f to the current continuation, and allows the
Reagent library to stash the thread on the unavailable resource’s wait queue.
The return type of
f is an option to handle the case when the resource might
have become available while suspending. If
None, then the control
returns to the scheduler. Once the resource becomes available, the reagent
library performs the
Resume effect to resume the suspended thread.
Comparison to STM
Reagents are less expressive than STM, which provides serializability. But in
return, Reagents provide stronger progress guarantee (lock-freedom) over STM
(obstruction-freedom)2. A reagent transaction operating more than once on
the same memory location will fail at runtime. Abstractly, this behaviour is
disallowed since it cannot be represented as a k-CAS operation. Due to this
restriction, the transaction
pop s1 >>> push s1 always fails, and prohibits
important patterns such as atomically pushing or popping multiple values from
the same stack. I am currently working on extending the reagents semantics to
relax this invariant. The resultant behaviour will be similar to a version of
snapshot isolation. While this is weaker than serializability semantics offered
by the STM, we will retain the benefit of lock-freedom.
Using the reagents library, we have implemented a collection of composable concurrent data and synchronization structures such as stacks, queues, countdown latches, reader-writer locks, condition variables, exchangers, atomic counters, etc. There is great opportunity here to build a standard library for fine-grained parallelism for Multicore OCaml, incorporating the latest developments in lock-free data structures. There is still work to be done optimising the implementation to remove allocations in the fast path, and fine-tuning the reagents core.
Contributions to the library are most welcome, and is a great way to contribute to the Multicore OCaml effort. Please do file those issues and submit pull-requests.