Lock-free programming, first steps: mostly lock-free fixed-size vector

I recently got interested in lock-free programming techniques, and began playing with the idea of a lock-free, fixed size vector. While the final result isn’t completely lock-free, it is an optimization over a naively locked version. The following is a graph showing the amount of time taken for x threads to complete three high-contention benchmarks on a shared vector:

The Interface

The (unsurprising) interface is specified by an serial reference implementaion, which is implemented subject to respecting the ACID properties in a multithreaded context:

      FixedVectorSemantics(T of Type, Size of Integer) {
        Fields:
          Buffer of Array(size = Size, type = T)
          Length of Integer

        Methods:
          push_back of (Value of Pointer(T)) → Integer {
            if Length < Size then:
              Buffer[Length] ← Value
              OldLength ← Length
              Length ← (Length + 1)
              return OldLength
            else:
              return -1
          }

          pop_back of (Pointer(T) × Integer) {
            if Length > 0 then:
              Length ← Length - 1
              return (Buffer[Length], Length)
            else:
              return OutOfRange
          }

          get of (Index of Integer) → Pointer(T) {
            if Index < Length:
              return Buffer[Index]
            else:
              return OutOfRange
          }
      }
    

There are a few things worth noting about the interface:

Pointer-only store

A FixedVector only stores pointers. This isn’t incidental –the atomic primitives it uses only operate on word-sized data. Moreover, the implementation steals the least significant bit of pointers in some cases. FixedVector can be conceivably be used to store any data that is, in size, smaller than or equal to a pointer, as long as the least significant bit of the datum is always guaranteed to be 0.

push_back and pop_back return indices

In a serial environment, the index at which an element has been inserted by push_back and the index of the element retrieved by pop_back can be inferred by inspecting Length immediately afterwards. In a multi-threaded environment, however, this is racy.

Absence of a put method

Unlike the previous two issue, is only incidental — nothing prevents the addition of a put method, I just haven’t gotten around to it yet.

The ACID Properties

The FixedVector implementation satisfies the ACID properties under the following interpretations:

Atomicity

A method either succeeds (and the change is then insured by durability) or fails (and effects no observable change).

Consistency

FixedVector doesn’t provide any property that I could group under this heading. Note that it does not provide the guarantee that invoking get on an index less than Length will not return OutOfRange. Providing this guarantee will force FixedVector to not provide other, more important guarantees (see the implementation of push_back).

Isolation

Let N requests, R0, R1RN executed concurrently on a FixedVector; and let those requests take the FixedVector from state S0 to S1 and return values V0, V1VN respectively. Then, there is a way to order the requests such that the change in state can also be obtained by serially invoking the requests in some order and have each request return the same values as earlier. Maintaining this property was the toughest aspect of the implementation.

Durability

A request that has gone through (the function called has returned) can’t be reordered with a request that comes in "later". This is a rather loose constraint, since different processors may see different views of the same FixedVector (even though x86’s cache coherency ensures every processor usually gets a consistent view).

As an example, consider the following four operations invoked concurrently on an empty stack:

  1. push_back(5)
  2. push_back(6)
  3. push_back(7)
  4. pop_back

Once all the four requests have been completed, the state of the vector may be one of the following:

Vector contents Value returned from pop_back
Any permutation of [5, 6, 7] OutOfRange
Any permutation of [5, 7] 6
Any permutation of [6, 7] 5
Any permutation of [5, 6] 7

No other state is legal.

The Implementation

I approach thinking about a certain problem by asking myself two, mutually exclusive questions: Is this trivial? If so, how? and Is this non-trivial? If so, why?. These, followed up with some experimentation, usually lead to a solution.

The implementation makes heavy use of atomic reads and writes, and the usual compare and swap. The operations are mostly lock-free, only push_back does some fine grained locking.

Implementing push_back

The simplest code that could possibly work for push_back looks like this:

  Len = AtomicRead(Length)
  if CompareAndSwap(&Length, Len, Len + 1)
    if failed: Retry
  AtomicWrite(Buffer[Len], { value to push })
      

Turns out, this is good enough, with one modification:

  Len = AtomicRead(Length)
  CompareAndSwap(&Length, Len, Len + 1)
    if failed: Retry
  MemoryFence()
  AtomicWrite(Buffer[Len], { value to push })
      

The MemoryFence ensures that write to Buffer[Length] is not reordered ahead of the write to Length. The CPU is normally free to do such a reordering since this reordering preserves the semantics of a single-threaded program. The problem with this reordering is easy to see (the line executing in a particular step is marked with an asterisk):

Thread 0 Thread 1
 push_back(5):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   Buffer[Len] = 5
	  
 push_back(10):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   Buffer[Len] = 10
	  
Vector State:

Length = 0
Data = []
	  
 push_back(5):
   Len = Length
     Len is 0 now
*  CompareAndSwap(
    &Length, 0, 1)
   Buffer[0] = 5
	  
 [Thread Stalled]
 push_back(10):
   Len = Length
     Len is 0 now
   CompareAndSwap(
     &Length, 0, 1)
*  Buffer[0] = 10
	  
Vector State:

Length = 0
Data = []
	  
 push_back(5):
   Len = Length
     Len is 0 now
   CompareAndSwap(
     &Length, 0, 1)
*  Buffer[0] = 5
	  
 [Thread Stalled]
 push_back(10):
   Len = Length
     Len is 0 now
   CompareAndSwap(
     &Length, 0, 1)
*  Buffer[0] = 10
	  
Vector State:

Length = 1
Data = []
	  
 push_back(5):
   Successful
	  
 push_back(10):
   Len = Length
     Len is 0 now
   CompareAndSwap(
     &Length, 0, 1)
*  Buffer[0] = 10
	  
Vector State:

Length = 1
Data = [5]
	  
 push_back(5):
   Successful
	  
 push_back(10):
   Len = Length
     Len is 0 now
*  CompareAndSwap(
     &Length, 0, 1)
   Buffer[0] = 10
	  
Vector State:

Length = 1
Data = [10]
	  
 push_back(5):
   Successful
	  
 push_back(10):
   Len = Length
   CompareAndSwap(
     &Length, 0, 1)
     Fails, retries
   Buffer[0] = 10
	  
Vector State:

Length = 1
Data = [10]
	  
 push_back(5):
   Successful
	  
 push_back(10):
   Successful in second
   try, execution skipped
   for brevity
	  
Vector State:

Length = 2
Data = [10, 10]
	  

The state of an empty vector after a serial push_back(5) and a push_back(10) (in either of the two possible orders) can never by [10, 10] — a missing fence breaks isolation. However, with the memory fence in place, the CAS is always executed (and fails when required) before the atomic write and this problem is avoided.

There is another subtlety, however. Consider a concurrent invocation of n + m + 1 (m >= 0) pushes and n pops on a vector with k elements (with k >= n). Assume the first push_back sets the length to k + t + 1 (with t >= 0) and stalls, after which all the other requests execute in a way that the length of the vector when the first request is resumed is greater than k. Since the first request, on resumption, writes a value to Buffer[k], it essentially overwrites a value that was written to Buffer by another push_back! I found this troubling, more so because, in this case, we end up pop_backing a random value that hasn’t yet been pushed! Fortunately, this issue can be solved with no modification to push_back at all, as we shall see. We’ll return to this issue after discussing the implementation for pop_back.

Implementing pop_back

I found implementing pop_back a great deal harder than implementing push_back, primarily because of the ABA problem. The problem stems from the fact that a push_back implementation needs to atomically both modify Length and read Buffer[Length - 1]. If we modify Length first, we might just allow a pop_back and read and return a value different from the one we popped, and if we read Buffer[Length - 1] first, we might just allow a pop_back followed by a push_back to overwrite the value we are about to pop and, again, return a value different from what we popped.

An example of the first issue (when we modify Length before reading Buffer[Length - 1]):

Thread 0 Thread 1
 pop_back:
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len - 1)
   OldValue = Buffer[Len]
   Return OldValue
      
 push_back(10):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   MemoryFence()
   Buffer[Len] = 10
      
Vector State:

Length = 1
Data = [42]
      
 pop_back:
   Len = Length
*  CompareAndSwap(
     &Length, 1, 0)
   OldValue = Buffer[Len]
   Return OldValue
      
 push_back(10):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   MemoryFence()
   Buffer[Len] = 10
      
Vector State:

Length = 1
Data = [42]
      
 pop_back:
   Len = Length
   CompareAndSwap(
     &Length, 1, 0)
*  OldValue = Buffer[Len]
   Return OldValue
      
 push_back(10):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   MemoryFence()
   Buffer[Len] = 10
      
Vector State:

Length = 0
Data = []
      
 pop_back:
   Len = Length
   CompareAndSwap(
     &Length, 1, 0)
*  OldValue = Buffer[Len]
   Return OldValue
      
 push_back(10):
    Succeeds
      
Vector State:

Length = 1
Data = [10]
      
 pop_back:
    "Succeeds", return 10
      
 push_back(10):
    Succeeds
      
Vector State:

Length = 1
Data = [10]
      

We end up breaking isolation in this case — no ordering of push_back(10) and pop_back on a vector [42] can have the pop_back return 10 and change the state of the vector to [10]. A similar example can be made for the case where we read Buffer[Length - 1] before modifying Length.

I eventually had to settle in for a (fine-grained) locking implementation. The lock here is a bit set in the word about to be popped. You cannot pop and return a word that has the lock bit set. The algorithm looks like this (Lock(word) returns word with the lock bit set):

  Len = AtomicRead(Length)
  if Len == 0 then return OutOfRange
  OldValue = AtomicRead(Buffer[Len - 1])
  if OldValue has lock bit set: Retry -- (A)

  CompareAndSwap(&Buffer[Len - 1], OldValue, Lock(OldValue))
    if failed: Retry
  MemoryFence()

  CompareAndSwap(&Length, Len, Len - 1)
    if failed:
      AtomicStore(&Buffer[Len - 1], OldValue)
      Retry
  return OldValue
      

First of all, pop_back isn’t lock free, because of the Retry in (A). In short, a pop_back requested when the length of the vector is k locks all the elements from pop_back with index <= k — since popping an element with index <= k involves popping the element at index k, which (A) forbids us to do until the pop_back either succeeds or fails. I have a hunch that making pop_back completely lock-free will be a significant performance boost.

Note that the lock bit isn’t reset if the pop_back succeeds — this is crucial to the correctness of this algorithm. This prevents a pop_back between a push_back CAS on Length and the subsequent write to Buffer[OldLength]. You see, since we’re pushing a value at location i (say), there must have been a successful pop from location i + 1 to i at some point. And that pop_back must have left the lock bit set for Buffer[i], and this is how the pop_back is locked out till the push_back is finished. This takes care of the situation discussed earlier in the push_back section. We take care of case when a push_back is done at a new index, one which hasn’t ever seen a value by initializing the entire buffer with a sentinel value at allocation. This sentinel value has the lock bit set.

The MemoryFence serves a purpose similar to that in push_back — if the CAS on Length is reordered ahead of the CAS on the lock bit, we might end up returning a different value than we popped.

Implementing get

Implementing get is the easiest of all, since it has the luxury of being a read-only operation:

  get(index):
    Len = AtomicRead(Length)
    if index >= Len: return OutOfRange
    Value = AtomicLoad(&Buffer[index])
    if Value != Sentinel:
      return Value
    else:
      return OutOfRange
      

In the case we do see a value that is being concurrently popped, we can pretend as if the get was reordered to before that particular pop_back request.

My Conclusions

Lock-free programming seems like a different world — multiple cores make ordinary things behave in unexpected ways. I have tested my implementing on an x86 (Intel i7), a coherent paradise, I’m sure testing on a looser architecture will reveal subtle, overlooked issues. And just so you know, there are better ways to effect a lock-free vector, written by people far smarter than me. I’m still learning. 🙂

Advertisements
This entry was posted in Computers and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s