Understanding Linux Kernel Circular Buffer - linux

There is an article at: http://lwn.net/Articles/378262/ that describes the Linux kernels circular buffer implementation. I have some questions:
Here is the "producer":
spin_lock(&producer_lock);
unsigned long head = buffer->head;
unsigned long tail = ACCESS_ONCE(buffer->tail);
if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
/* insert one item into the buffer */
struct item *item = buffer[head];
produce_item(item);
smp_wmb(); /* commit the item before incrementing the head */
buffer->head = (head + 1) & (buffer->size - 1);
/* wake_up() will make sure that the head is committed before
* waking anyone up */
wake_up(consumer);
}
spin_unlock(&producer_lock);
Questions:
Since this code explicitly deals with memory ordering and atomicity what is the point of the spin_lock()?
So far, my understanding is that ACCESS_ONCE stops compiler reordering, true?
Does produce_item(item) simply issue all of the writes associated with the item?
I believe smp_wmb() guarantees that all of the writes in produce_item(item) complete BEFORE the "publishing" write that follows it. true?
The commentary on the page where I got this code seems to imply that a smp_wmb() would
normally be needed after updating the head index, but wake_up(consumer) does this, so its not necessary. Is that true? If so why?
Here is the "consumer":
spin_lock(&consumer_lock);
unsigned long head = ACCESS_ONCE(buffer->head);
unsigned long tail = buffer->tail;
if (CIRC_CNT(head, tail, buffer->size) >= 1) {
/* read index before reading contents at that index */
smp_read_barrier_depends();
/* extract one item from the buffer */
struct item *item = buffer[tail];
consume_item(item);
smp_mb(); /* finish reading descriptor before incrementing tail */
buffer->tail = (tail + 1) & (buffer->size - 1);
}
spin_unlock(&consumer_lock);
Questions specific to "consumer":
What does smp_read_barrier_depends() do? From some comments in a forum it seems like you could have issued an smp_rmb() here, but on some architectures this is unnecessary (x86) and too expensive, so smp_read_barrier_depends() was created to do this optionally... That said, I don't really understand why smp_rmb() is ever necessary!
Is the smp_mb() there to guarantee that all of the reads before it complete before the write after it?

For the producer:
The spin_lock() here is to prevent two producers from trying to modify the queue at the same time.
ACCESS_ONCE does prevent reordering, it also prevents the compiler from reloading the value later. (There's an article about ACCESS_ONCE on LWN that expands on this further)
Correct.
Also correct.
The (implied) write barrier here is needed before waking the consumer as otherwise the consumer might not see the updated head value.
Consumer:
smp_read_barrier_depends() is a data dependency barrier, which is a weaker form of a read barrier (see 2). The effect in this case is to ensure that buffer->tail is read before using it as an array index in buffer[tail].
smp_mb() here is a full memory barrier, ensuring all reads and writes are committed by this point.
Additional references:
Linux kernel documentation on memory barriers
(Note: I'm not entirely sure about my answers for 5 in the producer and 1 for the consumer, but I believe they're a fair approximation of the facts. I highly recommend reading the documentation page about memory barriers, as it's more comprehensive than anything I could write here.)

Related

What is the minimum hardware support required for mutual exclusion of competing threads from a critical section?

When several threads share common data, to avoid race conditions when it is being modified, mutual exclusion is required. These can be implemented if the hardware supports atomic test-and-set instruction.
But can we go even simpler? By having just atomic read operation and atomic write operation, is it possible to achieve mutual exclusion? Dekker's algorithm and Peterson's algorithm are some of the algorithms that can achieve mutual exclusion between just 2 processes if there exists atomic read and atomic write operations.
I have seen that Peterson's algorithm can be extended to involve N processes. The algorithm for that is like this:
lock(for Process i):
/* repeat for all partners */
for (count = 0; count < (NUMPROCS-1); count++) {
flags[i] = count; // I think I'm in position "count" in the queue
turn[count] = i; // and I'm the most recent process to think I'm in position "count"
"wait until // wait until
(for all k != i, flags[k]<count) // everyone thinks they're behind me
or (turn[count] != i)" // or someone later than me thinks they're in position "count"
// now I can update my estimated position to "count"+1
} // now I'm at the head of the queue so I can start my critical section
Unlock (for Process i):
/* tell everyone we are finished */
flags[i] = -1; // I'm not in the queue anymore
As far as I can think, this algorithm only requires atomic reads and atomic writes. But above algorithm is for cases where N is known. It cannot be extended to dynamic N case, since there concurrent array insert-allocation has to be protected again.
So, is there any known algorithm that can provide mutual exclusion among dynamic N threads, in a preemptive, multi-core environment with no test-and-set instruction? What if the starvation requirement is not there? Or, is it proven that this cannot be done without atomic test-and-set?
Sequentially consistent memory model is assumed, but mention if this is also not required. I think every hardware supports in some way to write a sequentially consistent program.

Memory barrier in the implementation of single producer single consumer

The following implementation from Wikipedia:
volatile unsigned int produceCount = 0, consumeCount = 0;
TokenType buffer[BUFFER_SIZE];
void producer(void) {
while (1) {
while (produceCount - consumeCount == BUFFER_SIZE)
sched_yield(); // buffer is full
buffer[produceCount % BUFFER_SIZE] = produceToken();
// a memory_barrier should go here, see the explanation above
++produceCount;
}
}
void consumer(void) {
while (1) {
while (produceCount - consumeCount == 0)
sched_yield(); // buffer is empty
consumeToken(buffer[consumeCount % BUFFER_SIZE]);
// a memory_barrier should go here, the explanation above still applies
++consumeCount;
}
}
says that a memory barrier must be used between the line that accesses the buffer and the line that updates the Count variable.
This is done to prevent the CPU from reordering the instructions above the fence along-with that below it. The Count variable shouldn't be incremented before it is used to index into the buffer.
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Thanks
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Good question.
In c++, unless some form of memory barrier is used (atomic, mutex, etc), the compiler assumes that the code is single-threaded. In which case, the as-if rule says that the compiler may emit whatever code it likes, provided that the overall observable effect is 'as if' your code was executed sequentially.
As mentioned in the comments, volatile does not necessarily alter this, being merely an implementation-defined hint that the variable may change between accesses (this is not the same as being modified by another thread).
So if you write multi-threaded code without memory barriers, you get no guarantees that changes to a variable in one thread will even be observed by another thread, because as far as the compiler is concerned that other thread should not be touching the same memory, ever.
What you will actually observe is undefined behaviour.
It seems, that your question is "can incrementing Count and assigment to buffer be reordered without changing code behavior?".
Consider following code tansformation:
int count1 = produceCount++;
buffer[count1 % BUFFER_SIZE] = produceToken();
Notice that code behaves exactly as original one: one read from volatile variable, one write to volatile, read happens before write, state of program is the same. However, other threads will see different picture regarding order of produceCount increment and buffer modifications.
Both compiler and CPU can do that transformation without memory fences, so you need to force those two operations to be in correct order.
If a fence is not used, won't this kind of reordering violate the correctness of code?
Nope. Can you construct any portable code that can tell the difference?
The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Why shouldn't it? What would the payoff be for the costs incurred? Things like write combining and speculative fetching are huge optimizations and disabling them is a non-starter.
If you're thinking that volatile alone should do it, that's simply not true. The volatile keyword has no defined thread synchronization semantics in C or C++. It might happen to work on some platforms and it might happen not to work on others. In Java, volatile does have defined thread synchronization semantics, but they don't include providing ordering for accesses to non-volatiles.
However, memory barriers do have well-defined thread synchronization semantics. We need to make sure that no thread can see that data is available before it sees that data. And we need to make sure that a thread that marks data as able to be overwritten is not seen before the thread is finished with that data.

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.
After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.
I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}
I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

How are "nonblocking" data structures possible?

I'm having trouble understanding how any data structure can be "nonblocking".
Say you're making a "nonblocking" hashtable. At some point or another, your hashtable gets too full, so you have to re-hash into a larger table.
This implies you need to allocate memory, which is a global resource. So it seems that you must obtain some sort of lock to prevent global corruption of the heap... irrespective of possible problems with your data structure itself!
But then that means every other thread must block while you allocate your memory...
What am I missing here?
(How) can you allocate memory without blocking another thread which is doing the same?
Two examples for non blocking designs are optimistic design and Transactional Memory.
The idea of this is - in most of the cases, the blocking is redundant - since two OPs can concurrently occur without interrupting each other. However, sometimes when 2 OPs occur concurrently and the data becomes corrupted because of it - you can roll back to your previous state, and retry.
There might still be locks in these designs, but the time the data is locked is significantly shorter, and is limited only to the critical time where the affect of the OP is taking place.
Just for some definitions, additional information and to distinguish between non-blocking, lock-free and wait-free terms, I recommend reading the following article (I won't copy the relevant passages here as it's too long):
Definitions of Non-blocking, Lock-free and Wait-free
Most strategies have one fundamental pattern in common. They use a compare and swap (CAS) operation in a loop until it succeeds.
For example, lets consider a stack implemented with a linked list. I chose a linked list implementation because it is easy to make concurrent with a CAS, but there are other ways to do it. I will use C-like pseudocode.
Push(T item)
{
Node node = new Node(); // allocate node memory
Node initial;
do
{
initial = head;
node.Value = item;
node.Next = initial;
}
while (CompareAndSwap(head, node, initial) != initial);
}
Pop()
{
Node node;
Node initial;
do
{
initial = head;
node = initial.Next;
}
while (CompareAndSwap(head, node, initial) != initial);
T value = initial.Value;
delete initial; // deallocate node memory
return value;
}
In the above code CompareAndSwap is a non-blocking atomic operation that replaces the value in a memory address with a new value and returns the old value. If the old value does not match the expected value then you spin through the loop and try it all again.
All that non-blocking means is that you never wait indefinitely, not that you never wait at all. As long as your heap is also implemented using a non-blocking algorithm, you can implement other non-blocking algorithms on top of it.

Is there a way I can make two reads atomic?

I'm running into a situation where I need the atomic sum of two values in memory. The code I inherited goes like this:
int a = *MemoryLocationOne;
memory_fence();
int b = *MemoryLocationTwo;
return (a + b) == 0;
The individual reads of a and b are atomic, and all writes elsewhere in the code to these two memory locations are also lockless atomic. However the problem is that the values of the two locations can and do change between the two reads.
So how do I make this operation atomic? I know all about CAS, but it tends to only involve making read-modify-write operations atomic and that's not quite what I want to do here.
Is there a way to do it, or is the best option to refactor the code so that I only need to check one value?
Edit: Thanks, I didn't mention that I wanted to do this locklessly in the first revision, but some people picked up on it after my second revision. I know no one believes people when they say things like this, but I can't use locks practically. I'd have to emulate a mutex with atomics and that'd be more work than refactoring the code to keep track of one value instead of two.
For now my method of investigation involves taking advantage of the fact that the values are consecutive and grabbing them atomically with a 64 bit read, which I'm assured are atomic on my target platforms. If anyone has new ideas, please contribute! Thanks.
If you truly need to ensure that a and b don't change while you are doing this test, then you need to use the same synchronization for all access to a and b. That's your only choice. Each read and each write to either of these values needs to use the same memory fence, synchronizer, semaphore, timeslice lock, or whatever mechanism is used.
With this, you can ensure that if you:
memory_fence_start();
int a = *MemoryLocationOne;
int b = *MemoryLocationTwo;
int test = (a + b) == 0;
memory_fence_stop();
return test;
then a will not change while you are reading b. But again, you have to use the same synchronization mechanism for all access to a and to b.
To reflect a later edit to your question that you are looking for a lock-free method, well, it depends entirely on the processor you are using and on how long a and b are and on whether or not these memory locations are consecutive and aligned properly.
Assuming these are consecutive in memory and 32 bits each and that your processor has an atomic 64-bit read, then you can issue an atomic 64-bit read to read the two values in, parse the two values out of the 64-bit value, do the math and return what you want to return. Assuming you never need an atomic update to "a and b at the same time" but only atomic updates to "a" or to "b" in isolation, then this will do what you want without locks.
You would have to ensure that everywhere either of the two values were read or written, they were surrounded by a memory barrier (lock or critical section).
// all reads...
lock(lockProtectingAllAccessToMemoryOneAndTwo)
{
a = *MemoryLocationOne;
b = *MemoryLocationTwo;
}
...
// all writes...
lock(lockProtectingAllAccessToMemoryOneAndTwo)
{
*MemoryLocationOne = someValue;
*MemoryLocationTwo = someOtherValue;
}
If you are targeting x86, you can use the 64-bit compare/exchange support and pack both int's into a single 64-bit word.
On Windows, you would do this:
// Skipping ensuring padding.
union Data
{
struct members
{
int a;
int b;
};
LONGLONG _64bitData;
};
Data* data;
Data captured;
do
{
captured = *data;
int result = captured.members.a + captured.members.b;
} while (InterlockedCompareExchange64((LONGLONG*)&data->_64bitData,
captured._64BitData,
captured._64bitData) != captured._64BitData);
Really ugly. I'd suggest using a lock - much more maintainable.
EDIT:
To update and read the individual parts:
data->members.a = 0;
fence();
data->members.b = 0;
fence();
int captured = data->members.a;
int captured = data->members.b;
There really is no way to do this without a lock. No processors have a double atomic read, as far as I know.

Resources