Is there a way I can make two reads atomic? - multithreading

I'm running into a situation where I need the atomic sum of two values in memory. The code I inherited goes like this:
int a = *MemoryLocationOne;
memory_fence();
int b = *MemoryLocationTwo;
return (a + b) == 0;
The individual reads of a and b are atomic, and all writes elsewhere in the code to these two memory locations are also lockless atomic. However the problem is that the values of the two locations can and do change between the two reads.
So how do I make this operation atomic? I know all about CAS, but it tends to only involve making read-modify-write operations atomic and that's not quite what I want to do here.
Is there a way to do it, or is the best option to refactor the code so that I only need to check one value?
Edit: Thanks, I didn't mention that I wanted to do this locklessly in the first revision, but some people picked up on it after my second revision. I know no one believes people when they say things like this, but I can't use locks practically. I'd have to emulate a mutex with atomics and that'd be more work than refactoring the code to keep track of one value instead of two.
For now my method of investigation involves taking advantage of the fact that the values are consecutive and grabbing them atomically with a 64 bit read, which I'm assured are atomic on my target platforms. If anyone has new ideas, please contribute! Thanks.

If you truly need to ensure that a and b don't change while you are doing this test, then you need to use the same synchronization for all access to a and b. That's your only choice. Each read and each write to either of these values needs to use the same memory fence, synchronizer, semaphore, timeslice lock, or whatever mechanism is used.
With this, you can ensure that if you:
memory_fence_start();
int a = *MemoryLocationOne;
int b = *MemoryLocationTwo;
int test = (a + b) == 0;
memory_fence_stop();
return test;
then a will not change while you are reading b. But again, you have to use the same synchronization mechanism for all access to a and to b.
To reflect a later edit to your question that you are looking for a lock-free method, well, it depends entirely on the processor you are using and on how long a and b are and on whether or not these memory locations are consecutive and aligned properly.
Assuming these are consecutive in memory and 32 bits each and that your processor has an atomic 64-bit read, then you can issue an atomic 64-bit read to read the two values in, parse the two values out of the 64-bit value, do the math and return what you want to return. Assuming you never need an atomic update to "a and b at the same time" but only atomic updates to "a" or to "b" in isolation, then this will do what you want without locks.

You would have to ensure that everywhere either of the two values were read or written, they were surrounded by a memory barrier (lock or critical section).
// all reads...
lock(lockProtectingAllAccessToMemoryOneAndTwo)
{
a = *MemoryLocationOne;
b = *MemoryLocationTwo;
}
...
// all writes...
lock(lockProtectingAllAccessToMemoryOneAndTwo)
{
*MemoryLocationOne = someValue;
*MemoryLocationTwo = someOtherValue;
}

If you are targeting x86, you can use the 64-bit compare/exchange support and pack both int's into a single 64-bit word.
On Windows, you would do this:
// Skipping ensuring padding.
union Data
{
struct members
{
int a;
int b;
};
LONGLONG _64bitData;
};
Data* data;
Data captured;
do
{
captured = *data;
int result = captured.members.a + captured.members.b;
} while (InterlockedCompareExchange64((LONGLONG*)&data->_64bitData,
captured._64BitData,
captured._64bitData) != captured._64BitData);
Really ugly. I'd suggest using a lock - much more maintainable.
EDIT:
To update and read the individual parts:
data->members.a = 0;
fence();
data->members.b = 0;
fence();
int captured = data->members.a;
int captured = data->members.b;

There really is no way to do this without a lock. No processors have a double atomic read, as far as I know.

Related

Peterson's solution just use one variable

For Pi:
do {
turn = i; // prepare enter section
while(turn==j);
//critical section
turn = j; //exit section.
} while(true);
For Pj:
do {
turn = j; // prepare enter section
while(turn==i);
//critical section
turn = i; //exit section.
} while(true);
In this simplified algorithm, if process i want to enter critical section for i, it will set "turn = i"(different from Peterson's solution which will set "turn = j"). this algorithm does not seem to cause deadlock or starvation, so why Peterson's algorithm not simplified like this?
Another Question: as i know, mutual exclusion mechanisms such as semaphore P/V operations require atomicity (P should do test sem.value and sem.value-- concurrently). but why the algorithm above just use one variable turn does not seem to require atomicity (turn = i, test turn == j not atomicity )?
Before you ask whether the algorithm avoids deadlock and starvation, you first have to verify that it still locks. With your version, even assuming sequential consistency, the operations could be sequenced like this:
Pi Pj
turn = i;
while (turn == j); // exits immediately
turn = j;
while (turn == i); // exits immediately
// critical section // critical section
and you have a lock violation.
To your second question: it depends on what you mean by "atomicity". You do need it to be the case that when one thread stores turn = i; then the other thread loading turn will only read i or j and not anything else. On some machines, depending on the type of turn and the values of i and j, you could get tearing and load an entirely different value. So whatever language you are using may require you to declare turn as "atomic" in some fashion to avoid this. In C++ in particular, if turn isn't declared std::atomic, then any concurrent read/write access is a data race, and the behavior of the entire program becomes undefined (that's bad).
Besides the need to avoid tearing and data races, Peterson's algorithm also requires strict memory ordering (sequential consistency), which on many systems / languages is not guaranteed unless specially requested, again perhaps by declaring the variable as atomic in some fashion.
It is true that unlike more typical lock algorithms, Peterson doesn't require an atomic read-modify-write, only atomic sequentially consistent loads and stores. That's precisely what makes it an interesting and clever algorithm. But there's a substantial tradeoff in complexity and performance, especially if you want more than two threads, and most real-life systems do have reasonably efficient atomic RMW instructions, so Peterson is rarely used in practice.

Is there a data race?

class Test {
struct hazard_pointer {
std::atomic<void*> hp;
std::atomic<std::thread::id> id;
};
hazard_pointer hazard_pointers[max_hazard_pointers];
std::atomic<void*>& get_hazard_pointer_for_current_thread(){
std::thread::id id = std::this_thread::get_id();
for( int i =0; i < max_hazard_pointers; i++){
if( hazard_pointers[i].id.load() == id){
hazard_pointers[i].id.store(id);
return hazard_pointers[i].hp;
}
}
std::atomic<nullptr> ptr;
return ptr;
}
};
int main() {
Test* t =new Test();
std::thread t1([&t](){ while(1) t->get_hazard_pointer_for_current_thread();});
std::thread t2([&t](){ while(1) t->get_hazard_pointer_for_current_thread();});
t1.join();
t2.join();
return 0;
}
The function get_hazard_pointer_for_current_thread can be executed parallelly. Is there data race? On my eye there is no data race because of atomic operation, but I am not sure.
So, please make me sure or explain why there is ( are ) data race(s).
Let's assume that hazard_pointers array elements are initialized.
There are a few errors in the code:
get_hazard_pointer_for_current_thread may not return any value - undefined behaviour.
hazard_pointers array elements are not initialized.
if(hazard_pointers[i].id.load() == id) hazard_pointers[i].id.store(id); does not make any sense.
And yes, there is a data race. Between statement if(hazard_pointers[i].id.load() == id) and hazard_pointers[i].id.store(id); another thread may change hazard_pointers[i].id. You probably need to use a compare-and-swap instruction.
I don't think you have any C++ UB from concurrent access to non-atomic data, but it looks like you do have the normal kind of race condition in your code.
if (x==a) x = b almost always needs to be an atomic read-modify-write (instead of separate atomic loads and atomic stores) in lock-free algorithms, unless there's some reason why it's ok to still store b if x changed to something other than a between the check and the store.
(In this case, the only thing that can ever be stored is the value that was already there, as #MargaretBloom points out. So there's no "bug", just a bunch of useless stores if this is the only code that touches the array. I'm assuming that you didn't really intend to write a useless example, so I'm considering it a bug.)
Lock-free programming is not easy, even if you do it the low-performance way with the default std::memory_order_seq_cst for all the stores so the compiler has to MFENCE everywhere. Making everything atomic only avoids C++ UB; you still have to carefully design the logic of your algorithm to make sure it's correct even if multiple stores/loads from other thread(s) become visible between every one of your own operations, and stuff like that. (e.g. see Preshing's lock-free hash table.)
Being UB-free is necessary (at least in theory) but definitely not sufficient for code to be correct / safe. Being race-free means no (problematic) races even between atomic accesses. This is a stronger but still not sufficient part of being bug-free.
I say "in theory" because in practice a lot of code with UB happens to compile the way we expect, and will only bite you on other platforms, or with future compilers, or with different surrounding code that exposes the UB during optimization.
Testing can't easily detect all bugs, esp. if you only test on strongly-ordered x86 hardware, but a simple bug like this should be easily detectable with testing.
The problem with your code, in more detail:
You do a non-atomic compare-exchange, with an atomic load and a separate atomic store:
if( hazard_pointers[i].id.load() == id){
// a store from another thread can become visible here
hazard_pointers[i].id.store(id);
return hazard_pointers[i].hp;
}
The .store() should be a std::compare_exchange_strong, so the value isn't modified if a store from another thread changed the value between your load and your store. (Putting it inside an if on a relaxed or acquire load is still a good idea; I think a branch to avoid a lock cmpxchg is a good idea if you expect the value to not match most of the time. That should let the cache lines stay Shared when no thread finds a match on those elements.)

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.
After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.
I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}
I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

Benign data race condition with two threads

Is this a race condition?
class A {
int x;
update() {
x = 5;
}
retrieve() {
y = x;
}
}
If update() and retrieve() are called by two different threads without any locks being held, given that there is at least one write in two accesses of a shared variable, this can be classified as a race condition. But is this truly a problem during runtime?
Without locks, three things can happen:
y gets the new value of x (5).
y gets the old value of x (most likely 0).
if writes into int are not atomic, then y can get any other value.
In Java, reads to an int are atomic, so the third option cannot happen. No guarantee about the atomicity in other languages.
With locking, the first two options can happen as well.
There is an extra challenge depending on the memory model, though. In Java, if a write is not synchronized, it can be arbitrarily delayed up until the next synchronisation point (the end of a synchronized block or an access to a volatile field). Similarly, reads can be arbitrarily cached up from the previous synchronisation point (the start of a synchronized block or an access to a volatile field). This can easily result in problems arising from stale cache. The end effect is that the second option can happen even if the first one was supposed to.
In Java, always use volatile with fields that can be accessed from other threads, or you'll be facing hard-to-debug race conditions arising from memory access reordering. The same warning applies in other languages that use a memory model similar to the one in Java - you may need to tell the compiler to not do these optimisations.

When should the Win32 InterlockedExchange function be used?

I came across the function InterlockedExchange and was wondering when I should use this function. In my opinion, setting a 32 Bit value on an x86 processor should always be atomic?
In the case where I want to use the function, the new value does not depend on the old value (it is not an increment operation).
Could you provide an example where this method is mandatory (I'm not looking for InterlockedCompareExchange)
InterlockedExchange is both a write and a read -- it returns the previous value.
This is necessary to ensure another thread didn't write a different value just after you did. For example, say you're trying to increment a variable. You can read the value, add 1, then set the new value with InterlockedExchange. The value returned by InterlockedExchange must match the value you originally read, otherwise another thread probably incremented it at the same time, and you need to loop around and try again.
As well as writing the new value, InterlockedExchange also reads and returns the previous value; this whole operation is atomic. This is useful for lock-free algorithms.
(Incidentally, 32-bit writes are not guaranteed to be atomic. Consider the case where the write is unaligned and straddles a cache boundary, for instance.)
In a multi-processor or multi-core machine each core has it's own cache - so each core has each own potentially different "view" of what the content of the system memory is.
Thread synchronization mechanisms take care of synchronizing between cores, for more information look at http://blogs.msdn.com/oldnewthing/archive/2008/10/03/8969397.aspx or google for acquire and release semantics
Setting a 32-bit value is atomic, but only if you're setting a literal.
b = a is 2 operations:
mov eax,dword ptr [a]
mov dword ptr [b],eax
Theoretically there could be some interruption between the first and second operation.
Writing a value is never atomic by default. When you write a value to a variable, several machine instructions are generated. With modern, preemptive OSes, the OS might switch to another thread between the individual operations of the write.
This is even more a problem on multi-processor machines, where several threads could be executing at the same time, and trying to write to a single memory location simultaneously.
Interlocked operations avoid this by using specialized instructions to make the write (x86 has dedicated instructions for this kind of situation), which do the read-modify-write in one instruction. These instructions also lock the memory bus of all processors, to ensure that no other executing thread could be writing to the value at the same time.
InterlockedExchange makes sure that the change of a variable and the return of its original value are not interrupted by other threads.
So, if 'i' is an int, these calls (taken individually) do not need InterlockedExchange around 'i':
a = i;
i = 9;
i = a;
i = a + 9;
a = i + 9;
if(0 == i)
None of these statements rely upon BOTH the initial AND final values of 'i'. But these following calls DO need InterlockedExchange around 'i':
a = i++; //a = InterlockedExchange(&i, i + 1);
Without it, two threads running through this same code might get the same value of 'i' assigned to 'a' or 'a' may unexpectedly skip two or more numbers.
if(0 == i++) //if(0 == InterlockedExchange(&i, i + 1))
Two threads may both execute the code that is only supposed to happen once.
etc.
wow, so many conflicting answers. Hard to sift through who's right, who's wrong, and what information is misleading.
I'm unsure of the answer too, given the above half-answers, but I think it works like this, I may be wrong, and it will be interesting to find out if I am:
32-bit read & writes ARE atomic, but depending on your code, that may not mean much.
don't worry about non-aligned read/writes. ALL 32-bit writes to a 32-bit variable have to be aligned or the machine page-faults.
don't worry about a write wrapping around the end of a cached page, that can't happen.
If you need to write-then-read on one thread, and you're writing on another thread, then you need to use InterlockedExchange. If you're simply reading the value on one thread, and writing it on another, then you don't need to use it, but those values may be wiggly because of multithreading.

Resources