Mutually Exclusive code - multithreading

Mutually Exclusive code - multithreading

How can I tell is code is mutually exclusive in the critical section? I understand the concept but when it comes to code tracing I'm having a hard time identifying mutually exclusive code. Heres the code segment in question...
// flag[0] and [1] start as true
Thread 1
for(;;) {
flag[1] = false;
while(flag[2] == false)
flag[1] = true;
flag[1] = false;
// critical section
flag[1] = true;
// exit critical
}
Thread 2
for(;;) {
flag[2] = false;
while(flag[1] == false);
// critical section
flag[2] = true;
// exit critical
}
This sucks so any general insight into mutual exclusion is appreciated.

I'm having a hard time identifying mutually exclusive code
As a rule of thumb, when you are browsing code, you can spot mutually exclusive code (a.k.a critical sections) by thinking of these questions:
Have I got here publicly accessible data (Data is everything: instances of objects, std containers, etc...)?
Who can access it?
Can they access it while I'm accessing it?
If the answer to all these questions is yes, you probably have to take care and protect this data using some kind of synchronization mechanism.
For primitive types, the simplest way to go is probably atomic variables.
For more complicated types, e.g. container of some sort, you should probably use mutexes.
For more complicated scenarios, such as drivers, interrupts and more, you should read more about spinlocks and read-write-locks and more advanced mechanisms.
Note: This is a oversimplification of the issue, but I believe it paints a pretty good picture for someone who is starting to tackle this complicated issue.

Related

Peterson's solution just use one variable

For Pi:
do {
turn = i; // prepare enter section
while(turn==j);
//critical section
turn = j; //exit section.
} while(true);
For Pj:
do {
turn = j; // prepare enter section
while(turn==i);
//critical section
turn = i; //exit section.
} while(true);
In this simplified algorithm, if process i want to enter critical section for i, it will set "turn = i"(different from Peterson's solution which will set "turn = j"). this algorithm does not seem to cause deadlock or starvation, so why Peterson's algorithm not simplified like this?
Another Question: as i know, mutual exclusion mechanisms such as semaphore P/V operations require atomicity (P should do test sem.value and sem.value-- concurrently). but why the algorithm above just use one variable turn does not seem to require atomicity (turn = i, test turn == j not atomicity )?

Before you ask whether the algorithm avoids deadlock and starvation, you first have to verify that it still locks. With your version, even assuming sequential consistency, the operations could be sequenced like this:
Pi Pj
turn = i;
while (turn == j); // exits immediately
turn = j;
while (turn == i); // exits immediately
// critical section // critical section
and you have a lock violation.
To your second question: it depends on what you mean by "atomicity". You do need it to be the case that when one thread stores turn = i; then the other thread loading turn will only read i or j and not anything else. On some machines, depending on the type of turn and the values of i and j, you could get tearing and load an entirely different value. So whatever language you are using may require you to declare turn as "atomic" in some fashion to avoid this. In C++ in particular, if turn isn't declared std::atomic, then any concurrent read/write access is a data race, and the behavior of the entire program becomes undefined (that's bad).
Besides the need to avoid tearing and data races, Peterson's algorithm also requires strict memory ordering (sequential consistency), which on many systems / languages is not guaranteed unless specially requested, again perhaps by declaring the variable as atomic in some fashion.
It is true that unlike more typical lock algorithms, Peterson doesn't require an atomic read-modify-write, only atomic sequentially consistent loads and stores. That's precisely what makes it an interesting and clever algorithm. But there's a substantial tradeoff in complexity and performance, especially if you want more than two threads, and most real-life systems do have reasonably efficient atomic RMW instructions, so Peterson is rarely used in practice.

Is there a data race?

class Test {
struct hazard_pointer {
std::atomic<void*> hp;
std::atomic<std::thread::id> id;
};
hazard_pointer hazard_pointers[max_hazard_pointers];
std::atomic<void*>& get_hazard_pointer_for_current_thread(){
std::thread::id id = std::this_thread::get_id();
for( int i =0; i < max_hazard_pointers; i++){
if( hazard_pointers[i].id.load() == id){
hazard_pointers[i].id.store(id);
return hazard_pointers[i].hp;
}
}
std::atomic<nullptr> ptr;
return ptr;
}
};
int main() {
Test* t =new Test();
std::thread t1([&t](){ while(1) t->get_hazard_pointer_for_current_thread();});
std::thread t2([&t](){ while(1) t->get_hazard_pointer_for_current_thread();});
t1.join();
t2.join();
return 0;
}
The function get_hazard_pointer_for_current_thread can be executed parallelly. Is there data race? On my eye there is no data race because of atomic operation, but I am not sure.
So, please make me sure or explain why there is ( are ) data race(s).
Let's assume that hazard_pointers array elements are initialized.

There are a few errors in the code:
get_hazard_pointer_for_current_thread may not return any value - undefined behaviour.
hazard_pointers array elements are not initialized.
if(hazard_pointers[i].id.load() == id) hazard_pointers[i].id.store(id); does not make any sense.
And yes, there is a data race. Between statement if(hazard_pointers[i].id.load() == id) and hazard_pointers[i].id.store(id); another thread may change hazard_pointers[i].id. You probably need to use a compare-and-swap instruction.

I don't think you have any C++ UB from concurrent access to non-atomic data, but it looks like you do have the normal kind of race condition in your code.
if (x==a) x = b almost always needs to be an atomic read-modify-write (instead of separate atomic loads and atomic stores) in lock-free algorithms, unless there's some reason why it's ok to still store b if x changed to something other than a between the check and the store.
(In this case, the only thing that can ever be stored is the value that was already there, as #MargaretBloom points out. So there's no "bug", just a bunch of useless stores if this is the only code that touches the array. I'm assuming that you didn't really intend to write a useless example, so I'm considering it a bug.)
Lock-free programming is not easy, even if you do it the low-performance way with the default std::memory_order_seq_cst for all the stores so the compiler has to MFENCE everywhere. Making everything atomic only avoids C++ UB; you still have to carefully design the logic of your algorithm to make sure it's correct even if multiple stores/loads from other thread(s) become visible between every one of your own operations, and stuff like that. (e.g. see Preshing's lock-free hash table.)
Being UB-free is necessary (at least in theory) but definitely not sufficient for code to be correct / safe. Being race-free means no (problematic) races even between atomic accesses. This is a stronger but still not sufficient part of being bug-free.
I say "in theory" because in practice a lot of code with UB happens to compile the way we expect, and will only bite you on other platforms, or with future compilers, or with different surrounding code that exposes the UB during optimization.
Testing can't easily detect all bugs, esp. if you only test on strongly-ordered x86 hardware, but a simple bug like this should be easily detectable with testing.
The problem with your code, in more detail:
You do a non-atomic compare-exchange, with an atomic load and a separate atomic store:
if( hazard_pointers[i].id.load() == id){
// a store from another thread can become visible here
hazard_pointers[i].id.store(id);
return hazard_pointers[i].hp;
}
The .store() should be a std::compare_exchange_strong, so the value isn't modified if a store from another thread changed the value between your load and your store. (Putting it inside an if on a relaxed or acquire load is still a good idea; I think a branch to avoid a lock cmpxchg is a good idea if you expect the value to not match most of the time. That should let the cache lines stay Shared when no thread finds a match on those elements.)

Why is threading dangerous?

I've always been told to puts locks around variables that multiple threads will access, I've always assumed that this was because you want to make sure that the value you are working with doesn't change before you write it back
i.e.
mutex.lock()
int a = sharedVar
a = someComplexOperation(a)
sharedVar = a
mutex.unlock()
And that makes sense that you would lock that. But in other cases I don't understand why I can't get away with not using Mutexes.
Thread A:
sharedVar = someFunction()
Thread B:
localVar = sharedVar
What could possibly go wrong in this instance? Especially if I don't care that Thread B reads any particular value that Thread A assigns.

It depends a lot on the type of sharedVar, the language you're using, any framework, and the platform. In many cases, it's possible that assigning a single value to sharedVar may take more than one instruction, in which case you may read a "half-set" copy of the value.
Even when that's not the case, and the assignment is atomic, you may not see the latest value without a memory barrier in place.

MSDN Magazine has a good explanation of different problems you may encounter in multithreaded code:
Forgotten Synchronization
Incorrect Granularity
Read and Write Tearing
Lock-Free Reordering
Lock Convoys
Two-Step Dance
Priority Inversion
The code in your question is particularly vulnerable to Read/Write Tearing. But your code, having neither locks nor memory barriers, is also subject to Lock-Free Reordering (which may include speculative writes in which thread B reads a value that thread A never stored) in which side-effects become visible to a second thread in a different order from how they appeared in your source code.
It goes on to describe some known design patterns which avoid these problems:
Immutability
Purity
Isolation
The article is available here

The main problem is that the assignment operator (operator= in C++) is not always guaranteed to be atomic (not even for primitive, built in types). In plain English, that means that assignment can take more than a single clock cycle to complete. If, in the middle of that, the thread gets interrupted, then the current value of the variable might be corrupted.
Let me build off of your example:
Lets say sharedVar is some object with operator= defined as this:
object& operator=(const object& other) {
ready = false;
doStuff(other);
if (other.value == true) {
value = true;
doOtherStuff();
} else {
value = false;
}
ready = true;
return *this;
}
If thread A from your example is interrupted in the middle of this function, ready will still be false when thread B starts to run. This could mean that the object is only partially copied over, or is in some intermediate, invalid state when thread B attempts to copy it into a local variable.
For a particularly nasty example of this, think of a data structure with a removed node being deleted, then interrupted before it could be set to NULL.
(For some more information regarding structures that don't need a lock (aka, are atomic), here is another question that talks a bit more about that.)

This could go wrong, because threads can be suspended and resumed by the thread scheduler, so you can't be sure about the order these instructions are executed. It might just as well be in this order:
Thread B:
localVar = sharedVar
Thread A:
sharedVar = someFunction()
In which case localvar will be null or 0 (or some completeley unexpected value in an unsafe language), probably not what you intended.
Mutexes actually won't fix this particular issue by the way. The example you supply does not lend itself well for parallelization.

is Ccriticalsection usable in production?

We are a couple of newbies in MFC and we are building a multi-threded application. We come across the article in the URL that warns us not to use CCriticalSection since its implementation is broken. We are interested to know if anyone has any experience in using CCriticalSection and do you come across any problems or bugs? Is CCriticalSection usable and production ready if we use VC++ 2008 to build our application?
http://www.flounder.com/avoid_mfc_syncrhonization.htm
thx

I think that article is based on a fundamental misunderstanding of what CSingleLock is for and how to use it.
You cannot lock the same CSingleLock multiple times, but you are not supposed to. CSingleLock, as its name suggests, is for locking something ONCE.
Each CSingleLock just manages one lock on some other object (e.g. a CCriticalSection which you pass it during construction), with the aim of automatically releasing that lock when the CSingleLock goes out of scope.
If you want to lock the underlying object multiple times you would use multiple CSingleLocks; you would not use a single CSingleLock and try to lock it multiple times.
Wrong (his example):
CCriticalSection crit;
CSingleLock lock(&crit);
lock.Lock();
lock.Lock();
lock.Unlock();
lock.Unlock();
Right:
CCriticalSection crit;
CSingleLock lock1(&crit);
CSingleLock lock2(&crit);
lock1.Lock();
lock2.Lock();
lock2.Unlock();
lock1.Unlock();
Even better (so you get RAII):
CCriticalSection crit;
// Scope the objects
{
CSingleLock lock1(&crit, TRUE); // TRUE means it (tries to) locks immediately.
// Do stuff which needs the lock (if IsLocked returns success)
CSingleLock lock2(&crit, TRUE);
// Do stuff which needs the lock (if IsLocked returns success)
}
// crit is unlocked now.
(Of course, you would never intentionally get two locks on the same underlying critical section in a single block like that. That'd usually only happen as a result of calling functions which get a lock while inside something else that already has its own lock.)
(Also, you should check CSingleLock.IsLocked to see if the lock was successful. I've left those checks out for brevity, and because they were left out of the original example.)
If CCriticalSection itself suffers from the same problem then that certainly is a problem, but he's presented no evidence of that that I can see. (Maybe I missed something. I can't find the source to CCriticalSection in my MFC install to verify that way, either.)

That article suggests that a simple situation of using those primitives are fine, except that the implementation of them violates the semantics that should be expected of them.
basically, it suggests that if you use it as a non-recursive lock, where you take care to always ensure that the lock is valid (ie, not abandoned), then you should be fine.
The article does complain, however, that the limitations are inexcusable.

Primitive synchronization primitives -- safe?

On constrained devices, I often find myself "faking" locks between 2 threads with 2 bools. Each is only read by one thread, and only written by the other. Here's what I mean:
bool quitted = false, paused = false;
bool should_quit = false, should_pause = false;
void downloader_thread() {
quitted = false;
while(!should_quit) {
fill_buffer(bfr);
if(should_pause) {
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
}
}
quitted = true;
}
void ui_thread() {
// new Thread(downloader_thread).start();
// ...
should_pause = true;
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
}
Of course on a PC I wouldn't do this, but on constrained devices, it seems reading a bool value would be much quicker than obtaining a lock. Of course I trade off for slower recovery (see "sleep(50)") when a change to the buffer is needed.
The question -- is it completely thread-safe? Or are there hidden gotchas I need to be aware of when faking locks like this? Or should I not do this at all?

Using bool values to communicate between threads can work as you intend, but there are indeed two hidden gotchas as explained in this blog post by Vitaliy Liptchinsky:
Cache Coherency
A CPU does not always fetch memory values from RAM. Fast memory caches on the die are one of the tricks used by CPU designers to work around the Von Neumann bottleneck. On some multi-cpu or multi-core architectures (like Intel's Itanium) these CPU caches are not shared or automatically kept in sync. In other words, your threads may be seeing different values for the same memory address if they run on different CPU's.
To avoid this you need to declare your variables as volatile (C++, C#, java), or do explicit volatile read/writes, or make use of locking mechanisms.
Compiler Optimizations
The compiler or JITter may perform optimizations which are not safe if multiple threads are involved. See the linked blog post for an example. Again, you must make use of the volatile keyword or other mechanisms to inform you compiler.

Unless you understand the memory architecture of your device in detail, as well as the code generated by your compiler, this code is not safe.
Just because it seems that it would work, doesn't mean that it will. "Constrained" devices, like the unconstrained type, are getting more and more powerful. I wouldn't bet against finding a dual-core CPU in a cell phone, for instance. That means I wouldn't bet that the above code would work.

Concerning the sleep call, you could always just do sleep(0) or the equivalent call that pauses your thread letting the next in line a turn.
Concerning the rest, this is thread safe if you know the implementation details of your device.

Answering the questions.
Is this completely thread safe? I would answer no this is not thread safe and I would just not do this at all. Without knowing the details of our device and compiler, if this is C++, the compiler is free to reorder and optimize things away as it sees fit. e.g. you wrote:
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
but the compiler may choose to reorder this into something like this:
sleep(50);
is_paused = false;
this probably won't work even a single core device as others have said.
Rather than taking a lock, you may try to do better to just do less on the UI thread rather than yield in the middle of processing UI messages. If you think that you have spent too much time on the UI thread then find a way to cleanly exit and register an asynchronous call back.
If you call sleep on a UI thread (or try to acquire a lock or do anyting that may block) you open the door to hangs and glitchy UIs. A 50ms sleep is enough for a user to notice. And if you try to acquire a lock or do any other blocking operation (like I/O) you need to deal with the reality of waiting for an indeterminate amount of time to get the I/O which tends to translate from glitch to hang.

This code is unsafe under almost all circumstances. On multi-core processors you will not have cache coherency between cores because bool reads and writes are not atomic operations. This means each core is not guarenteed to have the same value in the cache or even from memory if the cache from the last write hasn't been flushed.
However, even on resource constrained single core devices this is not safe because you do not have control over the scheduler. Here is an example, for simplicty I'm going to pretend these are the only two threads on the device.
When the ui_thread runs, the following lines of code could be run in the same timeslice.
// new Thread(downloader_thread).start();
// ...
should_pause = true;
The downloader_thread runs next and in it's time slice the following lines are executed:
quitted = false;
while(!should_quit)
{
fill_buffer(bfr);
The scheduler prempts the downloader_thread before fill_buffer returns and then activates the ui_thread which runs.
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
The resize buffer operation is done while the downloader_thread is in the process of filling the buffer. This means the buffer is corrupted and you'll likely crash soon. It won't happen everytime, but the fact that you are filling the buffer before you set is_paused to true makes it more likely to happen, but even if you switched the order of those two operations on the downloader_thread you would still have a race condition, but you'd likely deadlock instead of corrupting the buffer.
Incidentally, this is a type of spinlock, it just doesn't work. Spinlock's aren't very for wait times that are likely to span to many time slices cause the spin the processor. Your implmentation does sleep which is a bit nicer but the scheduler still has to run your thread and thread context switches aren't cheap. If you are waiting on a critical section or semaphore, the scheduler doesn't active your thread again till the resource has become free.
You might be able to get away with this in some form on a specific platform/architecture, but it is really easy to make a mistake that is very hard to track down.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string