I have the following code fragment that runs on multicore using OpenMP. I found that (especially at higher core counts), the application hangs in a strange way.
......
while (true) {
compute1(...);
#pragma omp barrier
if (_terminated) {
...... // <- at least one thread reaches here (L1)
break;
}
......
compute2(...);
......
if (_terminated) {
cout << thread_id << endl; // <- when hangs, it always prints the last thread
}
#pragma omp barrier // <- one thread is stuck here (L2)
......
}
......
I observe that at least one thread is able to reach L1 (assume this is the case and assume the application quits successfully if all threads can reach L1). But sometimes not all threads can reach L1. In fact, by stopping the debugger when the application hangs, it indicates that at least one thread is stuck at the barrier at L2. I put a printing statement right above L2 and it always yields the last thread number (15 when using 16 threads, 7 when using 8 threads, etc.).
This is very strange because the fact that at least one thread can reach L1 indicates that it has moved past the first barrier above L1, which implies that all threads should have reached the same barrier. Therefore, all threads should reach L1 (_terminated is a global shared variable), but in reality, this is not the case. I encounter this issue frequently at higher core counts. It almost never happens when the number of cores is lower than the inherent parallelism in compute1 and compute2.
I am very confused by this issue that I am quite certain it is either that 1) I fundamentally misunderstood some aspects of OpenMP semantics. 2) This is a bug in OpenMP. Some suggestions are much appreciated!
You have race condition(s) in your code. When you write a shared variable (e.g. _terminated) in a thread and read it in a different one a data race may occur. Note that a data race is undefined behaviour in C/C++. To avoid data race one possible (and efficient) solution is using atomic operations:
To write it use
#pragma omp atomic write seq_cst
_terminated=...
To read it use
bool local_terminated;
#pragma omp atomic read seq_cst
local_terminated=_terminated;
if (local_terminated){
...
If this does not solve your problem please provide a minimal reproducible example.
Related
l_thread = 4;
max = 1; //4
#pragma omp parallel for num_threads(l_thread) private(i)
for(i=0;i<max;i++)
;//some operation
in this case, 4 threads will be created by omp. I want to know if the for loop which loops only 1 time(in the case), will be taken only by one of 4 threads right? and other threads will be idle state ? and in this case i am seeing cpu usage of 4 threads are nearly same. What might be the reason? only one thread should be high and others must be low right?
Your take on this is correct. If max=1 and you have more than one thread, thread 0 will execute the single loop iteration and the other threads will wait at the end of the parallel region for thread 0 to catch up. The reason you're seeing the n-1 other threads causing load on the system is because they spin-wait at the end of regions, because that is much faster when the threads have to wake up and notice that the parallel work (or in your case: not so parallel work :-)) is completed.
You can change this behavior via the OMP_WAIT_POLICY environment variable. See the OpenMP specification for a full description.
I have an assignment and the following was mentioned:
Parallelism adds further complications, as the random number
generator function must behave correctly when two or more threads call
it simultaneously, i.e., it must be thread-safe.
However, I can't understand why that is a problem in the first place. Random generators are usually called with a seed as a parameter and output a number by doing multiple operations on it, I understand that we need each thread to use a different seed but other than that where does the problem come from? I have realized that spawning and calling the random generator in a parallel region instead of a serial one also really worsens performance but I can't understand why this would happen as by the looks of it, the random number generator should run concurrently without any problems since they are no dependencies.
Any help in understand the theory behind this would be appreciated.
Aside from getting wrong values from the race conditions (pointed out by #MitchWheat), the code will be less efficient because of cache-line sharing between cores on mainstream x86 processors.
Here is an example of (pretty bad but simple) 32-bit random generator (written in C):
uint32_t seed = 0xD1263AA2;
uint32_t customRandom() {
uint32_t old = seed;
seed = (uint32_t)(((uint64_t)seed * 0x2E094C40) >> 24);
return old;
}
void generateValues(uint32_t* arr, size_t size) {
for(size_t i=0 ; i<size ; ++i)
arr[i] = customRandom();
}
If you run this example in sequential (see the result here), the state seed will likely be stored in the memory hierarchy using mainstream compilers (ie. GCC and Clang). This 32-bit block of memory will be read/written in the L1 cache which is very close to the core executing the code.
When you parallelize the loop naively, using for example #pragma omp parallel for in OpenMP, the state is read/written concurrently by multiple threads. There is a race condition: the state value seed can be read by multiple thread in parallel and be written in parallel. Consequently, the same value can be generated by multiple threads while the results are supposed to be random. Race condition are bad and must be fixed. You can fix that using a thread-local state here.
Assuming you do not fix the code because you want to understand the impact of the race condition on the resulting performance of this code, you should see a performance drop in parallel. The issue comes from the cache coherence protocol used by x86 mainstream processors. Indeed, seed is shared between all the threads executed on each core so the processor will try to synchronize the cache of the core so they are kept coherent. This process is very expensive (much slower than reading/writing in the L1 cache). More specifically, when a thread on a given core write in seed, the processor invalidate the seed stored in all the other threads located in the caches of the other cores. Each thread must then fetch the updated seed (typically from the much slower L3 cache) when seed is read. You can find more information here.
I have some doubts about the C++11/C11 memory model that I was wondering if anyone can clarify. These are questions about the model/abstract machine, not about any real architecture.
Are acquire/release effects guaranteed to "cascade" from one thread to the next?
Here is a pseudo code example of what I mean (assume all variables start as 0)
[Thread 1]
store_relaxed(x, 1);
store_release(a, 1);
[Thread 2]
while (load_acquire(a) == 0);
store_release(b, 1);
[Thread 3]
while (load_acquire(b) == 0);
assert(load_relaxed(x) == 1);
Thread 3's acquire syncs with Thread 2's release, which comes after Thread 2's acquire which syncs with Thread 1's release. Therefore, Thread 3 is guaranteed to see the value that Thread 1 set to x, correct? Or do we need to use seq cst here in order to be guaranteed that the assert will not fire? I have a feeling acquire/release is enough, but I can't quite find any simple explanation that guarantees it. Most explanations of acquire/release mainly focus on the acquiring thread receiving all the stores made by the releasing thread. However in the example above, Thread 2 never touches variable x, and Thread 1/Thread 3 do not touch the same atomic variable. It's obvious that if Thread 2 were to load x, it would see 1, but is that state guaranteed to cascade over into other threads which subsequently do an acquire/release sync with Thread 2? Or does Thread 3 also need to do an acquire on variable a in order to receive Thread 1's write to x?
According to https://en.cppreference.com/w/cpp/atomic/memory_order:
All writes in the current thread are visible in other threads that acquire the same atomic variable
All writes in other threads that release the same atomic variable are visible in the current thread
Since Thread 1 and Thread 3 don't touch the same atomic variable, I'm not sure if acquire/release alone is enough for the above case. There's probably an answer hiding in the formal description, but I can't quite work it out.
*EDIT: Didn't notice until after the fact, but there is an example at the link I posted ("The following example demonstrates transitive release-acquire ordering...") that is almost the same as my example, but it uses the same atomic variable across all three threads, which seems like it might be significant. I am specifically asking about the case where the variables are not the same.
Am I right in believing that according to the standard, there must always be a pair of non-relaxed atomic operations, one in each thread, in order for any kind of memory ordering at all to be guaranteed?
Imagine there is a function "get_data" that allocates a buffer, writes some data to it, and returns a pointer to the buffer. And there is a function "use_data" that takes the pointer to the buffer and does something with the data. Thread 1 gets a buffer from get_data and passes it to Thread 2 using a relaxed atomic store to a global atomic pointer. Thread 2 does relaxed atomic loads in a loop until it gets the pointer, and then passes it off to use_data:
int* get_data() {...}
void use_data(int* buf) {...}
int* global_ptr = nullptr;
[Thread 1]
int* buf = get_data();
super_duper_memory_fence();
store_relaxed(global_ptr, buf);
[Thread 2]
int* buf = nullptr;
while ((buf = load_relaxed(global_ptr)) == nullptr);
use_data(buf);
Is there any kind of operation at all that can be put in "super_duper_memory_fence", that will guarantee that by the time use_data gets the pointer, the data in the buffer is also visible? It is my understanding that there is not a portable way to do this, and that Thread 2 must have a matching fence or other atomic operation in order to guarantee that it receives the writes made into the buffer and not just the pointer value. Is this correct?
Thread 3's acquire syncs with Thread 2's release, which comes after Thread 2's acquire which syncs with Thread 1's release. Therefore, Thread 3 is guaranteed to see the value that Thread 1 set to x, correct?
Yes, this is correct. The acquire/release operations establish synchronize-with relations - i.e., store_release(a) synchronizes-with load_acquire(a) and store_release(b) synchronizes-with load_acquire(b). And load_acquire(a) is sequenced-before store_release(b). synchronize-with and sequenced-before are both part of the happens-before definition, and the happens-before relation is transitive. Therefore, store_relaxed(x, 1) happens-before load_relaxed(x).
Am I right in believing that according to the standard, there must always be a pair of non-relaxed atomic operations, one in each thread, in order for any kind of memory ordering at all to be guaranteed?
This is question is a bit too broad, but overall I would tend to say "yes". In general you have to ensure that there is a proper happens-before relation when operating on some (non-atomic) shared data. If one thread writes to some shared data and some other thread should read that data, you have to ensure that the write happens-before the read. There are different ways to achieve this - atomics with the correct memory orderings are just one way (although one could argue that almost all other methods (like std::mutex) also boil down to atomic operations).
Fences also have to be combined with other fences or atomic operations. Your example would work if super_duper_memory_fence() were a std::atomic_thread_fence(std::memory_order_release) and you put another std::atomic_thread_fence(std::memory_order_acquire) before your call to use_data.
For more details I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers
I have two threads sharing an uint64_t variable. The first thread just reads from the variable while the other thread just writes into. If I don't synchronize them using mutex/spinlock/atomic operations etc.., is there any possibility of reading another value from the writing thread wrote into? It is not important to read an old-value which was written by writing thread.
As an example, the writing thread increases the variable between 0 and 100, and the reading thread prints the value. So, is there any possibility to see a value in the screen different than [0-100] range. Currently I don't see any different value but I'm not sure it can cause a race condition.
Thanks in advance.
On a 64 bit processor, the data transfers are 64 bits at a time, so you will see logically consistent values i.e. you won't see 32 bits from before the write and 32 bits after the write. This is obviously not true of 32 bit processors.
The kind of issues you will see are things like, if the two threads are running on different cores, the reading thread will not see changes made by the writing thread until the writing thread's core flushes its cache. Also, optimisation may make either thread not bother to read memory at all in the loop. For example, if you have:
uint64_t x = 0;
void increment()
{
for (int i = 0 ; i < 100 ; ++i)
{
x++;
}
}
It is possible that the compiler will generate code that reads x into a register at the start of the loop and not write it back to memory until the loop exits. You need things like volatile and memory barriers.
All bad things can happen if you have a race condition on such a variable.
The correct tool with modern C for this are atomics. Just declare your variable
uint64_t _Atomic counter;
Then, all your operations (load, store, increment...) will be atomic, that is indivisible, uninterruptible and linearizable. No mutex or other protection mechanism is necessary.
This has been introduced in C11, and recent C compilers e.g gcc and clang, now support this out of the box.
I'm an OpenMP beginner and from what I've read #pragma omp parallel:
It creates a team of N threads ..., all of which execute the next
statement ... After the statement, the threads join back into one.
I cannot imagine an example where this could be useful without the for keyword after the directive written above. What I mean is that the for keyword split the iterations between the threads of the team, while with the directive above the following block/statement will be executed by all the threads and there is no performance improvement. Can you help me please to clarify it?
You can provide your own mechanism that splits the job into parallel pieces, but relies on OpenMP for parallelism.
Here’s a hypothetic example that uses OpenMP to dequeue some operations and run then in parallel:
#pragma omp parallel
{
operation op;
while( queue.tryDequeue( &op ) )
op.run();
}
The implementation of queue.tryDequeue must be thread-safe, i.e. guarded by critical section/mutex, or lock-free implementation.
To be efficient, the implementation of op.run() must be CPU-heavy, taking much longer than queue.tryDequeue() Otherwise, you’ll spend most of the time blocking that queue, and not doing the parallelizable work.
The for keyword does not divide the work !!!
You have to remember that divide the work means each thread executes a section of your loop. If you insist on using #pragma omp parallel then its like this
#pragma omp parallel
{
#pragma omp for
for(int i= 1...100)
{
}
}
what the above code does is divides the for loop among n threads and for each for n threads anything declared inside the #pragma omp for is a private variable to that thread. This ensure thread safety and also means you are responsible for gathering data, eg using reduction operations