Shared variables in OpenMP - multithreading

I have a very basic question (maybe stupid) regarding shared variables in OpenMP. Consider the following code:
void main()
{
int numthreads;
#pragma omp parallel default(none) shared(numthreads)
{
numthreads = omp_get_num_threads();
printf("%d\n",numthreads);
}
}
Now the value of numthreads is the same for all threads. is there a possibility that since various threads are writing the same value to the same variable, the value might get garbled/mangled ? Or is this operation on a primitive datatype guaranteed to be atomic ?

As per the standard, this is not safe:
A single access to a variable may be implemented with multiple load or store instructions, and
hence is not guaranteed to be atomic with respect to other accesses to the same variable.
[...]
If multiple threads write without synchronization to the same memory unit, including cases due to
atomicity considerations as described above, then a data race occurs. [...] If a data race occurs then the result of the program is unspecified.
I strongly recommend reading 1.4.1 Structure of the OpenMP Memory Model. While it's not the easiest read, it's very specific and quite clear. By far better than I could describe it here.
Two things need to be considered about shared variables in OpenMP: atomicity of access and the temporary view of memory.

Related

Will Mutex protection failed for register promotion?

In an article about c++11 memory order, author show an example reasoning "threads lib will not work in c++03"
for (...){
...
if (mt) pthread_mutex_lock(...);
x=...x...
if (mt) pthread_mutex_unlock(...);
}
//should not have data-race
//but if "clever" compiler use a technique called
//"register promotion" , code become like this:
r = x;
for (...){
...
if (mt) {
x=r; pthread_mutex_lock(...); r=x;
}
r=...r...
if (mt) {
x=r; pthread_mutex_unlock(...); r=x;
}
x=r;
There are 3 question:
1.Is this promotion only break the mutex protection in c++03?What about c language?
2.c++03 thread libs become unwork?
3.Any other promotion may caused same problem?
If it's wrong example, then thread libs work, what about the 《Threads Cannot be Implemented as a Library》by Hans Boehm.
POSIX functions pthread_mutex_lock and pthread_mutex_unlock are memory barriers, the compiler and/or CPU cannot reorder loads and stores around them. Otherwise the mutexes would be useless. That article is probably inaccurate.
See POSIX 4.12 Memory Synchronization:
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads: [see the list on the website]
For single thread code, the state in the abstract machine is not directly observable: objects that aren't volatile are not guaranteed to have any particular state when you pause the only thread with a signal and observe it via ptrace or the equivalent. The only requirement is that the program execution has the same observable behavior as a behavior of one possible execution of the abstract machine.
The observables are the interactions with external world; basically, input/output on streams and actions on volatile objects.
A compiler for mono-thread code can generate code that perform operations on global variables or other object that happen to be shared between threads, as long as the single thread semantic is respected. This is obviously the case if a global variable to changed in such a way that it gets back its original value.
For example, a compiler might emit code that increment then decrement a variable, at least in some rare cases; the goal would be to emit simple code, at the cost of the occasional few unneeded operations.
Such changes to shared variables that don't exist in the abstract machine would obviously break multithreaded code that concurrently performs a real operation; such code does not have any race condition on the accesses of the shared variable, that are properly serialized, but the generated code introduced a race that breaks the program.

Does this code need synchronization?

I plan on writing a multithreaded part in my game-project:
Thread A: loads bunch of objects from disk, which takes up to several seconds. Each object loaded increments a counter.
Thread B: a game loop, in which I either display loading screen with number of loaded objects, or start to manipulate objects once loading is done.
In code I believe it will look as following:
Counter = 0;
Objects;
THREAD A:
for (i = 0; i < ObjectsToLoad; ++i) {
Objects.push(LoadObject());
++Counter;
}
return;
THREAD B:
...
while (true) {
...
C = Counter;
if (C < ObjectsToLoad)
RenderLoadscreen(C);
else
WorkWithObjects(Objects)
...
}
...
Technically, this can be counted as a race condition - the object may be loaded but counter is not incremented yet, so B reads an old value. I also need to cache counter in B so its value won't change between check and rendering.
Now the question is - should I implement any synchronization mechanics here, like making counter atomic or introducing some mutex or conditional variable? The point here is that I can safely sacrifice an iteration of loop until the counter changes. And from what I get, as long as A only writes the value and B only checks it, everything is fine.
I've been discussing this question with a friend but we couldn't get to agreement, so we decided to ask for opinion of someone more competent in multithreading. The language is C++, if it helps.
You have to consider memory visibility / caching. Without memory barriers this can very well lead to delays of several seconds until the data is visible to Thread B(1).
This applies to both kind of data: The Counter and the Objects list.
The C++11 standard(2) guarantees that multithreaded programs are executed correctly only if you don't introduce race conditions. Without synchronization your program basically has undefined behaviour(3). However, in practice it might work without.
Yes, use a mutex and synchronize access to Counter and Objects.
(1) This is because each CPU core has its own registers and cache. If you don't tell the CPU Core A that some other Core B might be interested in the data, it can do optimizations by e.g. leaving the data in a register. Core A has to write the data to a higher level memory region (L2/L3 Cache or RAM) so Core B can load the changes.
(2) Any version before C++11 did not care about multithreading. There was support for mutexes, atomics etc. through third-party libraries but the language itself was thread-agnostic.
See: C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?
(3) The problem is that your code can be reordered (for more efficient execution) at different stages: At the compiler, the assembler and also at the CPU. You must tell the computer which instructions need to stay in that order by adding memory barriers through atomics or mutexes. This works the same in most languages.
I'd recommend watching these very interesting videos about the C++11 memory model:
atomic<> weapons by Herb Sutter
IMO: If you identify data that is accessed by multiple threads, use synchronization. Multithreading-bugs are hard to track and reproduce, so better avoid them all together.
Race condition is typically only when two threads try to non-atomically read-modify-write concurrently the same datum. In this case, only one thread writes (thread A), while the other thread reads (thread B).
The only "incorrectness" you'll encounter is, as you said, if the object has been loaded but the counter hasn't been incremented. This causes B to read stale data, as the load-and-increment operation was not executed atomically.
If you don't mind this innocent anomaly, then it works just fine. :)
If this annoys you, then you need to execute all of the load-and-increment statements in one go (by using locks or any other synchronization primitive).

What is the use-case of atomic read

I understand that atomic read serializes the read operations that performed by multiple threads.
What I don't understand is what is the use case?
More interestingly, I've found some implementation of atomic read which is
static inline int32_t ASMAtomicRead32(volatile int32_t *pi32)
{
return *pi32;
}
Where the only distinction to regular read is volatile. Does it mean that atomic read is the same as volatile read?
I understand that atomic read serializes the read operations that performed by multiple threads.
It's rather wrong. How you can ensure the order of reads if there is no write which stores a different value? Even when you have both read and write, it's not necessarily serialized unless correct memory semantics is used in conjunction with both the read & write operations, e.g. 'store-with-release' and 'load-with-acquire'. In your particular example, the memory semantics is relaxed. Though on x86, one can imply acquire semantics for each load and release for each store (unless non-temporal stores are used).
What I don't understand is what is the use case?
atomic reads must ensure that the data is read in one shot and no other thread can store a part of the data in the between. Thus it usually ensures the alignment of the atomic variable (since the read of aligned machine word is atomic) or work-arounds non-aligned cases using more heavy instructions. And finally, it ensures that the read is not optimized out by the compiler nor reordered across other operations in this thread (according to the memory semantics).
Does it mean that atomic read is the same as volatile read?
In a few words, volatile was not intended for such a use-case but sometimes can be abused for it when other requirements are met as well. For your example, my analysis is the following:
int32_t is likely a machine word or less - ok.
usually, everything is aligned at least on 4 bytes boundary, though there is no guarantee in your example
volatile ensures the read is not optimized out
the is no guarantee it will not be reordered either by processor (ok for x86) or by compiler (bad)
Please refer to Arch's blog and Concurrency: Atomic and volatile in C++11 memory model for the details.

lookup tables in C++ 11 with multithreading

I have 2 similar situations in a multithreaded C++11 software :
an array that I'm using as a lookup table inside a method declaration
an array that I'm using as a lookup table declared outside a method and that is being used by different and several methods by reference or with pointers.
now if we forget for a minute about this LUTs and we just consider C++11 and a multithreaded approach for a generic method, the most appropriate qualifier for this methods in terms of storage duration is probably thread_local.
This way if i feed a method foo() that is thread_local to 3 threads I basically end up having 3 instances of foo() for each thread, this move "solves" the problem with foo() being shared and accessed between 3 different threads, avoiding cache missings, but I basically have 3 possible different behaviours for my foo(), for example if I have the same PRNG implemented in foo() and i provide a seed that is time-dependant with a really good and high resolution, I probably will get 3 different results with each thread and a real mess in terms of consistency.
But let's say that I'm fine with how thread_local works, how I can write down the fact that I need to keep a LUT always ready and cached for my methods ?
I read something about a relaxed or less relaxed memory model, but in C++11 I have never seen a keyword or a practical application that can inject the caching of an array/LUT .
I'm on x86 or ARM.
I probably need something that is the opposite thing of volatile basically.
If the LUTs are read-only, so that you can share them without locks, you should just use one of them (i.e. declare them static).
Threads do not have their own caches. But even if they did (cores typically have their own L1 cache, and you might be able to lock a thread to a core), there would be no problem for two different threads to cache different parts of the same memory structure.
"Thread-local storage" does not mean that the memory is somehow physically tied to the thread. Rather, it's a way to let the same name refer to a different object in each thread. In no way does it restrict the ability of any thread to access the object, if given its address.
The CPU cache is not programmable. It uses its own internal logic to determine which memory regions to cache. Typically it will cache the memory that either has just been accessed by the CPU, or its prediction logic determines will shortly be accessed by the CPU. In a multiprocessor system, each CPU may have its own cache, or different CPUs may share a cache. If there are multiple caches, a memory region may be cached in more than one simultaneously.
If all threads must see the same values in the look-up tables, then a single table would be best. This could be achieved with a variable with static storage duration. If the data can be modified then you would probably also need a std::mutex to protect accesses to the table and avoid data races. Read-only data can be shared without additional synchronization; in this case it is best to declare it const to make the read-only nature explicit and avoid accidental modifications.
void foo(){
static const int lut[]={...};
}
You use thread_local where each thread must have its own copy of the data, usually because each copy will be modified independently. For example, you may choose to use thread_local for your random-number generator, so that each thread has its own RNG which is independent of the other threads, and does not require synchronization.
void bar(){
thread_local RandomNumberGenerator rng; // one per thread
auto val=rng.nextRandomNumber(); // use the instance for the current thread
}

How do threaded systems cope with shared data being being cached by different cpus?

I'm coming largely from a c++ background, but I think this question applies to threading in any language. Here's the scenario:
We have two threads (ThreadA and ThreadB), and a value x in shared memory
Assume that access to x is appropriately controlled by a mutex (or other suitable synchronization control)
If the threads happen to run on different processors, what happens if ThreadA performs a write operation, but its processor places the result in its L2 cache rather than the main memory? Then, if ThreadB tries to read the value, will it not just look in its own L1/L2 cache / main memory and then work with whatever old value was there?
If that's not the case, then how is this issue managed?
If that is the case, then what can be done about it?
Your example would work just fine.
Multiple processors use a coherency protocol such as MESI to ensure that data remains in sync between the caches. With MESI, each cache line is considered to be either modified, exclusively held, shared between CPU's, or invalid. Writing a cache line that is shared between processors forces it to become invalid in the other CPU's, keeping the caches in sync.
However, this is not quite enough. Different processors have different memory models, and most modern processors support some level of re-ordering memory accesses. In these cases, memory barriers are needed.
For instance if you have Thread A:
DoWork();
workDone = true;
And Thread B:
while (!workDone) {}
DoSomethingWithResults()
With both running on separate processors, there is no guarantee that the writes done within DoWork() will be visible to thread B before the write to workDone and DoSomethingWithResults() would proceed with potentially inconsistent state. Memory barriers guarantee some ordering of the reads and writes - adding a memory barrier after DoWork() in Thread A would force all reads/writes done by DoWork to complete before the write to workDone, so that Thread B would get a consistent view. Mutexes inherently provide a memory barrier, so that reads/writes cannot pass a call to lock and unlock.
In your case, one processor would signal to the others that it dirtied a cache line and force the other processors to reload from memory. Acquiring the mutex to read and write the value guarantees that the change to memory is visible to the other processor in the order expected.
Most locking primitives like mutexes imply memory barriers. These force a cache flush and reload to occur.
For example,
ThreadA {
x = 5; // probably writes to cache
unlock mutex; // forcibly writes local CPU cache to global memory
}
ThreadB {
lock mutex; // discards data in local cache
y = x; // x must read from global memory
}
In general, the compiler understands shared memory, and takes considerable effort to assure that shared memory is placed in a sharable place. Modern compilers are very complicated in the way that they order operations and memory accesses; they tend to understand the nature of threading and shared memory. That's not to say that they're perfect, but in general, much of the concern is taken care of by the compiler.
C# has some build in support for this kind of problems.
You can mark an variable with the volatile keyword, which forces it to be synchronized on all cpu's.
public static volatile int loggedUsers;
The other part is a syntactical wrappper around the .NET methods called Threading.Monitor.Enter(x) and Threading.Monitor.Exit(x), where x is an variable to lock. This causes other threads trying to lock x to have to wait untill the locking thread calls Exit(x).
public list users;
// In some function:
System.Threading.Monitor.Enter(users);
try {
// do something with users
}
finally {
System.Threading.Monitor.Exit(users);
}

Resources