lookup tables in C++ 11 with multithreading - multithreading

I have 2 similar situations in a multithreaded C++11 software :
an array that I'm using as a lookup table inside a method declaration
an array that I'm using as a lookup table declared outside a method and that is being used by different and several methods by reference or with pointers.
now if we forget for a minute about this LUTs and we just consider C++11 and a multithreaded approach for a generic method, the most appropriate qualifier for this methods in terms of storage duration is probably thread_local.
This way if i feed a method foo() that is thread_local to 3 threads I basically end up having 3 instances of foo() for each thread, this move "solves" the problem with foo() being shared and accessed between 3 different threads, avoiding cache missings, but I basically have 3 possible different behaviours for my foo(), for example if I have the same PRNG implemented in foo() and i provide a seed that is time-dependant with a really good and high resolution, I probably will get 3 different results with each thread and a real mess in terms of consistency.
But let's say that I'm fine with how thread_local works, how I can write down the fact that I need to keep a LUT always ready and cached for my methods ?
I read something about a relaxed or less relaxed memory model, but in C++11 I have never seen a keyword or a practical application that can inject the caching of an array/LUT .
I'm on x86 or ARM.
I probably need something that is the opposite thing of volatile basically.

If the LUTs are read-only, so that you can share them without locks, you should just use one of them (i.e. declare them static).
Threads do not have their own caches. But even if they did (cores typically have their own L1 cache, and you might be able to lock a thread to a core), there would be no problem for two different threads to cache different parts of the same memory structure.
"Thread-local storage" does not mean that the memory is somehow physically tied to the thread. Rather, it's a way to let the same name refer to a different object in each thread. In no way does it restrict the ability of any thread to access the object, if given its address.

The CPU cache is not programmable. It uses its own internal logic to determine which memory regions to cache. Typically it will cache the memory that either has just been accessed by the CPU, or its prediction logic determines will shortly be accessed by the CPU. In a multiprocessor system, each CPU may have its own cache, or different CPUs may share a cache. If there are multiple caches, a memory region may be cached in more than one simultaneously.
If all threads must see the same values in the look-up tables, then a single table would be best. This could be achieved with a variable with static storage duration. If the data can be modified then you would probably also need a std::mutex to protect accesses to the table and avoid data races. Read-only data can be shared without additional synchronization; in this case it is best to declare it const to make the read-only nature explicit and avoid accidental modifications.
void foo(){
static const int lut[]={...};
}
You use thread_local where each thread must have its own copy of the data, usually because each copy will be modified independently. For example, you may choose to use thread_local for your random-number generator, so that each thread has its own RNG which is independent of the other threads, and does not require synchronization.
void bar(){
thread_local RandomNumberGenerator rng; // one per thread
auto val=rng.nextRandomNumber(); // use the instance for the current thread
}

Related

Does a variable only read by one thread, read and written by another, need synchronization?

Motive:
I am just learning the fundamentals of multithreading, not close to finishing them, but I'd like to ask a question this early in my learning journey to guide me toward the topics most relevant to my project I 'm working on.
Main:
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
In general, yes. Otherwise, the thread editing the value could change the value only locally so that the other thread will never see the value change. This can happens because of compilers (that could use registers to read/store variables) but also because of the hardware (regarding the cache coherence mechanism used on the target platform). Generally, locks, atomic variables and memory barriers are used to perform such synchronizations.
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
In general, there is no guarantee that accesses are done atomically. Theoretically, two cores executing each one a thread can load/store the same variable at the same time (but often not in practice). It is very dependent of the target platform.
For processor having (coherent) caches (ie. all modern mainstream processors) cache lines (ie. chunks of typically 64 or 128 bytes) have a huge impact on the implicit synchronization between threads. This is a complex topic, but you can first read more about cache coherence in order to understand how the memory hierarchy works on modern platforms.
The cache coherence protocol prevent two load/store being done exactly at the same time in the same cache line. If the variable cross multiple cache lines, then there is no protection.
On widespread x86/x86-64 platforms, variables having primitive types of <= 8 bytes can be modified atomically (because the bus support that as well as the DRAM and the cache) assuming the address is correctly aligned (it does not cross cache lines). However, this does not means all such accesses are atomic. You need to specify this to the compiler/interpreter/etc. so it produces/executes the correct instructions. Note that there is also an extension for 16-bytes atomics. There is also an instruction set extension for the support of transactional memory. For wider types (or possibly composite ones) you likely need a lock or an atomic state to control the atomicity of the access to the target variable.

Mutexes. What even?

I am learning about computer architecture and how operating systems work. I have a few questions about how mutexes work.
Question 1
add_to_list(&list, &elem):
mutex m;
lock_mutex(m);
...
remove_from_list(&list):
mutex m;
lock_mutex(m);
...
These two functions instantiate their own mutex, which means they point to different places in memory and so one does not lock the other and effectively doesn't accomplish what we want--list to be protected.
How do we get two different functions to use the same mutex? Do we define a global variable? If so, how do you share this global variable throughout an entire program that is potentially spread throughout multiple files?
Question 2
mutex m;
modify_A():
lock_mutex(m);
A += 1;
modify_B():
lock_mutex(m);
B += 1;
These two functions modify different spaces in memory. Does that mean I need a unique mutex for each function / or piece of data? If I were to have a global mutex variable that I used for both functions, a thread calling modify_A() would block another thread trying to call modify_B()
Which brings me to my last question...
Question 3
A mutex seems like it just blocks a thread from running a piece of code until whatever thread is currently running that same code finishes. This is to create atomicity and protect the integrity of the data being used by a thread. However, the same piece of memory can be modified from many different places in a program. Which makes me think we have to use one mutex throughout an entire program, which would result in a lot of needless blocking of other threads.
Considering that pretty much every function in a given program is going to be modifying data, if we use a single mutex throughout a program, that means each function call will be blocked while that mutex is in use by another thread, even if the data it needs to access is unrelated.
Doesn't that effectively eliminate the gains from having multiple threads? If only one thread can run at a given time?
I feel like I'm totally misunderstanding how mutexes work, so please ELI5!
Thanks in advance.
Yes, you make it a global variable, or otherwise accessible to the required functions through some kind of convenience method or whatever. Global variables can be shared between translation units too, but that's language/system dependent. In C you'd just put an extern mutex m in a header that everyone shares and then define that mutex as mutex m in exactly one of your translation units.
If you don't want changes to B to block other threads from modifying A, yes, you'd use two different mutexes. If you want to lock both at the same time, you would share the mutex.
Multiple threads can run at the same time as long as no two of them are inside the critical section protected by a certain mutex at the same time. That's the whole point - everything goes on nice and parallel, but you use the mutex to serialize access to a specific resource or critical section you need protected.
You typically use a mutex to protect some particular piece of shared data. If the vast majority of your code's time is spent accessing one single piece of shared data, then you won't get much of a performance improvement from threads precisely because only one thread can safely access that piece of shared data at a time.
If you happen to fall into this situation, there are more complex techniques than mutexes. Fortunately, it's fairly rare (unless you're implementing operating systems or low-level libraries) so you can get away with using mutexes for a very large fraction of your synchronization needs.

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.
Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.
isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Will Mutex protection failed for register promotion?

In an article about c++11 memory order, author show an example reasoning "threads lib will not work in c++03"
for (...){
...
if (mt) pthread_mutex_lock(...);
x=...x...
if (mt) pthread_mutex_unlock(...);
}
//should not have data-race
//but if "clever" compiler use a technique called
//"register promotion" , code become like this:
r = x;
for (...){
...
if (mt) {
x=r; pthread_mutex_lock(...); r=x;
}
r=...r...
if (mt) {
x=r; pthread_mutex_unlock(...); r=x;
}
x=r;
There are 3 question:
1.Is this promotion only break the mutex protection in c++03?What about c language?
2.c++03 thread libs become unwork?
3.Any other promotion may caused same problem?
If it's wrong example, then thread libs work, what about the 《Threads Cannot be Implemented as a Library》by Hans Boehm.
POSIX functions pthread_mutex_lock and pthread_mutex_unlock are memory barriers, the compiler and/or CPU cannot reorder loads and stores around them. Otherwise the mutexes would be useless. That article is probably inaccurate.
See POSIX 4.12 Memory Synchronization:
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads: [see the list on the website]
For single thread code, the state in the abstract machine is not directly observable: objects that aren't volatile are not guaranteed to have any particular state when you pause the only thread with a signal and observe it via ptrace or the equivalent. The only requirement is that the program execution has the same observable behavior as a behavior of one possible execution of the abstract machine.
The observables are the interactions with external world; basically, input/output on streams and actions on volatile objects.
A compiler for mono-thread code can generate code that perform operations on global variables or other object that happen to be shared between threads, as long as the single thread semantic is respected. This is obviously the case if a global variable to changed in such a way that it gets back its original value.
For example, a compiler might emit code that increment then decrement a variable, at least in some rare cases; the goal would be to emit simple code, at the cost of the occasional few unneeded operations.
Such changes to shared variables that don't exist in the abstract machine would obviously break multithreaded code that concurrently performs a real operation; such code does not have any race condition on the accesses of the shared variable, that are properly serialized, but the generated code introduced a race that breaks the program.

How do threaded systems cope with shared data being being cached by different cpus?

I'm coming largely from a c++ background, but I think this question applies to threading in any language. Here's the scenario:
We have two threads (ThreadA and ThreadB), and a value x in shared memory
Assume that access to x is appropriately controlled by a mutex (or other suitable synchronization control)
If the threads happen to run on different processors, what happens if ThreadA performs a write operation, but its processor places the result in its L2 cache rather than the main memory? Then, if ThreadB tries to read the value, will it not just look in its own L1/L2 cache / main memory and then work with whatever old value was there?
If that's not the case, then how is this issue managed?
If that is the case, then what can be done about it?
Your example would work just fine.
Multiple processors use a coherency protocol such as MESI to ensure that data remains in sync between the caches. With MESI, each cache line is considered to be either modified, exclusively held, shared between CPU's, or invalid. Writing a cache line that is shared between processors forces it to become invalid in the other CPU's, keeping the caches in sync.
However, this is not quite enough. Different processors have different memory models, and most modern processors support some level of re-ordering memory accesses. In these cases, memory barriers are needed.
For instance if you have Thread A:
DoWork();
workDone = true;
And Thread B:
while (!workDone) {}
DoSomethingWithResults()
With both running on separate processors, there is no guarantee that the writes done within DoWork() will be visible to thread B before the write to workDone and DoSomethingWithResults() would proceed with potentially inconsistent state. Memory barriers guarantee some ordering of the reads and writes - adding a memory barrier after DoWork() in Thread A would force all reads/writes done by DoWork to complete before the write to workDone, so that Thread B would get a consistent view. Mutexes inherently provide a memory barrier, so that reads/writes cannot pass a call to lock and unlock.
In your case, one processor would signal to the others that it dirtied a cache line and force the other processors to reload from memory. Acquiring the mutex to read and write the value guarantees that the change to memory is visible to the other processor in the order expected.
Most locking primitives like mutexes imply memory barriers. These force a cache flush and reload to occur.
For example,
ThreadA {
x = 5; // probably writes to cache
unlock mutex; // forcibly writes local CPU cache to global memory
}
ThreadB {
lock mutex; // discards data in local cache
y = x; // x must read from global memory
}
In general, the compiler understands shared memory, and takes considerable effort to assure that shared memory is placed in a sharable place. Modern compilers are very complicated in the way that they order operations and memory accesses; they tend to understand the nature of threading and shared memory. That's not to say that they're perfect, but in general, much of the concern is taken care of by the compiler.
C# has some build in support for this kind of problems.
You can mark an variable with the volatile keyword, which forces it to be synchronized on all cpu's.
public static volatile int loggedUsers;
The other part is a syntactical wrappper around the .NET methods called Threading.Monitor.Enter(x) and Threading.Monitor.Exit(x), where x is an variable to lock. This causes other threads trying to lock x to have to wait untill the locking thread calls Exit(x).
public list users;
// In some function:
System.Threading.Monitor.Enter(users);
try {
// do something with users
}
finally {
System.Threading.Monitor.Exit(users);
}

Resources