Why do random number generators cause problems when executed in parallel? - multithreading

I have an assignment and the following was mentioned:
Parallelism adds further complications, as the random number
generator function must behave correctly when two or more threads call
it simultaneously, i.e., it must be thread-safe.
However, I can't understand why that is a problem in the first place. Random generators are usually called with a seed as a parameter and output a number by doing multiple operations on it, I understand that we need each thread to use a different seed but other than that where does the problem come from? I have realized that spawning and calling the random generator in a parallel region instead of a serial one also really worsens performance but I can't understand why this would happen as by the looks of it, the random number generator should run concurrently without any problems since they are no dependencies.
Any help in understand the theory behind this would be appreciated.

Aside from getting wrong values from the race conditions (pointed out by #MitchWheat), the code will be less efficient because of cache-line sharing between cores on mainstream x86 processors.
Here is an example of (pretty bad but simple) 32-bit random generator (written in C):
uint32_t seed = 0xD1263AA2;
uint32_t customRandom() {
uint32_t old = seed;
seed = (uint32_t)(((uint64_t)seed * 0x2E094C40) >> 24);
return old;
}
void generateValues(uint32_t* arr, size_t size) {
for(size_t i=0 ; i<size ; ++i)
arr[i] = customRandom();
}
If you run this example in sequential (see the result here), the state seed will likely be stored in the memory hierarchy using mainstream compilers (ie. GCC and Clang). This 32-bit block of memory will be read/written in the L1 cache which is very close to the core executing the code.
When you parallelize the loop naively, using for example #pragma omp parallel for in OpenMP, the state is read/written concurrently by multiple threads. There is a race condition: the state value seed can be read by multiple thread in parallel and be written in parallel. Consequently, the same value can be generated by multiple threads while the results are supposed to be random. Race condition are bad and must be fixed. You can fix that using a thread-local state here.
Assuming you do not fix the code because you want to understand the impact of the race condition on the resulting performance of this code, you should see a performance drop in parallel. The issue comes from the cache coherence protocol used by x86 mainstream processors. Indeed, seed is shared between all the threads executed on each core so the processor will try to synchronize the cache of the core so they are kept coherent. This process is very expensive (much slower than reading/writing in the L1 cache). More specifically, when a thread on a given core write in seed, the processor invalidate the seed stored in all the other threads located in the caches of the other cores. Each thread must then fetch the updated seed (typically from the much slower L3 cache) when seed is read. You can find more information here.

Related

Accessing a 64-bit variable in different threads without synchronization or atomicity

I have two threads sharing an uint64_t variable. The first thread just reads from the variable while the other thread just writes into. If I don't synchronize them using mutex/spinlock/atomic operations etc.., is there any possibility of reading another value from the writing thread wrote into? It is not important to read an old-value which was written by writing thread.
As an example, the writing thread increases the variable between 0 and 100, and the reading thread prints the value. So, is there any possibility to see a value in the screen different than [0-100] range. Currently I don't see any different value but I'm not sure it can cause a race condition.
Thanks in advance.
On a 64 bit processor, the data transfers are 64 bits at a time, so you will see logically consistent values i.e. you won't see 32 bits from before the write and 32 bits after the write. This is obviously not true of 32 bit processors.
The kind of issues you will see are things like, if the two threads are running on different cores, the reading thread will not see changes made by the writing thread until the writing thread's core flushes its cache. Also, optimisation may make either thread not bother to read memory at all in the loop. For example, if you have:
uint64_t x = 0;
void increment()
{
for (int i = 0 ; i < 100 ; ++i)
{
x++;
}
}
It is possible that the compiler will generate code that reads x into a register at the start of the loop and not write it back to memory until the loop exits. You need things like volatile and memory barriers.
All bad things can happen if you have a race condition on such a variable.
The correct tool with modern C for this are atomics. Just declare your variable
uint64_t _Atomic counter;
Then, all your operations (load, store, increment...) will be atomic, that is indivisible, uninterruptible and linearizable. No mutex or other protection mechanism is necessary.
This has been introduced in C11, and recent C compilers e.g gcc and clang, now support this out of the box.

When can't computation speed be improved by parallelization?

Similar questions have been asked before but I couldn't find an answer that was more about the low level mechanics of threads themselves.
Problem
I have a physical modeling project in which I need to apply a function to 160 billion data points.
for(int i=0; i < N(160,000,000,000); i++){
physicalModal(input[i]); //Linear function, just additions and subtractions
}
function physicalModal(x){
A*x +B*x +C*x + D*x......... //An over simplification but you get the idea. A linear function
}
Given the nature of this problem am I correct in thinking a single thread on a single core, or 1 thread per core, would be the fastest way to solve this? That using extra threads beyond the number of cores would not help me here?
My Logic (Please correct where my assumptions are wrong)
Threads on a single core don't really work independently, they just share processor time which can be beneficial when one thread is waiting on perhaps a socket response and other threads are processing requests. In the example I posted above I figure the CPU could go to 100% on one thread so using multiple threads would just disturb the computation. Is this correct?
What then determines when threading is useful?
If my above assumption is correct, whats the key factor in determining when other threads would be useful? My guess would be simultaneous operations that have varying completion times, waiting, etc...But thats based on my initial premise which may be incorrect.
I need to apply a function to 160 billion data points.
I assume that your function has no side effects (no writes to global/static variables; no disk/network access; no service to many remote users) and just do some arithmetics on its input (on single point of input or several nearby points as for stencil (it is stencil kernel):
for(int i=0; i < 160_000_000_000; i++){
//Linear function, just additions and subtractions
output[i] = physicalModel(input[i] /* possibly also input[i-1], input[i+1] .. */);
}
Then you have to check how efficient this function works on single CPU. Can you (or your compiler) unroll your loop and convert it to SIMD parallelism?
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = A*input[i-1]+ B*input[i] + C*input[i+1];
}
// unrolled 4 times; if input is float, compiler may load 4 floats
// into single SSE2 reg and do 4 operations from one asm command
for(int i=0+4; i < 160_000_000_000-4; i+=4){
output[i+0] = A*input[i-1]+ B*input[i+0] + C*input[i+1];
output[i+1] = A*input[i+0]+ B*input[i+1] + C*input[i+2];
output[i+2] = A*input[i+1]+ B*input[i+2] + C*input[i+3];
output[i+3] = A*input[i+2]+ B*input[i+3] + C*input[i+4];
}
When your function has good single-threaded performance, you can add thread or process parallelism (using OpenMP/MPI or other method). With our assumptions, there are no threads blocking on some external resource like reading from HDD or from network, so every thread you started can run at any time. Then we should start no more than 1 thread per CPU core. If we started several threads, each will run for some amount of time and displaced by other, having less performance than in case of 1 thread per cpu core.
In C/C++ adding of OpenMP thread level parallelism (https://en.wikipedia.org/wiki/OpenMP, http://www.openmp.org/) can be as easy as adding one line just before your for loop (and adding -fopenmp/-openmp option to your compilation); compiler and library will split your for loop into parts and distribute them between threads ([0..N/4], [N/4..N/2], [N/2..N*3/4], [N*3/4..N] for 4 threads or other split scheme; you can give hints with schedule option)
#pragma omp parallel for
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = physicalModel(input[i]);;
}
Thread count will be determined in runtime by OpenMP library (gomp in gcc - https://gcc.gnu.org/onlinedocs/libgomp/index.html). By default it is "one thread per CPU is used" (per logical cpu core). You can change number of threads with OMP_NUM_THREADS environment variable (export OMP_NUM_THREADS=5; ./program).
On CPU with hardware multithreading on single cpu cores (Intel HT, other variants of SMT: you have 4 physical cores and 8 "logical") in some cases you should use 1 thread per logical core, and in other cases 1 thread per physical core (with correct thread binding), as some resources (FPU units) are shared between logical cores. Do some experiments if your code will be used several (many) times.
If your threads (model) are limited by speed of memory (Memory Bound; they loads many data from memory and does very simple operation on every float), you may want to run less threads than cpu core count, as additional threads will not get addition memory bandwidth.
If your threads do lot of computations for every element loaded from memory, use better SIMD and more threads (compute bound). When you have very good and wide SIMD (full-width AVX), you will have no speedup from using HT, as full-width AVX unit is shared between logical cores (but every physical core has one, so use it); in this case you will also have lower cpu frequency, as full-width AVX unit is very hot under full load.
Illustration of memory and compute limited applications: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
https://crd.lbl.gov/assets/Uploads/FTG/Projects/Roofline/_resampled/ResizedImage600300-rooflineai.png

Does this code need synchronization?

I plan on writing a multithreaded part in my game-project:
Thread A: loads bunch of objects from disk, which takes up to several seconds. Each object loaded increments a counter.
Thread B: a game loop, in which I either display loading screen with number of loaded objects, or start to manipulate objects once loading is done.
In code I believe it will look as following:
Counter = 0;
Objects;
THREAD A:
for (i = 0; i < ObjectsToLoad; ++i) {
Objects.push(LoadObject());
++Counter;
}
return;
THREAD B:
...
while (true) {
...
C = Counter;
if (C < ObjectsToLoad)
RenderLoadscreen(C);
else
WorkWithObjects(Objects)
...
}
...
Technically, this can be counted as a race condition - the object may be loaded but counter is not incremented yet, so B reads an old value. I also need to cache counter in B so its value won't change between check and rendering.
Now the question is - should I implement any synchronization mechanics here, like making counter atomic or introducing some mutex or conditional variable? The point here is that I can safely sacrifice an iteration of loop until the counter changes. And from what I get, as long as A only writes the value and B only checks it, everything is fine.
I've been discussing this question with a friend but we couldn't get to agreement, so we decided to ask for opinion of someone more competent in multithreading. The language is C++, if it helps.
You have to consider memory visibility / caching. Without memory barriers this can very well lead to delays of several seconds until the data is visible to Thread B(1).
This applies to both kind of data: The Counter and the Objects list.
The C++11 standard(2) guarantees that multithreaded programs are executed correctly only if you don't introduce race conditions. Without synchronization your program basically has undefined behaviour(3). However, in practice it might work without.
Yes, use a mutex and synchronize access to Counter and Objects.
(1) This is because each CPU core has its own registers and cache. If you don't tell the CPU Core A that some other Core B might be interested in the data, it can do optimizations by e.g. leaving the data in a register. Core A has to write the data to a higher level memory region (L2/L3 Cache or RAM) so Core B can load the changes.
(2) Any version before C++11 did not care about multithreading. There was support for mutexes, atomics etc. through third-party libraries but the language itself was thread-agnostic.
See: C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?
(3) The problem is that your code can be reordered (for more efficient execution) at different stages: At the compiler, the assembler and also at the CPU. You must tell the computer which instructions need to stay in that order by adding memory barriers through atomics or mutexes. This works the same in most languages.
I'd recommend watching these very interesting videos about the C++11 memory model:
atomic<> weapons by Herb Sutter
IMO: If you identify data that is accessed by multiple threads, use synchronization. Multithreading-bugs are hard to track and reproduce, so better avoid them all together.
Race condition is typically only when two threads try to non-atomically read-modify-write concurrently the same datum. In this case, only one thread writes (thread A), while the other thread reads (thread B).
The only "incorrectness" you'll encounter is, as you said, if the object has been loaded but the counter hasn't been incremented. This causes B to read stale data, as the load-and-increment operation was not executed atomically.
If you don't mind this innocent anomaly, then it works just fine. :)
If this annoys you, then you need to execute all of the load-and-increment statements in one go (by using locks or any other synchronization primitive).

Out-of-order execution and reordering: can I see what after barrier before the barrier?

According to wikipedia: A memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.
Usually, articles talking about something like (I will use monitors instead of membars):
class ReadWriteExample {
int A = 0;
int Another = 0;
//thread1 runs this method
void writer () {
lock monitor1; //a new value will be stored
A = 10; //stores 10 to memory location A
unlock monitor1; //a new value is ready for reader to read
Another = 20; //#see my question
}
//thread2 runs this method
void reader () {
lock monitor1; //a new value will be read
assert A == 10; //loads from memory location A
print Another //#see my question
unlock monitor1;//a new value was just read
}
}
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
I.e. by definition operations issued prior to barrier can't be pushed down by compiler, but is it possible that operations issued after barrier would be occasionally seen before barrier? (just a probability)
Thanks
My answer below only addresses Java's memory model. The answer really can't be made for all languages as each may define the rules differently.
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
Your answer seems to be "Is it possible for the store of A = 20, be re-ordered above the unlock monitor?"
The answer is yes, it can be. If you look at the JSR 166 Cookbook, the first grid shown explains how re-orderings work.
In your writer case the first operation would be MonitorExit the second operation would be NormalStore. The grid explains, yes this sequence is permitted to be re-ordered.
This is known as Roach Motel ordering, that is, memory accesses can be moved into a synchronized block but cannot be moved out
What about another language? Well, this question is too broad to answer all questions as each may define the rules differently. If this is the case you would need to refine your question.
In Java there is the concept of happens-before. You can read all the details about it on in the Java Specification. A Java compiler or runtime engine can re-order code but it must abide by the happens-before rules. These rules are important for a Java developer that wants to have detailed control on how their code is re-ordered. I myself have been burnt by re-ordering code, turns out I was referencing the same object via two different variables and the runtime engine re-ordered my code not realizing that the operations were on the same object. If I had either a happens-before (between the two operations) or used the same variable, then the re-ordering would not have occurred.
Specifically:
It follows from the above definitions that:
An unlock on a monitor happens-before every subsequent lock on that monitor.
A write to a volatile field (§8.3.1.4) happens-before every subsequent
read of that field.
A call to start() on a thread happens-before any actions in the
started thread.
All actions in a thread happen-before any other thread successfully
returns from a join() on that thread.
The default initialization of any object happens-before any other
actions (other than default-writes) of a program.
Short answer - yes. This is very compiler and CPU architecture dependent. You have here the definition of a Race Condition. The scheduling Quantum won't end mid-instruction (can't have two writes to same location). However - the quantum could end between instructions - plus how they are executed out-of-order in the pipeline is architecture dependent (outside of the monitor block).
Now comes the "it depends" complications. The CPU guarantees little (see race condition). You might also look at NUMA (ccNUMA) - it is a method to scale CPU & Memory access by grouping CPUs (Nodes) with local RAM and a group owner - plus a special bus between Nodes.
The monitor doesn't prevent the other thread from running. It only prevents it from entering the code between the monitors. Therefore when the Writer exits the monitor-section it is free to execute the next statement - regardless of the other thread being inside the monitor. Monitors are gates that block access. Also - the quantum could interrupt the second thread after the A== statement - allowing Another to change value. Again - the quantum won't interrupt mid-instruction. Always think of threads executing in perfect parallel.
How do you apply this? I'm a bit out of date (sorry, C#/Java these days) with current Intel processors - and how their Pipelines work (hyperthreading etc). Years ago I worked with a processor called MIPS - and it had (through compiler instruction ordering) the ability to execute instructions that occurred serially AFTER a Branch instruction (Delay Slot). On this CPU/Compiler combination - YES - what you describe could happen. If Intel offers the same - then yes - it could happen. Esp with the NUMA (both Intel & AMD have this, I'm most familiar with AMD implementation).
My point - if threads were running across NUMA nodes - and access was to the common memory location then it could occur. Of course the OS tries hard to schedule operations within the same node.
You might be able to simulate this. I know C++ on MS allows access to NUMA technology (I've played with it). See if you can allocate memory across two nodes (placing A on one, and Another on the other). Schedule the threads to run on specific Nodes.
What happens in this model is that there are two pathways to RAM. I suppose this isn't what you had in mind - probably only a single path/Node model. In which case I go back to the MIPS model I described above.
I assumed a processor that interrupts - there are others that have a Yield model.

Cache consistency & spawning a thread

Background
I've been reading through various books and articles to learn about processor caches, cache consistency, and memory barriers in the context of concurrent execution. So far though, I have been unable to determine whether a common coding practice of mine is safe in the strictest sense.
Assumptions
The following pseudo-code is executed on a two-processor machine:
int sharedVar = 0;
myThread()
{
print(sharedVar);
}
main()
{
sharedVar = 1;
spawnThread(myThread);
sleep(-1);
}
main() executes on processor 1 (P1), while myThread() executes on P2.
Initially, sharedVar exists in the caches of both P1 and P2 with the initial value of 0 (due to some "warm-up code" that isn't shown above.)
Question
Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?
With my newfound knowledge of processor caches, it seems entirely possible that at the time of the print() statement, P2 may not have received the invalidation request for sharedVar caused by P1's assignment in main(). Therefore, it seems possible that myThread() could print 0.
References
These are the related articles and books I've been reading:
Shared Memory Consistency Models: A Tutorial
Memory Barriers: a Hardware View for Software Hackers
Linux Kernel Memory Barriers
Computer Architecture: A Quantitative Approach
Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?
Theoretically, it can print either 0 or 1, even on x86, since stores can move after loads on almost any architecture.
In practice, it would be hard to make myThread() print 0.
Spawning a thread will most likely function as an implicit store/release memory barrier, since it would probably:
- have at least one instruction along the execution path that results in a memory barrier - interlocked instructions, explicit memory barrier instructions, etc.,
- or the store would simply be retired/drained from the store buffer by the time myThread() is called, since setting up a new thread results in executing many instructions - among them, many stores.
I'll speak only to Java here: myThread() is guaranteed to print 1, due to the happens before definition from the The Java Language Specification (Section 17.4.5).
The write to sharedVar in main() happens before spawning the thread with function myThread() because the variable assignment comes first in program order. Next, spawning a thread happens before any actions in the thread being started. By the transitivity of the definition in Section 17.4.5 (hb(x, y) and hb(y, z) implies hb(x, z)), writing to the variable sharedVar happens before print() reads sharedVar in myThread().
You might also enjoy reading Brian Goetz's article Java theory and practice: Fixing the Java Memory Model, Part 2 covering this subject, as well as his book Java Concurrency in Practice.

Resources