Cache consistency & spawning a thread

Cache consistency & spawning a thread - multithreading

Background
I've been reading through various books and articles to learn about processor caches, cache consistency, and memory barriers in the context of concurrent execution. So far though, I have been unable to determine whether a common coding practice of mine is safe in the strictest sense.
Assumptions
The following pseudo-code is executed on a two-processor machine:
int sharedVar = 0;
myThread()
{
print(sharedVar);
}
main()
{
sharedVar = 1;
spawnThread(myThread);
sleep(-1);
}
main() executes on processor 1 (P1), while myThread() executes on P2.
Initially, sharedVar exists in the caches of both P1 and P2 with the initial value of 0 (due to some "warm-up code" that isn't shown above.)
Question
Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?
With my newfound knowledge of processor caches, it seems entirely possible that at the time of the print() statement, P2 may not have received the invalidation request for sharedVar caused by P1's assignment in main(). Therefore, it seems possible that myThread() could print 0.
References
These are the related articles and books I've been reading:
Shared Memory Consistency Models: A Tutorial
Memory Barriers: a Hardware View for Software Hackers
Linux Kernel Memory Barriers
Computer Architecture: A Quantitative Approach

Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?
Theoretically, it can print either 0 or 1, even on x86, since stores can move after loads on almost any architecture.
In practice, it would be hard to make myThread() print 0.
Spawning a thread will most likely function as an implicit store/release memory barrier, since it would probably:
- have at least one instruction along the execution path that results in a memory barrier - interlocked instructions, explicit memory barrier instructions, etc.,
- or the store would simply be retired/drained from the store buffer by the time myThread() is called, since setting up a new thread results in executing many instructions - among them, many stores.

I'll speak only to Java here: myThread() is guaranteed to print 1, due to the happens before definition from the The Java Language Specification (Section 17.4.5).
The write to sharedVar in main() happens before spawning the thread with function myThread() because the variable assignment comes first in program order. Next, spawning a thread happens before any actions in the thread being started. By the transitivity of the definition in Section 17.4.5 (hb(x, y) and hb(y, z) implies hb(x, z)), writing to the variable sharedVar happens before print() reads sharedVar in myThread().
You might also enjoy reading Brian Goetz's article Java theory and practice: Fixing the Java Memory Model, Part 2 covering this subject, as well as his book Java Concurrency in Practice.

Related

Does this code need synchronization?

I plan on writing a multithreaded part in my game-project:
Thread A: loads bunch of objects from disk, which takes up to several seconds. Each object loaded increments a counter.
Thread B: a game loop, in which I either display loading screen with number of loaded objects, or start to manipulate objects once loading is done.
In code I believe it will look as following:
Counter = 0;
Objects;
THREAD A:
for (i = 0; i < ObjectsToLoad; ++i) {
Objects.push(LoadObject());
++Counter;
}
return;
THREAD B:
...
while (true) {
...
C = Counter;
if (C < ObjectsToLoad)
RenderLoadscreen(C);
else
WorkWithObjects(Objects)
...
}
...
Technically, this can be counted as a race condition - the object may be loaded but counter is not incremented yet, so B reads an old value. I also need to cache counter in B so its value won't change between check and rendering.
Now the question is - should I implement any synchronization mechanics here, like making counter atomic or introducing some mutex or conditional variable? The point here is that I can safely sacrifice an iteration of loop until the counter changes. And from what I get, as long as A only writes the value and B only checks it, everything is fine.
I've been discussing this question with a friend but we couldn't get to agreement, so we decided to ask for opinion of someone more competent in multithreading. The language is C++, if it helps.

You have to consider memory visibility / caching. Without memory barriers this can very well lead to delays of several seconds until the data is visible to Thread B(1).
This applies to both kind of data: The Counter and the Objects list.
The C++11 standard(2) guarantees that multithreaded programs are executed correctly only if you don't introduce race conditions. Without synchronization your program basically has undefined behaviour(3). However, in practice it might work without.
Yes, use a mutex and synchronize access to Counter and Objects.
(1) This is because each CPU core has its own registers and cache. If you don't tell the CPU Core A that some other Core B might be interested in the data, it can do optimizations by e.g. leaving the data in a register. Core A has to write the data to a higher level memory region (L2/L3 Cache or RAM) so Core B can load the changes.
(2) Any version before C++11 did not care about multithreading. There was support for mutexes, atomics etc. through third-party libraries but the language itself was thread-agnostic.
See: C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?
(3) The problem is that your code can be reordered (for more efficient execution) at different stages: At the compiler, the assembler and also at the CPU. You must tell the computer which instructions need to stay in that order by adding memory barriers through atomics or mutexes. This works the same in most languages.
I'd recommend watching these very interesting videos about the C++11 memory model:
atomic<> weapons by Herb Sutter
IMO: If you identify data that is accessed by multiple threads, use synchronization. Multithreading-bugs are hard to track and reproduce, so better avoid them all together.

Race condition is typically only when two threads try to non-atomically read-modify-write concurrently the same datum. In this case, only one thread writes (thread A), while the other thread reads (thread B).
The only "incorrectness" you'll encounter is, as you said, if the object has been loaded but the counter hasn't been incremented. This causes B to read stale data, as the load-and-increment operation was not executed atomically.
If you don't mind this innocent anomaly, then it works just fine. :)
If this annoys you, then you need to execute all of the load-and-increment statements in one go (by using locks or any other synchronization primitive).

Out-of-order execution and reordering: can I see what after barrier before the barrier?

According to wikipedia: A memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.
Usually, articles talking about something like (I will use monitors instead of membars):
class ReadWriteExample {
int A = 0;
int Another = 0;
//thread1 runs this method
void writer () {
lock monitor1; //a new value will be stored
A = 10; //stores 10 to memory location A
unlock monitor1; //a new value is ready for reader to read
Another = 20; //#see my question
}
//thread2 runs this method
void reader () {
lock monitor1; //a new value will be read
assert A == 10; //loads from memory location A
print Another //#see my question
unlock monitor1;//a new value was just read
}
}
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
I.e. by definition operations issued prior to barrier can't be pushed down by compiler, but is it possible that operations issued after barrier would be occasionally seen before barrier? (just a probability)
Thanks

My answer below only addresses Java's memory model. The answer really can't be made for all languages as each may define the rules differently.
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
Your answer seems to be "Is it possible for the store of A = 20, be re-ordered above the unlock monitor?"
The answer is yes, it can be. If you look at the JSR 166 Cookbook, the first grid shown explains how re-orderings work.
In your writer case the first operation would be MonitorExit the second operation would be NormalStore. The grid explains, yes this sequence is permitted to be re-ordered.
This is known as Roach Motel ordering, that is, memory accesses can be moved into a synchronized block but cannot be moved out
What about another language? Well, this question is too broad to answer all questions as each may define the rules differently. If this is the case you would need to refine your question.

In Java there is the concept of happens-before. You can read all the details about it on in the Java Specification. A Java compiler or runtime engine can re-order code but it must abide by the happens-before rules. These rules are important for a Java developer that wants to have detailed control on how their code is re-ordered. I myself have been burnt by re-ordering code, turns out I was referencing the same object via two different variables and the runtime engine re-ordered my code not realizing that the operations were on the same object. If I had either a happens-before (between the two operations) or used the same variable, then the re-ordering would not have occurred.
Specifically:
It follows from the above definitions that:
An unlock on a monitor happens-before every subsequent lock on that monitor.
A write to a volatile field (§8.3.1.4) happens-before every subsequent
read of that field.
A call to start() on a thread happens-before any actions in the
started thread.
All actions in a thread happen-before any other thread successfully
returns from a join() on that thread.
The default initialization of any object happens-before any other
actions (other than default-writes) of a program.

Short answer - yes. This is very compiler and CPU architecture dependent. You have here the definition of a Race Condition. The scheduling Quantum won't end mid-instruction (can't have two writes to same location). However - the quantum could end between instructions - plus how they are executed out-of-order in the pipeline is architecture dependent (outside of the monitor block).
Now comes the "it depends" complications. The CPU guarantees little (see race condition). You might also look at NUMA (ccNUMA) - it is a method to scale CPU & Memory access by grouping CPUs (Nodes) with local RAM and a group owner - plus a special bus between Nodes.
The monitor doesn't prevent the other thread from running. It only prevents it from entering the code between the monitors. Therefore when the Writer exits the monitor-section it is free to execute the next statement - regardless of the other thread being inside the monitor. Monitors are gates that block access. Also - the quantum could interrupt the second thread after the A== statement - allowing Another to change value. Again - the quantum won't interrupt mid-instruction. Always think of threads executing in perfect parallel.
How do you apply this? I'm a bit out of date (sorry, C#/Java these days) with current Intel processors - and how their Pipelines work (hyperthreading etc). Years ago I worked with a processor called MIPS - and it had (through compiler instruction ordering) the ability to execute instructions that occurred serially AFTER a Branch instruction (Delay Slot). On this CPU/Compiler combination - YES - what you describe could happen. If Intel offers the same - then yes - it could happen. Esp with the NUMA (both Intel & AMD have this, I'm most familiar with AMD implementation).
My point - if threads were running across NUMA nodes - and access was to the common memory location then it could occur. Of course the OS tries hard to schedule operations within the same node.
You might be able to simulate this. I know C++ on MS allows access to NUMA technology (I've played with it). See if you can allocate memory across two nodes (placing A on one, and Another on the other). Schedule the threads to run on specific Nodes.
What happens in this model is that there are two pathways to RAM. I suppose this isn't what you had in mind - probably only a single path/Node model. In which case I go back to the MIPS model I described above.
I assumed a processor that interrupts - there are others that have a Yield model.

Thread visibility among one process

I'm reading the book Crack Code Interview recently, but there's one paragraph confusing me a lot on page 257:
A thread is a particular execution path of a process; when one thread modifies a process resource, the change is immediately visible to sibling threads.
IIRC, if one thread make a change to a variable, the change will firstly save in the CPU cache (say, L1 cache), and will not guarantee to synchronize to other threads unless the variable is declared as volatile.
Am I right?

Nope, you're wrong. But this is a very common misunderstanding.
Every modern multi-core CPU has hardware cache coherence. The L1, and similar caches, are invisible. CPU caches like the L1 cache have nothing to do with memory visibility.
Changes are visible immediately when a thread modifies a process resource. The issue is optimizations that cause process resources not to be modified in precisely the order the code specifies.
If your code has k = j; i = 4; if (j == 2) foo(); an optimizer might see that your first assignment reads the value of j. So it might not bother reading it again when you compare it to 2 since it "knows" that it can't have changed. However, another thread might have changed it. So optimizations of some kinds need to be disabled when synchronization between threads is required. That's what things like volatile do.
If compilers and CPUs made no optimizations and executed a program precisely as it was written, volatile would never be needed. Memory visibility is about optimizations in code (some done by the compiler, some by the CPU), not caches.

I think the text you are quoting is incorrect. The whole idea of the Java Memory Model is to deal with the complex optimizations by modern software & hardware, so that programmers can determine what writes are visible by the respective reads in other threads.
Unless a program in Java is properly synchronized, you can't guarantee that changes by one thread are immediately visible to other threads. Maybe the text refers to a very specific (and weak) memory model.
Usage of volatile variables is just one way to synchronize threads, and it's not suitable for all scenarios.
--Edit--
I think I understand the confusion now... I agree with David Schwartz, assuming that:
1) "modifies a process resource" means the actual change of the resource, not just the execution of a write instruction written in some high level computer language.
2) "is immediately visible to sibling threads" means that other threads are able to see it; it doesn't mean that a thread in your program will necessarily see it. You may still need to use synchronization tools in order to disable optimizations that bypass the actual access to the resource.

How locking is implemented?

i have following code:
while(lock)
;
lock = 1;
// critical section
lock = 0;
As reading or changing lock value is in itself a multi-instruction
read lock
change value
write it
If it happens like:
1) One thread reads the lock and stops there
2) Another thread reads it and sees it is free; lock it and do something untill half
3) First thread wakes up and goes into CS
SO how would locking would be implmented in system ?
Placing variables over top of another variables is not right : it would be like Guarding the guard ?
Stopping other processors threads is also not right ?

It is 100% platform specific. Generally, the CPU provides some form of atomic operation such as exchange or compare and swap. A typical lock might work like this:
1) Create: Store 0 (unlocked) in the variable.
2) Lock: Atomically attempt to switch the value of the variable from 0 (unlocked) to 1 (locked). If we failed (because it wasn't unlocked to begin with), let the CPU rest a bit, and then retry. Use a memory barrier to ensure no future memory operations sneak behind this one.
3) Unlock: Use a memory barrier to ensure previous memory operations don't sneak past this one. Atomically write 0 (unlocked) to the variable.
Note that you really don't need to understand this unless you want to design your own synchronization primitives. And if you want to do that, you need to understand an awful lot more. It's certainly a good idea for every programmer to have a general idea of what he's making the hardware do. But this is an area filled with seriously heavy wizardry. There are so many, many ways this can go horribly wrong. So just use the locking primitives provided by the geniuses who made your platform, compiler, and threading library. Here be dragons.
For example, SMP Pentium Pro systems have an erratum that requires special handling in the unlock operation. A naive implementation of the lock algorithm will cause the branch prediction logic to expect the operation to keep spinning, incurring a massive performance penalty at the worst possible time -- when you first acquire the lock. A naive implementation of the lock algorithm may cause two cores each waiting for the same lock to saturate the bus, slowing the CPU that needs to get work done in order to release the lock to a crawl. These all require heavy wizardry and deep understanding of the hardware to deal with.

In a course I studied at Uni, a possible firmware solution for implementing locks was presented in the form of the "atomicity bit" associated to a memory operation initiated by a processor.
Basically, when locking, you'll notice that you have a sequence of operations that need to be executed atomically: test the value of the flag and, if not set, set it to locked, otherwise try again. This sequence can be made atomic by associating a bit with each memory request send by the CPU. The first N-1 operations will have the bit set, while the last one will have it unset, to mark the end of the atomic sequence.
When the memory module (there can be several modules) where the flag data is stored receives the request for the first operation in the sequence (whose bit is set), it will serve it and not take requests from any other CPU until the CPU that initiated the atomic sequence sends a request with an unset atomicity bit (since these transactions are usually short, a coarse-grain approach like this is acceptable). Note that this is usually made easier by the assembler providing specialized instructions of type "compare-and-set", that do exactly what I mentioned before.

Atomic Instructions and Variable Update visibility

On most common platforms (the most important being x86; I understand that some platforms have extremely difficult memory models that provide almost no guarantees useful for multithreading, but I don't care about rare counter-examples), is the following code safe?
Thread 1:
someVariable = doStuff();
atomicSet(stuffDoneFlag, 1);
Thread 2:
while(!atomicRead(stuffDoneFlag)) {} // Wait for stuffDoneFlag to be set.
doMoreStuff(someVariable);
Assuming standard, reasonable implementations of atomic ops:
Is Thread 1's assignment to someVariable guaranteed to complete before atomicSet() is called?
Is Thread 2 guaranteed to see the assignment to someVariable before calling doMoreStuff() provided it reads stuffDoneFlag atomically?
Edits:
The implementation of atomic ops I'm using contains the x86 LOCK instruction in each
operation, if that helps.
Assume stuffDoneFlag is properly cleared somehow. How isn't important.
This is a very simplified example. I created it this way so that you wouldn't have to understand the whole context of the problem to answer it. I know it's not efficient.

If your actual x86 code has the store to someVariable before the store in atomicSet in Thread 1 and load of someVariable after the load in atomicRead in Thread 2, then you should be fine. Intel's Software Developer's Manual Volume 3A specifies the memory model for x86 in Section 8.2, and the intra-thread store-store and load-load constraints should be enough here.
However, there may not be anything preventing your compiler from reordering the instructions generated from whatever higher-level language you are using across the atomic operations.

1)Yes
2)Yes
Both work.

This code looks thread safe, but I question the efficiency of your spinlock (the while loop) unless you are only spinning for a very short amount of time. There is no guarantee on any given system that Thread 2 won't completely hog all processing time.
I would recommend using some actual synchronization primitives (looks like boost::condition_variable is what you want here) instead of relying on the spin lock.

The atomic instructions ensure that the thread 2 waits for thread 1 to complete setting the variable before thread 2 proceeds. There are, however, two key issues:
1) the someVariable must be declared 'volatile' to ensure that the compiler does not optimise it's allocation e.g. storing it in a register or deferring a write.
2) the second thread is blocking while waiting for the signal (termed spinlocking). Your platform probably provides much better locking and signalling primatives and mechanisms, but a relatively straightforward improvement would be to simply sleep() in the thread 2's while() body.

dsimcha written: "Assume stuffDoneFlag is properly cleared somehow. How isn't important."
This is not true!
Let's see scenario:
Thread2 checks the stuffDoneFlag if it's 1 start reading someVariable.
Before the Thread2 finish reading the task scheduler interrupt its task and suspend the task for some time.
Thread1 again access to the someVariable and change the memory content.
Task scheduler switch on again Thread2 and it continue the job but memory content of someVariable is changed!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string