What does "reads before reads" mean in memory ordering? - multithreading

Let's consider this sentence (Total Store Ordering):
reads are ordered before reads, writes before writes, and reads before writes, but not writes before reads.
I think I almost get the basics:
Each thread has its own program order (code as it is written)
In general, CPU may reorder instructions and we must constrain it to exclude incorrect orderings
CPU may also reorder memory loads and stores and we must constrain those as well
Current hardware implementation has "serializing instructions" like mfence which are invoked by all threads to address both of the problems.
Hardware typically allows only one dirty cache, so it is all about flushing that cache:
Storing thread flushes dirty cache
Loading thread requests and blocks until there is no dirty cache
Kernel developers care about devices other than CPU accessing memory but I don't.
Yet I still fail to understand what does "reads before reads" really mean. It probably means that there are implicit barriers and serializing instructions in those architectures but I can't really tell.

I am so sure I have heard that in the OS course at my uni in Greece, damn, I read it with the voice of the prof. :) Since nobody answered, I will attempt to answer.
Imagine you are the OS, every thread/program wants to perform reads and write. Now, since we are talking about multithreading, a thread may read something another thread has written, like a value of a variable.
Now if thread 1 wants to perform a read of a memory address x, and thread 2 wants to perform a read x too, it's OK to allow them to read x in any order. That's what it means, I think!
Hope it helps somehow, since I know it's not the best answer one could give! :/

Related

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.
Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.
isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Single write - single read big memory buffer sharing without locks

Let's suppose I have a big memory buffer used as a framebuffer, what is constantly written by a thread (or even multiple threads, guaranteed that no two threads write the same byte concurrently). These writes are indeterministic in time, scattered through the codebase, and cannot be blocked.
I have another single thread which periodically reads out (copies) the whole buffer for generating a display frame. This read should not be blocked, too. Tearing is not a problem in my case. In other words, my only goal is that every change done by the write thread(s) should eventually appear in the reading thread. The ordering or some (negligible compared to a display refresh rate) delay does not matter.
Reading and writing the same memory location concurrently is a data race, which results in an undefined behavior in c++11, and this article lists same really dreadful examples where the optimizing compiler generates code for a memory read that alters the memory contents in the presence of data race.
Still, I need some solution without completely redesigning this legacy code. Every advice counts what is safe from practical standpoints, independent of if it is theoretically correct or not. I am also open to not-fully-portable solutions, too.
Aside from that I have a data race, I can easily force the visibility of the buffer changes in the reading thread by establishing a synchronizes-with relation between the threads (acquire-release an atomic guard variable, used for nothing else), or by adding platform-specific memory fence calls to key points in the writer thread(s).
My ideas to target the data race:
Use assembly for the reading thread. I would try to avoid that.
Make the memory buffer volatile, thus preventing the compiler to optimize such nasty things what are described in the referenced article.
Put the reading thread's code in a separate compile unit, and compile with -O0
+1. Leave everything as is, and cross my fingers (as currently I do not notice issues) :)
What is the safest from the list above? Do you see a better solution?
FYI, the target platform is ARM (with multiple cores) and x86 (for testing).
(This question is concretizing a previous one what was a little too generic.)

Why doesn't the instruction reorder issue occur on a single CPU core?

From this post:
Two threads being timesliced on a single CPU core won't run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
Why can't the instruction reorder issue occur on a single CPU core? This article doesn't explain it.
EXAMPLE:
The following pictures are picked from Memory Reordering Caught in the Act:
Below is recorded:
I think the recorded instructions can also cause issue on a single CPU, because both r1 and r2 aren't 1.
A single core always knows about its own reordering and will properly resolve all its own memory accesses.
A single CPU core does reorder, but it knows it's own reordering, and can do clever tricks to pretend it's not. Thus, things go faster, without weird side effects.
Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
When a CPU reorders, the other CPUs can't compensate for this. Imagine if CPU #1 is waiting for a write to variableA, then it reads from variableB. If CPU#2 wrotes to variableB, then variableA like the code says, no problems occur. If CPU#2 reorders to write to variableA first, then CPU#1 doesn't know and tries to read from variableB before it has a value. This can cause crashes or any "random" behavior. (Intel chips have more magic that makes this not happen)
Two threads being timesliced on a single CPU core won't run into a reordering problem.
If both threads are on the same CPU, then it doesn't matter which order the writes happen in, because if they're reordered, then they're both in progress, and the CPU won't really switch until both are written, in which case they're safe to read from the other thread.
Example
For the code to have a problem on a single core, it would have to rearrange the two instructions from process 1 and be interrupted by process 2 and execute that between the two instructions. But if interrupted between them, it knows it has to abort both of them since it knows about it's own reordering, and knows it's in a dangerous state. So it will either do them in order, or do both before switching to process 2, or do neither before switching to process 2. All of which avoid the reordering problem.
There are multiple effects at work, but they are modeled as just one effect. Makes it easier to reason about them. Yes, a modern core already re-orders instructions by itself. But it maintains logical flow between them, if two instructions have an inter-dependency between them then they stay ordered so the logic of the program does not change. Discovering these inter-dependencies and preventing an instruction from being issued too early is the job of the reorder buffer in the execution engine.
This logic is solid and can be relied upon, it would be next to impossible to write a program if that wasn't the case. But that same guarantee cannot be provided by the memory controller. It has the un-enviable job of giving multiple processors access to the same shared memory.
First is the prefetcher, it reads data from memory ahead of time to ensure the data is available by the time a read instruction executes. Ensures the core won't stall waiting for the read to complete. With the problem that, because memory was read early, it might be a stale value that was changed by another core between the time the prefetch was done and the read instruction executes. To an outside observer it looks like the instruction executed early.
And the store buffer, it takes the data of a write instruction and writes it lazily to memory. Later, after the instruction executed. Ensures the core won't stall waiting on the memory bus write cycle to complete. To an outside observer, it just looks like the instruction executed late.
Modeling the effects of the prefetcher and store buffer as instruction reordering effects is very convenient. You can write that down on a piece of paper easily and reason about the side-effects.
To the core itself, the effects of the prefetcher and store buffer are entirely benign and it is oblivious to them. As long as there isn't another core that's also changing memory content. A machine with a single core always has that guarantee.

What to lock and what to not lock in a multithreaded environment (semaphores and shared memory)

I was implementing some simple Producer/Consumer program that had some semaphores and shared memory. To keep things simple, let's assume that there's just a block of shared memory and a semaphore in my program.
At first, I though that I only had to consider as critical section bits of code that'd try to write to the shared memory block. But as the shared memory block consists of, let's say, 1024bytes, I can't read all the data at the same time (it's not an atomic operation), so it is indeed possible that while I'm reading from it, the Producer comes and starts writing in it, so the reader will get half old data, half new data. From this, I can only think that I also have to put shared memory reading logic inside a "semaphore" block.
Now, I have lots of code that looks like this:
if (sharedMemory[0] == '0') { ... }
In this case, I am just looking for a single char in memory. I guess I don't have to worry about puting a semaphore around this, do I?
And what if instead I have something like
if (sharedMemory[0] == '0' && sharedMemory[1] == '1') { ... }
From my perspective, I guess that as this are 2 operations, I'd have to consider this as a critical section, thus having to put a semaphore around it. Am I right?
Thanks!
Technically, on a multicore or multiprocessor system the only thing that's atomic are assembly opcodes which are specifically documented as being atomic. Even reading a single byte presents a (quite small) chance the another processor will come along and modify it before you're doing reading it, except in some cases that deals with CPU cache and aligned chunks of memory (Fun thread: http://software.intel.com/en-us/forums/showthread.php?t=76744, Interesting read: http://www.corensic.com/CorensicBlog/tabid/101/EntryId/8/Memory-Consistency-Models.aspx)
You must either use types which internally guarantee atomicity or specifically protect accesses on multithreaded multicore systems.
(The answer may change slightly on IL platforms like .NET and JVMs since they make their own guarantees about what's atomic and what isn't).
Definitely lock around non-atomic operations, and checking two different values counts as as a non-atomic operation, although there are tricks you can use to check up to four bytes or more, provided your processor doesn't cache the results. You have to consider how your data is used. But basically, any access to shared memory should have have a semaphore around it.

What's the point of cache coherency?

On CPUs like x86, which provide cache coherency, how is this useful from a practical perspective? I understand that the idea is to make memory updates done on one core immediately visible on all other cores. This is a useful property. However, one can't rely too heavily on it if not writing in assembly language, because the compiler can store variable assignments in registers and never write them to memory. This means that one must still take explicit steps to make sure that stuff done in other threads is visible in the current thread. Therefore, from a practical perspective, what has cache coherency achieved?
The short story is, non-cache coherent system are exceptionally difficult to program especially if you want to maintain efficiency - which is also the main reason even most NUMA systems today are cache-coherent.
If the caches wern't coherent, the "explicit steps" would have to enforce the coherency - explicit steps are usually things like critical sections/mutexes(e.g. volatile in C/C++ is rarly enough) . It's quite hard, if not impossible for services such as mutexes to keep track of only the memory that have changes and needs to be updated in all the caches -it would probably have to update all the memory, and that is if it could even track which cores have what pieces of that memory in their caches.
Presumable the hardware can do a much better and efficient job at tracking the memory addresses/ranges that have been changed, and keep them in sync.
And, imagine a process running on core 1 and gets preempted. When it gets scheduled again, it got scheduled on core 2.
This would be pretty fatal if the caches weren't choerent as otherwise there might be remnants of the process data in the cache of core 1, which doesn't exist in core 2's cache. Though, for systems working that way, the OS would have to enforce the cache coherency as threads are scheduled - which would probably be an "update all the memory in caches between all the cores" operation, or perhaps it could track dirty pages vith the help of the MMU and only sync the memory pages that have been changed - again, the hardware likely keep the caches coherent in a more finegrainded and effcient way.
There are some nuances not covered by the great responses from the other authors.
First off, consider that a CPU doesn't deal with memory byte-by-byte, but with cache lines. A line might have 64 bytes. Now, if I allocate a 2 byte piece of memory at location P, and another CPU allocates an 8 byte piece of memory at location P + 8, and both P and P + 8 live on the same cache line, observe that without cache coherence the two CPUs can't concurrently update P and P + 8 without clobbering each others changes! Because each CPU does read-modify-write on the cache line, they might both write out a copy of the line that doesn't include the other CPU's changes! The last writer would win, and one of your modifications to memory would have "disappeared"!
The other thing to bear in mind is the distinction between coherency and consistency. Because even x86 derived CPUs use store buffers, there aren't the guarantees you might expect that instructions that have already finished have modified memory in such a way that other CPUs can see those modifications, even if the compiler has decided to write the value back to memory (maybe because of volatile?). Instead the mods may be sitting around in store buffers. Pretty much all CPUs in general use are cache coherent, but very few CPUs have a consistency model that is as forgiving as the x86's. Check out, for example, http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html for more information on this topic.
Hope this helps, and BTW, I work at Corensic, a company that's building a concurrency debugger that you may want to check out. It helps pick up the pieces when assumptions about concurrency, coherence, and consistency prove unfounded :)
Imagine you do this:
lock(); //some synchronization primitive e.g. a semaphore/mutex
globalint = somevalue;
unlock();
If there were no cache coherence, that last unlock() would have to assure that globalint are now visible everywhere, with cache coherance all you need to do is to write it to memory and let the hardware do the magic. A software solution would have keep tack of which memory exists in which caches, on which cores, and somehow make sure they're atomically in sync.
You'd win an award if you can find a software solution that keeps track of all the pieces of memory that exist in the caches that needs to be keept in sync, that's more efficient than a current hardware solution.
Cache coherency becomes extremely important when you are dealing with multiple threads and are accessing the same variable from multiple threads. In that particular case, you have to ensure that all processors/cores do see the same value if they access the variable at the same time, otherwise you'll have wonderfully non-deterministic behaviour.
It's not needed for locking. The locking code would include cache flushing if that was needed. It's mainly needed to ensure that concurrent updates by different processors to different variables in the same cache line aren't lost.
Cache coherency is implemented in hardware because the programmer doesn't have to worry about making sure all threads see the latest value of a memory location while operating in multicore/multiprocessor enviroment. Cache coherence gives an abstraction that all cores/processors are operating on a single unified cache, though every core/processor has it own individual cache.
It also makes sure the legacy multi-threaded code works as is on new processors models/multi processor systems, without making any code changes to ensure data consistency.

Resources