I was wondering if this scenario was possible or does the CPU make an guarantees that this won't happen at such a low level:
Say there is a value that is misaligned and requires 2 fetches to get the whole value (32 bit value misaligned on 32-bit system). So both threads are only executing one instruction, thread 1 a mov that is reading from memory and thread 2 an atomic mov that is writing to memory.
Thread 1 fetches first half of Value
Thread 2 atomically writes to Value
Thread 1 fetches second half of Value
So now on Thread 1 will contain 2 halfs of different values.
Is this scenario possible or does the CPU make any guarantees that this won't happen ?
Here's my non-expert anwser...
It is rather complex to make misaligned accesses atomic, so I believe most architectures don't give any sort of guarantee of atomicity in this case. (I don't know of any architecture that can do atomic misaligned fetches, but they just might exist. I don't know enough to tell.)
It's even complex just to fetch misaligned data; so architectures that want to keep things really simple don't even allow misaligned access (For example the very "RISC-y" old Alpha architecture).
It might be possible on some architectures to do it by somehow locking (or protecting, see below) two cache lines simultaneously in a local cache, but such things are AFAIK usually not available in 'user-land', i.e. non-OS threads.
See http://en.wikipedia.org/wiki/Load-link/store-conditional for the modern way to achieve load-store (i.e. NOT tear-free read of two misaligned areas) atomicity for one (aligned) word. Now if a thread was somehow allowed to issue two connected (atomic) 'protect' instructions like that, I suppose it could be done, but then again, that would be complex. I don't know if that exists on any CPU.
Related
Motive:
I am just learning the fundamentals of multithreading, not close to finishing them, but I'd like to ask a question this early in my learning journey to guide me toward the topics most relevant to my project I 'm working on.
Main:
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
In general, yes. Otherwise, the thread editing the value could change the value only locally so that the other thread will never see the value change. This can happens because of compilers (that could use registers to read/store variables) but also because of the hardware (regarding the cache coherence mechanism used on the target platform). Generally, locks, atomic variables and memory barriers are used to perform such synchronizations.
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
In general, there is no guarantee that accesses are done atomically. Theoretically, two cores executing each one a thread can load/store the same variable at the same time (but often not in practice). It is very dependent of the target platform.
For processor having (coherent) caches (ie. all modern mainstream processors) cache lines (ie. chunks of typically 64 or 128 bytes) have a huge impact on the implicit synchronization between threads. This is a complex topic, but you can first read more about cache coherence in order to understand how the memory hierarchy works on modern platforms.
The cache coherence protocol prevent two load/store being done exactly at the same time in the same cache line. If the variable cross multiple cache lines, then there is no protection.
On widespread x86/x86-64 platforms, variables having primitive types of <= 8 bytes can be modified atomically (because the bus support that as well as the DRAM and the cache) assuming the address is correctly aligned (it does not cross cache lines). However, this does not means all such accesses are atomic. You need to specify this to the compiler/interpreter/etc. so it produces/executes the correct instructions. Note that there is also an extension for 16-bytes atomics. There is also an instruction set extension for the support of transactional memory. For wider types (or possibly composite ones) you likely need a lock or an atomic state to control the atomicity of the access to the target variable.
In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.
Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.
isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.
I read about CAS in https://en.wikipedia.org/wiki/Compare-and-swap, and got some doubts:
Even though a single lock operation is implemented in a single instruction, but if 2 threads run on 2 different processors, then the 2 instruction could happen at the same time. Isn't that a race condition?
I saw following sentence in <Linux Kernel Development> 3rd page 168.
because a process can execute on only one processor at a time
I doubt that, not sure does it means what it literally says. What if the process has multiple threads, can't they run on multiple processors at a time?
Anyone help to explain these doubts? Thanks.
The cpu has a cache for memory, typically 64 bytes of size per so-called cache line. It will do stuff with respect to chunks of that size. In particular, when doing lock cmpxchg or similar things, the hardware thread you execute this on will negotiate exclusive access to the 64-byte portion of memory with other threads. And that's why it works.
In general, you want to read this book: https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
This particular bit is explained on page 21.
Regarding the LKD quote, there is no context provided. It is safe to assume they meant threads and were updating a thread-local counter.
Assuming I have two seperate threads, each running on their own CPU.
Both attempt to write to a shared piece of memory at the same time, how do I know which value will be stored into memory first/will get the most up-to-date verison?
Or will certain memory consistency models make sure this is impossible?
If so how?
"At the same time" does not really have meaning in multi-core systems. Do you mean the moment when the store instruction is executed (actually a span of time), the moment the value is put into the core-local store cache, the moment the store cache is flushed to the CPU-shared cache (note that there may be CPUs where a subset of cores has a shared cache), or when the cache is flushed to memory? Which core's view of the system do you want to use for observing the result? (In quad-core systems, it could well be that when core 1 and 2 both write values, cores 3 and 4 observe the writes in different orders.)
CPUs have ways of ensuring that two writes have a guaranteed global order (memory barriers and interlocked load/store operations) - unless you use those, the results are unspecified and guaranteed to be surprising. And in the case of the C11 and C++11 memory model, they're fully undefined.
If two threads are writing to the same memory at the same time, there's no way to know the order the writes happen in, by definition (otherwise it wouldn't be "at the same time").
Even if there was, what would happen if one of them was delayed by a nanosecond?
Assume we have a multithreaded C program (pthreads), and the (unsynchronized) shared variable accesses of the individual threads are not reordered by the compiler. Does an x86 CPU respect the order of the shared variable accesses (within a single thread), or is it possible that it reorders some memory accesses?
Unsynchronized shared variable accesses are dangerous, and out-of-order is one reason for it.
The x86 keeps writes in order (within a thread), but not reads.
This can get you into trouble, if you assume the order remains. For example:
Thread A writes to x and then to y. Assuming the compiler didn't reorder it, the cpu won't reorder it (x86 won't, others might).
thread B reads y and then x. You might think that if it got y's new value, then surely you'll get x's new value as well.
Not so. The CPU may reorder thread B's reads, so y will be actually read earlier.
EDIT: as "Man of One Way" pointed out, in this case, x86 (but not all processors!) guarantees ordering.
I quote the Intel software developer's manual:
Writes by a single processor are observed in the same order by all
processors.
This isn't true for writes by multiple processors - they may seem to be ordered differently by different processors.
However, I highly recommend not relying on it, and using use proper synchronization instead.
The synchronization primitives are implemented with atomic operations and/or barriers, which keep you safe.
Some reorderings are possible, due to the presence of a store buffer. See e.g. https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
However, the reorderings are seen only with several threads, within a single thread all accesses from that thread appear to happen in order.