Compare and swap - What if 2 processors execute locking simultaneous? - multithreading

I read about CAS in https://en.wikipedia.org/wiki/Compare-and-swap, and got some doubts:
Even though a single lock operation is implemented in a single instruction, but if 2 threads run on 2 different processors, then the 2 instruction could happen at the same time. Isn't that a race condition?
I saw following sentence in <Linux Kernel Development> 3rd page 168.
because a process can execute on only one processor at a time
I doubt that, not sure does it means what it literally says. What if the process has multiple threads, can't they run on multiple processors at a time?
Anyone help to explain these doubts? Thanks.

The cpu has a cache for memory, typically 64 bytes of size per so-called cache line. It will do stuff with respect to chunks of that size. In particular, when doing lock cmpxchg or similar things, the hardware thread you execute this on will negotiate exclusive access to the 64-byte portion of memory with other threads. And that's why it works.
In general, you want to read this book: https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
This particular bit is explained on page 21.
Regarding the LKD quote, there is no context provided. It is safe to assume they meant threads and were updating a thread-local counter.

Related

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.
Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.
isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Why doesn't the instruction reorder issue occur on a single CPU core?

From this post:
Two threads being timesliced on a single CPU core won't run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
Why can't the instruction reorder issue occur on a single CPU core? This article doesn't explain it.
EXAMPLE:
The following pictures are picked from Memory Reordering Caught in the Act:
Below is recorded:
I think the recorded instructions can also cause issue on a single CPU, because both r1 and r2 aren't 1.
A single core always knows about its own reordering and will properly resolve all its own memory accesses.
A single CPU core does reorder, but it knows it's own reordering, and can do clever tricks to pretend it's not. Thus, things go faster, without weird side effects.
Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
When a CPU reorders, the other CPUs can't compensate for this. Imagine if CPU #1 is waiting for a write to variableA, then it reads from variableB. If CPU#2 wrotes to variableB, then variableA like the code says, no problems occur. If CPU#2 reorders to write to variableA first, then CPU#1 doesn't know and tries to read from variableB before it has a value. This can cause crashes or any "random" behavior. (Intel chips have more magic that makes this not happen)
Two threads being timesliced on a single CPU core won't run into a reordering problem.
If both threads are on the same CPU, then it doesn't matter which order the writes happen in, because if they're reordered, then they're both in progress, and the CPU won't really switch until both are written, in which case they're safe to read from the other thread.
Example
For the code to have a problem on a single core, it would have to rearrange the two instructions from process 1 and be interrupted by process 2 and execute that between the two instructions. But if interrupted between them, it knows it has to abort both of them since it knows about it's own reordering, and knows it's in a dangerous state. So it will either do them in order, or do both before switching to process 2, or do neither before switching to process 2. All of which avoid the reordering problem.
There are multiple effects at work, but they are modeled as just one effect. Makes it easier to reason about them. Yes, a modern core already re-orders instructions by itself. But it maintains logical flow between them, if two instructions have an inter-dependency between them then they stay ordered so the logic of the program does not change. Discovering these inter-dependencies and preventing an instruction from being issued too early is the job of the reorder buffer in the execution engine.
This logic is solid and can be relied upon, it would be next to impossible to write a program if that wasn't the case. But that same guarantee cannot be provided by the memory controller. It has the un-enviable job of giving multiple processors access to the same shared memory.
First is the prefetcher, it reads data from memory ahead of time to ensure the data is available by the time a read instruction executes. Ensures the core won't stall waiting for the read to complete. With the problem that, because memory was read early, it might be a stale value that was changed by another core between the time the prefetch was done and the read instruction executes. To an outside observer it looks like the instruction executed early.
And the store buffer, it takes the data of a write instruction and writes it lazily to memory. Later, after the instruction executed. Ensures the core won't stall waiting on the memory bus write cycle to complete. To an outside observer, it just looks like the instruction executed late.
Modeling the effects of the prefetcher and store buffer as instruction reordering effects is very convenient. You can write that down on a piece of paper easily and reason about the side-effects.
To the core itself, the effects of the prefetcher and store buffer are entirely benign and it is oblivious to them. As long as there isn't another core that's also changing memory content. A machine with a single core always has that guarantee.

Multi Threading Non Atomic Operations With Atomic

I was wondering if this scenario was possible or does the CPU make an guarantees that this won't happen at such a low level:
Say there is a value that is misaligned and requires 2 fetches to get the whole value (32 bit value misaligned on 32-bit system). So both threads are only executing one instruction, thread 1 a mov that is reading from memory and thread 2 an atomic mov that is writing to memory.
Thread 1 fetches first half of Value
Thread 2 atomically writes to Value
Thread 1 fetches second half of Value
So now on Thread 1 will contain 2 halfs of different values.
Is this scenario possible or does the CPU make any guarantees that this won't happen ?
Here's my non-expert anwser...
It is rather complex to make misaligned accesses atomic, so I believe most architectures don't give any sort of guarantee of atomicity in this case. (I don't know of any architecture that can do atomic misaligned fetches, but they just might exist. I don't know enough to tell.)
It's even complex just to fetch misaligned data; so architectures that want to keep things really simple don't even allow misaligned access (For example the very "RISC-y" old Alpha architecture).
It might be possible on some architectures to do it by somehow locking (or protecting, see below) two cache lines simultaneously in a local cache, but such things are AFAIK usually not available in 'user-land', i.e. non-OS threads.
See http://en.wikipedia.org/wiki/Load-link/store-conditional for the modern way to achieve load-store (i.e. NOT tear-free read of two misaligned areas) atomicity for one (aligned) word. Now if a thread was somehow allowed to issue two connected (atomic) 'protect' instructions like that, I suppose it could be done, but then again, that would be complex. I don't know if that exists on any CPU.

Thread Cooperation on Dual-CPU Machines

I remember in a course I took in college, one of my favorite examples of a race condition was one in which a simple main() method started two threads, one of which incremented a shared (global) variable by one, the other decrementing it. Pseudo code:
static int i = 10;
main() {
new Thread(thread_run1).start();
new Thread(thread_run2).start();
waitForThreads();
print("The value of i: " + i);
}
thread_run1 {
i++;
}
thread_run2 {
i--;
}
The professor then asked what the value of i is after a million billion zillion runs. (If it would ever be anything other than 10, essentially.) Students unfamiliar with multithreading systems responded that 100% of the time, the print() statement would always report i as 10.
This was in fact incorrect, as our professor demonstrated that each increment/decrement statement was actually compiled (to assembly) as 3 statements:
1: move value of 'i' into register x
2: add 1 to value in register x
3: move value of register x into 'i'
Thus, the value of i could be 9, 10, or 11. (I won't go into specifics.)
My Question:
It was (is?) my understanding that the set of physical registers is processor-specific. When working with dual-CPU machines (note the difference between dual-core and dual-CPU), does each CPU have its own set of physical registers? I had assumed the answer is yes.
On a single-CPU (multithreaded) machine, context switching allows each thread to have its own virtual set of registers. Since there are two physical sets of registers on a dual-CPU machine, couldn't this result in even more potential for race conditions, since you can literally have two threads operating simultaneously, as opposed to 'virtual' simultaneous operation on a single-CPU machine? (Virtual simultaneous operation in reference to the fact that register states are saved/restored each context switch.)
To be more specific - if you were running this on an 8-CPU machine, each CPU with one thread, are race conditions eliminated? If you expand this example to use 8 threads, on a dual-CPU machine, each CPU having 4 cores, would the potential for race conditions increase or decrease? How does the operating system prevent step 3 of the assembly instructions from being run simultaneously on two different CPUs?
Yes, the introduction of dual-core CPUs made a significant number of programs with latent threading races fail quickly. Single-core CPUs multitask by the scheduler rapidly switching the threading context between threads. Which eliminates a class of threading bugs that are associated with a stale CPU cache.
The example you give can fail on a single core as well though. When the thread scheduler interrupts the thread just as it loaded the value of the variable in a register in order to increment it. It just won't fail nearly as frequently because the odds that the scheduler interrupts the thread just there isn't that great.
There's an operating system feature to allow these programs to limp along anyway instead of crashing within minutes. Called 'processor affinity', available as the AFFINITY command line option for start.exe on Windows, SetProcessAfinityMask() in the winapi. Review the Interlocked class for helper methods that atomically increment and decrement variables.
You'd still have a race condition - it doesn't change that at all. Imagine two cores both performing an increment at the same time - they'd both load the same value, increment to the same value, and then store the same value... so the overall increment from the two operations would be one instead of two.
There are additional causes of potential problems where memory models are concerned - where step 1 may not really retrieve the latest value of i, and step 3 may not immediately write the new value of i in a way which other threads can see.
Basically, it all becomes very tricky - which is why it's generally a good idea to either use synchronization when accessing shared data or to use lock-free higher level abstractions which have been written by experts who really know what they're doing.
First, dual processor versus dual core has no real effect. A dual core processor still has two completely separate processors on the chip. They may share some cache, and do share a common bus to memory/peripherals, but the processors themselves are entirely separate. (A dual-threaded single code, such as Hyperthreading) is a third variation -- but it has a set of registers per virtual processor as well. The two processors share a single set of execution resources, but they retain completely separate register sets.
Second, there are really only two cases that are realy interesting: a single thread of execution, and everything else. Once you have more than one thread (even if all threads run on a single processor), you have the same potential problems as if you're running on some huge machine with thousands of processors. Now, it's certainly true that you're likely to see the problems manifest themselves a lot sooner when the code runs on more processors (up to as many as you've created threads), but the problems themselves haven't/don't change at all.
From a practical viewpoint, having more cores is useful from a testing viewpoint. Given the granularity of task switching on a typical OS, it's pretty easy to write code that will run for years without showing problems on a single processor, that will crash and burn in a matter of hours or even minute when you run it on two more or physical processors. The problem hasn't really changed though -- it's just a lot more likely to show up a lot more quickly when you have more processors.
Ultimately, a race condition (or deadlock, livelock, etc.) is about the design of the code, not about the hardware it runs on. The hardware can make a difference in what steps you need to take to enforce the conditions involved, but the relevant differences have little to do with simple number of processors. Rather, they're about things like concessions made when you have not simply a single machine with multiple processors, but multiple machines with completely separate address spaces, so you may have to take extra steps to assure that when you write a value to memory that it becomes visible to the CPUs on other machines that can't see that memory directly.

Critical sections with multicore processors

With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread.
But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors potentially executing their test-and-set operations at the same time, what you really need to maintain atomicity on is access to the shared memory location of the mutex. (And if the shared memory location is loaded into cache, there's the whole cache consistency thing to deal with, too..)
This seems like it would incur far more overhead than the single core case, so here's the meat of the question: How much worse is it? Is it worse? Do we just live with it? Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
Multi-core/SMP systems are not just several CPUs glued together. There's explicit support for doing things in parallel. All the synchronization primitives are implemented with the help of hardware along the lines of atomic CAS. The instruction either locks the bus shared by CPUs and memory controller (and devices that do DMA) and updates the memory, or just updates the memory relying on cache snooping. This in turn causes cache coherency algorithm to kick in forcing all involved parties to flush their caches.Disclaimer - this is very basic description, there are more interesting things here like virtual vs. physical caches, cache write-back policies, memory models, fences, etc. etc.
If you want to know more about how OS might use these hardware facilities - here's an excellent book on the subject.
The vendor of multi-core cpus has to take care that the different cores coordinate themselves when executing instructions which guarantee atomic memory access.
On intel chips for instance you have the 'cmpxchg' instruction. It compares the value stored at a memory location to an expected value and exchanges it for the new value if the two match. If you precede it with the 'lock' instruction, it is guaranteed to be atomic with respect to all cores.
You would need a test-and-set that forces the processor to notify all the other cores of the operation so that they are aware. Yes, that introduces an overhead and you have to live with it. It's a reason to design multithreaded applications in such a way that they don't wait for synchronization primitives too often.
Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
That would cancel the whole point of multithreading. When you are using a lock, semaphore, or other syncronization techniques, you are relying on OS to make sure that these operations are interlocked, no matter how many cores you are using.
The time to switch to a different thread after a lock has been released is mostly determined by the cost of a context switch. This SO thread deals with the context switching overhead, so you might want to check that.
There are some other interesting threads also:
What are the differences between various threading synchronization options in C#?
Threading best practices
You should read this MSDN article also: Understanding the Impact of Low-Lock Techniques in Multithreaded Apps.
Memory accesses are handled by the memory controller which should take care of multi-core issues, i.e. it shouldn't allow simultaneous access to same addresses (probably handled either by memory page or memory line basis). So you can use a flag to indicate whether another processor is updating the memory contents of some block (this to avoid a type of dirty read where part of the record is updated, but not all).
A more elegant solution is to use a HW semaphore block if the processor has such a feature. A HW semaphore is a simple queue which could be of size no_of_cores -1. This is how it is in TI's 6487/8 processor. You can either query the semaphore directly (and loop until it is released) or do an indirect query which will result in an interrupt once your core gets the resource. The requests are queued and served in the order they were made. A semaphore query is an atomic operation.
Cache consistency is another issue and you might need to do cache writebacks and refreshes in some cases. But this is a very cache implementation specific thing. With 6487/8 we needed to do that on a few operations.
Well, depending on what type of computers you have laying around the house, do the following: Write a simple multithreaded application. Run this application on a single core (Pentium 4 or Core Solo) and then run it on a multicore processor (Core 2 Duo or similar) and see how big the speed up is.
Granted these are unfair comparisons since Pentium 4 and Core Solo are much slower regardless of cores than a Core 2 Duo. Maybe compare between a Core 2 Duo and a Core 2 Quad with an application that can use 4 or more threads.
You raise a number of valid points. Muliple processors introduce a lot of headache and overhead. However, we just have to live with them, because the speed boost of parallelism can far outweigh them, if the critical sections are made long enough.
As for your final suggestion about having all threads on the same physical core, that completely defeats the point of a multi-core computer!

Resources