Mutual Exclusion Conditions - multithreading

I was reading about Mutual Exclusion Conditions, which are as follows
No two processes may at the same moment inside their critical sections.
No assumptions are made about relative speeds of processes or number of CPUs.
No process should outside its critical section should block other processes.
No process should wait arbitrary long to enter its critical section.
Can someone explain me the meaning of 2nd point ?

To me, it means that you cannot decide something is correct because it is only a {small number} of instructions. A process may be pre-empted, a cpu may become suspended, suffer an interrupt, or other delay which mocks these assumptions.
Concurrent code has to be correct with any possible instruction interleaving.

Let's assume you know you have one processor. Let's also assume that your processor has an atomic instruction BBSC (Branch on bit set and set) that cannot be interrupted that branches if a bit is set and does not branch is clear and sets the bit
You can then do you locking using such an instruction
BBSS DID_NOT_GET_LOCK, #1,LOCK_LOCATION
; Critical Section
; . .. . . . .
MOV #0, LOCK_LOCATION ; End critical section
DID_NOT_GET_LOCK:
Locking becomes simple to implement in such a single processor system.
If you add multiple CPUs into the mix, that system of locking fails miserably. That instruction I describe has at least two memory accesses:
If (Bit is Set) ; Memory test
Goto Destination
Else
Set Bit ; Memory Set
If you have multiple processors, more than one process could see the Bit is clear simultaneously and could enter the critical section.

it means that now a days CPU are comes with multi-core, so multi-programming can be possible. one CPU can run multi-pal programs simultaneously.
but when you are learning OS then always assume that CPU have only one core and only one program can be execute.
so it is written No assumptions are made about multiple number of core(CPUs).

Related

Compare and swap - What if 2 processors execute locking simultaneous?

I read about CAS in https://en.wikipedia.org/wiki/Compare-and-swap, and got some doubts:
Even though a single lock operation is implemented in a single instruction, but if 2 threads run on 2 different processors, then the 2 instruction could happen at the same time. Isn't that a race condition?
I saw following sentence in <Linux Kernel Development> 3rd page 168.
because a process can execute on only one processor at a time
I doubt that, not sure does it means what it literally says. What if the process has multiple threads, can't they run on multiple processors at a time?
Anyone help to explain these doubts? Thanks.
The cpu has a cache for memory, typically 64 bytes of size per so-called cache line. It will do stuff with respect to chunks of that size. In particular, when doing lock cmpxchg or similar things, the hardware thread you execute this on will negotiate exclusive access to the 64-byte portion of memory with other threads. And that's why it works.
In general, you want to read this book: https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
This particular bit is explained on page 21.
Regarding the LKD quote, there is no context provided. It is safe to assume they meant threads and were updating a thread-local counter.

Why doesn't the instruction reorder issue occur on a single CPU core?

From this post:
Two threads being timesliced on a single CPU core won't run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
Why can't the instruction reorder issue occur on a single CPU core? This article doesn't explain it.
EXAMPLE:
The following pictures are picked from Memory Reordering Caught in the Act:
Below is recorded:
I think the recorded instructions can also cause issue on a single CPU, because both r1 and r2 aren't 1.
A single core always knows about its own reordering and will properly resolve all its own memory accesses.
A single CPU core does reorder, but it knows it's own reordering, and can do clever tricks to pretend it's not. Thus, things go faster, without weird side effects.
Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
When a CPU reorders, the other CPUs can't compensate for this. Imagine if CPU #1 is waiting for a write to variableA, then it reads from variableB. If CPU#2 wrotes to variableB, then variableA like the code says, no problems occur. If CPU#2 reorders to write to variableA first, then CPU#1 doesn't know and tries to read from variableB before it has a value. This can cause crashes or any "random" behavior. (Intel chips have more magic that makes this not happen)
Two threads being timesliced on a single CPU core won't run into a reordering problem.
If both threads are on the same CPU, then it doesn't matter which order the writes happen in, because if they're reordered, then they're both in progress, and the CPU won't really switch until both are written, in which case they're safe to read from the other thread.
Example
For the code to have a problem on a single core, it would have to rearrange the two instructions from process 1 and be interrupted by process 2 and execute that between the two instructions. But if interrupted between them, it knows it has to abort both of them since it knows about it's own reordering, and knows it's in a dangerous state. So it will either do them in order, or do both before switching to process 2, or do neither before switching to process 2. All of which avoid the reordering problem.
There are multiple effects at work, but they are modeled as just one effect. Makes it easier to reason about them. Yes, a modern core already re-orders instructions by itself. But it maintains logical flow between them, if two instructions have an inter-dependency between them then they stay ordered so the logic of the program does not change. Discovering these inter-dependencies and preventing an instruction from being issued too early is the job of the reorder buffer in the execution engine.
This logic is solid and can be relied upon, it would be next to impossible to write a program if that wasn't the case. But that same guarantee cannot be provided by the memory controller. It has the un-enviable job of giving multiple processors access to the same shared memory.
First is the prefetcher, it reads data from memory ahead of time to ensure the data is available by the time a read instruction executes. Ensures the core won't stall waiting for the read to complete. With the problem that, because memory was read early, it might be a stale value that was changed by another core between the time the prefetch was done and the read instruction executes. To an outside observer it looks like the instruction executed early.
And the store buffer, it takes the data of a write instruction and writes it lazily to memory. Later, after the instruction executed. Ensures the core won't stall waiting on the memory bus write cycle to complete. To an outside observer, it just looks like the instruction executed late.
Modeling the effects of the prefetcher and store buffer as instruction reordering effects is very convenient. You can write that down on a piece of paper easily and reason about the side-effects.
To the core itself, the effects of the prefetcher and store buffer are entirely benign and it is oblivious to them. As long as there isn't another core that's also changing memory content. A machine with a single core always has that guarantee.

Thread visibility among one process

I'm reading the book Crack Code Interview recently, but there's one paragraph confusing me a lot on page 257:
A thread is a particular execution path of a process; when one thread modifies a process resource, the change is immediately visible to sibling threads.
IIRC, if one thread make a change to a variable, the change will firstly save in the CPU cache (say, L1 cache), and will not guarantee to synchronize to other threads unless the variable is declared as volatile.
Am I right?
Nope, you're wrong. But this is a very common misunderstanding.
Every modern multi-core CPU has hardware cache coherence. The L1, and similar caches, are invisible. CPU caches like the L1 cache have nothing to do with memory visibility.
Changes are visible immediately when a thread modifies a process resource. The issue is optimizations that cause process resources not to be modified in precisely the order the code specifies.
If your code has k = j; i = 4; if (j == 2) foo(); an optimizer might see that your first assignment reads the value of j. So it might not bother reading it again when you compare it to 2 since it "knows" that it can't have changed. However, another thread might have changed it. So optimizations of some kinds need to be disabled when synchronization between threads is required. That's what things like volatile do.
If compilers and CPUs made no optimizations and executed a program precisely as it was written, volatile would never be needed. Memory visibility is about optimizations in code (some done by the compiler, some by the CPU), not caches.
I think the text you are quoting is incorrect. The whole idea of the Java Memory Model is to deal with the complex optimizations by modern software & hardware, so that programmers can determine what writes are visible by the respective reads in other threads.
Unless a program in Java is properly synchronized, you can't guarantee that changes by one thread are immediately visible to other threads. Maybe the text refers to a very specific (and weak) memory model.
Usage of volatile variables is just one way to synchronize threads, and it's not suitable for all scenarios.
--Edit--
I think I understand the confusion now... I agree with David Schwartz, assuming that:
1) "modifies a process resource" means the actual change of the resource, not just the execution of a write instruction written in some high level computer language.
2) "is immediately visible to sibling threads" means that other threads are able to see it; it doesn't mean that a thread in your program will necessarily see it. You may still need to use synchronization tools in order to disable optimizations that bypass the actual access to the resource.

How locking is implemented?

i have following code:
while(lock)
;
lock = 1;
// critical section
lock = 0;
As reading or changing lock value is in itself a multi-instruction
read lock
change value
write it
If it happens like:
1) One thread reads the lock and stops there
2) Another thread reads it and sees it is free; lock it and do something untill half
3) First thread wakes up and goes into CS
SO how would locking would be implmented in system ?
Placing variables over top of another variables is not right : it would be like Guarding the guard ?
Stopping other processors threads is also not right ?
It is 100% platform specific. Generally, the CPU provides some form of atomic operation such as exchange or compare and swap. A typical lock might work like this:
1) Create: Store 0 (unlocked) in the variable.
2) Lock: Atomically attempt to switch the value of the variable from 0 (unlocked) to 1 (locked). If we failed (because it wasn't unlocked to begin with), let the CPU rest a bit, and then retry. Use a memory barrier to ensure no future memory operations sneak behind this one.
3) Unlock: Use a memory barrier to ensure previous memory operations don't sneak past this one. Atomically write 0 (unlocked) to the variable.
Note that you really don't need to understand this unless you want to design your own synchronization primitives. And if you want to do that, you need to understand an awful lot more. It's certainly a good idea for every programmer to have a general idea of what he's making the hardware do. But this is an area filled with seriously heavy wizardry. There are so many, many ways this can go horribly wrong. So just use the locking primitives provided by the geniuses who made your platform, compiler, and threading library. Here be dragons.
For example, SMP Pentium Pro systems have an erratum that requires special handling in the unlock operation. A naive implementation of the lock algorithm will cause the branch prediction logic to expect the operation to keep spinning, incurring a massive performance penalty at the worst possible time -- when you first acquire the lock. A naive implementation of the lock algorithm may cause two cores each waiting for the same lock to saturate the bus, slowing the CPU that needs to get work done in order to release the lock to a crawl. These all require heavy wizardry and deep understanding of the hardware to deal with.
In a course I studied at Uni, a possible firmware solution for implementing locks was presented in the form of the "atomicity bit" associated to a memory operation initiated by a processor.
Basically, when locking, you'll notice that you have a sequence of operations that need to be executed atomically: test the value of the flag and, if not set, set it to locked, otherwise try again. This sequence can be made atomic by associating a bit with each memory request send by the CPU. The first N-1 operations will have the bit set, while the last one will have it unset, to mark the end of the atomic sequence.
When the memory module (there can be several modules) where the flag data is stored receives the request for the first operation in the sequence (whose bit is set), it will serve it and not take requests from any other CPU until the CPU that initiated the atomic sequence sends a request with an unset atomicity bit (since these transactions are usually short, a coarse-grain approach like this is acceptable). Note that this is usually made easier by the assembler providing specialized instructions of type "compare-and-set", that do exactly what I mentioned before.

Thread Cooperation on Dual-CPU Machines

I remember in a course I took in college, one of my favorite examples of a race condition was one in which a simple main() method started two threads, one of which incremented a shared (global) variable by one, the other decrementing it. Pseudo code:
static int i = 10;
main() {
new Thread(thread_run1).start();
new Thread(thread_run2).start();
waitForThreads();
print("The value of i: " + i);
}
thread_run1 {
i++;
}
thread_run2 {
i--;
}
The professor then asked what the value of i is after a million billion zillion runs. (If it would ever be anything other than 10, essentially.) Students unfamiliar with multithreading systems responded that 100% of the time, the print() statement would always report i as 10.
This was in fact incorrect, as our professor demonstrated that each increment/decrement statement was actually compiled (to assembly) as 3 statements:
1: move value of 'i' into register x
2: add 1 to value in register x
3: move value of register x into 'i'
Thus, the value of i could be 9, 10, or 11. (I won't go into specifics.)
My Question:
It was (is?) my understanding that the set of physical registers is processor-specific. When working with dual-CPU machines (note the difference between dual-core and dual-CPU), does each CPU have its own set of physical registers? I had assumed the answer is yes.
On a single-CPU (multithreaded) machine, context switching allows each thread to have its own virtual set of registers. Since there are two physical sets of registers on a dual-CPU machine, couldn't this result in even more potential for race conditions, since you can literally have two threads operating simultaneously, as opposed to 'virtual' simultaneous operation on a single-CPU machine? (Virtual simultaneous operation in reference to the fact that register states are saved/restored each context switch.)
To be more specific - if you were running this on an 8-CPU machine, each CPU with one thread, are race conditions eliminated? If you expand this example to use 8 threads, on a dual-CPU machine, each CPU having 4 cores, would the potential for race conditions increase or decrease? How does the operating system prevent step 3 of the assembly instructions from being run simultaneously on two different CPUs?
Yes, the introduction of dual-core CPUs made a significant number of programs with latent threading races fail quickly. Single-core CPUs multitask by the scheduler rapidly switching the threading context between threads. Which eliminates a class of threading bugs that are associated with a stale CPU cache.
The example you give can fail on a single core as well though. When the thread scheduler interrupts the thread just as it loaded the value of the variable in a register in order to increment it. It just won't fail nearly as frequently because the odds that the scheduler interrupts the thread just there isn't that great.
There's an operating system feature to allow these programs to limp along anyway instead of crashing within minutes. Called 'processor affinity', available as the AFFINITY command line option for start.exe on Windows, SetProcessAfinityMask() in the winapi. Review the Interlocked class for helper methods that atomically increment and decrement variables.
You'd still have a race condition - it doesn't change that at all. Imagine two cores both performing an increment at the same time - they'd both load the same value, increment to the same value, and then store the same value... so the overall increment from the two operations would be one instead of two.
There are additional causes of potential problems where memory models are concerned - where step 1 may not really retrieve the latest value of i, and step 3 may not immediately write the new value of i in a way which other threads can see.
Basically, it all becomes very tricky - which is why it's generally a good idea to either use synchronization when accessing shared data or to use lock-free higher level abstractions which have been written by experts who really know what they're doing.
First, dual processor versus dual core has no real effect. A dual core processor still has two completely separate processors on the chip. They may share some cache, and do share a common bus to memory/peripherals, but the processors themselves are entirely separate. (A dual-threaded single code, such as Hyperthreading) is a third variation -- but it has a set of registers per virtual processor as well. The two processors share a single set of execution resources, but they retain completely separate register sets.
Second, there are really only two cases that are realy interesting: a single thread of execution, and everything else. Once you have more than one thread (even if all threads run on a single processor), you have the same potential problems as if you're running on some huge machine with thousands of processors. Now, it's certainly true that you're likely to see the problems manifest themselves a lot sooner when the code runs on more processors (up to as many as you've created threads), but the problems themselves haven't/don't change at all.
From a practical viewpoint, having more cores is useful from a testing viewpoint. Given the granularity of task switching on a typical OS, it's pretty easy to write code that will run for years without showing problems on a single processor, that will crash and burn in a matter of hours or even minute when you run it on two more or physical processors. The problem hasn't really changed though -- it's just a lot more likely to show up a lot more quickly when you have more processors.
Ultimately, a race condition (or deadlock, livelock, etc.) is about the design of the code, not about the hardware it runs on. The hardware can make a difference in what steps you need to take to enforce the conditions involved, but the relevant differences have little to do with simple number of processors. Rather, they're about things like concessions made when you have not simply a single machine with multiple processors, but multiple machines with completely separate address spaces, so you may have to take extra steps to assure that when you write a value to memory that it becomes visible to the CPUs on other machines that can't see that memory directly.

Resources