Thread Cooperation on Dual-CPU Machines

Thread Cooperation on Dual-CPU Machines - multithreading

I remember in a course I took in college, one of my favorite examples of a race condition was one in which a simple main() method started two threads, one of which incremented a shared (global) variable by one, the other decrementing it. Pseudo code:
static int i = 10;
main() {
new Thread(thread_run1).start();
new Thread(thread_run2).start();
waitForThreads();
print("The value of i: " + i);
}
thread_run1 {
i++;
}
thread_run2 {
i--;
}
The professor then asked what the value of i is after a million billion zillion runs. (If it would ever be anything other than 10, essentially.) Students unfamiliar with multithreading systems responded that 100% of the time, the print() statement would always report i as 10.
This was in fact incorrect, as our professor demonstrated that each increment/decrement statement was actually compiled (to assembly) as 3 statements:
1: move value of 'i' into register x
2: add 1 to value in register x
3: move value of register x into 'i'
Thus, the value of i could be 9, 10, or 11. (I won't go into specifics.)
My Question:
It was (is?) my understanding that the set of physical registers is processor-specific. When working with dual-CPU machines (note the difference between dual-core and dual-CPU), does each CPU have its own set of physical registers? I had assumed the answer is yes.
On a single-CPU (multithreaded) machine, context switching allows each thread to have its own virtual set of registers. Since there are two physical sets of registers on a dual-CPU machine, couldn't this result in even more potential for race conditions, since you can literally have two threads operating simultaneously, as opposed to 'virtual' simultaneous operation on a single-CPU machine? (Virtual simultaneous operation in reference to the fact that register states are saved/restored each context switch.)
To be more specific - if you were running this on an 8-CPU machine, each CPU with one thread, are race conditions eliminated? If you expand this example to use 8 threads, on a dual-CPU machine, each CPU having 4 cores, would the potential for race conditions increase or decrease? How does the operating system prevent step 3 of the assembly instructions from being run simultaneously on two different CPUs?

Yes, the introduction of dual-core CPUs made a significant number of programs with latent threading races fail quickly. Single-core CPUs multitask by the scheduler rapidly switching the threading context between threads. Which eliminates a class of threading bugs that are associated with a stale CPU cache.
The example you give can fail on a single core as well though. When the thread scheduler interrupts the thread just as it loaded the value of the variable in a register in order to increment it. It just won't fail nearly as frequently because the odds that the scheduler interrupts the thread just there isn't that great.
There's an operating system feature to allow these programs to limp along anyway instead of crashing within minutes. Called 'processor affinity', available as the AFFINITY command line option for start.exe on Windows, SetProcessAfinityMask() in the winapi. Review the Interlocked class for helper methods that atomically increment and decrement variables.

You'd still have a race condition - it doesn't change that at all. Imagine two cores both performing an increment at the same time - they'd both load the same value, increment to the same value, and then store the same value... so the overall increment from the two operations would be one instead of two.
There are additional causes of potential problems where memory models are concerned - where step 1 may not really retrieve the latest value of i, and step 3 may not immediately write the new value of i in a way which other threads can see.
Basically, it all becomes very tricky - which is why it's generally a good idea to either use synchronization when accessing shared data or to use lock-free higher level abstractions which have been written by experts who really know what they're doing.

First, dual processor versus dual core has no real effect. A dual core processor still has two completely separate processors on the chip. They may share some cache, and do share a common bus to memory/peripherals, but the processors themselves are entirely separate. (A dual-threaded single code, such as Hyperthreading) is a third variation -- but it has a set of registers per virtual processor as well. The two processors share a single set of execution resources, but they retain completely separate register sets.
Second, there are really only two cases that are realy interesting: a single thread of execution, and everything else. Once you have more than one thread (even if all threads run on a single processor), you have the same potential problems as if you're running on some huge machine with thousands of processors. Now, it's certainly true that you're likely to see the problems manifest themselves a lot sooner when the code runs on more processors (up to as many as you've created threads), but the problems themselves haven't/don't change at all.
From a practical viewpoint, having more cores is useful from a testing viewpoint. Given the granularity of task switching on a typical OS, it's pretty easy to write code that will run for years without showing problems on a single processor, that will crash and burn in a matter of hours or even minute when you run it on two more or physical processors. The problem hasn't really changed though -- it's just a lot more likely to show up a lot more quickly when you have more processors.
Ultimately, a race condition (or deadlock, livelock, etc.) is about the design of the code, not about the hardware it runs on. The hardware can make a difference in what steps you need to take to enforce the conditions involved, but the relevant differences have little to do with simple number of processors. Rather, they're about things like concessions made when you have not simply a single machine with multiple processors, but multiple machines with completely separate address spaces, so you may have to take extra steps to assure that when you write a value to memory that it becomes visible to the CPUs on other machines that can't see that memory directly.

Related

Does multicore processors really perform work in parallel?

I am deal with threading and related topics like: processes, context switching...
I understand that on a system with one multicore processer real work of more than one processes isn't real. We have just an illusion of such work, because of process context switching.
But, what about threads within one process, that runs on a multicore processor. Does they really work simultaneously or it's also just an illusion of such work? Does processor with 2 hardware cores can work over two threads at a time? If not, what is the point in multicore processors?

Does processor with 2 hardware cores can work over two threads at a time?
Yes,...
...But, Imagine yourself back in Victorian times, hiring a bunch of clerks to perform a complex computation. All of the data that they need are in one book, and they're supposed to write all of their results back into the same book.
The book is like a computer's memory, and the clerks are like individual CPUs. Since only one clerk can use the book at any given time, then it might seem as if there's no point in having more than one of them,...
... Unless, you give each clerk a notepad. They can go to the book, copy some numbers, and then work for a while just from their own notepad, before they return to copy a partial result from their notepad into the book. That allows other clerks to do some useful work when any one clerk is at the book.
The notepads are like a computer's Level 1 caches—relatively small areas of high-speed memory that are associated with a single CPU, and which hold copies of data that have been read from, or need to be written back to the main memory. The computer hardware automatically copies data between main memory and the cache as needed, so the program does not necessarily need to be aware that the cache even exists. (see https://en.wikipedia.org/wiki/Cache_coherence)
But, the programmer should to be aware: If you can structure your program so that different threads spend most of their time reading and writing private variables, and relatively little time accessing variables that are shared with other threads, then most of the private variable accesses will go no further than the L1 cache, and the threads will be able to truly run in parallel. If, on the other hand, threads all try to use the same variables at the same time, or if threads all try to iterate over large amounts of data (too large to all fit in the cache,) then they will have much less ability to work in parallel.
See also:
https://en.wikipedia.org/wiki/Cache_hierarchy

Multiple cores do actually perform work in parallel (at least on all mainstream modern CPU architecture). Processes have one or multiple threads. The OS scheduler schedule active tasks, which are generally threads, to available core. When there are more active tasks than available cores, the OS use preemption so execute tasks concurrently on each core.
In practice, software applications can perform synchronization that may cause some cores to be inactive for a given period of time. Hardware operation can also cause this (eg. waiting for memory data to be retrieved, doing an atomic operation).
Moreover, on modern processors, physical cores are often split in multiple hardware threads that can each execute different tasks. This is called SMT (aka Hyper-threading). On quite recent x86 processors, 2 hardware threads of a same core can simultaneously execute 2 tasks in parallel. The tasks can share parts of the physical core like execution units so using 2 hardware thread can be faster than 1 for some tasks (typically the ones not using fully the processor cores).
Having 2 hardware threads that cannot truly run in parallel but run concurrently at a low granularity can still beneficial for performance. In fact, it was the case for a long time (during the last decade). For example, when a task is latency bound (eg. waiting for data to be retrieved from the RAM), another task can be scheduled so to do some work, improving the overall efficiency. This was the initial goal of SMT. The same is true for pre-empted tasks on a same core (though the granularity need to be much bigger): one process can perform a networking operation and be pre-empted so another process can do some work before being pre-empted again because of data being received from the network.

what exactly happens two threads reach to increment the atomic integers at same time

Consider the following scenario:
Thread 1 calls get and gets the value 1.
Thread 1 calculates next to be 2.
Thread 2 calls get and gets the value 1.
Thread 2 calculates next to be 2.
Both threads try to write the value.
Now because of atomics - only one thread will succeed, the other will recieve false from the compareAndSet and go around again.
I got stuck at "because of atomics" what if two threads pass compareandset method at sametime . I am looking for practical examples than theory.

Hardware interlocks will ensure that if two or more threads attempt a compareAndSet simultaneously, one will be selected as "winning" and all others will "lose". Typically this will be done by using a common clock for all cores, so that every core will see a discrete sequence of execution steps (called "cycles" at the hardware level) in which
various things happen. In a vastly over-simplified execution model where cores don't have caches but instead use a multi-port memory, each core could report to every other core on each cycle whether it is performing the "read" portion of a compareAndSet. Each core would then hold off on starting a compareAndSet on the cycle after it has seen another thread start one, and each core could defer and restart its own compareAndSet if a lower-numbered core starts one with the same address on the same cycle.
The net result is that it's impossible for two cores to "successfully" perform compareAndSet operations on the same storage at the same time. Instead, hardware will delay one of the actions so that they occur sequentially.

It is the hardware, specifically the cache coherence protocol (MESI, etc.) that's ensuring the consistency of atomic operations done concurrently from different CPU cores (which run respective concurrent threads). There is a good reading called "Memory Barriers: a Hardware View for Software Hackers" which I can highly recommend on the subject.

Maximum number of threads and multithreading

I'm tripping up on the multithreading concept.
For example, my processor has 2 cores (and with hyper threading) 2 threads per core totaling to 4 thread. So does this mean my CPU can execute four separate instructions simultaneously? Is each thread capable of being a multi-thread?

So does this mean my CPU can execute four separate instructions simultaneously? Is each thread capable of being a multi-thread?
In short to both, yes.
A CPU can only execute 1 single instruction per phase in a clock cycle, due to certain factors like pipelining, a CPU might be able to pass multiple instructions through the different phases in a single clock cycle, and the frequency of the clock might be extremely fast, but it's still only 1 instruction at a time.
As an example, NOP is an x86 assembly instruction which the CPU interprets as "no operation this cycle" that's 1 instruction out of the hundreds or thousands (and more) that are executed from something even as simple as:
int main(void)
{
while (1) { /* eat CPU */ }
return 0;
}
A CPU thread of execution is one in which a series of instructions (a thread of instructions) are being executed, it does not matter from what "application" the instructions are coming from, a CPU does not know about high level concepts (like applications), that's a function of the OS.
So if you have a computer with 2 (or 4/8/128/etc.) CPU's that share the same memory (cache/RAM), then you can have 2 (or more) CPU's that can run 2 (or more) instructions at (literally) the exact same time. Keep in mind that these are machine instructions that are running at the same time (i.e. the physical side of the software).
An OS level thread is something a bit different. While the CPU handles the physical side of the execution, the OS handles the logical side. The above code breaks down into more than 1 instruction and when executed, actually gets run on more than 1 CPU (in a multi-CPU aware environment), even though it's a single "thread" (at the OS level), the OS schedules when to run the next instructions and on what CPU (based on the OS's thread scheduling policy, which is different amongst the various OS's). So the above code will eat up 100% CPU usage per a given "time slice" on that CPU it's running on.
This "slicing" of "time" (also known as preemptive computing) is why an OS can run multiple applications "at the same time", it's not literally1 at the same time since a CPU can only handle 1 instruction at a time, but to a human (who can barely comprehend the length of 1 second), it appears "at the same time".
1) except in the case with a multi-CPU setup, then it might be literally the same time.
When an application is run, the kernel (the OS) actually spawns a separate thread (a kernel thread) to run the application on, additionally the application can request to create another external thread (i.e. spawning another process or forking), or by creating an internal thread by calling the OS's (or programming languages) API which actually call lower level kernel routines that spawn and maintain the context switching of the spawned thread, additionally, any created thread is also capable of calling the same API's to spawn other separate threads (thus a thread is capable of being "multi-threaded").
Multi-threading (in the sense of applications and operating systems), is not necessarily portable, so while you might learn Java or C# and use their API's (i.e. Thread.Start or Runnable), utilizing the actual OS API's as provided (i.e. CreateThread or pthread_create and the slew of other concurrency functions) opens a different door for design decisions (i.e. "does platform X support thread library Y"); just something to keep in mind as you explore the different API's.
I hope that can help add some clarity.

I actually researched this very topic in my Operating Systems class.
When using threads, a good rule of thumb for increased performance for CPU bound processes is to use an equal number of threads as cores, except in the case of a hyper-threaded system in which case one should use twice as many cores. The other rule of thumb that can be concluded is for I/O bound processes. This rule is to quadruple the number threads per cores, except for the case of a hyper-threaded system than one can quadruple the number of threads per core.

Do really need a count lock on Multi threads with one CPU core?

If i have some code looks like this(Please ignore the syntax, i want to understand it without a specified language):
count = 0
def countDown():
count += 1
if __name__ == '__main__':
thread1(countDown)
thread2(countDown)
thread3(countDown)
Here i have a CPU with only one core, do i really need a lock to the variable count in case of it could be over-written by other threads.
I don't know, but if the language cares a lot, please explain it under Java、C and Python, So many thanks.
Thanks guys, i now understand i do need a lock. But here's another question, When do i need to use multi threads ?
Since the CPU will execute only one instructor, it seems that multi threads will take more time to manage the threads switch, and can't save the calculation time.

Technically, in general yes. Maybe not in this particular example. But imagine your atomic function would consist of several instructions. The operating system can and does execute many threads at once. It executes some steps of one, then switches back to OS which chooses which process/thread to continue. It can start all of your threads and switch between them. Even on one CPU. Then all threads would operate on the same memory addresses and share variables.
Edit: Answer to 2nd question.
When you have one core I can imagine only one case when you would need multithreading. It is when one of your threads can lock and you need to monitor for it or do something else in this time. One practical example would be a sever. If you want to serve multiple clients at the same time you need to switch between them. If you served them in a queue one bad client could hang whole process.
If you are doing computations you might use it to split I/O and computation. But it would need to be a very extreme case to be useful or needed.

Yes, you probably still need a lock. Your countDown code probably compiles to something like this:
load global variable "count" into register x
x = x + 1
save register x into global variable "count"
If there is a thread switch in the middle there, then you're in trouble. You don't actually need a second core to get the bad behavior.
Sometimes countDown might compile to an atomic instruction. For instance, there are such instructions on x86, but there's no way I know to guarantee that the compiler uses them (except to write the assembly yourself).

For simple things like increment a counter, instead of using locks, in c you can find atomic functions which do the operation in a thread safe way. GCC defines these atomic builtin functions which are usually wrapped in public function call in what every your particular environment is
http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Atomic-Builtins.html
Mac OS X defines these for example https://developer.apple.com/library/mac/#documentation/cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html
These have the potential to be more efficient than lock because they are more limited in functionality than lock.

For the simplest example, we create multiple threads sharing a single variable and performing a single atomic instruction on it. No matter where any thread is interrupted its state is either completely before or completely after the instruction on the shared resource.
In this case, x86 increment is atomic and therefore thread safe. You would not need a lock to maintain consistency or idempotency.

When do you need multi-threading?
To me there are two distinct applications:
Parallell processing when several threads - ideally just one per core - work on a small part of the overall problem for an extended period of time. The required code and data is small and - in the best of worlds - will fit in the core's L1 and L2 caches. The bottleneck here - if performance is important - will be memory bandwidth and how to use as little of it as possible.
The other is when there are distinct components of a program the operate more or less independently of one another and where the processing requirements vary over time. One example could be a mail (SMTP) server which is has at least three independent components: an SMTP server to receive mails fron SMTP clients, an SMTP client to send mails to other SMTP servers and a name client to look up the real addresses to which the SMTP client should send the mails.

The lock issue has already been well explained by the other posters.
The other question is fairly easy too - most apps are multithreaded to improve I/O performance with multiple I/O streams that can block. I'm typing at one now. The browser must respond to network activity and user input at the mouse and keyboard. Often, it must do both 'at the same time'. User input and network comms are very slow and slow, respectively - both block. So, the GUI and network comms run on different threads. This needs to happen even with only one CPU core and not doing so results in old 'Windows 3.1' style 'hourglass apps' where the GUI is often non-responsive. Note that this issue of requiring multiple threads also applies to async I/O - something that can seem like it runs on one thread, but is supported by kernel threads/pools - most the blocking is moved into the kernel.
That's it for a single-core box. You cannot use multiple threads to speed up CPU-intensive calculations, (in fact, you will slow them down, as you realise), but you can use them for high-performance I/O. Many apps were multithreaded back when we all had single-core Pentiums and Windows 95 - to optimize I/O, not speed up calculations.

Critical Sections that Spin on Posix?

The Windows API provides critical sections in which a waiting thread will spin a limited amount of times before context switching, but only on a multiprocessor system. These are implemented using InitializeCriticalSectionAndSpinCount. (See http://msdn.microsoft.com/en-us/library/ms682530.aspx.) This is efficient when you have a critical section that will often only be locked for a short period of time and therefore contention should not immediately trigger a context switch. Two related questions:
For a high-level, cross-platform threading library or an implementation of a synchronized block, is having a small amount of spinning before triggering a context switch a good default?
What, if anything, is the equivalent to InitializeCriticalSectionAndSpinCount on other OS's, especially Posix?
Edit: Of course no spin count will be optimal for all cases. I'm only interested in whether using a nonzero spin count would be a better default than not using one.

My opinion is that the optimal "spin-count" for best application performance is too hardware-dependent for it to be an important part of a cross-platform API, and you should probably just use mutexes (in posix, pthread_mutex_init / destroy / lock / trylock) or spin-locks (pthread_spin_init / destroy / lock / trylock). Rationale follows.
What's the point of the spin count? Basically, if the lock owner is running simultaneously with the thread attempting to acquire the lock, then the lock owner might release the lock quickly enough that the EnterCriticalSection caller could avoid giving up CPU control in acquiring the lock, improving that thread's performance, and avoiding context switch overhead. Two things:
1: obviously this relies on the lock owner running in parallel to the thread attempting to acquire the lock. This is impossible on a single execution core, which is almost certainly why Microsoft treats the count as 0 in such environments. Even with multiple cores, it's quite possible that the lock owner is not running when another thread attempts to acquire the lock, and in such cases the optimal spin count (for that attempt) is still 0.
2: with simultaneous execution, the optimal spin count is still hardware dependent. Different processors will take different amounts of time to perform similar operations. They have different instruction sets (the ARM I work with most doesn't have an integer divide instruction), different cache sizes, the OS will have different pages in memory... Decrementing the spin count may take a different amount of time on a load-store architecture than on an architecture in which arithmetic instructions can access memory directly. Even on the same processor, the same task will take different amounts of time, depending on (at least) the contents and organization of the memory cache.
If the optimal spin count with simultaneous execution is infinite, then the pthread_spin_* functions should do what you're after. If it is not, then use the pthread_mutex_* functions.

For a high-level, cross-platform threading library or an
implementation of a synchronized block, is having a small amount of
spinning before triggering a context switch a good default?
One would think so. Many moons ago, Solaris 2.x implemented adaptive locks, which did exactly this - spin for a while, if the mutex is held by a thread executing on another CPU or block otherwise.
Obviously, it makes no sense to spin on single-CPU systems.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string