Differences beteween Threading in Nvidias GPUs and CPUs - multithreading

I am trying to understand the difference between the threading techniques used by Nividia GPUs and normal (multi threading) CPUs. In particular my two questions are:
Which part of the system is respondsible for the thread scheduling and according to which aspects are they scheduled?
Are threads processed synchronously?

CUDA cores and CPU cores are literally a complete different thing - the name is more a marketing thing;
What do you mean with responsible for thread scheduling? Its mostly both Software and Hardware. For instance the pure CPU has little to do with the actual thread scheduling, but provides the necessary functionality to implement a thread-scheduler as a part of the OS. So the scheduling parameter are defined by the software. Hence you should adopt your question to a specific OS.
One thing the CPU provides are the so called hardware-threads. Each hardware-thread allows the "parallel" execution of one software-thread. (Note: With Hyperthreading, the execution is not really parallel more interleaving). The scheduler distributes all running threads on these hardware-threads.
This is basically a MIMD-System.
The scheduling on graphic-cards are way more complicated. In short:
You have a few thousands CUDA-cores - but in contrast to the CPU you cannot assign a unique application to each of them. The CUDA-cores are organized in groups (so called warps) and all CUDA-cores inside the same group execute the same thread simultaneously.
This is called SIMT

Related

About multithreading, concurrency, and parallelism

Recently I had confusion with understanding concepts: multithreading, concurrency, and parallelism. In order to reduce confusion, I've tried to organize my understanding about these and drawn my conclusion. My question is,
Is there a misunderstanding or something wrong from conclusion below?
References I took can be found here.
1. Concurrency and parallelism are different level of category.
It is not either concurrent or parallelism. It is either concurrent or not, and either parallel or not.
For example,
Not concurrent (Sequential) / Not parallel
Not concurrent (Sequential) / Parallel
Concurrent / Not parallel
Concurrent / Parallel
2. Parallelism is not subset of concurrency.
3. How does threading or multithreading relates to concurrency and parallelism?
Definition of thread clarifies this. Thread is "unit of execution flow". This "execution flow" can be managed independently by a scheduler, which is typically a part of the operating system.
Having a thread means having one unit of execution flow.
Having multiple threads (Multithreading) means having multiple units of execution flow.
And,
Having multiple units of execution flow is having multiple things making progress, which is definition of concurrency.
And,
Multiple units of execution flow is done by time slicing in single core hardware environment.
Multiple units of execution flow is done in parallel in multi core hardware environment.
4. Is multithreading concurrent or parallel?
Multithreading, or having multiple units of execution flow, is concurrent.
Multithreading itself is just having multiple units of execution flow. This has nothing to do with parallelism.
How operating system deals with multiple units of execution flow relates to parallelism.
Parallelism is achieved by operating system and hardware environment.
"Code can be concurrent, but not parallel." (Parallelism implies concurrency but not the other way round right?, stackexchange)
Detailed description will be truly appreciated.
Parallelism Refers to any system in which a single application can make use of more computing hardware than a single CPU can provide. There are a number of different types of parallel computing architecture, but when people say "parallelism" they often are talking about one in particular...
...A Symmetric MultiProcessing (SMP) system is a computer with one memory system, and two or more traditional CPUs that have equal access to it. Most modern workstations, most mobile devices, and many server systems* are SMP.
Multithreading is a model of concurrent computing.** A computer scientist might tell you that two threads run concurrently when the order in which the operations they perform are interleaved is not strictly determined by the program itself. A software developer is more likely to say that two threads run concurrently with each other when both threads have been started and neither of them has finished.
One way to achieve parallelism in an application running on an SMP system is to use multiple concurrent threads.
* Some servers are NUMA, which is a close cousin to SMP. In a NUMA system, the CPUs all access the same memory system, just like in SMP, except that each CPU "owns" part of the physical memory space, and it can access its own memory locations more quickly than it can access memory locations that are owned by other CPUs.
** There are other models of concurrent computing. Some, such as Actors, are used in production software. Others are mostly of academic interest.

How does multithreading utilizes multiple cores?

So recently I've learned some basic knowledge about multithreading. What I've understood is that thread is a lightweight process that runs under processes by sharing memory, while one process is running under one CPU core.
Yet by this perspective I couldn't understand some saying that threads utilizes multiple cores and make the whole program executes more effective. From what I've known, threads created by one process should run only under that specific process, which means that it should only run under that very one CPU core. If we want to utilize multiple cores, we should actually use multiprocess to run parallelly. Most of what I've researched is only about the conclusion, i.e multithreading utilizes multiple cores, but none of them explains my question. Did I think anything wrong? Thanks!
Your confusion lies here:
[...] while one process is running under one CPU core.
[...] threads created by one process should run only under that specific process, which means that it should only run under that very one CPU core.
This is not true. I think what the various explanations you have read meant that any process have at least one thread (where a 'thread' is a sequence of instructions ran by a CPU core).
If you have a multithreaded program, the process will have several threads (sequences of instructions ran by a CPU core) that can run concurrently on different CPU cores.
There are many processes executing on your computer at any given time. The Operating System (OS) is the program that allocates the hardware resources (CPU cores) to all these processes and decides which process can use which cores for what amount of time before another process gets to use the CPU. Whether or not a process gets to use multiple cores is not entirely up to the process. More confusing still, multithreaded programs can use more threads than there are cores on the computer's CPU. In that case you can be certain that all your threads do not run in parallel.
One more thing:
[...] threads utilizes multiple cores and make the whole program executes more effective
I am going to sound very pedantic, but it is more complicated than that. It depends on what you mean by "effective". Are we talking about total computation time, energy consumption ..?
A sequential (1 thread) program may be very effective in terms of power consumption but taking a very long time to compute. If you are able to use multiple threads, you may be able to reduce that computation time but it will probably incur new costs (synchronization between threads, additional protection mechanisms against concurrent accesses ...).
Also, multithreading cannot help for certain tasks that fall outside of the CPU realm. For example, unless you have some very specific hardware support, reading a file from the hard-drive with 2 or more concurrent threads cannot be parallelized efficiently.

erlang threading and OS threads correlation [duplicate]

Erlang is known for being able to support MANY lightweight processes; it can do this because these are not processes in the traditional sense, or even threads like in P-threads, but threads entirely in user space.
This is well and good (fantastic actually). But how then are Erlang threads executed in parallel in a multicore/multiprocessor environment? Surely they have to somehow be mapped to kernel threads in order to be executed on separate cores?
Assuming that that's the case, how is this done? Are many lightweight processes mapped to a single kernel thread?
Or is there another way around this problem?
Answer depends on the VM which is used:
1) non-SMP: There is one scheduler (OS thread), which executes all Erlang processes, taken from the pool of runnable processes (i.e. those who are not blocked by e.g. receive)
2) SMP: There are K schedulers (OS threads, K is usually a number of CPU cores), which executes Erlang processes from the shared process queue. It is a simple FIFO queue (with locks to allow simultaneous access from multiple OS threads).
3) SMP in R13B and newer: There will be K schedulers (as before) which executes Erlang processes from multiple process queues. Each scheduler has it's own queue, so process migration logic from one scheduler to another will be added. This solution will improve performance by avoiding excessive locking in shared process queue.
For more information see this document prepared by Kenneth Lundin, Ericsson AB, for Erlang User Conference, Stockholm, November 13, 2008.
I want to ammend previous answers.
Erlang, or rather the Erlang runtime system (erts), defaults the number of schedulers (OS threads) and the number of runqueues to number of processing elements on your platform. That is processors cores or hardware threads. You can change these settings in runtime using:
erlang:system_flag(schedulers_online, NP) -> PrevNP
The Erlang processes does not have any affinity to any schedulers yet. The logic balancing the processes between the schedulers follows two rules. 1) A starving scheduler will steal work from another scheduler. 2) Migration paths are setup to push processes from schedulers with lots of processes to schedulers with less work. This is done to assure fairness in reduction count (execution time) for each process.
Schedulers however can be locked to specific processing elements. This not done by default. To let erts do the scheduler->core affinity use:
erlang:system_flag(scheduler_bind_type, default_bind) -> PrevBind
Several other bind types can be found in the documentation. Using affinity can greatly improve performance in heavy load situations! Especially in high lock contention situations. Also, the linux kernel cannot handle hyperthreads to say the least. If you have hyperthreads on your platform you should really use this feature in erlang.
I'm purely guessing here, but I'd imagine that there's a small number of threads, which pick processes from a common process pool for execution. Once a process hits a blocking operation, the thread executing it puts it aside and picks another. When a process being executed causes another process to become unblocked, that newly unblocked process gets placed into the pool. I suppose a thread might also stop execution of a process even when it's not blocked at certain points to serve other processes.
I would like to add some input to what was described in the accepted answer.
Erlang Scheduler is the essential part of the Erlang Runtime System and provides its own abstraction and implementation of the conception of lightweight processes atop the OS threads.
Each Scheduler runs within a single OS thread. Normally, there are as many schedulers as CPU (cores) are on he hardware (it is configurable though and naturally does not bring much value when number of schedulers exceeds those of hardware cores). The system might also be configured that scheduler will not jump between OS threads.
Now, when the Erlang process is being created it is entirely the responsibility of the ERTS and Scheduler to manage life cycle and resources consumption as well as its memory footprint etc.
One of the core implementation details is that each process has a time budget of 2000 reductions available when the Scheduler picks up that process from the run queue. Each progress in the system (even I/O) is guaranteed to have a reductions budget. That is what actually makes ERTS a system with preemptive multitasking.
I would recommend a great blog post on that topic by Jesper Louis Andersen http://jlouisramblings.blogspot.com/2013/01/how-erlang-does-scheduling.html
As the short answer: Erlang processes are not OS threads and do not map to them directly. Erlang Schedulers are what runs on the OS threads and provide smart implementation of more finely grained Erlang processes hiding those details behind programmer's eyes.

Threads & Processes Vs MultiThreading & Multi-Core/MultiProcessor : How they are mapped?

I was very confused but the following thread cleared my doubts:
Multiprocessing, Multithreading,HyperThreading, Multi-core
But it addresses the queries from the hardware point of view. I want to know how these hardware features are mapped to software?
One thing that is obvious is that there is no difference between MultiProcessor(=Mutlicpu) and MultiCore other than that in multicore all cpus reside on one chip(die) where as in Multiprocessor all cpus are on their own chips & connected together.
So, mutlicore/multiprocessor systems are capable of executing multiple processes (firefox,mediaplayer,googletalk) at the "sametime" (unlike context switching these processes on a single processor system) Right?
If it correct. I'm clear so far. But the confusion arises when multithreading comes into picture.
MultiThreading "is for" parallel processing. right?
What are elements that are involved in multithreading inside cpu? diagram? For me to exploit the power of parallel processing of two independent tasks, what should be the requriements of CPU?
When people say context switching of threads. I don't really get it. because if its context switching of threads then its not parallel processing. the threads must be executed "scrictly simultaneously". right?
My notion of multithreading is that:
Considering a system with single cpu. when process is context switched to firefox. (suppose) each tab of firefox is a thread and all the threads are executing strictly at the same time. Not like one thread has executed for sometime then again another thread has taken until the context switch time is arrived.
What happens if I run a multithreaded software on a processor which can't handle threads? I mean how does the cpu handle such software?
If everything is good so far, now question is HOW MANY THREADS? It must be limited by hardware, I guess? If hardware can support only 2 threads and I start 10 threads in my process. How would cpu handle it? Pros/Cons? From software engineering point of view, while developing a software that will be used by the users in wide variety of systems, Then how would I decide should I go for multithreading? if so, how many threads?
First, try to understand the concept of 'process' and 'thread'. A thread is a basic unit for execution: a thread is scheduled by operating system and executed by CPU. A process is a sort of container that holds multiple threads.
Yes, either multi-processing or multi-threading is for parallel processing. More precisely, to exploit thread-level parallelism.
Okay, multi-threading could mean hardware multi-threading (one example is HyperThreading). But, I assume that you just say multithreading in software. In this sense, CPU should support context switching.
Context switching is needed to implement multi-tasking even in a physically single core by time division.
Say there are two physical cores and four very busy threads. In this case, two threads are just waiting until they will get the chance to use CPU. Read some articles related to preemptive OS scheduling.
The number of thread that can physically run in concurrent is just identical to # of logical processors. You are asking a general thread scheduling problem in OS literature such as round-robin..
I strongly suggest you to study basics of operating system first. Then move on multithreading issues. It seems like you're still unclear for the key concepts such as context switching and scheduling. It will take a couple of month, but if you really want to be an expert in computer software, then you should know such very basic concepts. Please take whatever OS books and lecture slides.
Threads running on the same core are not technically parallel. They only appear to be executed in parallel, as the CPU switches between them very fast (for us, humans). This switch is what is called context switch.
Now, threads executing on different cores are executed in parallel.
Most modern CPUs have a number of cores, however, most modern OSes (windows, linux and friends) usually execute much larger number of threads, which still causes context switches.
Even if no user program is executed, still OS itself performs context switches for maintanance work.
This should answer 1-3.
About 4: basically, every processor can work with threads. it is much more a characteristic of operating system. Thread is basically: memory (optional), stack and registers, once those are replaced you are in another thread.
5: the number of threads is pretty high and is limited by OS. Usually it is higher than regular programmer can successfully handle :)
The number of threads is dictated by your program:
is it IO bound?
can the task be divided into a number of smaller tasks?
how small is the task? the task can be too small to make it worth to spawn threads at all.
synchronization: if extensive synhronization is required, the penalty might be too heavy and you should reduce the number of threads.
Multiple threads are separate 'chains' of commands within one process. From CPU point of view threads are more or less like processes. Each thread has its own set of registers and its own stack.
The reason why you can have more threads than CPUs is that most threads don't need CPU all the time. Thread can be waiting for user input, downloading something from the web or writing to disk. While it is doing that, it does not need CPU, so CPU is free to execute other threads.
In your example, each tab of Firefox probably can even have several threads. Or they can share some threads. You need one for downloading, one for rendering, one for message loop (user input), and perhaps one to run Javascript. You cannot easily combine them because while you download you still need to react to user's input. However, download thread is sleeping most of the time, and even when it's downloading it needs CPU only occasionally, and message loop thread only wakes up when you press a button.
If you go to task manager you'll see that despite all these threads your CPU use is still quite low.
Of course if all your threads do some number-crunching tasks, then you shouldn't create too many of them as you get no performance benefit (though there may be architectural benefits!).
However, if they are mainly I/O bound then create as many threads as your architecture dictates. It's hard to give advice without knowing your particular task.
Broadly speaking, yeah, but "parallel" can mean different things.
It depends what tasks you want to run in parallel.
Not necessarily. Some (indeed most) threads spend a lot of time doing nothing. Might as well switch away from them to a thread that wants to do something.
The OS handles thread switching. It will delegate to different cores if it wants to. If there's only one core it'll divide time between the different threads and processes.
The number of threads is limited by software and hardware. Threads consume processor and memory in varying degrees depending on what they're doing. The thread management software may impose its own limits as well.
The key thing to remember is the separation between logical/virtual parallelism and real/hardware parallelism. With your average OS, a system call is performed to spawn a new thread. What actually happens (whether it is mapped to a different core, a different hardware thread on the same core, or queued into the pool of software threads) is up to the OS.
Parallel processing uses all the methods not just multi-threading.
Generally speaking, if you want to have real parallel processing, you need to perform it in hardware. Take the example of the Niagara, it has up to 8-cores each capable of executing 4-threads in hardware.
Context switching is needed when there are more threads than is capable of being executed in parallel in hardware. Even then, when executed in series (switching between one thread to the next), they are considered concurrent because there is no guarantee on the order of switching. So, it may go T0, T1, T2, T1, T3, T0, T2 and so on. For all intents and purposes, the threads are parallel.
Time slicing.
That would be up to the OS.
Multithreading is the execution of more than one thread at a time. It can happen both on single core processors and the multicore processor systems. For single processor systems, context switching effects it. Look!Context switching in this computational environment refers to time slicing by the operating system. Therefore do not get confused. The operating system is the one that controls the execution of other programs. It allows one program to execute in the CPU at a time. But the frequency at which the threads are switched in and out of the CPU determines the transparency of parallelism exhibited by the system.
For multicore environment,multithreading occurs when each core executes a thread.Though,in multicore again,context switching can occur in the individual cores.
I think answers so far are pretty much to the point and give you a good basic context. In essence, say you have quad core processor, but each core is capable of executing 2 simultaneous threads.
Note, that there is only slight (or no) increase of speed if you are running 2 simultaneous threads on 1 core versus you run 1st thread and then 2nd thread vertically. However, each physical core adds speed to your general workflow.
Now, say you have a process running on your OS that has multiple threads (i.e. needs to run multiple things in "parallel") and has some kind of stack of tasks in a queue (or some other system with priority rules). Then software sends tasks to a queue and your processor attempts to execute them as fast as it can. Now you have 2 cases:
If a software supports multiprocessing, then tasks will be sent to any available processor (that is not doing anything or simply finished doing some other job and job send from your software is 1st in a queue).
If your software does not support multiprocessing, then all of your jobs will be done in a similar manner, but only by one of your cores.
I suggest reading Wikipedia page on thread. Very first picture there already gives you a nice insight. :)

How, if at all, do Erlang Processes map to Kernel Threads?

Erlang is known for being able to support MANY lightweight processes; it can do this because these are not processes in the traditional sense, or even threads like in P-threads, but threads entirely in user space.
This is well and good (fantastic actually). But how then are Erlang threads executed in parallel in a multicore/multiprocessor environment? Surely they have to somehow be mapped to kernel threads in order to be executed on separate cores?
Assuming that that's the case, how is this done? Are many lightweight processes mapped to a single kernel thread?
Or is there another way around this problem?
Answer depends on the VM which is used:
1) non-SMP: There is one scheduler (OS thread), which executes all Erlang processes, taken from the pool of runnable processes (i.e. those who are not blocked by e.g. receive)
2) SMP: There are K schedulers (OS threads, K is usually a number of CPU cores), which executes Erlang processes from the shared process queue. It is a simple FIFO queue (with locks to allow simultaneous access from multiple OS threads).
3) SMP in R13B and newer: There will be K schedulers (as before) which executes Erlang processes from multiple process queues. Each scheduler has it's own queue, so process migration logic from one scheduler to another will be added. This solution will improve performance by avoiding excessive locking in shared process queue.
For more information see this document prepared by Kenneth Lundin, Ericsson AB, for Erlang User Conference, Stockholm, November 13, 2008.
I want to ammend previous answers.
Erlang, or rather the Erlang runtime system (erts), defaults the number of schedulers (OS threads) and the number of runqueues to number of processing elements on your platform. That is processors cores or hardware threads. You can change these settings in runtime using:
erlang:system_flag(schedulers_online, NP) -> PrevNP
The Erlang processes does not have any affinity to any schedulers yet. The logic balancing the processes between the schedulers follows two rules. 1) A starving scheduler will steal work from another scheduler. 2) Migration paths are setup to push processes from schedulers with lots of processes to schedulers with less work. This is done to assure fairness in reduction count (execution time) for each process.
Schedulers however can be locked to specific processing elements. This not done by default. To let erts do the scheduler->core affinity use:
erlang:system_flag(scheduler_bind_type, default_bind) -> PrevBind
Several other bind types can be found in the documentation. Using affinity can greatly improve performance in heavy load situations! Especially in high lock contention situations. Also, the linux kernel cannot handle hyperthreads to say the least. If you have hyperthreads on your platform you should really use this feature in erlang.
I'm purely guessing here, but I'd imagine that there's a small number of threads, which pick processes from a common process pool for execution. Once a process hits a blocking operation, the thread executing it puts it aside and picks another. When a process being executed causes another process to become unblocked, that newly unblocked process gets placed into the pool. I suppose a thread might also stop execution of a process even when it's not blocked at certain points to serve other processes.
I would like to add some input to what was described in the accepted answer.
Erlang Scheduler is the essential part of the Erlang Runtime System and provides its own abstraction and implementation of the conception of lightweight processes atop the OS threads.
Each Scheduler runs within a single OS thread. Normally, there are as many schedulers as CPU (cores) are on he hardware (it is configurable though and naturally does not bring much value when number of schedulers exceeds those of hardware cores). The system might also be configured that scheduler will not jump between OS threads.
Now, when the Erlang process is being created it is entirely the responsibility of the ERTS and Scheduler to manage life cycle and resources consumption as well as its memory footprint etc.
One of the core implementation details is that each process has a time budget of 2000 reductions available when the Scheduler picks up that process from the run queue. Each progress in the system (even I/O) is guaranteed to have a reductions budget. That is what actually makes ERTS a system with preemptive multitasking.
I would recommend a great blog post on that topic by Jesper Louis Andersen http://jlouisramblings.blogspot.com/2013/01/how-erlang-does-scheduling.html
As the short answer: Erlang processes are not OS threads and do not map to them directly. Erlang Schedulers are what runs on the OS threads and provide smart implementation of more finely grained Erlang processes hiding those details behind programmer's eyes.

Resources