Linux: Processes and Threads in a Multi-core CPU

Is it true that threads, compared to processes, are less likely to benefit from a multi-core processor? In other words, would the kernel make the decision of executing threads on a single core rather than on multiple cores?
I'm talking about threads belonging to the same process.

I don't know how the (various) Linux scheduler handle this, but inter-thread communication gets more expensive when threads are running on different Cores.
So the scheduler may decide to run threads of a process on the same CPU if there are other processes needing CPU time.
Eg with a Dual-Core CPU, if there are two processes with two threads and all are using all CPU time they get, it is better to run the two threads of the first process on the first Core and the two threads of the other process on the second core.

That's news to me. Linux in particular makes little distinction between threads and processes. They are really just processes that share their address-space.

Multiple single-threaded processes are more expensive to the system than single multi-threaded ones. But they will benefit from multicore CPU with same efficiency. Plus inter-thread communication is much cheaper then inter-process communication. If these threads really form single application i vote for multithreading.

Shared-memory multithreading imposes huge complexity costs on everything from your tool-chain, to development, to debugging, reasoning, and testing your code. NEVER use shared-memory multithreading where you can reasonably use a multi-process design.
#Marcelo is right, any decent OS will treat threads and processes very similarly, some cpu-affinity for threads may reduce the multi-processor usage of a multi-threaded process, but you should see that with any two processes that share a common .text segment as well.
Pick threads vs. processes based on complexity and architectural design constraints, speed will almost never come into it.

It actually all depends on the scheduler, type of multiprocessing, and current running environment.
Assume nothing, test, test, test!
If you're the only multi-threaded process on the system, multi-threading is generally a good idea.
However, from the perspective of the ease of development, sometimes you want separate address spaces and shared data, especially in NUMA systems.
One thing for sure: If it's a 'HyperThreaded' system, threads are much more efficient by virtue of close memory sharing.
If it is a regular multi-core processing.. it should be similar.
If it is a NUMA system, you're better off keeping data shared and code separate. Again, it's all architecture dependent, and it doesn't matter performance-wise unless you're in the HPC business.
If you are in the HPC (supercomputing) business, TEST!. It's all machine dependent (and benefits are 10-25% on average, it matters if you're talking days of difference)

Whereas Windows uses fibres and threads I sometimes think Linux uses processes and twine.
I've found that in writing multi-threaded processes you really have to be rigorous, pedantic, disciplined and bloody-minded in designing threaded processes in order for them to achieve a balance of benefit in using whatever number of cores are available on the machine that the process is to run on.
Is it true, on Linux, that threads, compared to processes, are less likely to benefit from a multi-core processor? No one knows.


Are Tcl threads multi process/multi core

I'm new to using threads in tcl but thought it was a nice way to solve a problem I'm having
I was trying to read through the tcl thread documentation but i can't quite figure out if tcl threads span threads across multiple cpu cores or try to keep all threads within the CPU core from which the master process was started?
Tcl's threads are threads as supported by the operating system's standard libraries (e.g., they're normal POSIX threads on Linux and OSX), and so are entirely capable of running over as many cores as the OS allows.
Tcl takes care to limit the use of locks in its implementation as much as possible, so as to make multi-core operation as efficient as possible; this came from experience supporting high-performance application servers in the 1990s, where it turned out that reducing the sharing of resources was a big win as hardware scaled up the number of cores.
It also means that you've got a non-shared memory model based on structured message passing; it scales well, but it was very different to what most programmers knew at the time. It's a little bit more mainstream now because shared-memory parallelism remains annoyingly troublesome on modern hardware.

How are multiple CPU cores used by the OS

There are a lot of articles that discuss multi-core myth. That, in order to really benefit from multiple cores, one needs to write parallel algorithms. Many of them mention Amdahl's law.
Lets assume for simplicity that we have a desktop computer with a 4-core commodity CPU. And assume that the goal is to improve our application performance, as well as overall system performance.
I wonder how CPU cores are used to perform tasks.
Whether threads from a single process are allocated all cores
Or threads from different processes are scheduled to run on different cores.
If the latter is the case, then why is the myth even discussed? Won't multitasking OSes always benefit from multi-core CPUs, even if all the processes are single threaded? Are threads from the same process more likely to be scheduled at the same time on multiple cores?
What are some factors that matter? CPU cache maybe? Some application related maybe? Why?
Why would you ever want to use parallel libraries/algorithms? After all, CPU resources are shared between all running processes and there are always enough of them.
Is there an "active process" notion? i.e. process that gets most attention from the scheduler. If so, then how much more attention does this process usually get?
Whether threads from a single process are allocated all cores
Or threads from different processes are scheduled to run on different cores.
Yes, that too.
If the latter is the case, then why is the myth even discussed? Won't multitasking OSes always benefit from multi-core CPUs, even if all the processes are single threaded?
To some extent, yes. But if that process is doing a lot of computation and the only one we care about at some particular time, the benefit will be pretty low.
On the other hand, it also means the process won't be as likely to be interrupted just because the OS has to do something like handle a disk interrupt, an arriving network packet, or something like that. Interrupting a process to handle some hardware task not only reduce the CPU time the process gets but it also pollutes the CPU caches causing the process to run more slowly when it resumes. So multi-core CPUs can allow a single-threaded process to command a core for a higher percentage of the time and in longer bursts.
Are threads from the same process more likely to be scheduled at the same time on multiple cores?
Typically no. Why would you want to do that? That would tend to degrade overall system performance as threads from the same process are more likely to step on each other's toes. You want the system to get other process' work done efficiently so you get the CPU back.
Is there an "active process" notion?
To some extent. Windows has precisely such a notion -- a "foreground process". Most OSes don't. But they do have a "dynamic priority boost" feature. Basically, if a process is sitting around doing nothing and then needs to do something, it is given some priority as a "reward". This allows a process that sits around waiting for work to be done to get its work done quickly and makes the system feel more interactive and responsive. It often makes little sense on servers, but it's helpful on desktops. Whether this is implemented on threads individually or on all the threads of a process as a group is implementation specific.
If you run separate processes or threads that doesn't needs to interact each others then it will be far better having 4 cores rather then having just 1.
As soon as the processes or threads needs to share some data, you will get the overhead to serialize the access to the shared data.
A lot depends on how good an application is written to run on a multi-core CPU. It may happen in the worst case that trying to run an application on a 4-core CPU is slower than running it on a single core CPU; more likely the increase in performance would be far less than 100%.

Technically, why are processes in Erlang more efficient than OS threads?

Erlang's Characteristics
From Erlang Programming (2009):
Erlang concurrency is fast and scalable. Its processes are lightweight in that the Erlang virtual machine does not create an OS thread for every created process. They are created, scheduled, and handled in the VM, independent of underlying operating system. As a result, process creation time is of the order of microseconds and independent of the number of concurrently existing processes. Compare this with Java and C#, where for every process an underlying OS thread is created: you will get some very competitive comparisons, with Erlang greatly outperforming both languages.
From Concurrency oriented programming in Erlang (pdf) (slides) (2003):
We observe that the time taken to create an Erlang process is constant 1µs up to 2,500 processes; thereafter it increases to about 3µs for up to 30,000 processes. The performance of Java and C# is shown at the top of the figure. For a small number of processes it takes about 300µs to create a process. Creating more than two thousand processes is impossible.
We see that for up to 30,000 processes the time to send a message between two Erlang processes is about 0.8µs. For C# it takes about 50µs per message, up to the maximum number of processes (which was about 1800 processes). Java was even worse, for up to 100 process it took about 50µs per message thereafter it increased rapidly to 10ms per message when there were about 1000 Java processes.
My thoughts
I don't fully understand technically why Erlang processes are so much more efficient in spawning new processes and have much smaller memory footprints per process. Both the OS and Erlang VM have to do scheduling, context switches, and keep track of the values in the registers and so on...
Simply why aren't OS threads implemented in the same way as processes in Erlang? Do they have to support something more? And why do they need a bigger memory footprint? And why do they have slower spawning and communication?
Technically, why are processes in Erlang more efficient than OS threads when it comes to spawning and communication? And why can't threads in the OS be implemented and managed in the same efficient way? And why do OS threads have a bigger memory footprint, plus slower spawning and communication?
More reading
Inside the Erlang VM with focus on SMP (2008)
Concurrency in Java and in Erlang (pdf) (2004)
Performance Measurements of Threads in Java and Processes in Erlang (1998)
There are several contributing factors:
Erlang processes are not OS processes. They are implemented by the Erlang VM using a lightweight cooperative threading model (preemptive at the Erlang level, but under the control of a cooperatively scheduled runtime). This means that it is much cheaper to switch context, because they only switch at known, controlled points and therefore don't have to save the entire CPU state (normal, SSE and FPU registers, address space mapping, etc.).
Erlang processes use dynamically allocated stacks, which start very small and grow as necessary. This permits the spawning of many thousands — even millions — of Erlang processes without sucking up all available RAM.
Erlang used to be single-threaded, meaning that there was no requirement to ensure thread-safety between processes. It now supports SMP, but the interaction between Erlang processes on the same scheduler/core is still very lightweight (there are separate run queues per core).
After some more research I found a presentation by Joe Armstrong.
From Erlang - software for a concurrent world (presentation) (at 13 min):
[Erlang] is a concurrent language – by that I mean that threads are part of the programming language, they do not belong to the operating system. That's really what's wrong with programming languages like Java and C++. It's threads aren't in the programming language, threads are something in the operating system – and they inherit all the problems that they have in the operating system. One of the problems is granularity of the memory management system. The memory management in the operating system protects whole pages of memory, so the smallest size that a thread can be is the smallest size of a page. That's actually too big.
If you add more memory to your machine – you have the same number of bits that protects the memory so the granularity of the page tables goes up – you end up using say 64kB for a process you know running in a few hundred bytes.
I think it answers if not all, at least a few of my questions
I've implemented coroutines in assembler, and measured performance.
Switching between coroutines, a.k.a. Erlang processes, takes about 16 instructions and 20 nanoseconds on a modern processor. Also, you often know the process you are switching to (example: a process receiving a message in its queue can be implemented as straight hand-off from the calling process to the receiving process) so the scheduler doesn't come into play, making it an O(1) operation.
To switch OS threads, it takes about 500-1000 nanoseconds, because you're calling down to the kernel. The OS thread scheduler might run in O(log(n)) or O(log(log(n))) time, which will start to be noticeable if you have tens of thousands, or even millions of threads.
Therefore, Erlang processes are faster and scale better because both the fundamental operation of switching is faster, and the scheduler runs less often.
Erlang processes correspond (approximately) to green threads in other languages; there's no OS-enforced separation between the processes. (There may well be language-enforced separation, but that's a lesser protection despite Erlang doing a better job than most.) Because they're so much lighter-weight, they can be used far more extensively.
OS threads on the other hand are able to be simply scheduled on different CPU cores, and are (mostly) able to support independent CPU-bound processing. OS processes are like OS threads, but with much stronger OS-enforced separation. The price of these capabilities is that OS threads and (even more so) processes are more expensive.
Another way to understand the difference is this. Supposing you were going to write an implementation of Erlang on top of the JVM (not a particularly crazy suggestion) then you'd make each Erlang process be an object with some state. You'd then have a pool of Thread instances (typically sized according to the number of cores in your host system; that's a tunable parameter in real Erlang runtimes BTW) which run the Erlang processes. In turn, that will distribute the work that is to be done across the real system resources available. It's a pretty neat way of doing things, but relies utterly on the fact that each individual Erlang process doesn't do very much. That's OK of course; Erlang is structured to not require those individual processes to be heavyweight since it is the overall ensemble of them which execute the program.
In many ways, the real problem is one of terminology. The things that Erlang calls processes (and which correspond strongly to the same concept in CSP, CCS, and particularly the π-calculus) are simply not the same as the things that languages with a C heritage (including C++, Java, C#, and many others) call a process or a thread. There are some similarities (all involve some notion of concurrent execution) but there's definitely no equivalence. So be careful when someone says “process” to you; they might understand it to mean something utterly different…
I think Jonas wanted some numbers on comparing OS threads to Erlang processes. The author of Programming Erlang, Joe Armstrong, a while back tested the scalability of the spawning of Erlang processes to OS threads. He wrote a simple web server in Erlang and tested it against multi-threaded Apache (since Apache uses OS threads). There's an old website with the data dating back to 1998. I've managed only to find that site exactly once. So I can't supply a link. But the information is out there. The main point of the study showed that Apache maxed out just under 8K processes, while his hand written Erlang server handled 10K+ processes.
Because Erlang interpreter has only to worry about itself, the OS has many other things to worry about.
one of the reason is erlang process is created not in the OS, but in the evm(erlang virtual machine), so the cost is smaller.

Developing Kernels to support Multiple CPUs

I am looking to get into operating system kernel development and figured my contribution would be to extend the SANOS operating system in order to support multiple core machines. I have been reading books on operating systems (Tannenbaum) as well as studying how BSD and Linux have tackled this challenge but still am stuck on several concepts.
Does SANOS need to have more sophisticated scheduling algorithms when it runs on multiple CPUs or will what is currently in place work fine?
I know that it is a good idea for threads to have affinity to a core that they were started on, but is this handled via scheduling or by changing the implementation of how threads are created?
What would need to be considered such that SANOS could run on a machine with hundreds of cores? From what I can tell, BSD and Linux at best only support a maximum of a dozen of cores.
Your reading material is good. SO no problems there. Also take a peek at the CS downloadable lectures on operating system design from Stanford.
The scheduling algorithm may need to be more sophisticated. This depends on the types of applications running and how greedy they are. Do they yield themselves or are they forced to. That kind of thing. This is more a question of what your processes want, or expect. A RTOS will have more complex scheduling than a desktop.
Threads should have an affinity to one core, because 2 threads in one process can execute in parallel ... but not at the same real-time on the same core. Putting them on different cores allows them to really-run-in-parallel. Also caching can be optimized for core affinity. This is really a mix of your thread implementation and scheduler. The sched may want to ensure threads are started at the same time on cores, rather than ad-hoc to reduce the amount of time threads wait on eachother and things. If your thread library is user-space, maybe it assigns core, or lets the scheduler decide based on capacity or recent deaths.
Scalability is often a kernel limit (which can be arbitrary). In Linux, if I recall, the limits are due to static sizing of arrays that hold CPU information structs in the scheduler. Hence they are a fixed size. This can be changed by recompiling the kernel. Most good scheduling algorithms will support a very large number of cores. As your core or processor count gets higher, you need to be careful that you don't fragment a processes execution too much. If a program has 2 threads, try and schedule them in close-time-proximity because causation may exist (through shared data) between them.
You also need to decide how your threads are implemented, and how a process is represented (be it heavy or lightweight) in the kernel. Are threads kernel managed? user-space managed? These things all have an impact on scheduler design. Look at how POSIX threads are implemented in various operating systems. There is just so much for you to think about :)
in short there are not really any straight-cut answers to where the logic does, or should reside. It is all down to design, application expectation, time-constraints (on the programs) and so on.
Hope this helps, I am not an expert here however.

Threads & Processes Vs MultiThreading & Multi-Core/MultiProcessor : How they are mapped?

I was very confused but the following thread cleared my doubts:
Multiprocessing, Multithreading,HyperThreading, Multi-core
But it addresses the queries from the hardware point of view. I want to know how these hardware features are mapped to software?
One thing that is obvious is that there is no difference between MultiProcessor(=Mutlicpu) and MultiCore other than that in multicore all cpus reside on one chip(die) where as in Multiprocessor all cpus are on their own chips & connected together.
So, mutlicore/multiprocessor systems are capable of executing multiple processes (firefox,mediaplayer,googletalk) at the "sametime" (unlike context switching these processes on a single processor system) Right?
If it correct. I'm clear so far. But the confusion arises when multithreading comes into picture.
MultiThreading "is for" parallel processing. right?
What are elements that are involved in multithreading inside cpu? diagram? For me to exploit the power of parallel processing of two independent tasks, what should be the requriements of CPU?
When people say context switching of threads. I don't really get it. because if its context switching of threads then its not parallel processing. the threads must be executed "scrictly simultaneously". right?
My notion of multithreading is that:
Considering a system with single cpu. when process is context switched to firefox. (suppose) each tab of firefox is a thread and all the threads are executing strictly at the same time. Not like one thread has executed for sometime then again another thread has taken until the context switch time is arrived.
What happens if I run a multithreaded software on a processor which can't handle threads? I mean how does the cpu handle such software?
If everything is good so far, now question is HOW MANY THREADS? It must be limited by hardware, I guess? If hardware can support only 2 threads and I start 10 threads in my process. How would cpu handle it? Pros/Cons? From software engineering point of view, while developing a software that will be used by the users in wide variety of systems, Then how would I decide should I go for multithreading? if so, how many threads?
First, try to understand the concept of 'process' and 'thread'. A thread is a basic unit for execution: a thread is scheduled by operating system and executed by CPU. A process is a sort of container that holds multiple threads.
Yes, either multi-processing or multi-threading is for parallel processing. More precisely, to exploit thread-level parallelism.
Okay, multi-threading could mean hardware multi-threading (one example is HyperThreading). But, I assume that you just say multithreading in software. In this sense, CPU should support context switching.
Context switching is needed to implement multi-tasking even in a physically single core by time division.
Say there are two physical cores and four very busy threads. In this case, two threads are just waiting until they will get the chance to use CPU. Read some articles related to preemptive OS scheduling.
The number of thread that can physically run in concurrent is just identical to # of logical processors. You are asking a general thread scheduling problem in OS literature such as round-robin..
I strongly suggest you to study basics of operating system first. Then move on multithreading issues. It seems like you're still unclear for the key concepts such as context switching and scheduling. It will take a couple of month, but if you really want to be an expert in computer software, then you should know such very basic concepts. Please take whatever OS books and lecture slides.
Threads running on the same core are not technically parallel. They only appear to be executed in parallel, as the CPU switches between them very fast (for us, humans). This switch is what is called context switch.
Now, threads executing on different cores are executed in parallel.
Most modern CPUs have a number of cores, however, most modern OSes (windows, linux and friends) usually execute much larger number of threads, which still causes context switches.
Even if no user program is executed, still OS itself performs context switches for maintanance work.
This should answer 1-3.
About 4: basically, every processor can work with threads. it is much more a characteristic of operating system. Thread is basically: memory (optional), stack and registers, once those are replaced you are in another thread.
5: the number of threads is pretty high and is limited by OS. Usually it is higher than regular programmer can successfully handle :)
The number of threads is dictated by your program:
is it IO bound?
can the task be divided into a number of smaller tasks?
how small is the task? the task can be too small to make it worth to spawn threads at all.
synchronization: if extensive synhronization is required, the penalty might be too heavy and you should reduce the number of threads.
Multiple threads are separate 'chains' of commands within one process. From CPU point of view threads are more or less like processes. Each thread has its own set of registers and its own stack.
The reason why you can have more threads than CPUs is that most threads don't need CPU all the time. Thread can be waiting for user input, downloading something from the web or writing to disk. While it is doing that, it does not need CPU, so CPU is free to execute other threads.
In your example, each tab of Firefox probably can even have several threads. Or they can share some threads. You need one for downloading, one for rendering, one for message loop (user input), and perhaps one to run Javascript. You cannot easily combine them because while you download you still need to react to user's input. However, download thread is sleeping most of the time, and even when it's downloading it needs CPU only occasionally, and message loop thread only wakes up when you press a button.
If you go to task manager you'll see that despite all these threads your CPU use is still quite low.
Of course if all your threads do some number-crunching tasks, then you shouldn't create too many of them as you get no performance benefit (though there may be architectural benefits!).
However, if they are mainly I/O bound then create as many threads as your architecture dictates. It's hard to give advice without knowing your particular task.
Broadly speaking, yeah, but "parallel" can mean different things.
It depends what tasks you want to run in parallel.
Not necessarily. Some (indeed most) threads spend a lot of time doing nothing. Might as well switch away from them to a thread that wants to do something.
The OS handles thread switching. It will delegate to different cores if it wants to. If there's only one core it'll divide time between the different threads and processes.
The number of threads is limited by software and hardware. Threads consume processor and memory in varying degrees depending on what they're doing. The thread management software may impose its own limits as well.
The key thing to remember is the separation between logical/virtual parallelism and real/hardware parallelism. With your average OS, a system call is performed to spawn a new thread. What actually happens (whether it is mapped to a different core, a different hardware thread on the same core, or queued into the pool of software threads) is up to the OS.
Parallel processing uses all the methods not just multi-threading.
Generally speaking, if you want to have real parallel processing, you need to perform it in hardware. Take the example of the Niagara, it has up to 8-cores each capable of executing 4-threads in hardware.
Context switching is needed when there are more threads than is capable of being executed in parallel in hardware. Even then, when executed in series (switching between one thread to the next), they are considered concurrent because there is no guarantee on the order of switching. So, it may go T0, T1, T2, T1, T3, T0, T2 and so on. For all intents and purposes, the threads are parallel.
Time slicing.
That would be up to the OS.
Multithreading is the execution of more than one thread at a time. It can happen both on single core processors and the multicore processor systems. For single processor systems, context switching effects it. Look!Context switching in this computational environment refers to time slicing by the operating system. Therefore do not get confused. The operating system is the one that controls the execution of other programs. It allows one program to execute in the CPU at a time. But the frequency at which the threads are switched in and out of the CPU determines the transparency of parallelism exhibited by the system.
For multicore environment,multithreading occurs when each core executes a thread.Though,in multicore again,context switching can occur in the individual cores.
I think answers so far are pretty much to the point and give you a good basic context. In essence, say you have quad core processor, but each core is capable of executing 2 simultaneous threads.
Note, that there is only slight (or no) increase of speed if you are running 2 simultaneous threads on 1 core versus you run 1st thread and then 2nd thread vertically. However, each physical core adds speed to your general workflow.
Now, say you have a process running on your OS that has multiple threads (i.e. needs to run multiple things in "parallel") and has some kind of stack of tasks in a queue (or some other system with priority rules). Then software sends tasks to a queue and your processor attempts to execute them as fast as it can. Now you have 2 cases:
If a software supports multiprocessing, then tasks will be sent to any available processor (that is not doing anything or simply finished doing some other job and job send from your software is 1st in a queue).
If your software does not support multiprocessing, then all of your jobs will be done in a similar manner, but only by one of your cores.
I suggest reading Wikipedia page on thread. Very first picture there already gives you a nice insight. :)
