Technically, why are processes in Erlang more efficient than OS threads? - multithreading

Erlang's Characteristics
From Erlang Programming (2009):
Erlang concurrency is fast and scalable. Its processes are lightweight in that the Erlang virtual machine does not create an OS thread for every created process. They are created, scheduled, and handled in the VM, independent of underlying operating system. As a result, process creation time is of the order of microseconds and independent of the number of concurrently existing processes. Compare this with Java and C#, where for every process an underlying OS thread is created: you will get some very competitive comparisons, with Erlang greatly outperforming both languages.
From Concurrency oriented programming in Erlang (pdf) (slides) (2003):
We observe that the time taken to create an Erlang process is constant 1µs up to 2,500 processes; thereafter it increases to about 3µs for up to 30,000 processes. The performance of Java and C# is shown at the top of the figure. For a small number of processes it takes about 300µs to create a process. Creating more than two thousand processes is impossible.
We see that for up to 30,000 processes the time to send a message between two Erlang processes is about 0.8µs. For C# it takes about 50µs per message, up to the maximum number of processes (which was about 1800 processes). Java was even worse, for up to 100 process it took about 50µs per message thereafter it increased rapidly to 10ms per message when there were about 1000 Java processes.
My thoughts
I don't fully understand technically why Erlang processes are so much more efficient in spawning new processes and have much smaller memory footprints per process. Both the OS and Erlang VM have to do scheduling, context switches, and keep track of the values in the registers and so on...
Simply why aren't OS threads implemented in the same way as processes in Erlang? Do they have to support something more? And why do they need a bigger memory footprint? And why do they have slower spawning and communication?
Technically, why are processes in Erlang more efficient than OS threads when it comes to spawning and communication? And why can't threads in the OS be implemented and managed in the same efficient way? And why do OS threads have a bigger memory footprint, plus slower spawning and communication?
More reading
Inside the Erlang VM with focus on SMP (2008)
Concurrency in Java and in Erlang (pdf) (2004)
Performance Measurements of Threads in Java and Processes in Erlang (1998)

There are several contributing factors:
Erlang processes are not OS processes. They are implemented by the Erlang VM using a lightweight cooperative threading model (preemptive at the Erlang level, but under the control of a cooperatively scheduled runtime). This means that it is much cheaper to switch context, because they only switch at known, controlled points and therefore don't have to save the entire CPU state (normal, SSE and FPU registers, address space mapping, etc.).
Erlang processes use dynamically allocated stacks, which start very small and grow as necessary. This permits the spawning of many thousands — even millions — of Erlang processes without sucking up all available RAM.
Erlang used to be single-threaded, meaning that there was no requirement to ensure thread-safety between processes. It now supports SMP, but the interaction between Erlang processes on the same scheduler/core is still very lightweight (there are separate run queues per core).

After some more research I found a presentation by Joe Armstrong.
From Erlang - software for a concurrent world (presentation) (at 13 min):
[Erlang] is a concurrent language – by that I mean that threads are part of the programming language, they do not belong to the operating system. That's really what's wrong with programming languages like Java and C++. It's threads aren't in the programming language, threads are something in the operating system – and they inherit all the problems that they have in the operating system. One of the problems is granularity of the memory management system. The memory management in the operating system protects whole pages of memory, so the smallest size that a thread can be is the smallest size of a page. That's actually too big.
If you add more memory to your machine – you have the same number of bits that protects the memory so the granularity of the page tables goes up – you end up using say 64kB for a process you know running in a few hundred bytes.
I think it answers if not all, at least a few of my questions

I've implemented coroutines in assembler, and measured performance.
Switching between coroutines, a.k.a. Erlang processes, takes about 16 instructions and 20 nanoseconds on a modern processor. Also, you often know the process you are switching to (example: a process receiving a message in its queue can be implemented as straight hand-off from the calling process to the receiving process) so the scheduler doesn't come into play, making it an O(1) operation.
To switch OS threads, it takes about 500-1000 nanoseconds, because you're calling down to the kernel. The OS thread scheduler might run in O(log(n)) or O(log(log(n))) time, which will start to be noticeable if you have tens of thousands, or even millions of threads.
Therefore, Erlang processes are faster and scale better because both the fundamental operation of switching is faster, and the scheduler runs less often.

Erlang processes correspond (approximately) to green threads in other languages; there's no OS-enforced separation between the processes. (There may well be language-enforced separation, but that's a lesser protection despite Erlang doing a better job than most.) Because they're so much lighter-weight, they can be used far more extensively.
OS threads on the other hand are able to be simply scheduled on different CPU cores, and are (mostly) able to support independent CPU-bound processing. OS processes are like OS threads, but with much stronger OS-enforced separation. The price of these capabilities is that OS threads and (even more so) processes are more expensive.
Another way to understand the difference is this. Supposing you were going to write an implementation of Erlang on top of the JVM (not a particularly crazy suggestion) then you'd make each Erlang process be an object with some state. You'd then have a pool of Thread instances (typically sized according to the number of cores in your host system; that's a tunable parameter in real Erlang runtimes BTW) which run the Erlang processes. In turn, that will distribute the work that is to be done across the real system resources available. It's a pretty neat way of doing things, but relies utterly on the fact that each individual Erlang process doesn't do very much. That's OK of course; Erlang is structured to not require those individual processes to be heavyweight since it is the overall ensemble of them which execute the program.
In many ways, the real problem is one of terminology. The things that Erlang calls processes (and which correspond strongly to the same concept in CSP, CCS, and particularly the π-calculus) are simply not the same as the things that languages with a C heritage (including C++, Java, C#, and many others) call a process or a thread. There are some similarities (all involve some notion of concurrent execution) but there's definitely no equivalence. So be careful when someone says “process” to you; they might understand it to mean something utterly different…

I think Jonas wanted some numbers on comparing OS threads to Erlang processes. The author of Programming Erlang, Joe Armstrong, a while back tested the scalability of the spawning of Erlang processes to OS threads. He wrote a simple web server in Erlang and tested it against multi-threaded Apache (since Apache uses OS threads). There's an old website with the data dating back to 1998. I've managed only to find that site exactly once. So I can't supply a link. But the information is out there. The main point of the study showed that Apache maxed out just under 8K processes, while his hand written Erlang server handled 10K+ processes.

Because Erlang interpreter has only to worry about itself, the OS has many other things to worry about.

one of the reason is erlang process is created not in the OS, but in the evm(erlang virtual machine), so the cost is smaller.

Related

Are Tcl threads multi process/multi core

I'm new to using threads in tcl but thought it was a nice way to solve a problem I'm having
I was trying to read through the tcl thread documentation but i can't quite figure out if tcl threads span threads across multiple cpu cores or try to keep all threads within the CPU core from which the master process was started?
Tcl's threads are threads as supported by the operating system's standard libraries (e.g., they're normal POSIX threads on Linux and OSX), and so are entirely capable of running over as many cores as the OS allows.
Tcl takes care to limit the use of locks in its implementation as much as possible, so as to make multi-core operation as efficient as possible; this came from experience supporting high-performance application servers in the 1990s, where it turned out that reducing the sharing of resources was a big win as hardware scaled up the number of cores.
It also means that you've got a non-shared memory model based on structured message passing; it scales well, but it was very different to what most programmers knew at the time. It's a little bit more mainstream now because shared-memory parallelism remains annoyingly troublesome on modern hardware.

What is the difference between multicore and concurrent programming

Can anyone help me out I am working on a presentation and would like to include a bit about - 'The difference between multicore and concurrent programming', I have googled a bit but not turning up many good descriptions, any help appreciated! :)
Thanks,
Eamonn
Concurrent (occurring or existing simultaneously) implies that different code MAY execute at the exact same cycle. It means that things can possibly happen in parallel if multiple processors or a processor with multiple cores is available and the program is crafted correctly. Just adding threads does not imply concurrent execution.
The reason I say MAY and possibly is that anytime the programs separate threads need to share volatile/mutable state, other threads that need access to that state can not continue executing and will have to wait their turn to access that state, and things start happening serially again.
Typically this is implemented in a single program as more than one thread executing code concurrently at the same exact cycle as another thread, given that there is no resource contentions as listed above. This requires multiple physical processors or cores. Other models run multiple heavyweight OS processes that can execute concurrently.
Concurrent programming is very hard to do correctly with mutable shared state.
You can write a concurrent program
that runs serially on a single single
core processor, but scales up to
execute more things at the same time
when more processors or cores, or even
multiple processors with multiple
cores is present.
You can also cause single threaded programs to appear concurrent on a multi-core / multi-processor system if they can operate on independent ranges of input data at the same time. Example: a single threaded 3D rendering program can on a dual core machine can run 2 separate instances the first rendering all the odd frames and the second rendering all the even frames. As long as they don't try to share any mutable resources.
Multi-core means that a single CPU has multiple Processor cores that can execute threads or processes concurrently and typically appears as multiple processors to mainstream operating systems.
It does NOT imply that programs that are single threaded gain any concurrency behaviors or benefits from the additional processor cores available.
Concurrent Programming is more broad - it just refers to writing software that will run "concurrently" - ie: more than one thing will happen at a time.
"Multi-core" programming is really referring to a specific subset of concurrent programming, in which you are targetting multiple available CPU cores on a specific machine. This is the most common form of concurrent programming (typically single process running on a single computer), but still only one form of concurrent programming.
You can do concurrent programming on a machine that has only a single CPU core. The operating system provides the illusion that more than one thread is running at the same time, it rapidly switches back-and-forth between them.
A machine with multiple cores simply needs to this context switching less often since two threads can run at the same time on two cores. It is only a bit special because threading bugs can make your life difficult much quicker. The odds that two threads try to access a shared memory location at the same time is much higher.
At a high level, multi-core is an attribute of the processor chip in your computer. Multi core means it has got multiple processing cores. There are several types of multi-processor computers: the old style super computers with thousands of computers connected via ethernet, systems with more than processors (like 2 Pentium 4s), and contemporary multi-core systems where every processor package has multiple processing cores 9like Intel i7). The third type is often called multi-core of Chip Multiprocessor (CMP).
Concurrent programming is an attribute of software. Concurrent programming is about writing code which has is split into multiple tasks that can execute concurrently if processors are available. While concurrent programs do leverage multi-core, concurrent programming is broader in two dimensions:
Concurrent programs can run on a single core or multiple cores.
Concurrent programs can be used on any type of multi-processors I mentioned above.
Thus, to summarize:
Concurrent programming is about software that can use multiple processors if available. those processors can be on the same chip (multi-core or Chip Multiprocessor) or on different chips (often known as SMP). You can have systems where you can put two multi-core chips in the same system making it a CMP and an SMP at the same time. Concurrent programming will work for that as well.
Concurrent programming regards operations that appear to overlap and is primarily concerned with the complexity that arises due to non-deterministic control flow. The quantitative costs associated with concurrent programs are typically both throughput and latency. Concurrent programs are often IO bound but not always, e.g. concurrent garbage collectors are entirely on-CPU. The pedagogical example of a concurrent program is a web crawler. This program initiates requests for web pages and accepts the responses concurrently as the results of the downloads become available, accumulating a set of pages that have already been visited. Control flow is non-deterministic because the responses are not necessarily received in the same order each time the program is run. This characteristic can make it very hard to debug concurrent programs. Some applications are fundamentally concurrent, e.g. web servers must handle client connections concurrently. Erlang, F# asynchronous workflows and Scala's Akka library are perhaps the most promising approaches to highly concurrent programming.
Multicore programming is a special case of parallel programming. Parallel programming concerns operations that are overlapped for the specific goal of improving throughput. The difficulties of concurrent programming are evaded by making control flow deterministic. Typically, programs spawn sets of child tasks that run in parallel and the parent task only continues once every subtask has finished. This makes parallel programs much easier to debug than concurrent programs. The hard part of parallel programming is performance optimization with respect to issues such as granularity and communication. The latter is still an issue in the context of multicores because there is a considerable cost associated with transferring data from one cache to another. Dense matrix-matrix multiply is a pedagogical example of parallel programming and it can be solved efficiently by using Straasen's divide-and-conquer algorithm and attacking the sub-problems in parallel. Cilk is perhaps the most promising approach for high-performance parallel programming on multicores and it has been adopted in both Intel's Threaded Building Blocks and Microsoft's Task Parallel Library (in .NET 4).

Concurrency: Processes vs Threads

What are the main advantages of using a model for concurrency based on processes over one
based on threads and in what contexts is the latter appropriate?
Fault-tolerance and scalability are the main advantages of using Processes vs. Threads.
A system that relies on shared memory or some other kind of technology that is only available when using threads, will be useless when you want to run the system on multiple machines. Sooner or later you will need to communicate between different processes.
When using processes you are forced to deal with communication via messages, for example, this is the way Erlang handles communication. Data is not shared, so there is no risk of data corruption.
Another advantage of processes is that they can crash and you can feel relatively safe in the knowledge that you can just restart them (even across network hosts). However, if a thread crashes, it may crash the entire process, which may bring down your entire application. To illustrate: If an Erlang process crashes, you will only lose that phone call, or that webrequest, etc. Not the whole application.
In saying all this, OS processes also have many drawbacks that can make them harder to use, like the fact that it takes forever to spawn a new process. However, Erlang has it's own notion of processes, which are extremely lightweight.
With that said, this discussion is really a topic of research. If you want to get into more of the details, you can give Joe Armstrong's paper on fault-tolerant systems]1 a read, it explains a lot about Erlang and the philosophy that drives it.
The disadvantage of using a process-based model is that it will be slower. You will have to copy data between the concurrent parts of your program.
The disadvantage of using a thread-based model is that you will probably get it wrong. It may sound mean, but it's true-- show me code based on threads and I'll show you a bug. I've found bugs in threaded code that has run "correctly" for 10 years.
The advantages of using a process-based model are numerous. The separation forces you to think in terms of protocols and formal communication patterns, which means its far more likely that you will get it right. Processes communicating with each other are easier to scale out across multiple machines. Multiple concurrent processes allows one process to crash without necessarily crashing the others.
The advantage of using a thread-based model is that it is fast.
It may be obvious which of the two I prefer, but in case it isn't: processes, every day of the week and twice on Sunday. Threads are too hard: I haven't ever met anybody who could write correct multi-threaded code; those that claim to be able to usually don't know enough about the space yet.
In this case Processes are more independent of eachother, while Threads shares some resources e.g. memory. But in a general case Threads are more light-weight than Processes.
Erlang Processes is not the same thing as OS Processes. Erlang Processes are very light-weight and Erlang can have many Erlang Processes within the same OS Thread. See Technically why is processes in Erlang more efficient than OS threads?
First and foremost, processes differ from threads mostly in the way their memory is handled:
Process = n*Thread + memory region (n>=1)
Processes have their own isolated memory.
Processes can have multiple threads.
Processes are isolated from each other on the operating system level.
Threads share their memory with their peers in the process.
(This is often undesirable. There are libraries and methods out there to remedy this, but that is usually an artificial layer over operating system threads.)
The memory thing is the most important discerning factor, as it has certain implications:
Exchanging data between processes is slower than between threads. Breaking the process isolation always requires some involvement of kernel calls and memory remapping.
Threads are more lightweight than processes. The operating system has to allocate resources and do memory management for each process.
Using processes gives you memory isolation and synchronization. Common problems with access to memory shared between threads do not concern you. Since you have to make a special effort to share data between processes, you will most likely sync automatically with that.
Using processes gives you good (or ultimate) encapsulation. Since inter process communication needs special effort, you will be forced to define a clean interface. It is a good idea to break certain parts of your application out of the main executable. Maybe you can split dependencies like that.
e.g. Process_RobotAi <-> Process_RobotControl
The AI will have vastly different dependencies compared to the control component. The interface might be simple: Process_RobotAI --DriveXY--> Process_RobotControl.
Maybe you change the robot platform. You only have to implement a new RobotControl executable with that simple interface. You don't have to touch or even recompile anything in your AI component.
It will also, for the same reasons, speed up compilation in most cases.
Edit: Just for completeness I will shamelessly add what the others have reminded me of :
A crashing process does not (necessarily) crash your whole application.
In General:
Want to create something highly concurrent or synchronuous, like an algorithm with n>>1 instances running in parallel and sharing data, use threads.
Have a system with multiple components that do not need to share data or algorithms, nor do they exchange data too often, use processes. If you use a RPC library for the inter process communication, you get a network-distributable solution at no extra cost.
1 and 2 are the extreme and no-brainer scenarios, everything in between must be decided individually.
For a good (or awesome) example of a system that uses IPC/RPC heavily, have a look at ros.

Linux: Processes and Threads in a Multi-core CPU

Is it true that threads, compared to processes, are less likely to benefit from a multi-core processor? In other words, would the kernel make the decision of executing threads on a single core rather than on multiple cores?
I'm talking about threads belonging to the same process.
I don't know how the (various) Linux scheduler handle this, but inter-thread communication gets more expensive when threads are running on different Cores.
So the scheduler may decide to run threads of a process on the same CPU if there are other processes needing CPU time.
Eg with a Dual-Core CPU, if there are two processes with two threads and all are using all CPU time they get, it is better to run the two threads of the first process on the first Core and the two threads of the other process on the second core.
That's news to me. Linux in particular makes little distinction between threads and processes. They are really just processes that share their address-space.
Multiple single-threaded processes are more expensive to the system than single multi-threaded ones. But they will benefit from multicore CPU with same efficiency. Plus inter-thread communication is much cheaper then inter-process communication. If these threads really form single application i vote for multithreading.
Shared-memory multithreading imposes huge complexity costs on everything from your tool-chain, to development, to debugging, reasoning, and testing your code. NEVER use shared-memory multithreading where you can reasonably use a multi-process design.
#Marcelo is right, any decent OS will treat threads and processes very similarly, some cpu-affinity for threads may reduce the multi-processor usage of a multi-threaded process, but you should see that with any two processes that share a common .text segment as well.
Pick threads vs. processes based on complexity and architectural design constraints, speed will almost never come into it.
It actually all depends on the scheduler, type of multiprocessing, and current running environment.
Assume nothing, test, test, test!
If you're the only multi-threaded process on the system, multi-threading is generally a good idea.
However, from the perspective of the ease of development, sometimes you want separate address spaces and shared data, especially in NUMA systems.
One thing for sure: If it's a 'HyperThreaded' system, threads are much more efficient by virtue of close memory sharing.
If it is a regular multi-core processing.. it should be similar.
If it is a NUMA system, you're better off keeping data shared and code separate. Again, it's all architecture dependent, and it doesn't matter performance-wise unless you're in the HPC business.
If you are in the HPC (supercomputing) business, TEST!. It's all machine dependent (and benefits are 10-25% on average, it matters if you're talking days of difference)
Whereas Windows uses fibres and threads I sometimes think Linux uses processes and twine.
I've found that in writing multi-threaded processes you really have to be rigorous, pedantic, disciplined and bloody-minded in designing threaded processes in order for them to achieve a balance of benefit in using whatever number of cores are available on the machine that the process is to run on.
Is it true, on Linux, that threads, compared to processes, are less likely to benefit from a multi-core processor? No one knows.

Developing Kernels to support Multiple CPUs

I am looking to get into operating system kernel development and figured my contribution would be to extend the SANOS operating system in order to support multiple core machines. I have been reading books on operating systems (Tannenbaum) as well as studying how BSD and Linux have tackled this challenge but still am stuck on several concepts.
Does SANOS need to have more sophisticated scheduling algorithms when it runs on multiple CPUs or will what is currently in place work fine?
I know that it is a good idea for threads to have affinity to a core that they were started on, but is this handled via scheduling or by changing the implementation of how threads are created?
What would need to be considered such that SANOS could run on a machine with hundreds of cores? From what I can tell, BSD and Linux at best only support a maximum of a dozen of cores.
Your reading material is good. SO no problems there. Also take a peek at the CS downloadable lectures on operating system design from Stanford.
The scheduling algorithm may need to be more sophisticated. This depends on the types of applications running and how greedy they are. Do they yield themselves or are they forced to. That kind of thing. This is more a question of what your processes want, or expect. A RTOS will have more complex scheduling than a desktop.
Threads should have an affinity to one core, because 2 threads in one process can execute in parallel ... but not at the same real-time on the same core. Putting them on different cores allows them to really-run-in-parallel. Also caching can be optimized for core affinity. This is really a mix of your thread implementation and scheduler. The sched may want to ensure threads are started at the same time on cores, rather than ad-hoc to reduce the amount of time threads wait on eachother and things. If your thread library is user-space, maybe it assigns core, or lets the scheduler decide based on capacity or recent deaths.
Scalability is often a kernel limit (which can be arbitrary). In Linux, if I recall, the limits are due to static sizing of arrays that hold CPU information structs in the scheduler. Hence they are a fixed size. This can be changed by recompiling the kernel. Most good scheduling algorithms will support a very large number of cores. As your core or processor count gets higher, you need to be careful that you don't fragment a processes execution too much. If a program has 2 threads, try and schedule them in close-time-proximity because causation may exist (through shared data) between them.
You also need to decide how your threads are implemented, and how a process is represented (be it heavy or lightweight) in the kernel. Are threads kernel managed? user-space managed? These things all have an impact on scheduler design. Look at how POSIX threads are implemented in various operating systems. There is just so much for you to think about :)
in short there are not really any straight-cut answers to where the logic does, or should reside. It is all down to design, application expectation, time-constraints (on the programs) and so on.
Hope this helps, I am not an expert here however.

Resources