why is GHC thread extremely light weight? - multithreading

reading Simon Marlow's tutorial on parallel programming in haskell.
a thread typically costs less than a hundred bytes plus the space for the stack, so the runtime supports literally millions of them, limited only by the available memory, unlike OS threads ....
I had different impression on thread costs. kernel structure used for a single thread exceeds 4MB (thread stack). 32bit process space can spawn no more than 1000 threads, not literally millions
I think he is referring to the part that GHC controlled, but if OS has already maintained stack for thread why would GHC need to do that again?

Haskell is using "green threads" (managed by the Haskell runtime) in addition / on top of operating system threads (you still need those to make use of multiple CPU cores).
if OS has already maintained stack for thread why would GHC need to do that again?
Precisely for the reason you mention: An OS thread is heavy. A green thread can be very lightweight.
If you are familar with Java, this is roughly equivalent to using Thread versus submitting a task to an ExecutorService backed by a threadpool.

Related

User thread, Kernel thread, software thread and hardware thread

I'm studying thread and multithreading concepts and I ran into different kinds of thread:
User thread: supported above the kernel and are managed without the kernel.
Kernel thread: supported and managed directly by the operating system.
Software thread: threads of execution managed by the operating system.
Hardware thread: a feature of some processors that allow better utilization of the processor under some circumstances.
Can anyone clarify the difference between these types of threads (I'm confused)?
Thanks
Hardware thread is what allows you to actually run things in parallel (which is not the same as concurrently). These corresspond to number of your CPU cores (with nuances like hyperthreading, which can double the number of cores).
On top of that are OS (kernel) threads. Its an abstraction provided by your OS. The OS will map them to hardware threads. It does this via internal scheduler, and we have little to no control over that. Note that in theory there may be arbitrarily many OS threads (if there are not enough cores to handle them they simply wait for CPU), although the price for so called context switch limits it to few thousands, maybe more.
User threads (a.k.a. green threads, coroutines, etc. they have many names) is an abstraction provided by your software (e.g. programming language and its runtime). They run on top of OS threads, and are mapped to them via internal (but in user space) scheduler. They tend to perform better than OS threads (especially with i/o bound tasks) because they have lower context switch overhead, plus they can take advantage of async apis (e.g. nonblocking sockets) without spawning OS threads (which is costly as well). Since they are lightweight, you can spawn lots of them. Some people claim to run millions of such threads at a time. I've seen tens of thousands without issues.
I've never seen the term "software thread" though. But depending on context it means either user or kernel thread. Unlikely it means anything else.
Btw no real code can run without some OS support. It can be limited, if for example you don't want things to run in parallel. But as soon as you want true parallelism there is no escape from OS threads. The internal scheduler for user threads have to spawn OS threads and map user threads to them in some way. Although typically it is an invisible implemention detail.
"Hardware thread" is a bad name. It was chosen as a term of art by CPU designers, without much regard for what software developers think "thread" means.
When an operating system interrupts a running thread so that some other thread may be allowed to use the CPU, it must save enough of the state of the CPU so that the thread can be resumed again later on. Mostly that saved state consists of the program counter, the stack pointer, and other CPU registers that are part of the programmer's model of the CPU.
A so-called "hyperthreaded CPU" has two or more complete sets of those registers. That allows it to execute instructions on behalf of two or more program threads without any need for the operating system to intervene.
Experts in the field like nice, short names for things. Instead of talking about "complete sets of context registers," they just call them "hardware threads."

Memory management while using threads

1) I tried searching how memory would be allocated when we use threads in program but couldn't find the answer. Here What and where are the stack and heap? is how stack and heap works when a single program is called. But what happens when it comes to program with threads?
2)Using OpenMP parallel region creates threads and parallel code would be executed concurrently in each thread. Does this allocate more space in the memory than the memory occupied by same code with sequential execution?
In general, yes, [user-space] stacks are one per thread, whereas the heap is usually shared by all threads. See for example this Linux question. However, on some operating systems (OS), on Windows in particular, even a single threaded app may use more than one heap. Using OpenMP for threading doesn't change these basics, which are mostly dependant on the operating system. So unless you narrow your question to a specific OS, more can't be said at this level of generality.
Since I'm too lazy to draw this myself, here's the comparative illustration from PThreads Programming by Nichols et al. (1996)
A somewhat more detailed (and alas potentially a bit more confusing) diagram is found in the free LLNL POSIX Threads Programming tutorial by B. Barney.
And yes, as you correctly suspected, running more threads does consume more stack memory. You can actually exhaust the virtual address space of a process just with thread stacks if you make enough of them. Various implementations of OpenMP have a STACKSIZE environment variable (or thereabout) that controls how much stack OpenMP allocates for a thread.
Regarding Z boson's question/suggestion about Thread Local Storage (TLS): roughly (i.e. conceptually) speaking, Thread Local Storage is a per-thread heap. There are differences from the per-process heap in the API used to manipulate it, at the very least because each thread needs its own separate pointer to its own TLS, but basically you have a heap-like chunk of the process address space that's reserved to each thread. TLS is optional, you don't have to use it. OpenMP provides its own abstraction/directive for TLS-like persistent per-thread data, called THREADPRIVATE. It's not necessary that the OpenMP THREADPRIVATE uses the operating system's TLS support, however there's a Linux-focused paper which says that such an implementation gave the best performance, at least in that environment.
And here is a subtlety (or why I said "roughly speaking" when I compared TLS to per-thread heaps): assume you want a per-thread heap, say, in order to reduce locking contention to the main heap. You don't actually have to store an entire per-thread heap in each thread's TLS. It suffices to store in each thread's TLS a different head pointer to heaps allocated in the shared per-process space. Identifying and automatically using per-thread heaps in a program (in order to reduce locking contention on the main heap) is a farily difficult CS problem. Heap allocators which do this automatically are called scalable/parallel[izing] heap allocators or thereabout. For example, Intel TBB provides one such allocator, and it can be used in your program even if you use nothing else from TBB. Although some people seem to believe Intel's TBB allocator contains black magic, it's in fact not really different from the aforementioned basic idea of using TLS to point to some thread-local heap, which in turn is made of several doubly-linked lists segregated by block/object-size, as the following diagrams from the Intel paper on TBB illustrate:
IBM has something rather similar for AIX 7.1, but a bit more complex. You can tell its (default) allocator to use a fixed number of heaps for multi-threaded applications, e.g. MALLOCOPTIONS=multiheap:3. AIX 7.1 also has another option (which can be combined the multiheap) MALLOCOPTIONS=threadcache, which appears somewhat similar to what Intel TBB does, in that it keeps a per-thread cache of deallocated regions, from which future allocation requests can be serviced with less global heap contention. Besides those options for the default allocator, AIX 7.1 also has a (non-default) "Watson2" allocator which "uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required." (But you do need to select this allocator explicitly with MALLOCTYPE=Watson2.) Watson2's operation sounds even closer to what the Intel TBB allocator does.
The aforementioned two examples (Intel TBB and AIX) detailed above just meant as concrete examples, but shouldn't be understood as holding some exclusive sauce. The idea of per-thread or per-CPU heap cache/arena/magazine is fairly widespread. The BSDcan jemalloc paper cites a 1998 MS Research paper as the first to have systematically evaluated arenas for this purpose. The aforementioned MS paper does cite the ptmalloc web page as "visited on May 11, 1998" and summarizes ptmalloc's working as follows: "It uses a linked list of subheaps where each subheap has a lock, 128 free lists, and some memory to manage. When a thread needs to allocate a block, it scans the list of subheaps and grabs the first unlocked one, allocates the required block, and returns. If it can't find an unlocked subheap, it creates a new one and adds it to the list. In this way, a thread never waits on a locked subheap."

Haskell lightweight threads overhead and use on multicores

I've been reading the "Real World Haskell" book, the chapter on concurrency and parallelism. My question is as follows:
Since Haskell threads are really just multiple "virtual" threads inside one "real" OS-thread, does this mean that creating a lot of them (like 1000) will not have a drastic impact on performance? I.e., can we say that the overhead incurred from creating a Haskell thread with forkIO is (almost) negligible? Please bring pactical examples if possible.
Doesn't the concept of lightweight threads prevent us from using the benefints of multicore architectures? As I understand, it is not possible for two Haskell threads to execute concurrently on two separate cores, because they are really one single thread from the operating system's point of view. Or does the Haskell runtime do some clever tricks to ensure that multiple CPU's can be made use of?
GHC's runtime provides an execution environment supporting billions of sparks, thousands of lightweight threads, which may be distributed over multiple hardware cores. Compile with -threaded and use the +RTS -N4 flags to set your desired number of cores.
Specifically:
does this mean that creating a lot of them (like 1000) will not have a drastic impact on performance?
Well, creating 1,000,000 of them is certainly possible. 1000 is so cheap it won't even show up. You can see in thread creation benchmarks, such as "thread ring" that GHC is very, very good.
Doesn't the concept of lightweight threads prevent us from using the benefints of multicore architectures?
Not at all. GHC has been running on multicores since 2004. The current status of the multicore runtime is tracked here.
How does it do it? The best place to read up on this architecture is in the paper, "Runtime Support for Multicore Haskell":
The GHC runtime system supports millions of lightweight threads
by multiplexing them onto a handful of operating system threads,
roughly one for each physical CPU. ...
Haskell threads are executed by a set of operating system
threads, which we call worker threads. We maintain roughly one
worker thread per physical CPU, but exactly which worker thread
may vary from moment to moment ...
Since the worker thread may change, we maintain exactly one
Haskell Execution Context (HEC) for each CPU. The HEC is a
data structure that contains all the data that an OS worker thread
requires in order to execute Haskell threads
You can monitor your threads being created, and where they're executing, via threadscope.. Here, e.g. running the binary-trees benchmark:
The Warp webserver uses these lightweight threads extensively to get really good performance. Note that the other Haskell web servers also smoke the competition: this is more of a "Haskell is good" than "Warp is good."
Haskell provides a multithreaded runtime which can distribute lightweight threads across multiple system threads. It works very well for up to 4 cores. Past that, there are some performance issues, though those are being actively worked on.
Creating 1000 processes is relatively light weight; don't worry about doing it. As for performance, you should just benchmark it.
As has been pointed out before, multiple cores work just fine. Several Haskell threads can run at the same time by being scheduled on different OS threads.

Technically, why are processes in Erlang more efficient than OS threads?

Erlang's Characteristics
From Erlang Programming (2009):
Erlang concurrency is fast and scalable. Its processes are lightweight in that the Erlang virtual machine does not create an OS thread for every created process. They are created, scheduled, and handled in the VM, independent of underlying operating system. As a result, process creation time is of the order of microseconds and independent of the number of concurrently existing processes. Compare this with Java and C#, where for every process an underlying OS thread is created: you will get some very competitive comparisons, with Erlang greatly outperforming both languages.
From Concurrency oriented programming in Erlang (pdf) (slides) (2003):
We observe that the time taken to create an Erlang process is constant 1µs up to 2,500 processes; thereafter it increases to about 3µs for up to 30,000 processes. The performance of Java and C# is shown at the top of the figure. For a small number of processes it takes about 300µs to create a process. Creating more than two thousand processes is impossible.
We see that for up to 30,000 processes the time to send a message between two Erlang processes is about 0.8µs. For C# it takes about 50µs per message, up to the maximum number of processes (which was about 1800 processes). Java was even worse, for up to 100 process it took about 50µs per message thereafter it increased rapidly to 10ms per message when there were about 1000 Java processes.
My thoughts
I don't fully understand technically why Erlang processes are so much more efficient in spawning new processes and have much smaller memory footprints per process. Both the OS and Erlang VM have to do scheduling, context switches, and keep track of the values in the registers and so on...
Simply why aren't OS threads implemented in the same way as processes in Erlang? Do they have to support something more? And why do they need a bigger memory footprint? And why do they have slower spawning and communication?
Technically, why are processes in Erlang more efficient than OS threads when it comes to spawning and communication? And why can't threads in the OS be implemented and managed in the same efficient way? And why do OS threads have a bigger memory footprint, plus slower spawning and communication?
More reading
Inside the Erlang VM with focus on SMP (2008)
Concurrency in Java and in Erlang (pdf) (2004)
Performance Measurements of Threads in Java and Processes in Erlang (1998)
There are several contributing factors:
Erlang processes are not OS processes. They are implemented by the Erlang VM using a lightweight cooperative threading model (preemptive at the Erlang level, but under the control of a cooperatively scheduled runtime). This means that it is much cheaper to switch context, because they only switch at known, controlled points and therefore don't have to save the entire CPU state (normal, SSE and FPU registers, address space mapping, etc.).
Erlang processes use dynamically allocated stacks, which start very small and grow as necessary. This permits the spawning of many thousands — even millions — of Erlang processes without sucking up all available RAM.
Erlang used to be single-threaded, meaning that there was no requirement to ensure thread-safety between processes. It now supports SMP, but the interaction between Erlang processes on the same scheduler/core is still very lightweight (there are separate run queues per core).
After some more research I found a presentation by Joe Armstrong.
From Erlang - software for a concurrent world (presentation) (at 13 min):
[Erlang] is a concurrent language – by that I mean that threads are part of the programming language, they do not belong to the operating system. That's really what's wrong with programming languages like Java and C++. It's threads aren't in the programming language, threads are something in the operating system – and they inherit all the problems that they have in the operating system. One of the problems is granularity of the memory management system. The memory management in the operating system protects whole pages of memory, so the smallest size that a thread can be is the smallest size of a page. That's actually too big.
If you add more memory to your machine – you have the same number of bits that protects the memory so the granularity of the page tables goes up – you end up using say 64kB for a process you know running in a few hundred bytes.
I think it answers if not all, at least a few of my questions
I've implemented coroutines in assembler, and measured performance.
Switching between coroutines, a.k.a. Erlang processes, takes about 16 instructions and 20 nanoseconds on a modern processor. Also, you often know the process you are switching to (example: a process receiving a message in its queue can be implemented as straight hand-off from the calling process to the receiving process) so the scheduler doesn't come into play, making it an O(1) operation.
To switch OS threads, it takes about 500-1000 nanoseconds, because you're calling down to the kernel. The OS thread scheduler might run in O(log(n)) or O(log(log(n))) time, which will start to be noticeable if you have tens of thousands, or even millions of threads.
Therefore, Erlang processes are faster and scale better because both the fundamental operation of switching is faster, and the scheduler runs less often.
Erlang processes correspond (approximately) to green threads in other languages; there's no OS-enforced separation between the processes. (There may well be language-enforced separation, but that's a lesser protection despite Erlang doing a better job than most.) Because they're so much lighter-weight, they can be used far more extensively.
OS threads on the other hand are able to be simply scheduled on different CPU cores, and are (mostly) able to support independent CPU-bound processing. OS processes are like OS threads, but with much stronger OS-enforced separation. The price of these capabilities is that OS threads and (even more so) processes are more expensive.
Another way to understand the difference is this. Supposing you were going to write an implementation of Erlang on top of the JVM (not a particularly crazy suggestion) then you'd make each Erlang process be an object with some state. You'd then have a pool of Thread instances (typically sized according to the number of cores in your host system; that's a tunable parameter in real Erlang runtimes BTW) which run the Erlang processes. In turn, that will distribute the work that is to be done across the real system resources available. It's a pretty neat way of doing things, but relies utterly on the fact that each individual Erlang process doesn't do very much. That's OK of course; Erlang is structured to not require those individual processes to be heavyweight since it is the overall ensemble of them which execute the program.
In many ways, the real problem is one of terminology. The things that Erlang calls processes (and which correspond strongly to the same concept in CSP, CCS, and particularly the π-calculus) are simply not the same as the things that languages with a C heritage (including C++, Java, C#, and many others) call a process or a thread. There are some similarities (all involve some notion of concurrent execution) but there's definitely no equivalence. So be careful when someone says “process” to you; they might understand it to mean something utterly different…
I think Jonas wanted some numbers on comparing OS threads to Erlang processes. The author of Programming Erlang, Joe Armstrong, a while back tested the scalability of the spawning of Erlang processes to OS threads. He wrote a simple web server in Erlang and tested it against multi-threaded Apache (since Apache uses OS threads). There's an old website with the data dating back to 1998. I've managed only to find that site exactly once. So I can't supply a link. But the information is out there. The main point of the study showed that Apache maxed out just under 8K processes, while his hand written Erlang server handled 10K+ processes.
Because Erlang interpreter has only to worry about itself, the OS has many other things to worry about.
one of the reason is erlang process is created not in the OS, but in the evm(erlang virtual machine), so the cost is smaller.

Heavy weight and light weight thread

What are the Light weight and heavy weight threads in terms of Java?
It's related to the amount of "context" associated with a thread, and consequently the amount of time it takes to perform a "context switch".
Heavyweight threads, (usually kernel/os level threads) have a lot of context (hardware registers, kernel stacks, etc). So it takes a lot of time to switch between threads. Heavyweight threads may also have restrictions on them, for example, on some OSes, kernel threads cannot be pre-empted, which means they can't forcibly be switched out until they give up control.
Lightweight threads on the other hand (usually, user space threads) have much less context. (They essentially share the same hardware context), they only need to store the context of the user stack, hence the time taking to switch lightweight threads is much shorter.
On most OSes, any threads you create as a programmer in user space will be lightweight in comparison to the kernel space threads. There is no formal definition of heavyweight and lightweight, it's just more of a comparison between threads with more context and threads with less context. Don't forget that every OS has its own different implementation of threads, and the lines between heavy and light threads are not necessarily clearly defined. In some programming languages and frameworks, when you create a "Thread" you might not even be getting a full thread, you might just be getting some abstraction that hides the real number of threads underneath.
[Some OSes allow threads to share address space, so threads that would usually be quite heavy, are slightly lighter]
Java standard threads are reasonably heavy in comparison to Erlang threads which are very light spawnable processes. Erlang demonstrates a distributed finite state machine.
However as an example, http://kilim.malhar.net/ , a Java extension library based on the Actor model of concurrency, proposes a construct for light weight threads in java. Instead of Thread implementing run(), a Kilim thread implements from the Kilim library using an execute() method. Apparently it shows Java's runtime outperforms Erlang's (atleast in a local environment AFAIK). Java did actually have such things in the original language spec called 'green threads' but subsequent Java versions dropped them in favor of native threads
In most systems Light weight threads are the normal threads you create with the help of library, like p_threads in linux.
While Heavy weight, in some systems, refer to a system process, with its own virtual memory and a more complex structure, like information about the process performance/statistics.
For more information:
http://www.computerworld.com/s/article/66405/Processes_and_Threads
http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx

Resources