Why does GHC have a stack for each thread?

Why does GHC have a stack for each thread? - haskell

It's my understanding that GHC gives each thread a stack. Why is this necessary? Doesn't GHC compile to CPS? Isn't a thread expressed concisely as a closure?

There are several aspects to your question.
The key reference for the design decisions in the GHC runtime is the paper ''Runtime Support for Multicore Haskell''.
Recall that
The GHC runtime system supports millions of lightweight threads
by multiplexing them onto a handful of operating system threads,
roughly one for each physical CPU.
And:
Each Haskell thread runs on a ﬁnite-sized stack, which is allocated in
the heap. The state of a thread, together with its stack, is kept in a
heap-allocated thread state object (TSO). The size of a TSO is around
15 words plus the stack, and constitutes the whole state of a Haskell
thread. A stack may grow by copying the TSO into a larger area, and
may subsequently shrink again
GHC does not compile via CPS. Each thread make recursive calls, and they must allocate to the stack. By representing the stack as a heap-allocated object, things are made simpler.
A thread is more than a closure.
As a thread executes, it starts allocating to the heap and stack. Thus:
The stack of a thread, and hence its TSO, is mutable. When a
thread executes, the stack will accumulate pointers to new objects,
and so if the TSO resides in an old generation it must be added to
the remembered set [of the GC].
Garbage collecting objects pointed at by stacks can be optimized to ensure GC takes place on the same physical thread as the thread.
Furthermore, when the garbage collector runs,
it is highly desirable that the TSOs that have been executed on a
given CPU are traversed by the garbage collector on the same CPU,
because the TSO and data it refers to are likely to be in the local
cache of that CPU.
So, GHC has a stack for each thread because the compilation mandates that threads have access to a stack and a heap. By giving each thread its own stack, threads can execute in parallel more efficiently. Threads are more than "just a closure", since they have a mutable stack.

Related

preemptions due to malloc

I am thinking of the following scenario and I want to double check it with you.
One Linux process with 2 or more threads running in parallel on different cores. Let's say that they both call malloc with same amount such that malloc will not have to invoke mmap. In other words, the heap is big enough and (previously) increased by other sbrk invocations. In a such case, the memory allocations are entirely in user-space. By looking on git hub I have seen that there is a mutex protecting the internal data structures that malloc uses.
My questions is, can a thread be preempted out by the kernel given that the threads try to acquire the same lock? In other words, one of the threads will suffer a penalty in its execution due to the fact that the other has got that lock.
Thanks,

How can there be multiple call stacks allocated at the same time? How does the stack pointer change between threads?

Summary of my understanding:
The top memory addresses are used for the? (I initially thought there was only one call stack) stack, and the? stack grows downwards (What and where are the stack and heap?)
However, each thread gets it's own stack allocated, so there should be multiple call stacks in memory (https://stackoverflow.com/a/80113/2415178)
Applications can share threads (e.g, the key application is using the main thread), but several threads can be running at the same time.
There is a CPU register called sp that tracks the stack pointer, the current stack frame of a call stack.
So here's my confusion:
Do all of the call stacks necessary for an application (if this is even possible to know) get allocated when the application gets launched? Or do call stacks get allocated/de-allocated dynamically as applications spin off new threads? And if that is the case, (I know stacks have a fixed size), do the new stacks just get allocated right below the previous stacks-- So you would end up with a stack of stacks in the top addresses of memory? Or am I just fundamentally misunderstanding how call stacks are being created/used?
I am an OS X application developer, so my visual reference for how call stacks are created come from Xcode's stack debugger:
Now I realize that how things are here are more than likely unique to OS X, but I was hoping that conventions would be similar across operating systems.
It appears that each application can execute code on multiple threads, and even spin off new worker threads that belong to the application-- and every thread needs a call stack to keep track of the stack frames.
Which leads me to my last question:
How does the sp register work if there are multiple call stacks? Is it only used for the main call stack? (Presumably the top-most call stack in memory, and associated with the main thread of the OS) [https://stackoverflow.com/a/1213360/2415178]

Do all of the call stacks necessary for an application (if this is even possible to know) get allocated when the application gets launched?
No. Typically, each thread's stack is allocated when that thread is created.
Or do call stacks get allocated/de-allocated dynamically as applications spin off new threads?
Yes.
And if that is the case, (I know stacks have a fixed size), do the new stacks just get allocated right below the previous stacks-- So you would end up with a stack of stacks in the top addresses of memory? Or am I just fundamentally misunderstanding how call stacks are being created/used?
It varies. But the stack just has to be at the top of a large enough chunk of available address space in the memory map for that particular process. It doesn't have to be at the very top. If you need 1MB for the stack, and you have 1MB, you can just reserve that 1MB and have the stack start at the top of it.
How does the sp register work if there are multiple call stacks? Is it only used for the main call stack?
A CPU has as many register sets as threads that can run at a time. When the running thread is switched, the leaving thread's stack pointer is saved and the new thread's stack pointer is restored -- just like all other registers.
There is no "main thread of the OS". There are some kernel threads that do only kernel tasks, but also user-space threads also run in kernel space to run the OS code. Pure kernel threads have their own stacks somewhere in kernel memory. But just like normal threads, it doesn't have to be at the very top, the stack pointer just has to start at the highest address in the chunk used for that stack.

There is no such thing as the "main thread of the OS". Every process has its own set of threads, and those threads are specific to that process, not shared. Typically, at any given point in time, most threads on a system will be suspended awaiting input.
Every thread in a process has its own stack, which is allocated when the thread is created. Most operating systems will leave some space between each stack to allow them to grow if needed, and to prevent them from colliding with each other.
Every thread also has its own set of CPU registers, including a stack pointer (pointing to a location in that thread's stack).

Memory management while using threads

1) I tried searching how memory would be allocated when we use threads in program but couldn't find the answer. Here What and where are the stack and heap? is how stack and heap works when a single program is called. But what happens when it comes to program with threads?
2)Using OpenMP parallel region creates threads and parallel code would be executed concurrently in each thread. Does this allocate more space in the memory than the memory occupied by same code with sequential execution?

In general, yes, [user-space] stacks are one per thread, whereas the heap is usually shared by all threads. See for example this Linux question. However, on some operating systems (OS), on Windows in particular, even a single threaded app may use more than one heap. Using OpenMP for threading doesn't change these basics, which are mostly dependant on the operating system. So unless you narrow your question to a specific OS, more can't be said at this level of generality.
Since I'm too lazy to draw this myself, here's the comparative illustration from PThreads Programming by Nichols et al. (1996)
A somewhat more detailed (and alas potentially a bit more confusing) diagram is found in the free LLNL POSIX Threads Programming tutorial by B. Barney.
And yes, as you correctly suspected, running more threads does consume more stack memory. You can actually exhaust the virtual address space of a process just with thread stacks if you make enough of them. Various implementations of OpenMP have a STACKSIZE environment variable (or thereabout) that controls how much stack OpenMP allocates for a thread.
Regarding Z boson's question/suggestion about Thread Local Storage (TLS): roughly (i.e. conceptually) speaking, Thread Local Storage is a per-thread heap. There are differences from the per-process heap in the API used to manipulate it, at the very least because each thread needs its own separate pointer to its own TLS, but basically you have a heap-like chunk of the process address space that's reserved to each thread. TLS is optional, you don't have to use it. OpenMP provides its own abstraction/directive for TLS-like persistent per-thread data, called THREADPRIVATE. It's not necessary that the OpenMP THREADPRIVATE uses the operating system's TLS support, however there's a Linux-focused paper which says that such an implementation gave the best performance, at least in that environment.
And here is a subtlety (or why I said "roughly speaking" when I compared TLS to per-thread heaps): assume you want a per-thread heap, say, in order to reduce locking contention to the main heap. You don't actually have to store an entire per-thread heap in each thread's TLS. It suffices to store in each thread's TLS a different head pointer to heaps allocated in the shared per-process space. Identifying and automatically using per-thread heaps in a program (in order to reduce locking contention on the main heap) is a farily difficult CS problem. Heap allocators which do this automatically are called scalable/parallel[izing] heap allocators or thereabout. For example, Intel TBB provides one such allocator, and it can be used in your program even if you use nothing else from TBB. Although some people seem to believe Intel's TBB allocator contains black magic, it's in fact not really different from the aforementioned basic idea of using TLS to point to some thread-local heap, which in turn is made of several doubly-linked lists segregated by block/object-size, as the following diagrams from the Intel paper on TBB illustrate:
IBM has something rather similar for AIX 7.1, but a bit more complex. You can tell its (default) allocator to use a fixed number of heaps for multi-threaded applications, e.g. MALLOCOPTIONS=multiheap:3. AIX 7.1 also has another option (which can be combined the multiheap) MALLOCOPTIONS=threadcache, which appears somewhat similar to what Intel TBB does, in that it keeps a per-thread cache of deallocated regions, from which future allocation requests can be serviced with less global heap contention. Besides those options for the default allocator, AIX 7.1 also has a (non-default) "Watson2" allocator which "uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required." (But you do need to select this allocator explicitly with MALLOCTYPE=Watson2.) Watson2's operation sounds even closer to what the Intel TBB allocator does.
The aforementioned two examples (Intel TBB and AIX) detailed above just meant as concrete examples, but shouldn't be understood as holding some exclusive sauce. The idea of per-thread or per-CPU heap cache/arena/magazine is fairly widespread. The BSDcan jemalloc paper cites a 1998 MS Research paper as the first to have systematically evaluated arenas for this purpose. The aforementioned MS paper does cite the ptmalloc web page as "visited on May 11, 1998" and summarizes ptmalloc's working as follows: "It uses a linked list of subheaps where each subheap has a lock, 128 free lists, and some memory to manage. When a thread needs to allocate a block, it scans the list of subheaps and grabs the first unlocked one, allocates the required block, and returns. If it can't find an unlocked subheap, it creates a new one and adds it to the list. In this way, a thread never waits on a locked subheap."

Functions and variable space with threading using clone

I currently intend to implement threading using clone() and a question is, if I have all threads using the same memory space, with each function I call in a given thread, will each thread using a different part of memory when the same function is called, or do I do have todo something to ensure this happens?

Each thread will be using the same memory map overall but a different, separate thread-local stack for function calls. When different threads have called the same function (which itself lives at the same executable memory location), any local variables will not be shared, because they are allocated on the stack upon entry into the function / as needed, and each thread has its own stack by default.
References to any static/global memory (i.e., anything not allocated on the thread-local stack, such as globals or references to the heap or mmap memory regions passed to / visible to a thread after calling clone, and in general, the full memory map and process context itself) will, of course, be shared and subject to the usual host of multithreading issues (i.e., synchronization of shared state).
Note that you have to setup this thread-local stack space yourself before calling clone. From the manpage:
The child_stack argument specifies the location of the stack used by
the child process. Since the child and calling process may share
memory, it is not possible for the child process to execute in the
same stack as the calling process. The calling process must therefore
set up memory space for the child stack and pass a pointer to this
space to clone().
Stacks grow downward on all processors that run Linux (except the HP
PA processors), so child_stack usually points to the topmost address
of the memory space set up for the child stack.
The child_stack parameter is the second argument to clone. It's the responsibility of the caller (i.e., the parent thread) to ensure that each child thread created via clone receives a separate and non-overlapping chunk of memory for its stack.
Note that allocating and setting-up this thread-local stack memory region is not at all simple. Ensure that your allocations are page-aligned (start address is on a 4K boundary), a multiple of the page size (4K), amply-sized (if you only have a few threads, 2MB is safe), and ideally contains a "guard" section following the usable space. The stack guard is some number of pages with no access privileges-- no reading, no writing-- following the main stack region to guard the rest of the virtual memory address space should a thread dynamically exceed its stack size (e.g., with a bunch of recursion or functions with very large temporary buffers as local variables) and try to continue to grow into the stack guard region, which will fail-early as the thread will be served a SIGSEGV right away rather than insidiously corrupting. The stack guard is technically optional. You should probably be using mmap to allocate your stacks, although posix_memalign would do as well.
All that said, I've got to ask: why try to implement threading with clone to start? There are some very challenging problems here, and the POSIX threading library has solved them (in a portable way as well). If it's the fine-grained control of clone you want, then checkout the pthread_attr_* functions; they pretty much cover every non-obscure use case (such as allowing you to allocate your own stack if you like-- from the previous discussion, I'd think you wouldn't). The very performant, general Linux implementation, amongst other things, fully wraps clone and a large variety of other heinous system calls relevant to threading-- many of which do not even have C library wrappers and must be called via syscall. It all depends upon what you want to do.

Which components of program state is shared across threads in a multi-threaded process?

Which of the following components of program state is shared across threads in a multi-threaded process?
Register values
Heap Memory
Global Variables
Stack memory
My suggestion; Only global variables, global variables are allocated on the heap? So Heap memory and Global Variables. Is this correct?

Heap memory always.
Global variables depends on platform, usually they are shared.
Stack is thread-specific, as well as registers.

It depends on the language and the thread implementation. For example, I don't think that even C lets you directly access the CPU registers, so it's rather moot whether, say, pthreads shares registers (which, for the record, I am fairly certain it does not). Also in C, global variables are not in fact allocated on the heap, though they may be in other languages.
The stack is more complicated. In C/pthreads, each thread has its own stack, but in other languages and threading models, the situation could be far more complicated simply because the underlying stack models may not be so simple.

stack : no
registers: no
heap: yes (if you have to choose y or n, the true answers is it depends)
globals: yes

The Global values and heap memory are shared across a multithreaded process. Register values and stack memory are private to each thread.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string