Functions and variable space with threading using clone - multithreading

I currently intend to implement threading using clone() and a question is, if I have all threads using the same memory space, with each function I call in a given thread, will each thread using a different part of memory when the same function is called, or do I do have todo something to ensure this happens?

Each thread will be using the same memory map overall but a different, separate thread-local stack for function calls. When different threads have called the same function (which itself lives at the same executable memory location), any local variables will not be shared, because they are allocated on the stack upon entry into the function / as needed, and each thread has its own stack by default.
References to any static/global memory (i.e., anything not allocated on the thread-local stack, such as globals or references to the heap or mmap memory regions passed to / visible to a thread after calling clone, and in general, the full memory map and process context itself) will, of course, be shared and subject to the usual host of multithreading issues (i.e., synchronization of shared state).
Note that you have to setup this thread-local stack space yourself before calling clone. From the manpage:
The child_stack argument specifies the location of the stack used by
the child process. Since the child and calling process may share
memory, it is not possible for the child process to execute in the
same stack as the calling process. The calling process must therefore
set up memory space for the child stack and pass a pointer to this
space to clone().
Stacks grow downward on all processors that run Linux (except the HP
PA processors), so child_stack usually points to the topmost address
of the memory space set up for the child stack.
The child_stack parameter is the second argument to clone. It's the responsibility of the caller (i.e., the parent thread) to ensure that each child thread created via clone receives a separate and non-overlapping chunk of memory for its stack.
Note that allocating and setting-up this thread-local stack memory region is not at all simple. Ensure that your allocations are page-aligned (start address is on a 4K boundary), a multiple of the page size (4K), amply-sized (if you only have a few threads, 2MB is safe), and ideally contains a "guard" section following the usable space. The stack guard is some number of pages with no access privileges-- no reading, no writing-- following the main stack region to guard the rest of the virtual memory address space should a thread dynamically exceed its stack size (e.g., with a bunch of recursion or functions with very large temporary buffers as local variables) and try to continue to grow into the stack guard region, which will fail-early as the thread will be served a SIGSEGV right away rather than insidiously corrupting. The stack guard is technically optional. You should probably be using mmap to allocate your stacks, although posix_memalign would do as well.
All that said, I've got to ask: why try to implement threading with clone to start? There are some very challenging problems here, and the POSIX threading library has solved them (in a portable way as well). If it's the fine-grained control of clone you want, then checkout the pthread_attr_* functions; they pretty much cover every non-obscure use case (such as allowing you to allocate your own stack if you like-- from the previous discussion, I'd think you wouldn't). The very performant, general Linux implementation, amongst other things, fully wraps clone and a large variety of other heinous system calls relevant to threading-- many of which do not even have C library wrappers and must be called via syscall. It all depends upon what you want to do.

Related

How can there be multiple call stacks allocated at the same time? How does the stack pointer change between threads?

Summary of my understanding:
The top memory addresses are used for the? (I initially thought there was only one call stack) stack, and the? stack grows downwards (What and where are the stack and heap?)
However, each thread gets it's own stack allocated, so there should be multiple call stacks in memory (https://stackoverflow.com/a/80113/2415178)
Applications can share threads (e.g, the key application is using the main thread), but several threads can be running at the same time.
There is a CPU register called sp that tracks the stack pointer, the current stack frame of a call stack.
So here's my confusion:
Do all of the call stacks necessary for an application (if this is even possible to know) get allocated when the application gets launched? Or do call stacks get allocated/de-allocated dynamically as applications spin off new threads? And if that is the case, (I know stacks have a fixed size), do the new stacks just get allocated right below the previous stacks-- So you would end up with a stack of stacks in the top addresses of memory? Or am I just fundamentally misunderstanding how call stacks are being created/used?
I am an OS X application developer, so my visual reference for how call stacks are created come from Xcode's stack debugger:
Now I realize that how things are here are more than likely unique to OS X, but I was hoping that conventions would be similar across operating systems.
It appears that each application can execute code on multiple threads, and even spin off new worker threads that belong to the application-- and every thread needs a call stack to keep track of the stack frames.
Which leads me to my last question:
How does the sp register work if there are multiple call stacks? Is it only used for the main call stack? (Presumably the top-most call stack in memory, and associated with the main thread of the OS) [https://stackoverflow.com/a/1213360/2415178]
Do all of the call stacks necessary for an application (if this is even possible to know) get allocated when the application gets launched?
No. Typically, each thread's stack is allocated when that thread is created.
Or do call stacks get allocated/de-allocated dynamically as applications spin off new threads?
Yes.
And if that is the case, (I know stacks have a fixed size), do the new stacks just get allocated right below the previous stacks-- So you would end up with a stack of stacks in the top addresses of memory? Or am I just fundamentally misunderstanding how call stacks are being created/used?
It varies. But the stack just has to be at the top of a large enough chunk of available address space in the memory map for that particular process. It doesn't have to be at the very top. If you need 1MB for the stack, and you have 1MB, you can just reserve that 1MB and have the stack start at the top of it.
How does the sp register work if there are multiple call stacks? Is it only used for the main call stack?
A CPU has as many register sets as threads that can run at a time. When the running thread is switched, the leaving thread's stack pointer is saved and the new thread's stack pointer is restored -- just like all other registers.
There is no "main thread of the OS". There are some kernel threads that do only kernel tasks, but also user-space threads also run in kernel space to run the OS code. Pure kernel threads have their own stacks somewhere in kernel memory. But just like normal threads, it doesn't have to be at the very top, the stack pointer just has to start at the highest address in the chunk used for that stack.
There is no such thing as the "main thread of the OS". Every process has its own set of threads, and those threads are specific to that process, not shared. Typically, at any given point in time, most threads on a system will be suspended awaiting input.
Every thread in a process has its own stack, which is allocated when the thread is created. Most operating systems will leave some space between each stack to allow them to grow if needed, and to prevent them from colliding with each other.
Every thread also has its own set of CPU registers, including a stack pointer (pointing to a location in that thread's stack).

Stack for threads of a process in Linux

How is stack space allocated (in the same address space) to each thread of a process in Linux or any other OS for that matter?
It depends on the type of thread library, a user space library like pthreads would allocate memory and divide it into thread stacks. On the OS side each thread would get a kernel stack.
On creation of new thread, the operating system reserves space in stack segment for current thread (parent), where the future auto variables and function call data of parent will live. Then, it allocates one guard page (this is to prevent the parent colliding into child stack, but this may vary with different operating systems). Once this is done, the stack frame for child thread is created (which is typically one-two page(s)).
This process is repeated in case the parent spawns multiple threads. All these stack frames live in stack segment of address space of process whose all these threads are part of.

where thread is implemented in memory?

We know that thread has its own stack it's implemented within the process. But my question is that when thread is implemented in his own stack that time it is the same stack which used by process or any other function?
One more doubt that thread share it's global variable,file descriptor, signal handler etc. But how it's share all these parameters within same address where all the threads executed?
Brief explanation will be appreciated.
when thread is implemented in his own stack that time it is the same stack which used by process or any other?
Can't quite parse this but I get the gist I think.
In most cases, under Linux in a multithreaded application, all of the threads share the same address space. Each thread if it is running on a separate processor may have local cached memory but the overall address space is shared by all threads. Even per-thread stack space is shared by all threads -- just that each thread gets a different contiguous memory area.
But how it's share all these parameters within same address?
This is also true of the global variables, file descriptors, etc.. They are all shared.
Most thread implementations running under Linux use the clone(2) syscall to create new thread processes. To quote from the clone man page:
clone() creates a new process, in a manner similar to fork(2). It is actually a library function layered on top of the underlying clone() system call, hereinafter referred to as sys_clone. A description of sys_clone is given toward the end of this page.
Unlike fork(2), these calls allow the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers.
You can see the cloned processes by using ps -eLf under Linux.

Why does GHC have a stack for each thread?

It's my understanding that GHC gives each thread a stack. Why is this necessary? Doesn't GHC compile to CPS? Isn't a thread expressed concisely as a closure?
There are several aspects to your question.
The key reference for the design decisions in the GHC runtime is the paper ''Runtime Support for Multicore Haskell''.
Recall that
The GHC runtime system supports millions of lightweight threads
by multiplexing them onto a handful of operating system threads,
roughly one for each physical CPU.
And:
Each Haskell thread runs on a finite-sized stack, which is allocated in
the heap. The state of a thread, together with its stack, is kept in a
heap-allocated thread state object (TSO). The size of a TSO is around
15 words plus the stack, and constitutes the whole state of a Haskell
thread. A stack may grow by copying the TSO into a larger area, and
may subsequently shrink again
GHC does not compile via CPS. Each thread make recursive calls, and they must allocate to the stack. By representing the stack as a heap-allocated object, things are made simpler.
A thread is more than a closure.
As a thread executes, it starts allocating to the heap and stack. Thus:
The stack of a thread, and hence its TSO, is mutable. When a
thread executes, the stack will accumulate pointers to new objects,
and so if the TSO resides in an old generation it must be added to
the remembered set [of the GC].
Garbage collecting objects pointed at by stacks can be optimized to ensure GC takes place on the same physical thread as the thread.
Furthermore, when the garbage collector runs,
it is highly desirable that the TSOs that have been executed on a
given CPU are traversed by the garbage collector on the same CPU,
because the TSO and data it refers to are likely to be in the local
cache of that CPU.
So, GHC has a stack for each thread because the compilation mandates that threads have access to a stack and a heap. By giving each thread its own stack, threads can execute in parallel more efficiently. Threads are more than "just a closure", since they have a mutable stack.

Which components of program state is shared across threads in a multi-threaded process?

Which of the following components of program state is shared across threads in a multi-threaded process?
Register values
Heap Memory
Global Variables
Stack memory
My suggestion; Only global variables, global variables are allocated on the heap? So Heap memory and Global Variables. Is this correct?
Heap memory always.
Global variables depends on platform, usually they are shared.
Stack is thread-specific, as well as registers.
It depends on the language and the thread implementation. For example, I don't think that even C lets you directly access the CPU registers, so it's rather moot whether, say, pthreads shares registers (which, for the record, I am fairly certain it does not). Also in C, global variables are not in fact allocated on the heap, though they may be in other languages.
The stack is more complicated. In C/pthreads, each thread has its own stack, but in other languages and threading models, the situation could be far more complicated simply because the underlying stack models may not be so simple.
stack : no
registers: no
heap: yes (if you have to choose y or n, the true answers is it depends)
globals: yes
The Global values and heap memory are shared across a multithreaded process. Register values and stack memory are private to each thread.

Resources