Stack for threads of a process in Linux - linux

How is stack space allocated (in the same address space) to each thread of a process in Linux or any other OS for that matter?

It depends on the type of thread library, a user space library like pthreads would allocate memory and divide it into thread stacks. On the OS side each thread would get a kernel stack.

On creation of new thread, the operating system reserves space in stack segment for current thread (parent), where the future auto variables and function call data of parent will live. Then, it allocates one guard page (this is to prevent the parent colliding into child stack, but this may vary with different operating systems). Once this is done, the stack frame for child thread is created (which is typically one-two page(s)).
This process is repeated in case the parent spawns multiple threads. All these stack frames live in stack segment of address space of process whose all these threads are part of.

Related

How can there be multiple call stacks allocated at the same time? How does the stack pointer change between threads?

Summary of my understanding:
The top memory addresses are used for the? (I initially thought there was only one call stack) stack, and the? stack grows downwards (What and where are the stack and heap?)
However, each thread gets it's own stack allocated, so there should be multiple call stacks in memory (https://stackoverflow.com/a/80113/2415178)
Applications can share threads (e.g, the key application is using the main thread), but several threads can be running at the same time.
There is a CPU register called sp that tracks the stack pointer, the current stack frame of a call stack.
So here's my confusion:
Do all of the call stacks necessary for an application (if this is even possible to know) get allocated when the application gets launched? Or do call stacks get allocated/de-allocated dynamically as applications spin off new threads? And if that is the case, (I know stacks have a fixed size), do the new stacks just get allocated right below the previous stacks-- So you would end up with a stack of stacks in the top addresses of memory? Or am I just fundamentally misunderstanding how call stacks are being created/used?
I am an OS X application developer, so my visual reference for how call stacks are created come from Xcode's stack debugger:
Now I realize that how things are here are more than likely unique to OS X, but I was hoping that conventions would be similar across operating systems.
It appears that each application can execute code on multiple threads, and even spin off new worker threads that belong to the application-- and every thread needs a call stack to keep track of the stack frames.
Which leads me to my last question:
How does the sp register work if there are multiple call stacks? Is it only used for the main call stack? (Presumably the top-most call stack in memory, and associated with the main thread of the OS) [https://stackoverflow.com/a/1213360/2415178]
Do all of the call stacks necessary for an application (if this is even possible to know) get allocated when the application gets launched?
No. Typically, each thread's stack is allocated when that thread is created.
Or do call stacks get allocated/de-allocated dynamically as applications spin off new threads?
Yes.
And if that is the case, (I know stacks have a fixed size), do the new stacks just get allocated right below the previous stacks-- So you would end up with a stack of stacks in the top addresses of memory? Or am I just fundamentally misunderstanding how call stacks are being created/used?
It varies. But the stack just has to be at the top of a large enough chunk of available address space in the memory map for that particular process. It doesn't have to be at the very top. If you need 1MB for the stack, and you have 1MB, you can just reserve that 1MB and have the stack start at the top of it.
How does the sp register work if there are multiple call stacks? Is it only used for the main call stack?
A CPU has as many register sets as threads that can run at a time. When the running thread is switched, the leaving thread's stack pointer is saved and the new thread's stack pointer is restored -- just like all other registers.
There is no "main thread of the OS". There are some kernel threads that do only kernel tasks, but also user-space threads also run in kernel space to run the OS code. Pure kernel threads have their own stacks somewhere in kernel memory. But just like normal threads, it doesn't have to be at the very top, the stack pointer just has to start at the highest address in the chunk used for that stack.
There is no such thing as the "main thread of the OS". Every process has its own set of threads, and those threads are specific to that process, not shared. Typically, at any given point in time, most threads on a system will be suspended awaiting input.
Every thread in a process has its own stack, which is allocated when the thread is created. Most operating systems will leave some space between each stack to allow them to grow if needed, and to prevent them from colliding with each other.
Every thread also has its own set of CPU registers, including a stack pointer (pointing to a location in that thread's stack).

Where would the CPU context interrupted by ptrace be, userspace stack or kernel stack?

On Linux x86_64, when I use ptrace to stop a process, would all the threads' CPU contexts would be saved, or just the process's CPU context be saved?
Is the context on process's userspace stack or kernel stack? Or somewhere else? Or multiple copies?
For other situations (not ptrace), where could the interrupted (including exception and syscall) CPU context saved, kernel stack, userspace stack or somewhere else?
Is ptrace an interrupt?
Update
It seems that, ptrace's context pt_regs_x86_t, where to save is determined by the programmers. But would the kernel also stores a copy for the interrupted context?
Yes, the kernel will always store context for any thread that is not currently executing. That context is largely the same whether the thread is being ptrace'd or not. The difference is in how/whether the thread can be scheduled anew -- if it's being ptraced, the tracing process will decide when it can be resumed.
The thread's user-space context is stored on the kernel stack (but it's important to note that there is a separate kernel stack area for each thread). And that will be the same whether the thread entered the kernel by executing a system call, or was suspended due to an interrupt -- and ultimately, those are the only two ways a thread can be suspended.
As you discovered, when a process is ptrace'd, the tracing program is given access to the traced threads' registers in the state they had when the threads last executed. That is accomplished by simply copying the saved registers from the traced thread's kernel stack.
Finally, it's worth noting that if you look at linux kernel code, you won't find any concrete representation of a process. A process is just a group of related threads that share various parts of their state: process ID, address space, file descriptors, etc.

where thread is implemented in memory?

We know that thread has its own stack it's implemented within the process. But my question is that when thread is implemented in his own stack that time it is the same stack which used by process or any other function?
One more doubt that thread share it's global variable,file descriptor, signal handler etc. But how it's share all these parameters within same address where all the threads executed?
Brief explanation will be appreciated.
when thread is implemented in his own stack that time it is the same stack which used by process or any other?
Can't quite parse this but I get the gist I think.
In most cases, under Linux in a multithreaded application, all of the threads share the same address space. Each thread if it is running on a separate processor may have local cached memory but the overall address space is shared by all threads. Even per-thread stack space is shared by all threads -- just that each thread gets a different contiguous memory area.
But how it's share all these parameters within same address?
This is also true of the global variables, file descriptors, etc.. They are all shared.
Most thread implementations running under Linux use the clone(2) syscall to create new thread processes. To quote from the clone man page:
clone() creates a new process, in a manner similar to fork(2). It is actually a library function layered on top of the underlying clone() system call, hereinafter referred to as sys_clone. A description of sys_clone is given toward the end of this page.
Unlike fork(2), these calls allow the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers.
You can see the cloned processes by using ps -eLf under Linux.

Functions and variable space with threading using clone

I currently intend to implement threading using clone() and a question is, if I have all threads using the same memory space, with each function I call in a given thread, will each thread using a different part of memory when the same function is called, or do I do have todo something to ensure this happens?
Each thread will be using the same memory map overall but a different, separate thread-local stack for function calls. When different threads have called the same function (which itself lives at the same executable memory location), any local variables will not be shared, because they are allocated on the stack upon entry into the function / as needed, and each thread has its own stack by default.
References to any static/global memory (i.e., anything not allocated on the thread-local stack, such as globals or references to the heap or mmap memory regions passed to / visible to a thread after calling clone, and in general, the full memory map and process context itself) will, of course, be shared and subject to the usual host of multithreading issues (i.e., synchronization of shared state).
Note that you have to setup this thread-local stack space yourself before calling clone. From the manpage:
The child_stack argument specifies the location of the stack used by
the child process. Since the child and calling process may share
memory, it is not possible for the child process to execute in the
same stack as the calling process. The calling process must therefore
set up memory space for the child stack and pass a pointer to this
space to clone().
Stacks grow downward on all processors that run Linux (except the HP
PA processors), so child_stack usually points to the topmost address
of the memory space set up for the child stack.
The child_stack parameter is the second argument to clone. It's the responsibility of the caller (i.e., the parent thread) to ensure that each child thread created via clone receives a separate and non-overlapping chunk of memory for its stack.
Note that allocating and setting-up this thread-local stack memory region is not at all simple. Ensure that your allocations are page-aligned (start address is on a 4K boundary), a multiple of the page size (4K), amply-sized (if you only have a few threads, 2MB is safe), and ideally contains a "guard" section following the usable space. The stack guard is some number of pages with no access privileges-- no reading, no writing-- following the main stack region to guard the rest of the virtual memory address space should a thread dynamically exceed its stack size (e.g., with a bunch of recursion or functions with very large temporary buffers as local variables) and try to continue to grow into the stack guard region, which will fail-early as the thread will be served a SIGSEGV right away rather than insidiously corrupting. The stack guard is technically optional. You should probably be using mmap to allocate your stacks, although posix_memalign would do as well.
All that said, I've got to ask: why try to implement threading with clone to start? There are some very challenging problems here, and the POSIX threading library has solved them (in a portable way as well). If it's the fine-grained control of clone you want, then checkout the pthread_attr_* functions; they pretty much cover every non-obscure use case (such as allowing you to allocate your own stack if you like-- from the previous discussion, I'd think you wouldn't). The very performant, general Linux implementation, amongst other things, fully wraps clone and a large variety of other heinous system calls relevant to threading-- many of which do not even have C library wrappers and must be called via syscall. It all depends upon what you want to do.

What is the Difference B/W TCB(Thread control block) & PCB(Process)

A process control block (PCB) and a Thread Control Block (TCB) are both used in linux kernels to have time on the CPU delegated to them. What are the difference between the two?
What information is generally maintained in a process control bloc (PCB)?
Some notable fields that the PCB could contain are the process id, process group id, the parent process and child processes, the heap pointer, program counter, scheduling state (running, ready, blocked), permissions (what system resources the process is allowed to access), content of the general purpose registers, and open files.
TCB has a few of the same fields as the PCB (register values, stack pointer, program counter, scheduling state), in addition to a few specific values like the thread id and a pointer to the process that contains that thread. Note that there is not protection between threads.
In Linux there is a struct task_struct that stores information about a thread or process. It is declared in sched.h.
'A process control block (PCB) and a Thread Control Block (TCB) are both used in kernels to have time on the CPU delegated to them' - not normally, no. A PCB will have one or more TCB's linked to it. The TCB describes an execution context, (eg. stack pointer), the PCB an environment context, (eg. memory segments and permissions).
The PCB stores information about the kernel process. Like adressspaces etc...
A process can include different kernel threads.
Both are managed by the dispatcher and scheduler.
The TCB includes thread specific information.

Resources