kernel stack for linux process - linux

Is the kernel stack for all process shared or there is a seperate kernel stack for each process? If it is seperate for each process where is this stack pointer stored? In task_struct ?

There is just one common kernel memory. In it each process has it's own task_struct + kernel stack (by default 8K).
In a context switch the old stack pointer is saved somewhere and the actual stack pointer is made to point to the top of the stack (or bottom depending on the hardware architecture) of the new process which is going to run.

This old article says that each process has its own kernel stack. See comments to why that seems to be a very good design.
I tried reading the current source to make sure, but since the kernel stack is "implicit", it's not visible in the task_struct. This is mentioned in the article.
This answer was edited to incorporate wisdom from comments. Thanks.

The book "Linux kernel Development" of Robert Love has a good explanation about process kernel stack.
And yes, each process has its own kernel stack and if I´m not wrong its pointer is stored on thread_info structure. But I´m not really sure about it, and the struct task_struct is stored on beginning or the end of process kernel stack, depending of CPU architecture.
Cheers.
Carlos Maiolino

I think each process has its own kernel mode stack. Driver is executing in the kernel mode, the process sometimes will be blocked while executing driver routine. and the operating system can schedule another process to run. The scheduled process can call driver routine again. If kernel stack is shared, 2 processes are using the kernel stack, things will be mixed up. I'm puzzled by this question for a long time. At first I think kernel stack is shared, some books say that. After I read the Linux kernel Development, and see some driver code, I begin to think the kernel stack is not shared.

Related

How does Linux remember its Kernel Stack Pointer?

I know that there are two types of stack in Linux : user stack for each user threads and Kernel Stack for kernel threads (but 1 process). The interruptions, more precisely, the interruption procedures, are the bridges between this two modes, kernel (0) and user (3). The interrupt vector table let the processor loads the right instruction address in the PC register, but how is the stack pointer register changed when it switches in kernel mode? Does the subroutines indicate where's the kernel stack just before its first instruction? Or does the processor uses two stack pointer registers (I really doubt it)?
How does the "return from interrupt" knows where to return? Is the PCB saved in kernel stack or elsewhere?
Please don't hesitate to rectify it anything I've said to be true is not.
Thanks a lot for your help.
Kernel mode stack in Linux kernel is stored in task_struct->stack. Where and how it comes from is totally up to the platform. Some platforms might not be saving it like above. But then you can use task_stack_page() to find the stack.
And
While entering the interrupt handler, PC is stored on kernel stack. While returning from Interrupt this PC is loaded back from the kernel stack.

Stack of hardware interrupt top-half in Linux kernel?

I know Linux kernel take thread kernel stack as ISR stack before 2.6.32, after 2.6.32, kernel uses separated stack, if wrong, please correct me.
Would you tell me when the ISR stack is setup/crated, or destroy if there is. Or tell me the source file name and line number? Thanks in advance.
Updated at Oct 17 2014:
There are several kinds of stack in Linux. Below are 3 major(not all) that I know.
User space process stack, each user space task has its own stack,
this is created by mmap() when task is created.
Kernel stack for user space task, one for each user space task, this is
created within do_fork()->copy_process()->dup_task_struct()->alloc_thread_info() and used for system_call.
Stack for hardware interruption(top half), one for each CPU(after 2.6),
defined in arch/x86/kernel/irq_32.c: DEFINE_PER_CPU(struct irq_stack *, hardirq_stack); do_IRQ() -> handle_irq() ->
execute_on_irq_stack() switch the interrupt stack
Please let me know if these are correct or not.
For Interrupt handler there is IRQ stack. 2 kinds of stack comes into picture for interrupt handler:
Hardware IRQ Stack.
Software IRQ Stack.
In contrast to the regular kernel stack that is allocated per process, the two additional stacks are allocated per CPU. Whenever a hardware interrupt occurs (or a softIRQ is processed), the kernel needs to switch to the appropriate stack. Historically, interrupt handlers did not receive their own stacks. Instead, interrupt handlers would share the stack of the running process, they interrupted. The kernel stack is two pages in size; typically, that is 8KB on 32-bit architectures and 16KB on 64-bit architectures. Because in this setup interrupt handlers share the stack, they must be exceptionally frugal with what data they allocate there. Of course, the kernel stack is limited to begin with, so all kernel code should be cautious.
Pointers to the additional stacks are provided in the following array:
arch/x86/kernel/irq_32.c
static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly;
static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly;

Why keep a kernel stack for each process in linux?

What's the point in keeping a different kernel stack for each process in linux?
Why not keep just one stack for the kernel to work with?
What's the point in keeping a different kernel stack for each process in linux?
It simplifies pre-emption of processes in the kernel space.
Why not keep just one stack for the kernel to work with?
It would be a night mare to implement pre-emption without seperates stacks.
Separate kernel stacks are not really mandated. Each architecture is free to do whatever it wants. If there was no per-emption during a system call, then a single kernel stack might make sense.
However, *nix has processes and each process can make a system call. However, Linux allows one task to be pre-empted during a write(), etc and another task to schedule. The kernel stack is a snapshot of the context of kernel work that is being performed for each process.
Also, the per-process kernel stacks come with little overhead. A thread_info or some mechanism to get the process information from assembler is needed. This is at least a page allocation. By placing the kernel mode stack in the same location, a simple mask can get the thread_info from assembler. So, we already need the per-process variable and allocation. Why not use it as a stack to store kernel context and allow preemption during system calls?
The efficiency of preemption can be demonstrated by mentioned write above. If the write() is to disk or network, it will take time to complete. A 5k to 8k buffer written to disk or network will take many CPU cycles to complete (if synchronous) and the user process will block until it is finished. This transfer in the driver can be done with DMA. Here, a hardware element will complete the task of transferring the buffer to the device. In the mean time, a lower priority process can have the CPU and be allowed to make system calls when the kernel keeps different stacks per process. These stacks are near zero cost as the kernel already needs to have book keeping information for process state and the two are both keep in an 4k or 8k page.
Why not keep just one stack for the kernel to work with?
In this case only one process/thread would be able to enter the kernel at a time.
Basically, each thread has its own stack, and crossing the user-space to kernel boundary does not change this fact. Kernel also has its own kernel threads (not belonging to any user-space process) and they all have their own stacks.

Does Linux have both a user-mode and a system-mode stack for each process it runs?

This is my assignment question. Now I understand that the difference between the user mode and kernel mode (I think that is system mode).
But my question is: How the process works in Linux? Does the system have both user mode and system mode stacks for each process it runs?
I believe that this question has already been answered here:
What is the difference between kernel stack and user stack?
kernel stack and user space stack
That is, a userspace process has only one stack, a pointer to which is defined in the second element of the task_struct in include/linux/sched.h (about line 1045 in 3.12).
There is a possibility of some confusion with the per-thread kernel stack, as noted in the above posts. In a sense, a process can have one or more stacks, userspace and kernel space, depending on the number of threads it has at any point in time. The connection between the per-thread kernel stack, the thread and the process task_struct is described in this lecture by Junfeng Yang.

Linux Stack Sizes

I'm looking for a good description of stacks within the linux kernel, but I'm finding it surprisingly difficult to find anything useful.
I know that stacks are limited to 4k for most systems, and 8k for others. I'm assuming that each kernel thread / bottom half has its own stack. I've also heard that if an interrupt goes off, it uses the current thread's stack, but I can't find any documentation on any of this. What I'm looking for is how the stacks are allocated, if there's any good debugging routines for them (I'm suspecting a stack overflow for a particular problem, and I'd like to know if its possible to compile the kernel to police stack sizes, etc).
The reason that documentation is scarce is that it's an area that's quite architecture-dependent. The code is really the best documentation - for example, the THREAD_SIZE macro defines the (architecture-dependent) per-thread kernel stack size.
The stacks are allocated in alloc_thread_stack_node(). The stack pointer in the struct task_struct is updated in dup_task_struct(), which is called as part of cloning a thread.
The kernel does check for kernel stack overflows, by placing a canary value STACK_END_MAGIC at the end of the stack. In the page fault handler, if a fault in kernel space occurs this canary is checked - see for example the x86 fault handler which prints the message Thread overran stack, or stack corrupted after the Oops message if the stack canary has been clobbered.
Of course this won't trigger on all stack overruns, only the ones that clobber the stack canary. However, you should always be able to tell from the Oops output if you've suffered a stack overrun - that's the case if the stack pointer is below task->stack.
You can determine the process stack size with the ulimit command. I get 8192 KiB on my system:
$ ulimit -s
8192
For processes, you can control the stack size of processes via ulimit command (-s option). For threads, the default stack size varies a lot, but you can control it via a call to pthread_attr_setstacksize() (assuming you are using pthreads).
As for the interrupt using the userland stack, I somewhat doubt it, as accessing userland memory is a kind of a hassle from the kernel, especially from an interrupt routine. But I don't know for sure.

Resources