Stack of hardware interrupt top-half in Linux kernel?

Stack of hardware interrupt top-half in Linux kernel? - linux

I know Linux kernel take thread kernel stack as ISR stack before 2.6.32, after 2.6.32, kernel uses separated stack, if wrong, please correct me.
Would you tell me when the ISR stack is setup/crated, or destroy if there is. Or tell me the source file name and line number? Thanks in advance.
Updated at Oct 17 2014:
There are several kinds of stack in Linux. Below are 3 major(not all) that I know.
User space process stack, each user space task has its own stack,
this is created by mmap() when task is created.
Kernel stack for user space task, one for each user space task, this is
created within do_fork()->copy_process()->dup_task_struct()->alloc_thread_info() and used for system_call.
Stack for hardware interruption(top half), one for each CPU(after 2.6),
defined in arch/x86/kernel/irq_32.c: DEFINE_PER_CPU(struct irq_stack *, hardirq_stack); do_IRQ() -> handle_irq() ->
execute_on_irq_stack() switch the interrupt stack
Please let me know if these are correct or not.

For Interrupt handler there is IRQ stack. 2 kinds of stack comes into picture for interrupt handler:
Hardware IRQ Stack.
Software IRQ Stack.
In contrast to the regular kernel stack that is allocated per process, the two additional stacks are allocated per CPU. Whenever a hardware interrupt occurs (or a softIRQ is processed), the kernel needs to switch to the appropriate stack. Historically, interrupt handlers did not receive their own stacks. Instead, interrupt handlers would share the stack of the running process, they interrupted. The kernel stack is two pages in size; typically, that is 8KB on 32-bit architectures and 16KB on 64-bit architectures. Because in this setup interrupt handlers share the stack, they must be exceptionally frugal with what data they allocate there. Of course, the kernel stack is limited to begin with, so all kernel code should be cautious.
Pointers to the additional stacks are provided in the following array:
arch/x86/kernel/irq_32.c
static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly;
static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly;

Related

Where are the stacks for the other threads located in a process virtual address space?

The following image shows where the sections of a process are laid out in the process's virtual address space (in Linux):
You can see that there is only one stack section (since this process only has one thread I assume).
But what if this process has another thread, where will the stack for this second thread be located? will it be located immediately below the first stack?

Stack space for a new thread is created by the parent thread with mmap(MAP_ANONYMOUS|MAP_STACK). So they're in the "memory map segment", as your diagram labels it. It can end up anywhere that a large malloc() could go. (glibc malloc(3) uses mmap(MAP_ANONYMOUS) for large allocations.)
(MAP_STACK is currently a no-op, and exists in case some future architecture needs special handling).
You pass a pointer to the new thread's stack space to the clone(2) system call which actually creates the thread. (Try using strace -f on a multi-threaded process sometime). See also this blog post about creating a thread using raw Linux syscalls.
See this answer on a related question for some more details about mmaping stacks. e.g. MAP_GROWSDOWN doesn't prevent another mmap() from picking the address right below the thread stack, so you can't depend on it to dynamically grow a small stack the way you can for the main thread's stack (where the kernel reserves the address space even though it's not mapped yet).
So even though mmap(MAP_GROWSDOWN) was designed for allocating stacks, it's so bad that Ulrich Drepper proposed removing it in 2.6.29.
Also, note that your memory-map diagram is for a 32-bit kernel. A 64-bit kernel doesn't have to reserve any user virtual-address space for mapping kernel memory, so a 32-bit process running on an amd64 kernel can use the full 4GB of virtual address space. (Except for the low 64k by default (sysctl vm.mmap_min_addr = 65536), so NULL-pointer dereference does actually fault. And the top page is also reserved as error codes, not valid pointers.)
Related:
See Relation between stack limit and threads for more about stack-size for pthreads. getrlimit(RLIMIT_STACK) is the main thread's stack size. Linux pthreads uses RLIMIT_STACK as the stack size for new threads, too.

How does Linux remember its Kernel Stack Pointer?

I know that there are two types of stack in Linux : user stack for each user threads and Kernel Stack for kernel threads (but 1 process). The interruptions, more precisely, the interruption procedures, are the bridges between this two modes, kernel (0) and user (3). The interrupt vector table let the processor loads the right instruction address in the PC register, but how is the stack pointer register changed when it switches in kernel mode? Does the subroutines indicate where's the kernel stack just before its first instruction? Or does the processor uses two stack pointer registers (I really doubt it)?
How does the "return from interrupt" knows where to return? Is the PCB saved in kernel stack or elsewhere?
Please don't hesitate to rectify it anything I've said to be true is not.
Thanks a lot for your help.

Kernel mode stack in Linux kernel is stored in task_struct->stack. Where and how it comes from is totally up to the platform. Some platforms might not be saving it like above. But then you can use task_stack_page() to find the stack.
And
While entering the interrupt handler, PC is stored on kernel stack. While returning from Interrupt this PC is loaded back from the kernel stack.

Which stack is used by Interrupt Handler - Linux

In a multitasking system when any hardware generates a interrupt to a particular CPU, where CPU can be performing either of below cases unless it is already serving a ISR:
User mode process is executing on CPU
Kernel mode process is executing on CPU
Would like to know which stack is used by interrupt handler in above two situations and why ?

All interrupts are handled by kernel. That is done by interrupt handler written for that particular interrupt. For Interrupt handler there is IRQ stack. The setup of an interrupt handler’s stacks is configuration option. The size of the kernel stack might not always be enough for the kernel work and the space required by
IRQ processing routines. Hence 2 stack comes into picture.
Hardware IRQ Stack.
Software IRQ Stack.
In contrast to the regular kernel stack that is allocated per process, the two additional stacks are allocated per CPU. Whenever a hardware interrupt occurs (or a softIRQ is processed), the kernel needs to switch to
the appropriate stack.
Historically, interrupt handlers did not receive their own stacks. Instead, interrupt handlers would share the stack of the running process, they interrupted. The kernel stack is two pages in size; typically, that is 8KB on 32-bit architectures and 16KB on 64-bit architectures. Because in this setup interrupt handlers share the stack, they must be exceptionally frugal with what data they allocate there. Of course, the kernel stack is limited to begin with, so all kernel code should be cautious.

Interrupts are only handled by the kernel. So it is some kernel stack which is used (in both cases).
Interrupts do not affect (directly) user processes.
Processes may get signals, but these are not interrupts. See signal(7)...

Historically, interrupt handlers did not receive their own stacks.
Instead, they would share the stack of the process that they interrupted.
Note that a process is always running. When nothing else is schedulable, the idle task runs.
The kernel stack is two pages in size:
8KB on 32-bit architectures.
16KB on 64-bit architectures.
Because of sharing the stack, interrupt handlers must be exceptionally frugal with what data they allocate there.
Early in the 2.6 kernel process, an option was added to reduce the stack size from two pages to one, providing only a 4KB stack on 32-bit system, and interrupt handlers were given their own stack, one stack per processor, one page in size. This stack is referred to as the interrupt stack.
Although the total size of the interrupt stack is half that of the original shared stack, the average stack space available is greater because interrupt handlers get the full page of memory to themselves, because previously every process on the system needed two pages of contiguous, nonswappable kernel memory.
Your interrupt handler should not care what stack setup is in use or what the size of the kernel stack is. Always use an absolute minimum amount of stack space
https://notes.shichao.io/lkd/ch7/#stacks-of-an-interrupt-handler

Why keep a kernel stack for each process in linux?

What's the point in keeping a different kernel stack for each process in linux?
Why not keep just one stack for the kernel to work with?

What's the point in keeping a different kernel stack for each process in linux?
It simplifies pre-emption of processes in the kernel space.
Why not keep just one stack for the kernel to work with?
It would be a night mare to implement pre-emption without seperates stacks.
Separate kernel stacks are not really mandated. Each architecture is free to do whatever it wants. If there was no per-emption during a system call, then a single kernel stack might make sense.
However, *nix has processes and each process can make a system call. However, Linux allows one task to be pre-empted during a write(), etc and another task to schedule. The kernel stack is a snapshot of the context of kernel work that is being performed for each process.
Also, the per-process kernel stacks come with little overhead. A thread_info or some mechanism to get the process information from assembler is needed. This is at least a page allocation. By placing the kernel mode stack in the same location, a simple mask can get the thread_info from assembler. So, we already need the per-process variable and allocation. Why not use it as a stack to store kernel context and allow preemption during system calls?
The efficiency of preemption can be demonstrated by mentioned write above. If the write() is to disk or network, it will take time to complete. A 5k to 8k buffer written to disk or network will take many CPU cycles to complete (if synchronous) and the user process will block until it is finished. This transfer in the driver can be done with DMA. Here, a hardware element will complete the task of transferring the buffer to the device. In the mean time, a lower priority process can have the CPU and be allowed to make system calls when the kernel keeps different stacks per process. These stacks are near zero cost as the kernel already needs to have book keeping information for process state and the two are both keep in an 4k or 8k page.

Why not keep just one stack for the kernel to work with?
In this case only one process/thread would be able to enter the kernel at a time.
Basically, each thread has its own stack, and crossing the user-space to kernel boundary does not change this fact. Kernel also has its own kernel threads (not belonging to any user-space process) and they all have their own stacks.

Does there exist Kernel stack for each process ?

Does there exist a Kernel stack and a user-space stack for each user space process? If both stacks exist, there should be 2 stack pointers for each user space process right?

In Linux, each task (userspace or kernel thread) has a kernel stack of either 8kb or 4kb, depending on kernel configuration. There are indeed separate stack pointers, however, only one is present in the CPU at any given time; if userspace code is running, the kernel stack pointer to be used on exceptions or interrupts is specified by the task-state segment, and if kernel code is running, the user stack pointer is saved in the context structure located on the kernel stack.

Each user space thread(not just process) has its user space stack and kernel space stack. The kernel space has one stack percpu and the ISR has its seperate stack too.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string