kernel entry points on ARM - linux

I was reading through the ARM kernel source code in order to better my understanding and came across something interesting.
Inside arch/arm/kernel/entry-armv.S there is a macro named vector_stub, that generates a small chunk of assembly followed by a jump table for various ARM modes. For instance, there is a call to vector_stub irq, IRQ_MODE, 4 which causes the macro to be expanded to a body with label vector_irq; and the same occurs for vector_dabt, vector_pabt, vector_und, and vector_fiq.
Inside each of these vector_* jump tables, there is exactly 1 DWORD with the address of a label with a _usr suffix.
I'd like to confirm that my understanding is accurate, please see below.
Does this mean that labels with the _usr suffix are executed, only if the interrupt arises when the kernel thread executing on that CPU is in userspace context? For instance, irq_usr is executed if the interrupt occurs when the kernel thread is in userspace context, dabt_usr is executed if the interrupt occurs when the kernel thread is in userspace context, and so on.
If [1] is true, then which kernel threads are responsible for handling, say irqs, with a different suffix such as irq_svc. I am assuming that this is the handler for an interrupt request that happens in SVC mode. If so, which kernel thread handles this? The kernel thread currently in SVC mode, on whichever CPU receives the interrupt?
If [2] is true, then at what point does the kernel thread finish processing the second interrupt, and return to where it had left off(also in SVC mode)? Is it ret_from_intr?

Inside each of these vector_* jump tables, there is exactly 1 DWORD with the address of a label with a _usr suffix.
This is correct. The table in indexed by the current mode. For instance, irq only has three entries; irq_usr, irq_svc, and irq_invalid. Irq's should be disabled during data aborts, FIQ and other modes. Linux will always transfer to svc mode after this brief 'vector stub' code. It is accomplished with,
#
# Prepare for SVC32 mode. IRQs remain disabled.
#
mrs r0, cpsr
eor r0, r0, #(\mode ^ SVC_MODE | PSR_ISETSTATE)
msr spsr_cxsf, r0
### ... other unrelated code
movs pc, lr # branch to handler in SVC mode
This is why irq_invalid is used for all other modes. Exceptions should never happen when this vector stub code is executing.
Does this mean that labels with the _usr suffix are executed, only if the interrupt arises when the kernel thread executing on that CPU is in userspace context? For instance, irq_usr is executed if the interrupt occurs when the kernel thread is in userspace context, dabt_usr is executed if the interrupt occurs when the kernel thread is in userspace context, and so on.
Yes, the spsr is the interrupted mode and the table indexes by these mode bits.
If 1 is true, then which kernel threads are responsible for handling, say irqs, with a different suffix such as irq_svc. I am assuming that this is the handler for an interrupt request that happens in SVC mode. If so, which kernel thread handles this? The kernel thread currently in SVC mode, on whichever CPU receives the interrupt?
I think you have some misunderstanding here. There is a 'kernel thread' for user space processes. The irq_usr is responsible for storing the user mode registers as a reschedule might take place. The context is different for irq_svc as a kernel stack was in use and it is the same one the IRQ code will use. What happens when a user task calls read()? It uses a system call and code executes in a kernel context. Each process has both a user and svc/kernel stack (and thread info). A kernel thread is a process without any user space stack.
If 2 is true, then at what point does the kernel thread finish processing the second interrupt, and return to where it had left off(also in SVC mode)? Is it ret_from_intr?
Generally Linux returns to the kernel thread that was interrupted so it can finish it's work. However, there is a configuration option for pre-empting svc threads/contexts. If the interrupt resulted in a reschedule event, then a process/context switch may result if CONFIG_PREEMPT is active. See svc_preempt for this code.
See also:
Linux kernel arm exception stack init
Arm specific irq initialization

Related

Core execution flow in the point of thread context switch and CPU mode switch

If
CPU has mode (privilege level) (Added because not all processors have privilege levels according to here)
CPU is multi-core (CISC / x86-64 instruction set)
scheduling is round robin scheduling
thread is kernel managed thread
OS is windows if necessary
I want to know simplified core execution flow in the point of thread context switch and CPU mode switch per time slice.
My understanding is as follows. Please correct me if I'm wrong.
In case of the thread is kernel managed user mode thread not involving interrupt, or anything that requires kernel mode,
The thread context switch occurs.
The core executing thread switches to kernel mode because context switch can only occur in kernel mode according to here, here and here unless the thread is user managed thread.
The core executing thread switches to user mode.
The core executes sequence of instructions located in user space.
Time slice expires.
Repeat 1.
Closest related diagram I could find is below.
Even a little clue to answer will be sincerely appreciated.
You said it yourself:
context switch can only occur in kernel mode
So the CPU must enter kernel mode before there can be a context switch. That can happen in either one of two ways in most operating systems:
The user-mode code makes a system call, or
An interrupt occurs.
If the thread enters kernel mode by making a system call, then there could be a context switch if the syscall causes the thread to no longer be runnable (e.g., a sleep() call), or there could be a context switch if the syscall causes some higher-priority thread to become runnable. (e.g., the syscall releases a mutex that the higher priority thread was awaiting.)
If the thread enters kernel mode because of an interrupt, then there could be a context switch because the interrupt handler made some higher-priority thread runnable (e.g., if the other thread was awaiting data from the disk), or there could be a context switch because it was a timer interrupt, and the current thread's time slice has expired.
The mechanism of context switching may be different on different hardware platforms. Here's how it could happen on some hypothetical CPU:
The current thread (threadA) enters sheduler code which chooses some other thread (threadB) as the next to run on the current CPU.
It calls some switchContext(threadB) function.
The switchContext function copies values from the stack pointer register, and from other live registers into the current thread (threadA)'s saved context area.*
It then sets the "current thread" pointer to point to threadB's saved context area, and it restores threadB's context by copying all the same things in reverse.**
Finally, the switchContext function returns... IN threadB,... at exactly the place where threadB last called it.
Eventually, threadB returns from the interrupt or system call to application code running in user-mode.
* The author of switchContext may have to be careful, may have to do some tricky things, in order to save the entire context without trashing it. E.g., it had better not use any register that needs saving before it has actually saved it somewhere.
** The trickiest part is when restoring the stack pointer register. As soon as that happens, "the" stack suddenly is threadB's stack instead of threadA's stack.

Signal vs Exceptions vs Hardware Interrupts vs Traps

I read this answer and I thought that I got a clear idea. But then this answer is confusing me again.
Can somebody please give me a clear picture of the differences between Signal, exception, hardware interrupts and traps?
Moreover, I would like to know which among these block CPU preemption of the kernel code?
Examples would be helpful.
•Interrupts are generated by hardware for events external to the processor core.
These are asynchronous in nature which means the processor is unaware of when the interrupt will be generated. These are also called hardware interrupts. Example: interrupt generated by keyboard to type a character on screen, or the timer interrupt.
•Exceptions: Exceptions occur when the processor detects an error condition while executing an instruction and are classified as faults, traps, or aborts depending on the way they are reported and whether the instruction that caused the exception can be restarted without loss of program or task continuity. (Those technical terms are used on x86 at least, maybe other architectures or in general.) Example: Divide by zero, or a page fault.
•Traps: are basically an instruction which tells the kernel to switch to kernel mode from user mode. Example: during a system call, a TRAP instruction would force kernel to execute the system call code inside kernel (kernel mode) on behalf of the process.
Trap is a kind of exception.
The x86 int 0x80 "software interrupt" instruction is a trap, not like external interrupts. x86 uses a single table of handlers for both interrupts and exceptions; other ISAs may also do that.
Some people use this term more generally, as a synonym for "exception". e.g. you might say "MIPS add will trap on signed overflow, so compilers always use addu."
•Signals: signals are generated by kernel or by a process (kill system call). They are eventually managed by the OS kernel, which delivers them to the target thread/process. E.g. a divide by zero instruction would result in kernel delivering a SIGFPE signal (arithmetic exception) to the process that ran it. (For example, the x86 #DE fault is handled by the kernel, generating a software SIGFPE for the current process.)
Related:
When an interrupt occurs, what happens to instructions in the pipeline? - #Krazy Glew's answer also defines some terminology using Intel definitions.
Why are segfaults called faults (and not aborts) if they are not recoverable? - segfaults are a special case of page faults, which are normally recoverable. And even an invalid page fault from user-space doesn't crash the kernel, so it's recoverable in that sense.

What is "process context" exactly, and how does it relates to "interrupt context"?

What does the following phrase mean: "the kernel executes in the process context"?
Does it mean that if CPU is executing some process and then some interrupt occurs (system call, key press, etc.), the CPU will keep the page table for the currently running process loaded and then it will execute the interrupt handler which resides in the process's kernel space?
If this is what it means, then it seems like the interrupt handler is executed in the process context, so what does interrupt context means?
Process context is its current state.
We need to save the context of the current running process so it can be resumed after the interrupt is handled.
Process context is basically its current state (what is in its registers).
esp
ss
eip
cs
and more.
We need to save the instruction pointer (EIP) and the CS (Code Segment) so that after the interrupt is handled we can continue running from where we were stopped.
The interrupt handler code resides in Kernel memory. Once an interrupt occur, we immediately switch from user mode to kernel mode. The state of the current running process is saved, part of it on user-stack and the other part on kernel-stack (depending on architecture). Assuming it's x86 then the interrupt handler is run by loading the appropriate ss, cs, esp and eip from TSS and Interrupt descriptor table.

Must IRET be used when returning from an interrupt?

IRET can restore the registers from the stack,including EFLAGS, ESP, EIP and so on, but we can also restore the registers all by ourselves. For example, "movl" can be used to restore the %esp register, "jmp" can jump the address pointed to EIP which is stored on the stack.
The linux kernel returns from all interrupts by IRET, which is a weight instruction.
Some kernel operations (like context switches) happen frequently.
Isn't IRET a waste?
Besides all the heavy stuff IRET can and often should do in addition to a mere blend of POPF+RETF, there's one more thing that it does. It has a special function related to non-maskable interrupts (NMIs).
Concurrent NMIs are delivered to the CPU one by one. IRET signals to the NMI circuitry that another NMI can now be delivered. No other instruction can do this signalling.
If NMIs could preempt execution of other NMI ISRs, they would be able to cause a stack overflow, which rarely is a good thing. Unless we're talking about this wonderful website. :)
So, all in all, IRET is not a waste.
Probably because doing all that manually would need even more CPU clocks.
From wikipedija:
The actual code that is invoked when an interrupt occurs is called the
Interrupt Service Routine (ISR). When an exception occurs, a program
invokes an interrupt, or the hardware raises an interrupt, the
processor uses one of several methods (to be discussed) to transfer
control to the ISR, whilst allowing the ISR to safely return control
to whatever it interrupted after execution is complete. At minimum,
FLAGS and CS:IP are saved and the ISR's CS:IP loaded; however, some
mechanisms cause a full task switch to occur before the ISR begins
(and another task switch when it ends).
So IRET isn't waste, it is minimum (and the fastest way) to return from ISR. Also all other CPU registers used in ISR must be preserved at begining and restored before IRET exsecution!

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources