Signal vs Exceptions vs Hardware Interrupts vs Traps - linux

I read this answer and I thought that I got a clear idea. But then this answer is confusing me again.
Can somebody please give me a clear picture of the differences between Signal, exception, hardware interrupts and traps?
Moreover, I would like to know which among these block CPU preemption of the kernel code?
Examples would be helpful.

•Interrupts are generated by hardware for events external to the processor core.
These are asynchronous in nature which means the processor is unaware of when the interrupt will be generated. These are also called hardware interrupts. Example: interrupt generated by keyboard to type a character on screen, or the timer interrupt.
•Exceptions: Exceptions occur when the processor detects an error condition while executing an instruction and are classified as faults, traps, or aborts depending on the way they are reported and whether the instruction that caused the exception can be restarted without loss of program or task continuity. (Those technical terms are used on x86 at least, maybe other architectures or in general.) Example: Divide by zero, or a page fault.
•Traps: are basically an instruction which tells the kernel to switch to kernel mode from user mode. Example: during a system call, a TRAP instruction would force kernel to execute the system call code inside kernel (kernel mode) on behalf of the process.
Trap is a kind of exception.
The x86 int 0x80 "software interrupt" instruction is a trap, not like external interrupts. x86 uses a single table of handlers for both interrupts and exceptions; other ISAs may also do that.
Some people use this term more generally, as a synonym for "exception". e.g. you might say "MIPS add will trap on signed overflow, so compilers always use addu."
•Signals: signals are generated by kernel or by a process (kill system call). They are eventually managed by the OS kernel, which delivers them to the target thread/process. E.g. a divide by zero instruction would result in kernel delivering a SIGFPE signal (arithmetic exception) to the process that ran it. (For example, the x86 #DE fault is handled by the kernel, generating a software SIGFPE for the current process.)
Related:
When an interrupt occurs, what happens to instructions in the pipeline? - #Krazy Glew's answer also defines some terminology using Intel definitions.
Why are segfaults called faults (and not aborts) if they are not recoverable? - segfaults are a special case of page faults, which are normally recoverable. And even an invalid page fault from user-space doesn't crash the kernel, so it's recoverable in that sense.

Related

Sending user-mode interrupts on x86

On Linux x86, can I send interrupts (e.g., triggered by a timer, or other other mechanism), which will be handled by code running in user mode?
Assuming the answer is yes (and it is almost certainly yes, see e.g., timer_create), does delivering this interrupt occur solely in user mode, or is there some kernel transition involved (e.g., the interrupt is initially handled by the kernel, which then sends the signal to the user process).
All kernel timer interfaces work by delivering signals to user-space processes, after handling the timer interrupt inside the kernel (or otherwise noticing that or waiting until the deadline has been reached).
There are many big obstacles to having an interrupt handler run in ring 3, or from a user-space virtual address that's only mapped by one specific process. (Even if you pin that memory so it can't be paged, it is still only mapped at all when CR3 is set to that process's page tables. x86 uses virtual addresses in the IDT (interrupt descriptor table) and the page must be mapped when the interrupt fires (or else you get a page fault, I guess, which you really don't want to happen totally asynchronously). This is not a problem for normal kernel interrupt handlers; it always keeps kernel code mapped to the same virtual address across all user-space page tables. )
A kernel API that allowed registering a user-space function pointer as a ring 0 interrupt handler would be handing the keys to the kingdom to that userspace process, literally running with kernel privileges, so that's pretty much unreasonable.
It is technically possible for x86 to have an interrupt handler that runs in ring 3, but if the interrupt fired while in ring 0, iret would fault instead of returning back to the kernel code that got interrupted.
An interrupt handler would have to be written specially to return with iret, and to preserve all registers. e.g. __attribute__((interrupt_handler)) https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html. And any other process on the same core would be at the mercy of this process; any bugs in this (like destroying some architectural state, or dirtying SSE/AVX regs) could affect other processes. (If you can figure out how to get code in one process to run while CR3 might be set for another process...)
Avoiding deadlocks would also be a big issue; in the kernel there are a lot of limits on what you can do in an interrupt handler proper (the "top half") because it can run asynchronously in between any other instruction (unless you disable interrupts on that core).
I don't think it's really plausible for Linux to let you do this; even if you somehow solve all the (very hard) problems and even get the handler to run in ring 3, the kernel still has to trust it not to step on the architectural state of any other process.
There is precedent for things like X servers getting privileges to run in/out instructions (via iopl) and/or access /dev/mem (which would in theory let it steal info from other processes). But this would be even worse, and give you easy access to snapshots of register state from other processes.

Under what circumstances does control pass from userspace to the Linux kernel space?

I'm trying to understand which events can cause a transition from userspace to the linux kernel. If it's relevant, the scope of this question can be limited to the x86/x86_64 architecture.
Here are some sources of transitions that I'm aware of:
System calls (which includes accessing devices) causes a context switch from userspace to kernel space.
Interrupts will cause a context switch. As far as I know, this also includes scheduler preemptions, since a scheduler usually relies on a timer interrupt to do its work.
Signals. It seems like at least some signals are implemented using interrupts but I don't know if some are implemented differently so I'm listing them separately.
I'm asking two things here:
Am I missing any userspace->kernel path?
What are the various code paths that are involved in these context switches?
One you are missing: Exceptions
(which can be further broken down in faults, traps and aborts)
For example a page fault, breakpoint, division by zero or floating-point exception. Technically, one can view exceptions as interrupts but not really the way you have defined an interrupt in your question.
You can find a list of x86 exceptions at this osdev webpage.
With regard to your second question:
What are the various code paths that are involved in these context
switches?
That really depends on the architecture and OS, you will need to be more specific. For x86, when an interrupt occurs you go to the IDT entry and for SYSENTER you get to to address specified in the MSR. What happens after that is completely up to the OS.
No one wrote a complete answer so I will try to incorporate the comments and partial answers into an answer. Feel free to comment or edit the answer to improve it.
For the purposes of this question and answer, userspace to kernel transitions mean a change in processor state that allows access to kernel code and memory. In short I will refer to these transistions as context switches.
When discussing events that can trigger userspace to kernel transitions, it is important to separate the OS constructs that we are used to (signals, system calls, scheduling) that require context switches and the way these constructs are implemented, using context switches.
In x86, there are two central ways for context switches to occur: interrupts and SYSENTER. Interrupts are a processor feature, which causes a context switch when certain events happen:
Hardware devices may request an interrupt, for example, a timer/clock can cause an interrupt when a certain amount of time has elapsed. A keyboard can interrupt when keys are pressed. It's also called a hardware interrupt.
Userspace can initiate an interrupt. For example, the old way to perform a system call in Linux on x86 was to execute INT 0x80 with arguments passed through the registers. Debugging breakpoints are also implemented using interrupts, with the debugger replacing an instruction with INT 0x3. This type of an interrupt is called a software interrupt.
The CPU itself generates interrupts in certain situations, like when memory is accessed without permissions, when a user divides by zero, or when one core must notify another core that it needs to do something. This type of interrupt is called an exception, and you can read more about them in #esm 's answer.
For a broader discussion of interrupts see here: http://wiki.osdev.org/Interrupt
SYSENTER is an instruction that provides the modern path to cause a context switch for the particular case of performing a system call.
The code that handles the context switching due to interrupts or SYSENTER in Linux can be found in arch/x86/kernel/entry_{32|64}.S.
There are many situations in which a higher-level Linux construct might cause a context switch. Here are a few examples:
If a system call got to int 0x80 or sysenter instruction, a context switch occurs. Some system call routines can use userspace information to get the information the system call was meant to get. In this case, no context switch will occur.
Many times scheduling doesn't require an interrupt: a thread will perform a system call, and the return from the syscall is delayed until it is scheduled again. For processses that are in a section where syscalls aren't performed, Linux relies on timer interrupts to gain control.
Virtual memory access to a memory location that was paged out will cause a segmentation fault, and therefore a context switch.
Signals are usually delivered when a process is already "switched out" (see comments by #caf on the question), but sometimes an inter-processor interrupt is used to deliver the signal between two running processes.

Linux Process Context and SVC call in ARM

As per some Linux books
kernel code that services system calls issued by user applications
runs on behalf of the corresponding application process and is said to
be executing in process context. Interrupt Handlers run in interrupt
context.
Now svc and irq are two exceptions.
So when linux is handling svc it is in process context and while it is handling irq it is in interrupt context. Is that how it is mapped ?
Just one edit to this
It is also mentioned in books that tasklets / softirqs run in interrupt context while workqueues run in Process context. So does it mean that tasklet would run in CPSR.mode = IRQ ?
If I understand your confusion in the right way:
Since Linux is a capable, preemptive, complex operating system it has much finer handling of concepts such as handling of interrupts or serving software traps compared to bare metal hardware.
For example when a supervisor call (svc) happens hardware switches to SVC mode then Linux handles this as simple as preparing some data structures for handling it further then quits from SVC mode so core can continue serving in user mode thus making it possible to run into many more exception modes instead of blocking them.
It is same for IRQ mode, Linux handles bare minimum in IRQ mode. It prepares data structures as which IRQ happened, which handler should be invoked etc then exits from IRQ mode immediately to allow more to happen on that core. Later on some other internal kernel thread may process that interrupt further. Since hardware while being relatively simple runs really fast thus handling of interrupt runs in parallel with many processes.
Downside of this advanced approach is it gives no guarantees on response time requirements or overhead of it becomes visible on slower hardware like MCUs.
So ARM's exception modes provides two things for Linux: message type and priority backed with hardware support.
Message type is what exception mode is about, if it was a SVC, IRQ, FIQ, DATA ABORT, UNDEFINED INSTRUCTION, etc. So when hardware runs into an exception mode, Linux implicitly knows what it is handling.
Priority is about providing a capable and responsive hardware, for example system should be able to acknowledge an interrupt while handling some less important supervisor call.
Hardware support is for handling above two easier and faster. For example some registers are banked, or there is an extra system mode to handle reentrant IRQ easier.

Page fault in Interrupt context

Can a page fault occur in an interrupt handler/atomic context ?
It can, but it would be a disaster. :-)
(This is an oldish question. The existing answers contain correct facts, but are quite thin. I will attempt to answer it in a more substantial way.)
The answer to this question depends upon whether the code is in the kernel (supervisor mode), or in user mode. The reason is that the rules for memory access in these regions are usually different. Here is a brief sequence of events to illustrate the problem (assuming kernel memory could be paged out):
While a user program is executing, an interrupt occurs (e.g. key press / disk event).
CPU transitions to supervisor mode and begins executing the handler in the kernel.
The interrupt handler begins to save the CPU state (so that the user process can be correctly resumed later), but in doing so it touches some of its storage which had previously been paged out.
This triggers a page fault exception.
In order to process the page fault exception, the kernel must now save the CPU state of the code that experienced the page miss.
It may actually be able to do this if it has a preallocated pool of memory that will never be paged out, but such a pool would be inevitably be limited in size.
So you see, the safest (and simplest) solution is for the kernel to ensure that memory owned by the kernel is not pagable at all. For this reason, page faults should not really occur within the kernel. They can occur, but as #adobriyan notes, that usually indicates a much bigger error than a simple need to page in some memory. (I believe this is the case in Linux. Check your specific OS to be sure whether kernel memory is non-pagable. OS architectures do differ.)
So in summary, kernel memory is usually not pagable, and since interrupts are usually handled within the kernel, page faults should not in general occur while servicing interrupts. Higher priority interrupts can still interrupt lower ones. It is just that all their resources are kept in physical memory.
The question about atomic contexts is less clear. If by that you mean atomic operations supported by the hardware, then no interrupt occurs within a partial completion of the operation. If you are instead referring to something like a critical section, then remember that critical sections only emulate atomicity. From the perspective of the hardware there is nothing special about such code except for the entry and exit code, which may use true hardware atomic operations. The code in between is normal code, and subject to being interrupted.
I hope this provides a useful response to this question, as I also wondered about this issue for a while.
Yes.
The code for the handler or critical region could span the boundary between two pages. If the second page is not available, then a page fault is necessary to bring it in.
Not sure why no body has used the word "Double Fault":
http://en.wikipedia.org/wiki/Double_fault
But that is the terms used in Intel manual:
http://software.intel.com/en-us/articles/introduction-to-pc-architecture/
or here:
ftp://download.intel.com/design/processor/manuals/253668.pdf (look at section 6-38).
There is something called triple fault too, which as the name indicate, can also happened when the CPU is trying to service the double fault error.
I think the answer is YES.
I just checked the page fault handler code in kernel 4.15 for x86_64 platform.
Take the following as a hint. no_context is the classic 'kernel oops'.
no_context(struct pt_regs *regs, unsigned long error_code,
unsigned long address, int signal, int si_code)
{
/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs, X86_TRAP_PF)) {
/*
* Any interrupt that takes a fault gets the fixup. This makes
* the below recursive fault logic only apply to a faults from
* task context.
*/
if (in_interrupt())
return;

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources