Sending user-mode interrupts on x86 - linux

On Linux x86, can I send interrupts (e.g., triggered by a timer, or other other mechanism), which will be handled by code running in user mode?
Assuming the answer is yes (and it is almost certainly yes, see e.g., timer_create), does delivering this interrupt occur solely in user mode, or is there some kernel transition involved (e.g., the interrupt is initially handled by the kernel, which then sends the signal to the user process).

All kernel timer interfaces work by delivering signals to user-space processes, after handling the timer interrupt inside the kernel (or otherwise noticing that or waiting until the deadline has been reached).
There are many big obstacles to having an interrupt handler run in ring 3, or from a user-space virtual address that's only mapped by one specific process. (Even if you pin that memory so it can't be paged, it is still only mapped at all when CR3 is set to that process's page tables. x86 uses virtual addresses in the IDT (interrupt descriptor table) and the page must be mapped when the interrupt fires (or else you get a page fault, I guess, which you really don't want to happen totally asynchronously). This is not a problem for normal kernel interrupt handlers; it always keeps kernel code mapped to the same virtual address across all user-space page tables. )
A kernel API that allowed registering a user-space function pointer as a ring 0 interrupt handler would be handing the keys to the kingdom to that userspace process, literally running with kernel privileges, so that's pretty much unreasonable.
It is technically possible for x86 to have an interrupt handler that runs in ring 3, but if the interrupt fired while in ring 0, iret would fault instead of returning back to the kernel code that got interrupted.
An interrupt handler would have to be written specially to return with iret, and to preserve all registers. e.g. __attribute__((interrupt_handler)) https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html. And any other process on the same core would be at the mercy of this process; any bugs in this (like destroying some architectural state, or dirtying SSE/AVX regs) could affect other processes. (If you can figure out how to get code in one process to run while CR3 might be set for another process...)
Avoiding deadlocks would also be a big issue; in the kernel there are a lot of limits on what you can do in an interrupt handler proper (the "top half") because it can run asynchronously in between any other instruction (unless you disable interrupts on that core).
I don't think it's really plausible for Linux to let you do this; even if you somehow solve all the (very hard) problems and even get the handler to run in ring 3, the kernel still has to trust it not to step on the architectural state of any other process.
There is precedent for things like X servers getting privileges to run in/out instructions (via iopl) and/or access /dev/mem (which would in theory let it steal info from other processes). But this would be even worse, and give you easy access to snapshots of register state from other processes.

Related

Difference between Kernel, Kernel-Thread and User-Thread

i'm not sure, if i totally understand the above mentioned differences, so i'd like to explain it on my own and you can interrupt me, as far as i get wrong:
"A kernel is the initial piece of code which creates kernel-threads. Kernel threads are processes managed by the kernel. user-threads are part of a process. If you have a single-threaded process, than the whole process itself would be a user-thread. User-Threads make system-calls and this system-calls are served by a specific kernel-Thread which belongs to the calling user-threads. So for ervery user-thread which make a system call, a kernel-thread is created and after the kernel-thread has done its job, it gives control back to the user-thread and then the kernel-threas is destroyed."
Would this be ok?
Thank you!
Many greetings from Germany!
I don't think that's a very good mental model for kernel vs user. I think it's useful to look at the implementation of these abstractions in order to fully understand them:
What is a Kernel?
A kernel is basically just a piece of memory. It was privileged enough to be loaded before anything else, thereby allowing it to set the CPU's interrupt vectors.
Interrupts control everything, including I/O, timers, and virtual memory. That means that the kernel gets to decide how all that is handled.
A library is also just a piece of memory, and you can very well look at the kernel as the "system call library", among other things. But because the kernel represents the hardware, that piece of memory is shared among everyone.
Kernel Mode vs User Mode
Kernel mode is the CPU's "natural" mode, with no restrictions (on x86 CPUS - "ring 0"). User mode (on x86 CPUs - "ring 3") is when the CPU is instructed to trigger an interrupt whenever certain instructions are used or whenever some memory locations are accessed. This allows the kernel to have the CPU execute specific kernel code when the user tries to access kernel memory or memory representing I/O ports or hardware memory such as the GPU's frame buffer.
Processes and Threads
A process is also just a piece of memory, consisting of its own heap and the memory used by libraries, among which is the kernel.
A thread (= a unit of scheduling) is just a stack with an ID that the kernel knows of and tracks. That's the call stack that the CPU uses when the thread is running. User threads have 2 stacks: one for user mode and one for kernel mode - but they still have the same ID.
Because the kernel controls timers, it sets up a timer to go off e.g. every 1 ms. When the timer triggers ("timer interrupt"), the CPU runs the callback that the kernel set up for that interrupt, where the kernel can see that the current thread has been running for a while and decide to unschedule it and schedule another thread instead.
Virtual Memory Context
By "virtual memory context" I mean all the memory that can be accessed by the CPU. This includes all the memory of the process - including the user-mode heap and memory of libraries, user-mode call stacks of all process threads, kernel-mode stack of all threads in the system, the kernel's heap memory, I/O ports, and hardware memory.
When an interrupt or a system call occur, the virtual memory context doesn't change, only a CPU flag is flipped (i.e. from ring 3 to ring 0) and the CPU is now back in its "natural" kernel mode where it can freely access kernel memory, I/O ports and hardware memory.
When a new process is created, what actually happens is that a new thread is created, and assigned a new virtual memory context. Therefore, every process starts as single-threaded. That thread can later ask the kernel via a system call to create more threads (= stacks) which share its virtual memory context (= process), or ask the kernel to create more threads, each with a new virtual memory context (= new processes).
Kernel Threads
Like any other library, the kernel can have its own background threads for optimization purposes. When such a need arises (which can happen in the memory context of any process when servicing a system call), the kernel will create new threads and give them a special memory context, which is a context that only contains the kernel's memory, with no access to memory of any process.
You're mixing up a few somewhat different concepts.
To follow from what you wrote, there is a Kernel, which is a piece of code that handles all internal operations of the Operating System. It does create kernel threads, but the Kernel threads are nothing special. They are just threads which run in "Kernel-Mode" and are not associated with any "User-Mode" process.
Now we have a concept which is lacking from your explanation and is the key to understand it better. Kernel-Mode (or sometimes called system mode), along with User-Mode make up CPU modes available to OS.
Kernel-Mode is a kind of trusted execution mode, which allows the code to access any memory and execute any instruction. It handles I/O and system interrupts.
User-Mode is a limited mode, which does not allow the executing code to access any memory address except those associated with the User-Mode process.
Also User-Mode cannot access I/O or those many OS related function (such as handle or process creation). For these operations, User-Mode code should call into Kernel-Mode, by a system call (as you have correctly mentioned).
A system call is a special CPU instruction which switches the CPU mode to Kernel-Mode and starts executing a special code provided by OS which dispatches different system calls. So, it means the work is NOT scheduled for a Kernel-Mode thread, instead the OS (kernel/trusted) code is executed in the context of the same User-Mode thread. The only thing that happens is that CPU mode changes to Kernel-Mode.
As for completing jobs in a Kernel-thread, I should say although in some cases, some operations (e.g. I/O) might be scheduled for a separate Kernel thread to complete, but the Kernel threads are not created and destroyed in the process of a system call.
Backed by:
10+ years of driver development experience
Also:
http://www.linfo.org/kernel_mode.html
https://learn.microsoft.com/en-us/windows-hardware/drivers/gettingstarted/user-mode-and-kernel-mode

Under what circumstances does control pass from userspace to the Linux kernel space?

I'm trying to understand which events can cause a transition from userspace to the linux kernel. If it's relevant, the scope of this question can be limited to the x86/x86_64 architecture.
Here are some sources of transitions that I'm aware of:
System calls (which includes accessing devices) causes a context switch from userspace to kernel space.
Interrupts will cause a context switch. As far as I know, this also includes scheduler preemptions, since a scheduler usually relies on a timer interrupt to do its work.
Signals. It seems like at least some signals are implemented using interrupts but I don't know if some are implemented differently so I'm listing them separately.
I'm asking two things here:
Am I missing any userspace->kernel path?
What are the various code paths that are involved in these context switches?
One you are missing: Exceptions
(which can be further broken down in faults, traps and aborts)
For example a page fault, breakpoint, division by zero or floating-point exception. Technically, one can view exceptions as interrupts but not really the way you have defined an interrupt in your question.
You can find a list of x86 exceptions at this osdev webpage.
With regard to your second question:
What are the various code paths that are involved in these context
switches?
That really depends on the architecture and OS, you will need to be more specific. For x86, when an interrupt occurs you go to the IDT entry and for SYSENTER you get to to address specified in the MSR. What happens after that is completely up to the OS.
No one wrote a complete answer so I will try to incorporate the comments and partial answers into an answer. Feel free to comment or edit the answer to improve it.
For the purposes of this question and answer, userspace to kernel transitions mean a change in processor state that allows access to kernel code and memory. In short I will refer to these transistions as context switches.
When discussing events that can trigger userspace to kernel transitions, it is important to separate the OS constructs that we are used to (signals, system calls, scheduling) that require context switches and the way these constructs are implemented, using context switches.
In x86, there are two central ways for context switches to occur: interrupts and SYSENTER. Interrupts are a processor feature, which causes a context switch when certain events happen:
Hardware devices may request an interrupt, for example, a timer/clock can cause an interrupt when a certain amount of time has elapsed. A keyboard can interrupt when keys are pressed. It's also called a hardware interrupt.
Userspace can initiate an interrupt. For example, the old way to perform a system call in Linux on x86 was to execute INT 0x80 with arguments passed through the registers. Debugging breakpoints are also implemented using interrupts, with the debugger replacing an instruction with INT 0x3. This type of an interrupt is called a software interrupt.
The CPU itself generates interrupts in certain situations, like when memory is accessed without permissions, when a user divides by zero, or when one core must notify another core that it needs to do something. This type of interrupt is called an exception, and you can read more about them in #esm 's answer.
For a broader discussion of interrupts see here: http://wiki.osdev.org/Interrupt
SYSENTER is an instruction that provides the modern path to cause a context switch for the particular case of performing a system call.
The code that handles the context switching due to interrupts or SYSENTER in Linux can be found in arch/x86/kernel/entry_{32|64}.S.
There are many situations in which a higher-level Linux construct might cause a context switch. Here are a few examples:
If a system call got to int 0x80 or sysenter instruction, a context switch occurs. Some system call routines can use userspace information to get the information the system call was meant to get. In this case, no context switch will occur.
Many times scheduling doesn't require an interrupt: a thread will perform a system call, and the return from the syscall is delayed until it is scheduled again. For processses that are in a section where syscalls aren't performed, Linux relies on timer interrupts to gain control.
Virtual memory access to a memory location that was paged out will cause a segmentation fault, and therefore a context switch.
Signals are usually delivered when a process is already "switched out" (see comments by #caf on the question), but sometimes an inter-processor interrupt is used to deliver the signal between two running processes.

Linux Process Context and SVC call in ARM

As per some Linux books
kernel code that services system calls issued by user applications
runs on behalf of the corresponding application process and is said to
be executing in process context. Interrupt Handlers run in interrupt
context.
Now svc and irq are two exceptions.
So when linux is handling svc it is in process context and while it is handling irq it is in interrupt context. Is that how it is mapped ?
Just one edit to this
It is also mentioned in books that tasklets / softirqs run in interrupt context while workqueues run in Process context. So does it mean that tasklet would run in CPSR.mode = IRQ ?
If I understand your confusion in the right way:
Since Linux is a capable, preemptive, complex operating system it has much finer handling of concepts such as handling of interrupts or serving software traps compared to bare metal hardware.
For example when a supervisor call (svc) happens hardware switches to SVC mode then Linux handles this as simple as preparing some data structures for handling it further then quits from SVC mode so core can continue serving in user mode thus making it possible to run into many more exception modes instead of blocking them.
It is same for IRQ mode, Linux handles bare minimum in IRQ mode. It prepares data structures as which IRQ happened, which handler should be invoked etc then exits from IRQ mode immediately to allow more to happen on that core. Later on some other internal kernel thread may process that interrupt further. Since hardware while being relatively simple runs really fast thus handling of interrupt runs in parallel with many processes.
Downside of this advanced approach is it gives no guarantees on response time requirements or overhead of it becomes visible on slower hardware like MCUs.
So ARM's exception modes provides two things for Linux: message type and priority backed with hardware support.
Message type is what exception mode is about, if it was a SVC, IRQ, FIQ, DATA ABORT, UNDEFINED INSTRUCTION, etc. So when hardware runs into an exception mode, Linux implicitly knows what it is handling.
Priority is about providing a capable and responsive hardware, for example system should be able to acknowledge an interrupt while handling some less important supervisor call.
Hardware support is for handling above two easier and faster. For example some registers are banked, or there is an extra system mode to handle reentrant IRQ easier.

Linux Interrupts Concurency

Are interrupts executed on all processors, or only on one?
For instance, when I type, do all processors handle the interrupt? Or only one of them and the rest carry on with other taks?
Here's a high-level view of the low-level processing. I'm describing a simple typical architecture, real architectures can be more complex or differ in ways that don't matter at this level of detail.
When an interrupt occurs, the processor looks if interrupts are masked. If they are, nothing happens until they are unmasked. When interrupts become unmasked, if there are any pending interrupts, the processor picks one.
Then the processor executes the interrupt by branching to a particular address in memory. The code at that address is called the interrupt handler. When the processor branches there, it masks interrupts (so the interrupt handler has exclusive control) and saves the contents of some registers in some place (typically other registers).
The interrupt handler does what it must do, typically by communicating with the peripheral that triggered the interrupt to send or receive data. If the interrupt was raised by the timer, the handler might trigger the OS scheduler, to switch to a different thread. When the handler finishes executing, it executes a special return-from-interrupt instruction that restores the saved registers and unmasks interrupts.
The interrupt handler must run quickly, because it's preventing any other interrupt from running. In the Linux kernel, interrupt processing is divided in two parts:
The “top half” is the interrupt handler. It does the minimum necessary, typically communicate with the hardware and set a flag somewhere in kernel memory.
The “bottom half” does any other necessary processing, for example copying data into process memory, updating kernel data structures, etc. It can take its time and even block waiting for some other part of the system since it runs with interrupts enabled.

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources