Must IRET be used when returning from an interrupt? - linux

IRET can restore the registers from the stack,including EFLAGS, ESP, EIP and so on, but we can also restore the registers all by ourselves. For example, "movl" can be used to restore the %esp register, "jmp" can jump the address pointed to EIP which is stored on the stack.
The linux kernel returns from all interrupts by IRET, which is a weight instruction.
Some kernel operations (like context switches) happen frequently.
Isn't IRET a waste?

Besides all the heavy stuff IRET can and often should do in addition to a mere blend of POPF+RETF, there's one more thing that it does. It has a special function related to non-maskable interrupts (NMIs).
Concurrent NMIs are delivered to the CPU one by one. IRET signals to the NMI circuitry that another NMI can now be delivered. No other instruction can do this signalling.
If NMIs could preempt execution of other NMI ISRs, they would be able to cause a stack overflow, which rarely is a good thing. Unless we're talking about this wonderful website. :)
So, all in all, IRET is not a waste.

Probably because doing all that manually would need even more CPU clocks.

From wikipedija:
The actual code that is invoked when an interrupt occurs is called the
Interrupt Service Routine (ISR). When an exception occurs, a program
invokes an interrupt, or the hardware raises an interrupt, the
processor uses one of several methods (to be discussed) to transfer
control to the ISR, whilst allowing the ISR to safely return control
to whatever it interrupted after execution is complete. At minimum,
FLAGS and CS:IP are saved and the ISR's CS:IP loaded; however, some
mechanisms cause a full task switch to occur before the ISR begins
(and another task switch when it ends).
So IRET isn't waste, it is minimum (and the fastest way) to return from ISR. Also all other CPU registers used in ISR must be preserved at begining and restored before IRET exsecution!

Related

Is assembly instruction "inc rax" atomic?

I know that modern CPUs have instruction pipelining, that the execution of every single machine instruction will be separated into several steps, for example, the RISC five-level pipelines. And my question is whether the assembly instruction inc rax is atomic when it is executed by different threads? Is that possible that thread A is in the Instruction Execution (EX) stage, calculating the result by incrementing the current value in register rax by 1 while thread B is in the Instruction Decoding (ID) stage, reading from the register rax of the value that is not incremented by thread A yet. So in the case, there is a data race between threads A and B, is this correct?
TL;DR: For a multithreaded program on x86-64, inc rax cannot cause or suffer any data race issues.
At the machine level, there are two senses of "atomic" that people usually use.
One is atomicity with respect to concurrent access by multiple cores. In this sense, the question doesn't really even make sense, because cores do not share registers; each has its own independent set. So a register-only instruction like inc rax cannot affect, or be affected by, anything that another core may be doing. There is certainly no data race issue to worry about.
Atomicity concerns in this sense only arise when two or more cores are accessing a shared resource - primarily memory.
The other is atomicity on a single core with respect to interrupts - if a hardware interrupt or exception occurs while an instruction is executing on the same core, what happens, and what machine state is observed by the interrupt handler? Here we do have to think about registers, because the interrupt handler can observe the same registers that the main code was using.
The answer is that x86 has precise interrupts, where interrupts appear to occur "between instructions". When calling the interrupt handler, the CPU pushes CS:RIP onto the stack, and the architectural state of the machine (registers, memory, etc) is as if:
the instruction pointed to by CS:RIP, and all subsequent instructions, have not begun to execute at all; the architectural state reflects none of their effects.
all instructions previous to CS:RIP have completely finished, and the architectural state reflects all of their effects.
On an old-fashioned in-order scalar CPU, this is easily accomplished by having the CPU check for interrupts as a step in between the completion of one instruction and the execution of the next. On a pipelined CPU, it takes more work; if there are several instructions in flight, the CPU may wait for some of them to retire, and abort the others.
For more details, see When an interrupt occurs, what happens to instructions in the pipeline?
There are a few exceptions to this rule: e.g. the AVX-512 scatter/gather instructions may be partially completed when an interrupt occurs, so that some of the loads/stores have been done and others have not. But it sets the registers in such a way that when returning to execute the instruction again, only the remaining loads/stores will be done.
From the point of view of an application on a multitasking operating system, threads can run simultaneously on several cores, or run sequentially on a single core (or some combination). In the first case, there is no problem with inc rax as the registers are not shared between cores. In the second case, each thread still has its own register set as part of its context. Your thread may be interrupted by a hardware interrupt at any time (e.g. timer tick), and the OS may then decide to schedule in a different thread. To do so, it saves your thread's context, including the register contents at the time of the interrupt - and since we have precise interrupts, these contents reflect instructions in an all-or-nothing fashion. So inc rax is atomic for that purpose; when another thread gets control, the saved context of your thread has either all the effects of inc rax or none of them. (And it usually doesn't even matter, because the only machine state affected by inc rax is registers, and other threads don't normally try to observe the saved context of threads which are scheduled out, even if the OS provides a way to do that.)

Under what circumstances does control pass from userspace to the Linux kernel space?

I'm trying to understand which events can cause a transition from userspace to the linux kernel. If it's relevant, the scope of this question can be limited to the x86/x86_64 architecture.
Here are some sources of transitions that I'm aware of:
System calls (which includes accessing devices) causes a context switch from userspace to kernel space.
Interrupts will cause a context switch. As far as I know, this also includes scheduler preemptions, since a scheduler usually relies on a timer interrupt to do its work.
Signals. It seems like at least some signals are implemented using interrupts but I don't know if some are implemented differently so I'm listing them separately.
I'm asking two things here:
Am I missing any userspace->kernel path?
What are the various code paths that are involved in these context switches?
One you are missing: Exceptions
(which can be further broken down in faults, traps and aborts)
For example a page fault, breakpoint, division by zero or floating-point exception. Technically, one can view exceptions as interrupts but not really the way you have defined an interrupt in your question.
You can find a list of x86 exceptions at this osdev webpage.
With regard to your second question:
What are the various code paths that are involved in these context
switches?
That really depends on the architecture and OS, you will need to be more specific. For x86, when an interrupt occurs you go to the IDT entry and for SYSENTER you get to to address specified in the MSR. What happens after that is completely up to the OS.
No one wrote a complete answer so I will try to incorporate the comments and partial answers into an answer. Feel free to comment or edit the answer to improve it.
For the purposes of this question and answer, userspace to kernel transitions mean a change in processor state that allows access to kernel code and memory. In short I will refer to these transistions as context switches.
When discussing events that can trigger userspace to kernel transitions, it is important to separate the OS constructs that we are used to (signals, system calls, scheduling) that require context switches and the way these constructs are implemented, using context switches.
In x86, there are two central ways for context switches to occur: interrupts and SYSENTER. Interrupts are a processor feature, which causes a context switch when certain events happen:
Hardware devices may request an interrupt, for example, a timer/clock can cause an interrupt when a certain amount of time has elapsed. A keyboard can interrupt when keys are pressed. It's also called a hardware interrupt.
Userspace can initiate an interrupt. For example, the old way to perform a system call in Linux on x86 was to execute INT 0x80 with arguments passed through the registers. Debugging breakpoints are also implemented using interrupts, with the debugger replacing an instruction with INT 0x3. This type of an interrupt is called a software interrupt.
The CPU itself generates interrupts in certain situations, like when memory is accessed without permissions, when a user divides by zero, or when one core must notify another core that it needs to do something. This type of interrupt is called an exception, and you can read more about them in #esm 's answer.
For a broader discussion of interrupts see here: http://wiki.osdev.org/Interrupt
SYSENTER is an instruction that provides the modern path to cause a context switch for the particular case of performing a system call.
The code that handles the context switching due to interrupts or SYSENTER in Linux can be found in arch/x86/kernel/entry_{32|64}.S.
There are many situations in which a higher-level Linux construct might cause a context switch. Here are a few examples:
If a system call got to int 0x80 or sysenter instruction, a context switch occurs. Some system call routines can use userspace information to get the information the system call was meant to get. In this case, no context switch will occur.
Many times scheduling doesn't require an interrupt: a thread will perform a system call, and the return from the syscall is delayed until it is scheduled again. For processses that are in a section where syscalls aren't performed, Linux relies on timer interrupts to gain control.
Virtual memory access to a memory location that was paged out will cause a segmentation fault, and therefore a context switch.
Signals are usually delivered when a process is already "switched out" (see comments by #caf on the question), but sometimes an inter-processor interrupt is used to deliver the signal between two running processes.

Does the flag register need to be saved when an interrupt occurs, or a process scheduling happens?

I know all the general registers are pushed on the stack when the an interrupts happen, but I can't see any code that flag register are save. The assembly instruction like setl which depends on the flag register, is easy to make a wrong result when restoring from an interrupt, if the flag register is corrupted.
Yes, the (e/r)flags register needs to be saved across context switches like that.
All interrupts (hardware and software, including exceptions) save it automatically on the stack and the iret instruction at the end of the ISR restores it.
System calls use the same or similar mechanism and preserve the register.
Scheduling is triggered by interrupts or system calls. So, everything's covered.

How processor deals with instruction upon interrupt

What will happen if in the middle of a long instruction the CPU recieves interruption? Will the CPU execute the whole instruction or only part of it?
From a programmer's point of view, either a specific instruction is retired, and all its side effects committed to registers/memory or it isn't (and it's as if the instruction wasn't executed at all). The whole point of instruction retirement is to guarantee a coherent view of the program state at the point of external events, such as interrupts.
That's notably why instructions retire in order, so that external observers can still look at the architectural state of the CPU as if it was executing sequentially instructions.
There are exceptions to this, notably the REP-string class of instructions.
I believe this is what you asked about, but if it is not, then let me ask you: how would you observe that an instruction was "partially" executed from anywhere ?
As far as I know, it depends on the processor and the instruction. More specifically, it depends whether and when the processor samples for pending interrupts. If the processor only looks for pending interrupts after completing the current instruction, then clearly nothing will interrupt an instruction in the middle. However, if a long instruction is executing, it might be beneficial (latency-wise) to sample for interrupts several times during the instruction's execution. There is a downside to this -- in this case, you would have to restore all changes that were made to registers and flags as a result of the instruction's partial execution, because after the interrupt completes you would have to go back and reissue that instruction.
(Source: Wikipedia, http://en.wikipedia.org/wiki/Interrupt)

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources