How processor deals with instruction upon interrupt - multithreading

What will happen if in the middle of a long instruction the CPU recieves interruption? Will the CPU execute the whole instruction or only part of it?

From a programmer's point of view, either a specific instruction is retired, and all its side effects committed to registers/memory or it isn't (and it's as if the instruction wasn't executed at all). The whole point of instruction retirement is to guarantee a coherent view of the program state at the point of external events, such as interrupts.
That's notably why instructions retire in order, so that external observers can still look at the architectural state of the CPU as if it was executing sequentially instructions.
There are exceptions to this, notably the REP-string class of instructions.
I believe this is what you asked about, but if it is not, then let me ask you: how would you observe that an instruction was "partially" executed from anywhere ?

As far as I know, it depends on the processor and the instruction. More specifically, it depends whether and when the processor samples for pending interrupts. If the processor only looks for pending interrupts after completing the current instruction, then clearly nothing will interrupt an instruction in the middle. However, if a long instruction is executing, it might be beneficial (latency-wise) to sample for interrupts several times during the instruction's execution. There is a downside to this -- in this case, you would have to restore all changes that were made to registers and flags as a result of the instruction's partial execution, because after the interrupt completes you would have to go back and reissue that instruction.
(Source: Wikipedia, http://en.wikipedia.org/wiki/Interrupt)

Related

Does UMWAIT make the process do REP NOP or context switch immediately?

Does calling UMWAIT make the process to do REP NOP (= keep using its hardware thread, not evicted, but use less power by not issuing the uOPs to the processor backend) until its scheduled time is over?
Or, does it make the process to be evicted right away through context switch?
Yes, umwait (the user-mode version of mwait with a limit on how deep a sleep it can do) is basically like pause (encoded as rep nop, which is how it executes on ancient CPUs that don't recognize it as a pause instruction).
It doesn't make a yield() system-call or otherwise trap to the OS. Same for mwait in kernel mode; it sleeps the CPU core, not traps. Kernels use it to put the CPU into a C-state until the next interrupt. (I think it was originally designed for actually waiting for memory writes from another core, but now one of the primary purposes is an API that includes a sleep level, unlike hlt, so it's how CPUs expose deep sleep levels. The waiting for memory use-case is still supported, too.)
If it just trapped so the OS could context switch, it wouldn't need to exist. int or syscall instructions already exist. Or in a kernel, a simple call schedule would potentially context-switch.
UMWAIT will put the core into C0.2/C0.1 state to save power. ... if the other SMT thread is active, most of backend/frontend will be active to C0.0, and if the other SMT thread is not active, then it will probably go into C1~ state.
Yeah, if the other logical core is still active, the physical core should keep running. (And maybe switch back to "single-threaded" mode, allowing the other logical core to use the full ROB and store buffer, and similarly un-partitioning any other statically-partitioned resources. Check perf stat -e cpu_clk_unhalted.one_thread_active against the case where the other thread is fully idle.)
I don't know the details on what sleep levels real microarchitectures actually have in practice, and how the on-paper levels of sleep map to them. It might be a more shallow sleep if regular C1 doesn't have a low enough wake-up latency, since some OSes would definitely want to stop user-space from doing anything too high latency to meet realtime guarantees it wants to provide.

Is assembly instruction "inc rax" atomic?

I know that modern CPUs have instruction pipelining, that the execution of every single machine instruction will be separated into several steps, for example, the RISC five-level pipelines. And my question is whether the assembly instruction inc rax is atomic when it is executed by different threads? Is that possible that thread A is in the Instruction Execution (EX) stage, calculating the result by incrementing the current value in register rax by 1 while thread B is in the Instruction Decoding (ID) stage, reading from the register rax of the value that is not incremented by thread A yet. So in the case, there is a data race between threads A and B, is this correct?
TL;DR: For a multithreaded program on x86-64, inc rax cannot cause or suffer any data race issues.
At the machine level, there are two senses of "atomic" that people usually use.
One is atomicity with respect to concurrent access by multiple cores. In this sense, the question doesn't really even make sense, because cores do not share registers; each has its own independent set. So a register-only instruction like inc rax cannot affect, or be affected by, anything that another core may be doing. There is certainly no data race issue to worry about.
Atomicity concerns in this sense only arise when two or more cores are accessing a shared resource - primarily memory.
The other is atomicity on a single core with respect to interrupts - if a hardware interrupt or exception occurs while an instruction is executing on the same core, what happens, and what machine state is observed by the interrupt handler? Here we do have to think about registers, because the interrupt handler can observe the same registers that the main code was using.
The answer is that x86 has precise interrupts, where interrupts appear to occur "between instructions". When calling the interrupt handler, the CPU pushes CS:RIP onto the stack, and the architectural state of the machine (registers, memory, etc) is as if:
the instruction pointed to by CS:RIP, and all subsequent instructions, have not begun to execute at all; the architectural state reflects none of their effects.
all instructions previous to CS:RIP have completely finished, and the architectural state reflects all of their effects.
On an old-fashioned in-order scalar CPU, this is easily accomplished by having the CPU check for interrupts as a step in between the completion of one instruction and the execution of the next. On a pipelined CPU, it takes more work; if there are several instructions in flight, the CPU may wait for some of them to retire, and abort the others.
For more details, see When an interrupt occurs, what happens to instructions in the pipeline?
There are a few exceptions to this rule: e.g. the AVX-512 scatter/gather instructions may be partially completed when an interrupt occurs, so that some of the loads/stores have been done and others have not. But it sets the registers in such a way that when returning to execute the instruction again, only the remaining loads/stores will be done.
From the point of view of an application on a multitasking operating system, threads can run simultaneously on several cores, or run sequentially on a single core (or some combination). In the first case, there is no problem with inc rax as the registers are not shared between cores. In the second case, each thread still has its own register set as part of its context. Your thread may be interrupted by a hardware interrupt at any time (e.g. timer tick), and the OS may then decide to schedule in a different thread. To do so, it saves your thread's context, including the register contents at the time of the interrupt - and since we have precise interrupts, these contents reflect instructions in an all-or-nothing fashion. So inc rax is atomic for that purpose; when another thread gets control, the saved context of your thread has either all the effects of inc rax or none of them. (And it usually doesn't even matter, because the only machine state affected by inc rax is registers, and other threads don't normally try to observe the saved context of threads which are scheduled out, even if the OS provides a way to do that.)

Under what circumstances does control pass from userspace to the Linux kernel space?

I'm trying to understand which events can cause a transition from userspace to the linux kernel. If it's relevant, the scope of this question can be limited to the x86/x86_64 architecture.
Here are some sources of transitions that I'm aware of:
System calls (which includes accessing devices) causes a context switch from userspace to kernel space.
Interrupts will cause a context switch. As far as I know, this also includes scheduler preemptions, since a scheduler usually relies on a timer interrupt to do its work.
Signals. It seems like at least some signals are implemented using interrupts but I don't know if some are implemented differently so I'm listing them separately.
I'm asking two things here:
Am I missing any userspace->kernel path?
What are the various code paths that are involved in these context switches?
One you are missing: Exceptions
(which can be further broken down in faults, traps and aborts)
For example a page fault, breakpoint, division by zero or floating-point exception. Technically, one can view exceptions as interrupts but not really the way you have defined an interrupt in your question.
You can find a list of x86 exceptions at this osdev webpage.
With regard to your second question:
What are the various code paths that are involved in these context
switches?
That really depends on the architecture and OS, you will need to be more specific. For x86, when an interrupt occurs you go to the IDT entry and for SYSENTER you get to to address specified in the MSR. What happens after that is completely up to the OS.
No one wrote a complete answer so I will try to incorporate the comments and partial answers into an answer. Feel free to comment or edit the answer to improve it.
For the purposes of this question and answer, userspace to kernel transitions mean a change in processor state that allows access to kernel code and memory. In short I will refer to these transistions as context switches.
When discussing events that can trigger userspace to kernel transitions, it is important to separate the OS constructs that we are used to (signals, system calls, scheduling) that require context switches and the way these constructs are implemented, using context switches.
In x86, there are two central ways for context switches to occur: interrupts and SYSENTER. Interrupts are a processor feature, which causes a context switch when certain events happen:
Hardware devices may request an interrupt, for example, a timer/clock can cause an interrupt when a certain amount of time has elapsed. A keyboard can interrupt when keys are pressed. It's also called a hardware interrupt.
Userspace can initiate an interrupt. For example, the old way to perform a system call in Linux on x86 was to execute INT 0x80 with arguments passed through the registers. Debugging breakpoints are also implemented using interrupts, with the debugger replacing an instruction with INT 0x3. This type of an interrupt is called a software interrupt.
The CPU itself generates interrupts in certain situations, like when memory is accessed without permissions, when a user divides by zero, or when one core must notify another core that it needs to do something. This type of interrupt is called an exception, and you can read more about them in #esm 's answer.
For a broader discussion of interrupts see here: http://wiki.osdev.org/Interrupt
SYSENTER is an instruction that provides the modern path to cause a context switch for the particular case of performing a system call.
The code that handles the context switching due to interrupts or SYSENTER in Linux can be found in arch/x86/kernel/entry_{32|64}.S.
There are many situations in which a higher-level Linux construct might cause a context switch. Here are a few examples:
If a system call got to int 0x80 or sysenter instruction, a context switch occurs. Some system call routines can use userspace information to get the information the system call was meant to get. In this case, no context switch will occur.
Many times scheduling doesn't require an interrupt: a thread will perform a system call, and the return from the syscall is delayed until it is scheduled again. For processses that are in a section where syscalls aren't performed, Linux relies on timer interrupts to gain control.
Virtual memory access to a memory location that was paged out will cause a segmentation fault, and therefore a context switch.
Signals are usually delivered when a process is already "switched out" (see comments by #caf on the question), but sometimes an inter-processor interrupt is used to deliver the signal between two running processes.

Page fault in Interrupt context

Can a page fault occur in an interrupt handler/atomic context ?
It can, but it would be a disaster. :-)
(This is an oldish question. The existing answers contain correct facts, but are quite thin. I will attempt to answer it in a more substantial way.)
The answer to this question depends upon whether the code is in the kernel (supervisor mode), or in user mode. The reason is that the rules for memory access in these regions are usually different. Here is a brief sequence of events to illustrate the problem (assuming kernel memory could be paged out):
While a user program is executing, an interrupt occurs (e.g. key press / disk event).
CPU transitions to supervisor mode and begins executing the handler in the kernel.
The interrupt handler begins to save the CPU state (so that the user process can be correctly resumed later), but in doing so it touches some of its storage which had previously been paged out.
This triggers a page fault exception.
In order to process the page fault exception, the kernel must now save the CPU state of the code that experienced the page miss.
It may actually be able to do this if it has a preallocated pool of memory that will never be paged out, but such a pool would be inevitably be limited in size.
So you see, the safest (and simplest) solution is for the kernel to ensure that memory owned by the kernel is not pagable at all. For this reason, page faults should not really occur within the kernel. They can occur, but as #adobriyan notes, that usually indicates a much bigger error than a simple need to page in some memory. (I believe this is the case in Linux. Check your specific OS to be sure whether kernel memory is non-pagable. OS architectures do differ.)
So in summary, kernel memory is usually not pagable, and since interrupts are usually handled within the kernel, page faults should not in general occur while servicing interrupts. Higher priority interrupts can still interrupt lower ones. It is just that all their resources are kept in physical memory.
The question about atomic contexts is less clear. If by that you mean atomic operations supported by the hardware, then no interrupt occurs within a partial completion of the operation. If you are instead referring to something like a critical section, then remember that critical sections only emulate atomicity. From the perspective of the hardware there is nothing special about such code except for the entry and exit code, which may use true hardware atomic operations. The code in between is normal code, and subject to being interrupted.
I hope this provides a useful response to this question, as I also wondered about this issue for a while.
Yes.
The code for the handler or critical region could span the boundary between two pages. If the second page is not available, then a page fault is necessary to bring it in.
Not sure why no body has used the word "Double Fault":
http://en.wikipedia.org/wiki/Double_fault
But that is the terms used in Intel manual:
http://software.intel.com/en-us/articles/introduction-to-pc-architecture/
or here:
ftp://download.intel.com/design/processor/manuals/253668.pdf (look at section 6-38).
There is something called triple fault too, which as the name indicate, can also happened when the CPU is trying to service the double fault error.
I think the answer is YES.
I just checked the page fault handler code in kernel 4.15 for x86_64 platform.
Take the following as a hint. no_context is the classic 'kernel oops'.
no_context(struct pt_regs *regs, unsigned long error_code,
unsigned long address, int signal, int si_code)
{
/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs, X86_TRAP_PF)) {
/*
* Any interrupt that takes a fault gets the fixup. This makes
* the below recursive fault logic only apply to a faults from
* task context.
*/
if (in_interrupt())
return;

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources