Is CPU affinity enforced across system calls? - linux

So if I set a process's CPU affinity using:
sched_setaffinity()
and then perform some other system call using that process, is that system call ALSO guaranteed to execute on the same CPU enforced by sched_setaffinity?
Essentially, I'm trying to enforce that a process, and the system calls it makes, are executed on the same core. Obviously I can use sched_setaffinity() to enforce userspace code will execute on only one CPU, but does that same system call enforce kernel-space code in that process context will execute on the same core as well?
Thanks!

Syscalls are really just your process code switching from user to kernel mode. The task that is being run does not change at all, it just temporarily enters kernel mode to execute the syscall and then returns back to user mode.
A task can be preempted by the scheduler and moved to a different CPU, and this can happen in the middle of normal user mode code or even in the middle of a syscall.
By setting the task affinity to a single CPU using sched_setaffinity(), you remove this possibility, since even if the task gets preempted, the scheduler has no choice but to keep it running on the same CPU (it may of course change the currently running task, but when your task resumes it will still be on the same CPU).
So to answer your question:
does that same system call enforce kernel-space code in that process context will execute on the same core as well?
Yes, it does.
Now, to address #Barmar's comment: in the case of syscalls that can "sleep", this does not mean that the task could change CPU if the affinity does not allow it.
What happens when a syscall sleeps, is simply that the syscall code tells the scheduler: "hey, I'm waiting for something, just run another task while I wait and wake me up later". When the syscall resumes, it checks if the requested resource is available (it could even tell the kernel exactly when it wants to be waken up), and if not it either waits again or returns to user code saying "sorry, I got nothing, try again". The resource could of course be made available by some interrupt that causes an interrupt handler to run on a different CPU, but that's a different story, and it doesn't really matter. To put it simply: interrupt code does not run in process context, at all. For what the task executing the syscall is concerned, the resource is just magically there when execution resumes.

Related

What happens if when a system call is being executed, the hardware timer interrupt fires?

Say we have one CPU with only one core, and there are many threads that are running on this core.
Let's say that Thread A has issued a system call, now the interrupt handler for the system call will be executed.
Now while the the system call is being executed, say that the hardware timer interrupt (the one responsible for the scheduling of threads) fires. What will happen in this case, will the CPU stop running the system call and go to execute the scheduler code, or does the CPU must wait for the system call to be fully executed before switching to another thread?
In Linux the answer is actually dependent on a kernel build time configuration option called CONFIG_PREEMPT. There are actually three options:
If CONFIG_PREEMPT is not set, the interrupt handler will mark a flag indicating that the scheduler needs to run. The flag will be checked on return to user space upon system call termination.
If CONFIG_PREEMPT_VOLUNTARY is set, the same will occur except the flag will be checked and the scheduler run (and task possibly switched if needed) at specific static code points in the system call code
If CONFIG_PREEMPT_FULL is set, the scheduler will run in most cases on the return code path from the interrupt handler to the system call code, unless a preempt critical section is in force.
Unless the system call blocks interrupts, the interrupt handler will get invoked.

does kernel's panic() function completely freezes every other process?

I would like to be confirmed that kernel's panic() function and the others like kernel_halt() and machine_halt(), once triggered, guarantee complete freezing of the machine.
So, are all the kernel and user processes frozen? Is panic() interruptible by the scheduler? The interrupt handlers could still be executed?
Use case: in case of serious error, I need to be sure that the hardware watchdog resets the machine. To this end, I need to make sure that no other thread/process is keeping the watchdog alive. I need to trigger a complete halt of the system. Currently, inside my kernel module, I simply call panic() to freeze everything.
Also, the user-space halt command is guaranteed to freeze the system?
Thanks.
edit: According to: http://linux.die.net/man/2/reboot, I think the best way is to use reboot(LINUX_REBOOT_CMD_HALT): "Control is given to the ROM monitor, if there is one"
Thank you for the comments above. After some research, I am ready to give myself a more complete answer, below:
At least for the x86 architecture, the reboot(LINUX_REBOOT_CMD_HALT) is the way to go. This, in turn, calls the syscall reboot() (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L433). Then, for the LINUX_REBOOT_CMD_HALT flag (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L480), the syscall calls kernel_halt() (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L394). That function calls syscore_shutdown() to execute all the registered system core shutdown callbacks, displays the "System halted" message, then it dumps the kernel, AND, finally, it calls machine_halt(), that is a wrapper for native_machine_halt() (see: http://lxr.linux.no/linux+v3.6.6/arch/x86/kernel/reboot.c#L680). It is this function that stops the other CPUs (through machine_shutdown()), then calls stop_this_cpu() to disable the last remaining working processor. The first thing that this function does is to disable interrupts on the current processor, that is the scheduler is no more able to take control.
I am not sure why the syscall reboot() still calls do_exit(0), after calling kernel_halt(). I interpret it like that: now, with all processors marked as disabled, the syscall reboot() calls do_exit(0) and ends itself. Even if the scheduler is awoken, there are no more enabled processors on which it could schedule some task, nor interrupt: the system is halted. I am not sure about this explanation, as the stop_this_cpu() seems to not return (it enters an infinite loop). Maybe is just a safeguard, for the case when the stop_this_cpu() fails (and returns): in this case, do_exit() will end cleanly the current task, then the panic() function is called.
As for the panic() code (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/panic.c#L69), the function first disables the local interrupts, then it disables all the other processors, except the current one by calling smp_send_stop(). Finally, as the sole task executing on the current processor (which is the only processor still alive), with all local interrupts disabled (that is, the preemptible scheduler -- a timer interrupt, after all -- has no chance...), then the panic() function loops some time or it calls emergency_restart(), that is supposed to restart the processor.
If you have better insight, please contribute.

Does the linux scheduler needs to be context switched?

I have a general question about the linux scheduler and some other similar kernel system calls.
Is the linux scheduler considered a "process" and every call to the scheduler requires a context switch like its just another process?
Say we have a clock tick which interrupts the current running user mode process, and we now have to call the scheduler. Does the call to the scheduler itself provokes a context switch? Does the scheduler has its own set of registers and U-area and whatnot which it has to restore at every call?
And the said question applies to many other system calls. Do kernel processes behave like regular processes in regard to context switching, the only difference is that they have more permissions and access to the cpu?
I ask this because context switch overhead is expensive. And it sounds odd that calling the scheduler itself provokes a context switch to restore the scheduler state, and after that the scheduler calls another process to run and again another context switch.
That's a very good question, and the answer to it would be "yes" except for the fact that the hardware is aware of the concept of an OS and task scheduler.
In the hardware, you'll find registers that are restricted to "supervisor" mode. Without going into too much detail about the internal CPU architecture, there's a copy of the basic program execution registers for "user mode" and "supervisor mode," the latter of which can only be accessed by the OS itself (via a flag in a control register that the kernel sets which says whether or not the kernel or a user mode application is currently running).
So the "context switch" you speak of is the process of swapping/resetting the user mode registers (instruction register, stack pointer register, etc.) etc. but the system registers don't need to be swapped out because they're stored apart from the user ones.
For instance, the user mode stack in x86 is USP - A7, whereas the supervisor mode stack is SSP - A7. So the kernel itself (which contains the task scheduler) would use the supervisor mode stack and other supervisor mode registers to run itself, setting the supervisor mode flag to 1 when it's running, then perform a context switch on the user mode hardware to swap between apps and setting the supervisor mode flag to 0.
But prior to the idea of OSes and task scheduling, if you wanted to do a multitasking system then you'd have had to use the basic concept that you outlined in your question: use a hardware interrupt to call the task scheduler every x cycles, then swap out the app for the task scheduler, then swap in the new app. But in most cases the timer interrupt would be your actual task scheduler itself and it would have been heavily optimized to make it less of a context switch and more of a simple interrupt handler routine.
Actually you can check the code for the schedule() function in kernel/sched.c. It is admirably well-written and should answer most of your question.
But bottom-line is that the Linux scheduler is invoked by calling schedule(), which does the job using the context of its caller. Thus there is no dedicated "scheduler" process. This would make things more difficult actually - if the scheduler was a process, it would also have to schedule itself!
When schedule() is invoked explicitly, it just switches the contexts of the caller thread A with the one of the selected runnable thread B such as it will return into B (by restoring register values and stack pointers, the return address of schedule() will become the one of B instead of A).
Here is an attempt at a simple description of what goes on during the dispatcher call:
The program that currently has context is running on the processor. Registers, program counter, flags, stack base, etc are all appropriate for this program; with the possible exception of an operating-system-native "reserved register" or some such, nothing about the program knows anything about the dispatcher.
The timed interrupt for dispatcher function is triggered. The only thing that happens at this point (in the vanilla architecture case) is that the program counter jumps immediately to whatever the PC address in the BIOS interrupt is listed as. This begins execution of the dispatcher's "dispatch" subroutine; everything else is left untouched, so the dispatcher sees the registers, stack, etc of the program that was previously executing.
The dispatcher (like all programs) has a set of instructions that operate on the current register set. These instructions are written in such a way that they know that the previously executing application has left all of its state behind. The first few instructions in the dispatcher will store this state in memory somewhere.
The dispatcher determines what the next program to have the cpu should be, takes all of its previously stored state and fills registers with it.
The dispatcher jumps to the appropriate PC counter as listed in the task that now has its full context established on the cpu.
To (over)simplify in summary; the dispatcher doesn't need registers, all it does is write the current cpu state to a predetermined memory location, load another processes' cpu state from a predetermined memory location, and jumps to where that process left off.

Process state when calling syscall?

What process state has when it calls a syscall?
I mean, don't asume it's an I/O syscall like read or write...
It's the process itselft that executes kernel code, or the process is suspendes and there's like a "kernel thread" that execute the syscall handler (and knows wich process called (current))?
I'm not sure if changes from executing to ready, or executing to blocked.
It's the process itself that switches to kernel mode and executes the system call - although it switches to a kernel stack to do so. A process executing inside the kernel has state Running, and can be pre-empted and end up in state Runnable.
It depends what the syscall does.
Suppose there is a hypothetical syscall which calculates PI to a lot of digits, and places the result in a buffer the application specifies, then the process will probably just be in the "R" running state. Switching to kernel mode doesn't stop it running in the context of the task that made the call.
Of course many system calls wait for things - consider sleep() for example, which releases the CPU rather than spinning. This puts the process to sleep, having registered a kernel timer to wake it up.
Quite a lot of syscalls never sleep, the likes of getpid() which just retrieve information which is always in ram. And many which do sometimes sleep don't necessarily do so, for example, if you call read() on data already in a kernel buffer.

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources