Does the linux scheduler needs to be context switched? - linux

I have a general question about the linux scheduler and some other similar kernel system calls.
Is the linux scheduler considered a "process" and every call to the scheduler requires a context switch like its just another process?
Say we have a clock tick which interrupts the current running user mode process, and we now have to call the scheduler. Does the call to the scheduler itself provokes a context switch? Does the scheduler has its own set of registers and U-area and whatnot which it has to restore at every call?
And the said question applies to many other system calls. Do kernel processes behave like regular processes in regard to context switching, the only difference is that they have more permissions and access to the cpu?
I ask this because context switch overhead is expensive. And it sounds odd that calling the scheduler itself provokes a context switch to restore the scheduler state, and after that the scheduler calls another process to run and again another context switch.

That's a very good question, and the answer to it would be "yes" except for the fact that the hardware is aware of the concept of an OS and task scheduler.
In the hardware, you'll find registers that are restricted to "supervisor" mode. Without going into too much detail about the internal CPU architecture, there's a copy of the basic program execution registers for "user mode" and "supervisor mode," the latter of which can only be accessed by the OS itself (via a flag in a control register that the kernel sets which says whether or not the kernel or a user mode application is currently running).
So the "context switch" you speak of is the process of swapping/resetting the user mode registers (instruction register, stack pointer register, etc.) etc. but the system registers don't need to be swapped out because they're stored apart from the user ones.
For instance, the user mode stack in x86 is USP - A7, whereas the supervisor mode stack is SSP - A7. So the kernel itself (which contains the task scheduler) would use the supervisor mode stack and other supervisor mode registers to run itself, setting the supervisor mode flag to 1 when it's running, then perform a context switch on the user mode hardware to swap between apps and setting the supervisor mode flag to 0.
But prior to the idea of OSes and task scheduling, if you wanted to do a multitasking system then you'd have had to use the basic concept that you outlined in your question: use a hardware interrupt to call the task scheduler every x cycles, then swap out the app for the task scheduler, then swap in the new app. But in most cases the timer interrupt would be your actual task scheduler itself and it would have been heavily optimized to make it less of a context switch and more of a simple interrupt handler routine.

Actually you can check the code for the schedule() function in kernel/sched.c. It is admirably well-written and should answer most of your question.
But bottom-line is that the Linux scheduler is invoked by calling schedule(), which does the job using the context of its caller. Thus there is no dedicated "scheduler" process. This would make things more difficult actually - if the scheduler was a process, it would also have to schedule itself!
When schedule() is invoked explicitly, it just switches the contexts of the caller thread A with the one of the selected runnable thread B such as it will return into B (by restoring register values and stack pointers, the return address of schedule() will become the one of B instead of A).

Here is an attempt at a simple description of what goes on during the dispatcher call:
The program that currently has context is running on the processor. Registers, program counter, flags, stack base, etc are all appropriate for this program; with the possible exception of an operating-system-native "reserved register" or some such, nothing about the program knows anything about the dispatcher.
The timed interrupt for dispatcher function is triggered. The only thing that happens at this point (in the vanilla architecture case) is that the program counter jumps immediately to whatever the PC address in the BIOS interrupt is listed as. This begins execution of the dispatcher's "dispatch" subroutine; everything else is left untouched, so the dispatcher sees the registers, stack, etc of the program that was previously executing.
The dispatcher (like all programs) has a set of instructions that operate on the current register set. These instructions are written in such a way that they know that the previously executing application has left all of its state behind. The first few instructions in the dispatcher will store this state in memory somewhere.
The dispatcher determines what the next program to have the cpu should be, takes all of its previously stored state and fills registers with it.
The dispatcher jumps to the appropriate PC counter as listed in the task that now has its full context established on the cpu.
To (over)simplify in summary; the dispatcher doesn't need registers, all it does is write the current cpu state to a predetermined memory location, load another processes' cpu state from a predetermined memory location, and jumps to where that process left off.

Related

Core execution flow in the point of thread context switch and CPU mode switch

If
CPU has mode (privilege level) (Added because not all processors have privilege levels according to here)
CPU is multi-core (CISC / x86-64 instruction set)
scheduling is round robin scheduling
thread is kernel managed thread
OS is windows if necessary
I want to know simplified core execution flow in the point of thread context switch and CPU mode switch per time slice.
My understanding is as follows. Please correct me if I'm wrong.
In case of the thread is kernel managed user mode thread not involving interrupt, or anything that requires kernel mode,
The thread context switch occurs.
The core executing thread switches to kernel mode because context switch can only occur in kernel mode according to here, here and here unless the thread is user managed thread.
The core executing thread switches to user mode.
The core executes sequence of instructions located in user space.
Time slice expires.
Repeat 1.
Closest related diagram I could find is below.
Even a little clue to answer will be sincerely appreciated.
You said it yourself:
context switch can only occur in kernel mode
So the CPU must enter kernel mode before there can be a context switch. That can happen in either one of two ways in most operating systems:
The user-mode code makes a system call, or
An interrupt occurs.
If the thread enters kernel mode by making a system call, then there could be a context switch if the syscall causes the thread to no longer be runnable (e.g., a sleep() call), or there could be a context switch if the syscall causes some higher-priority thread to become runnable. (e.g., the syscall releases a mutex that the higher priority thread was awaiting.)
If the thread enters kernel mode because of an interrupt, then there could be a context switch because the interrupt handler made some higher-priority thread runnable (e.g., if the other thread was awaiting data from the disk), or there could be a context switch because it was a timer interrupt, and the current thread's time slice has expired.
The mechanism of context switching may be different on different hardware platforms. Here's how it could happen on some hypothetical CPU:
The current thread (threadA) enters sheduler code which chooses some other thread (threadB) as the next to run on the current CPU.
It calls some switchContext(threadB) function.
The switchContext function copies values from the stack pointer register, and from other live registers into the current thread (threadA)'s saved context area.*
It then sets the "current thread" pointer to point to threadB's saved context area, and it restores threadB's context by copying all the same things in reverse.**
Finally, the switchContext function returns... IN threadB,... at exactly the place where threadB last called it.
Eventually, threadB returns from the interrupt or system call to application code running in user-mode.
* The author of switchContext may have to be careful, may have to do some tricky things, in order to save the entire context without trashing it. E.g., it had better not use any register that needs saving before it has actually saved it somewhere.
** The trickiest part is when restoring the stack pointer register. As soon as that happens, "the" stack suddenly is threadB's stack instead of threadA's stack.

Is CPU affinity enforced across system calls?

So if I set a process's CPU affinity using:
sched_setaffinity()
and then perform some other system call using that process, is that system call ALSO guaranteed to execute on the same CPU enforced by sched_setaffinity?
Essentially, I'm trying to enforce that a process, and the system calls it makes, are executed on the same core. Obviously I can use sched_setaffinity() to enforce userspace code will execute on only one CPU, but does that same system call enforce kernel-space code in that process context will execute on the same core as well?
Thanks!
Syscalls are really just your process code switching from user to kernel mode. The task that is being run does not change at all, it just temporarily enters kernel mode to execute the syscall and then returns back to user mode.
A task can be preempted by the scheduler and moved to a different CPU, and this can happen in the middle of normal user mode code or even in the middle of a syscall.
By setting the task affinity to a single CPU using sched_setaffinity(), you remove this possibility, since even if the task gets preempted, the scheduler has no choice but to keep it running on the same CPU (it may of course change the currently running task, but when your task resumes it will still be on the same CPU).
So to answer your question:
does that same system call enforce kernel-space code in that process context will execute on the same core as well?
Yes, it does.
Now, to address #Barmar's comment: in the case of syscalls that can "sleep", this does not mean that the task could change CPU if the affinity does not allow it.
What happens when a syscall sleeps, is simply that the syscall code tells the scheduler: "hey, I'm waiting for something, just run another task while I wait and wake me up later". When the syscall resumes, it checks if the requested resource is available (it could even tell the kernel exactly when it wants to be waken up), and if not it either waits again or returns to user code saying "sorry, I got nothing, try again". The resource could of course be made available by some interrupt that causes an interrupt handler to run on a different CPU, but that's a different story, and it doesn't really matter. To put it simply: interrupt code does not run in process context, at all. For what the task executing the syscall is concerned, the resource is just magically there when execution resumes.

Context switch between kernel threads vs user threads

Copy pasted from this link:
Thread switching does not require Kernel mode privileges.
User level threads are fast to create and manage.
Kernel threads are generally slower to create and manage than the user threads.
Transfer of control from one thread to another within the same process requires a mode switch to the Kernel.
I never came across these points while reading standard operating systems reference books. Though these points sound logical, I wanted to know how they reflect in Linux. To be precise :
Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
Can someone link me to Linux source code line (say on github) handling context switch.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?
Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Let's imagine a thread needs to read data from a file, but the file isn't cached in memory and disk drives are slow so the thread has to wait; and for simplicity let's also assume that the kernel is monolithic.
For kernel threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes, sets the thread to "blocked waiting for IO" and switches to a different thread (that may belong to a completely different process, depending on global thread priorities). The kernel returns to the user-space of whatever thread it switch to.
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do the for (currently blocked) thread and unblocks that thread. At this point the kernel might decide to switch to the "now unblocked" thread; and the kernel returns to the user-space of the "now unblocked" thread.
For user threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes but can't take care of that because some fool decided to make everything worse by doing thread switching in user space, so the kernel returns to user-space with "IO request has been queued" status.
after the pointless extra overhead of switching back to user-space; the user-space scheduler does the thread switch that the kernel could have done. At this point the user-space scheduler will either tell kernel it has nothing to do and you'll have more pointless extra overhead switching back to kernel; or user-space scheduler will do a thread switch to another thread in the same process (which may be the wrong thread because a thread in a different process is higher priority).
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do for the (currently blocked) thread; but the kernel isn't able to do the thread switch to unblock the thread because some fool decided to make everything worse by doing thread switching in user space. Now we've got a problem - how does kernel inform the user-space scheduler that the IO has finished? To solve this (without any "user-space scheduler running zero threads constantly polls kernel" insanity) you have to have some kind of "kernel puts notification of IO completion on some kind of queue and (if the process was idle) wakes the process up" which (on its own) will be more expensive than just doing the thread switch in the kernel. Of course if the process wasn't idle then code in user-space is going to have to poll its notification queue to find out if/when the "notification of IO completion" arrives, and that's going to increase latency and overhead. In any case, after lots of stupid pointless and avoidable overhead; the user-space scheduler can do the thread switch.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
The actual low-level context switch code typically begins with something like:
save whichever registers are "caller preserved" according to the calling conventions on the stack
save the current stack top in some kind of "thread info structure" belonging to the old thread
load a new stack top from some kind of "thread info structure" belonging to the new thread
pop whichever registers are "caller preserved" according to the calling conventions
return
However:
usually (for modern CPUs) there's a relatively large amount of "SIMD register state" (e.g. for 80x86 with support for AVX-512 I think it's over 4 KiB of of stuff). CPU manufacturers often have mechanisms to avoid saving parts of that state if it wasn't changed, and to (optionally) postpone the loading of (pieces of) that state until its actually used (and avoid it completely if its not actually used). All of that requires kernel.
if it's a task switch and not just used for thread switches you might need some kind of "if virtual address space needs to change { change virtual address space }" on top of that
normally you want to keep track of statistics, like how much CPU time a thread has used. This requires some kind of "thread_info.time_used += now() - time_at_last_thread_switch;"; which gets difficulty/ugly when "process switching" is separated from "thread switching".
normally there's other state (e.g. pointer to thread local storage, special registers for performance monitoring and/or debugging, ...) that may need to be saved/loaded during thread switches. Often this state is not directly accessible in user code.
normally you also want to set a timer to expire when the thread has used too much time; either because you're doing some kind of "time multiplexing" (e.g. round-robin scheduler) or because its a cooperating scheduler where you need to have some kind of "terminate this task after 5 seconds of not responding in case it goes into an infinite loop forever" safe-guard.
this is just the low level task/thread switching in isolation. There is almost always higher level code to select a task to switch to, handle "thread used too much CPU time", etc.
Can someone link me to Linux source code line (say on github) handling context switch
Someone probably can't. It's not one line; it's many lines of assembly for each different architecture, plus extra higher-level code (for timers, support routines, the "select a task to switch to" code, for exception handlers to support "lazy SIMD state load", ...); which probably all adds up to something like 10 thousand lines of code spread across 50 files.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?
Yes; often you're already in kernel code when you find out that a thread switch is needed.
Rarely/sometimes (mostly only due to communication between threads belonging to the same process - e.g. 2 or more threads in the same process trying to acquire the same mutex/semaphore at the same time; or threads sending data to each other and waiting for data from each other to arrive) kernel isn't involved; and in some cases (which are almost always massive design failures - e.g. extreme lock contention problems, failure to use "worker thread pools" to limit the number of threads needed, etc) it's possible for this to be the dominant cause of thread switches, and therefore possible that doing thread switches in user space can be beneficial (e.g. as a work-around for the massive design failures).
Don't limit yourself to Linux or even UNIX, they are neither the first nor last word on systems or programming models. The synchronous execution model dates back to the early days of computing, and are not particularly well suited to larger scale concurrent and reactive programming.
Golang, for example, employs a great many lightweight user threads -- goroutines -- and multiplexes them on a smaller set of heavyweight kernel threads to produce a more compelling concurrency paradigm. Some other programming systems take similar approaches.

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

what is a reentrant kernel

What is a reentrant kernel?
Much simpler answer:
Kernel Re-Entrance
If the kernel is not re-entrant, a process can only be suspended while it is in user mode. Although it could be suspended in kernel mode, that would still block kernel mode execution on all other processes. The reason for this is that all kernel threads share the same memory. If execution would jump between them arbitrarily, corruption might occur.
A re-entrant kernel enables processes (or, to be more precise, their corresponding kernel threads) to give away the CPU while in kernel mode. They do not hinder other processes from also entering kernel mode. A typical use case is IO wait. The process wants to read a file. It calls a kernel function for this. Inside the kernel function, the disk controller is asked for the data. Getting the data will take some time and the function is blocked during that time.
With a re-entrant kernel, the scheduler will assign the CPU to another process (kernel thread) until an interrupt from the disk controller indicates that the data is available and our thread can be resumed. This process can still access IO (which needs kernel functions), like user input. The system stays responsive and CPU time waste due to IO wait is reduced.
This is pretty much standard for today's desktop operating systems.
Kernel pre-emption
Kernel pre-emption does not help in the overall throughput of the system. Instead, it seeks for better responsiveness.
The idea here is that normally kernel functions are only interrupted by hardware causes: Either external interrupts, or IO wait cases, where it voluntarily gives away control to the scheduler. A pre-emptive kernel instead also interrupts and suspends kernel functions just like it would interrupt processes in user mode. The system is more responsive, as processes e.g. handling mouse input, are woken up even while heavy work is done inside the kernel.
Pre-emption on kernel level makes things harder for the kernel developer: The kernel function cannot be suspended only voluntarily or by interrupt handlers (which are somewhat a controlled environment), but also by any other process due to the scheduler. Care has to be taken to e.g. avoid deadlocks: A thread locks resource A but needing resource B is interrupted by another thread which locks resource B, but then needs resource A.
Take my explanation of pre-emption with a grain of salt. I'm happy for any corrections.
All Unix kernels are reentrant. This means that several processes may be executing in Kernel Mode at the same time. Of course, on uniprocessor systems, only one process can progress, but many can be blocked in Kernel Mode when waiting for the CPU or the completion of some I/O operation. For instance, after issuing a read to a disk on behalf of a process, the kernel lets the disk controller handle it and resumes executing other processes. An interrupt notifies the kernel when the device has satisfied the read, so the former process can resume the execution.
One way to provide reentrancy is to write functions so that they modify only local variables and do not alter global data structures. Such functions are called reentrant functions . But a reentrant kernel is not limited only to such reentrant functions (although that is how some real-time kernels are implemented). Instead, the kernel can include nonreentrant functions and use locking mechanisms to ensure that only one process can execute a nonreentrant function at a time.
If a hardware interrupt occurs, a reentrant kernel is able to suspend the current running process even if that process is in Kernel Mode. This capability is very important, because it improves the throughput of the device controllers that issue interrupts. Once a device has issued an interrupt, it waits until the CPU acknowledges it. If the kernel is able to answer quickly, the device controller will be able to perform other tasks while the CPU handles the interrupt.
Now let's look at kernel reentrancy and its impact on the organization of the kernel. A kernel control path denotes the sequence of instructions executed by the kernel to handle a system call, an exception, or an interrupt.
In the simplest case, the CPU executes a kernel control path sequentially from the first instruction to the last. When one of the following events occurs, however, the CPU interleaves the kernel control paths :
A process executing in User Mode invokes a system call, and the corresponding kernel control path verifies that the request cannot be satisfied immediately; it then invokes the scheduler to select a new process to run. As a result, a process switch occurs. The first kernel control path is left unfinished, and the CPU resumes the execution of some other kernel control path. In this case, the two control paths are executed on behalf of two different processes.
The CPU detects an exception-for example, access to a page not present in RAM-while running a kernel control path. The first control path is suspended, and the CPU starts the execution of a suitable procedure. In our example, this type of procedure can allocate a new page for the process and read its contents from disk. When the procedure terminates, the first control path can be resumed. In this case, the two control paths are executed on behalf of the same process.
A hardware interrupt occurs while the CPU is running a kernel control path with the interrupts enabled. The first kernel control path is left unfinished, and the CPU starts processing another kernel control path to handle the interrupt. The first kernel control path resumes when the interrupt handler terminates. In this case, the two kernel control paths run in the execution context of the same process, and the total system CPU time is accounted to it. However, the interrupt handler doesn't necessarily operate on behalf of the process.
An interrupt occurs while the CPU is running with kernel preemption enabled, and a higher priority process is runnable. In this case, the first kernel control path is left unfinished, and the CPU resumes executing another kernel control path on behalf of the higher priority process. This occurs only if the kernel has been compiled with kernel preemption support.
These information available on http://jno.glas.net/data/prog_books/lin_kern_2.6/0596005652/understandlk-CHP-1-SECT-6.html
More On http://linux.omnipotent.net/article.php?article_id=12496&page=-1
The kernel is the core part of an operating system that interfaces directly with the hardware and schedules processes to run.
Processes call kernel functions to perform tasks such as accessing hardware or starting new processes. For certain periods of time, therefore, a process will be executing kernel code. A kernel is called reentrant if more than one process can be executing kernel code at the same time. "At the same time" can mean either that two processes are actually executing kernel code concurrently (on a multiprocessor system) or that one process has been interrupted while it is executing kernel code (because it is waiting for hardware to respond, for instance) and that another process that has been scheduled to run has also called into the kernel.
A reentrant kernel provides better performance because there is no contention for the kernel. A kernel that is not reentrant needs to use a lock to make sure that no two processes are executing kernel code at the same time.
A reentrant function is one that can be used by more than one task concurrently without fear of data corruption. Conversely, a non-reentrant function is one that cannot be shared by more than one task unless mutual exclusion to the function is ensured either by using a semaphore or by disabling interrupts during critical sections of code. A reentrant function can be interrupted at any time and resumed at a later time without loss of data. Reentrant functions either use local variables or protect their data when global variables are used.
A reentrant function:
Does not hold static data over successive calls
Does not return a pointer to static data; all data is provided by the caller of the function
Uses local data or ensures protection of global data by making a local copy of it
Must not call any non-reentrant functions

Resources