Context switching kernel processes in Linux

Context switching kernel processes in Linux - linux

Consider the process keventd. It spends all it's lifetime in kernel mode.
Now, as far as I know, Linux checks if a context switch is due, while the process is switching from kernel mode to user mode, and as far as I know, keventd will never switch from kernel mode to user mode, so, how will the Linux kernel know when to switch it off?

If the kernel were to do as you say, and only check whether a process is due to be switched out on an explicit user-to-kernel-mode transition, then the following loop would lock up a core of your computer:
while (1);
Obviously, this does not happen on normal desktop operating systems. The reason why is preemption, where after a process has run for its time slice the kernel gets an alarm, steps in, and forcibly switches contexts as necessary.
Preemption could in principle work for kernel processes too. However, I'm not sure that's what keventd does - it's more likely that it voluntarily relinquishes its time slice on a regular basis (see sched_yield, a userspace call for the same effect), especially since the kernel can be configured to be non-preemptible. That is a kernel process' prerogative.

Related

How does operating system preempt a process and regain control?

When a process is running on a CPU, the operating system is not running in the background as a single core CPU can execute only 1 instruction at a time. Then how does the operating system preempt a process, is it done by the hardware?
I couldn't find an answer anywhere

To understand how the OS regains control of a process, the concept of interrupts must be understood. An interrupt is a signal sent to the CPU that signifies that the current processes must be stopped (i.e. interrupted) so that another process can begin. In some sense, this is accomplished at the hardware level as there are dedicated registers in the CPU that interrupting bits are placed in.
When an interrupt occurs, the contents of the CPU's registers are stored, the current stack pointer is saved, and the program counter is then pointed to the next instruction set forth by the scheduler that decides which process to begin next - usually the interrupting one. Barring deadlock, in which no progress on any processes can be made - the scheduler will make its way back to the original process, and that process's executing context will be reloaded into the machine (since we saved it prior). This concept of saving the state of the machine, executing a new process, and returning to the original process is known as a context switch. More on that here

Linux process scheduling in kernel mode

Here is some description quoted from Wiki
The Linux kernel provides preemptive scheduling under certain
conditions. Until kernel version 2.4, only processes were preemptive,
i.e. in addition to time quantum expiration, an execution of current
process in user mode would be interrupted if higher dynamic priority
processes entered TASK_RUNNING state. Towards Linux 2.6, an
ability to interrupt a task executing kernel code was added, although
with that not all sections of the kernel code can be preempted.
Then it also says this,
Preemption improves latency, increases responsiveness, and makes Linux
more suitable for desktop and real-time applications. Older versions
of the kernel had a so-called big kernel lock for synchronization
across the entire kernel. This was finally removed by Arnd Bergmann in
2011
So does the above statement hold true for the current linux kernel that kernel preemption is
conditional? e.g. if a process is trapped into kernel mode by making a system call, this process will not be under preemptive scheduling?
Where can I find some up-to-date introduction articles/books about linux scheduling in both user mode and kernel mode?

Of course kernel preemption is conditional. You would not want the kernel to switch tasks while holding an exclusive lock or while writing to time-sensitive hardware registers in a device driver.
However, the Linux kernel does its best to minimize these conditions in order to make preemption happen as quickly as it can.
Note that this in-kernel preemption is only compiled into the kernel when the compile option CONFIG_PREEMPT is yes. There is also CONFIG_PREEMPT_VOLUNTARY which only does task switching when the kernel explicitly checks for it.
Kernel preemption comes with a cost. Rapidly switching tasks requires doing a lot of mostly wasted housekeeping work instead of actual work. This slows down the whole system and results in less work being done. That is why these compile options exist. A Linux kernel built for a database or web server should not use preemption at all. A kernel built for HPC is sometimes modified to only switch tasks once a second, or less.
That all changes for real-time tasks. These tasks rely on reacting quickly and within a reliable timeframe. The default Linux kernel is pretty good at this, but there is a patch set called the "-rt patches" that makes it really good. The patch set does all sorts of things like prioritize interrupt handlers and change kernel locks so that locks can be dropped and restarted later.

CPU scheduling decisions may take place when a process:
1. Switches from running to waiting state (e.g. I/O request)
2. Switches from running to ready state (e.g. Interrupt)
3. Switches from waiting to ready (e.g. I/O completion)
4. Terminates
Scheduling under 1 and 4 are non-preemptive and all other scheduling is preemptive, have to deal with possibility that operations (system calls) may be incomplete.
Yes Linux provides preemptive scheduling under certain conditions unlike some Unix variants where kernel schedules until completion without preemption. In Linux 2.6, kernel was made preemptive a task running as long it is not holding a lock and safe to rescheduling.
Older versions of the kernel had a so-called big kernel lock for synchronization
across the entire kernel.
refers to each user-level thread maps only to one kernel thread.

What does it mean to say "linux kernel is preemptive"?

I read that Linux kernel is preemptive, which is different from most Unix kernels. So, what does it really mean for a kernal to be preemptive?
Some analogies or examples would be better than pure theoretical explanation.
ADD 1 -- 11:00 AM 12/7/2018
Preemptive is just one paradigm of multi-tasking. There are others like Cooperative Multi-tasking. A better understanding can be achieved by comparing them.

Prior to Linux kernel version 2.5.4, Linux Kernel was not preemptive which means a process running in kernel mode cannot be moved out of processor until it itself leaves the processor or it starts waiting for some input output operation to get complete.
Generally a process in user mode can enter into kernel mode using system calls. Previously when the kernel was non-preemptive, a lower priority process could priority invert a higher priority process by denying it access to the processor by repeatedly calling system calls and remaining in the kernel mode. Even if the lower priority process' timeslice expired, it would continue running until it completed its work in the kernel or voluntarily relinquished control. If the higher priority process waiting to run is a text editor in which the user is typing or an MP3 player ready to refill its audio buffer, the result is poor interactive performance. This way non-preemptive kernel was a major drawback at that time.

Imagine the simple view of preemptive multi-tasking. We have two user tasks, both of which are running all the time without using any I/O or performing kernel calls. Those two tasks don't have to do anything special to be able to run on a multi-tasking operating system. The kernel, typically based on a timer interrupt, simply decides that it's time for one task to pause to let another one run. The task in question is completely unaware that anything happened.
However, most tasks make occasional requests of the kernel via syscalls. When this happens, the same user context exists, but the CPU is running kernel code on behalf of that task.
Older Linux kernels would never allow preemption of a task while it was busy running kernel code. (Note that I/O operations always voluntarily re-schedule. I'm talking about a case where the kernel code has some CPU-intensive operation like sorting a list.)
If the system allows that task to be preempted while it is running kernel code, then we have what is called a "preemptive kernel." Such a system is immune to unpredictable delays that can be encountered during syscalls, so it might be better suited for embedded or real-time tasks.
For example, if on a particular CPU there are two tasks available, and one takes a syscall that takes 5ms to complete, and the other is an MP3 player application that needs to feed the audio pipe every 2ms, you might hear stuttering audio.
The argument against preemption is that all kernel code that might be called in task context must be able to survive preemption-- there's a lot of poor device driver code, for example, that might be better off if it's always able to complete an operation before allowing some other task to run on that processor. (With multi-processor systems the rule rather than the exception these days, all kernel code must be re-entrant, so that argument isn't as relevant today.) Additionally, if the same goal could be met by improving the syscalls with bad latency, perhaps preemption is unnecessary.
A compromise is CONFIG_PREEMPT_VOLUNTARY, which allows a task-switch at certain points inside the kernel, but not everywhere. If there are only a small number of places where kernel code might get bogged down, this is a cheap way of reducing latency while keeping the complexity manageable.

Traditional unix kernels had a single lock, which was held by a thread while kernel code was running. Therefore no other kernel code could interrupt that thread.
This made designing the kernel easier, since you knew that while one thread using kernel resources, no other thread was. Therefore the different threads cannot mess up each others work.
In single processor systems this doesn't cause too many problems.
However in multiprocessor systems, you could have a situation where several threads on different processors or cores all wanted to run kernel code at the same time. This means that depending on the type of workload, you could have lots of processors, but all of them spend most of their time waiting for each other.
In Linux 2.6, the kernel resources were divided up into much smaller units, protected by individual locks, and the kernel code was reviewed to make sure that locks were only held while the corresponding resources were in use. So now different processors only have to wait for each other if they want access to the same resource (for example hardware resource).

The preemption allows the kernel to give the IMPRESSION of parallelism: you've got only one processor (let's say a decade ago), but you feel like all your processes are running simulaneously. That's because the kernel preempts (ie, take the execution out of) the execution from one process to give it to the next one (maybe according to their priority).
EDIT Not preemptive kernels wait for processes to give back the hand (ie, during syscalls), so if your process computes a lot of data and doesn't call any kind of yield function, the other processes won't be able to execute to execute their calls. Such systems are said to be cooperative because they ask for the cooperation of the processes to ensure the equity of the execution time
EDIT 2 The main goal of preemption is to improve the reactivity of the system among multiple tasks, so that's good for end-users, whereas on the other-hand, servers want to achieve the highest througput, so they don't need it: (from the Linux kernel configuration)
Preemptible kernel (low-latency desktop)
Voluntary kernel preemption (desktop)
No forced preemption (server)

The linux kernel is monolithic and give a little computing timespan to all the running process sequentially. It means that the processes (eg. the programs) do not run concurrently, but they are given a give timespan regularly to execute their logic. The main problem is that some logic can take longer to terminate and prevent the kernel to allow time for the next process. This results in system "lags".
A preemtive kernel has the ability to switch context. It means that it can stop a "hanging" process even if it is not finished, and give the computing time to the next process as expected. The "hanging" process will continue to execute when its time has come without any problem.
Practically, it means that the kernel has the ability to achieve tasks in realtime, which is particularly interesting for audio recording and editing.
The ubuntu studio districution packages a preemptive kernel as well as a buch of quality free software devoted to audio and video edition.

It means that the operating system scheduler is free to suspend the execution of the running processes to give the CPU to another process whenever it wants; the normal way to do this is to give to each process that is waiting for the CPU a "quantum" of CPU time to run. After it has expired the scheduler takes back the control (and the running process cannot avoid this) to give another quantum to another process.
This method is often compared with the cooperative multitasking, in which processes keep the CPU for all the time they need, without being interrupted, and to let other applications run they have to call explicitly some kind of "yield" function; naturally, to avoid giving the feeling of the system being stuck, well-behaved applications will yield the CPU often. Still,if there's a bug in an application (e.g. an infinite loop without yield calls) the whole system will hang, since the CPU is completely kept by the faulty program.
Almost all recent desktop OSes use preemptive multitasking, that, even if it's more expensive in terms of resources, is in general more stable (it's more difficult for a sigle faulty app to hang the whole system, since the OS is always in control). On the other hand, when the resources are tight and the application are expected to be well-behaved, cooperative multitasking is used. Windows 3 was a cooperative multitasking OS; a more recent example can be RockBox, an opensource PMP firmware replacement.

I think everyone did a good job of explaining this but I'm just gonna add little more info. in context of Linux IRQ, interrupt and kernel scheduler.
Process scheduler is the component of the OS that is responsible for deciding if current running job/process should continue to run and if not which process should run next.
preemptive scheduler is a scheduler which allows to be interrupted and a running process then can change it's state and then let another process to run (since the current one was interrupted).
On the other hand, non-preemptive scheduler can't take away CPU away from a process (aka cooperative)
FYI, the name word "cooperative" can be confusing because the word's meaning does not clearly indicate what scheduler actually does.
For example, Older Windows like 3.1 had cooperative schedulers.
Full credit to wonderful article here

I think it became preemptive from 2.6. preemptive means when a new process is ready to run, the cpu will be allocated to the new process, it doesn't need the running process be co-operative and give up the cpu.

Linux kernel is preemptive means that The kernel supports preemption.
For example, there are two processes P1(higher priority) and P2(lower priority) which are doing read system calls and they are running in kernel mode. Suppose P2 is running and is in the kernel mode and P2 is scheduled to run.
If kernel preemption is available, then preemption can happen at the kernel level i.e P2 can get preempted and but to sleep and the P1 can continue to run.
If kernel preemption is not available, since P2 is in kernel mode, system simply waits till P2 is complete and then

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?

A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.

To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)

http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html

This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

User mode vs supervisor mode

I have a few questions on the user-mode and supervisor-mode on Unix-like machines.
What is the difference between user-mode and supervisor-mode? I know that the user processes cannot access all memory and hardware and execute all instructions. Is there more to this?
What are the advantages of having different modes?
What are the steps involved when one switches from the user-mode to the supervisor mode?
When a system call is made by a user-program, the mode has to change from user-mode to supervisor mode. I have read elsewhere that this is achieved on x86 machines by using an int x80. So how is a mode-switch different from interrupt handling?
How is it different from a context-switch?
How are supervisor modes implemented in different architectures?
Any answers or pointers will be appreciated!

The CPU will not physically allow access to the areas which are determined as "privileged". Because this is enforced in hardware, it gives your operating system the capability to protect itself. Without this mechanism there would be no "security" in an operating system, as the most obscure piece of code could simply access kernel memory and read all the passwords for instance.
User mode to supervisor mode switch is expensive because it is a context switch, and for security purposes the cache must be flushed (otherwise you might be able to access something that you weren't meant to).
As for a context switch, this inherently involves a switch to kernel mode to perform a task. When the CPU Scheduler timer interrupt fires, it switches into kernel mode, selects the next task to execute, and then switches back to user mode with the next task to resume.

Two concepts exist:
software user/kernel modes, which are switched from each other when performing a system call or a return form system call,
hardware user/supervisor modes, which are switched from each other on interrupts.
Very few code is executed in HW supervisor mode, mainly interrupt routines at low level and the very beginning of startup. Even most of SW kernel mode is executed in HW user mode.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string