Process Management in Linux Kernel - linux

I have been studying the subsystems of the Linux kernel. There, it is written that Linux kernel is responsible for context switching(letting another process to use the CPU). Here are the steps the kernel goes through to do context switching:
The CPU (the actual hardware) interrupts the current process based
on an internal timer, switches into kernel mode, and hands control
back to the kernel.
The kernel records the current state of the CPU and memory, which
will be essential to resuming the process that was just interrupted.
The kernel performs any tasks that might have come up during the
preceding time slice (such as collecting data from input and output,
or I/O, operations).
The kernel is now ready to let another process run. The kernel analyzes
the list of processes that are ready to run and chooses one.
The kernel prepares the memory for this new process, and then prepares the CPU.
The kernel tells the CPU how long the time slice for the new process
will last.
The kernel switches the CPU into user mode and hands control of the
CPU to the process.
Problem with me is that I can't understand the 3rd step above. Can someone please shed some light on that sentence? Thanks!

The CPU (the actual hardware) interrupts the current process based on an internal timer, switches into kernel mode, and hands control back to the kernel.
Most task switches are caused by tasks blocking (because they have to wait for a mutex, disk IO, user IO, end-user action, etc).
It's better (more accurate) to say that "something" (IRQ, system call) causes a switch to the kernel's code before the kernel decides it wants to do a task switch, and that this "something" is not part of the task switch itself.
The kernel records the current state of the CPU and memory, which will be essential to resuming the process that was just interrupted.
Sort of. Because "something" (IRQ, system call) causes a switch to the kernel's code before the kernel decides it wants to do a task switch; all task switches only every happen between kernel's code (for one task) and kernel's code (for another task). Because task switches only ever switch from kernel's code to kernel's code; the task switch itself doesn't need to care about user-space memory (which isn't so important for kernel's code) or kernel's memory (which is global/shared by all CPUs and all virtual address spaces). More; because some registers are "callee preserved" (by C calling conventions) and some are "constant as far as kernel is concerned" (e.g. segment registers) the task switch code doesn't need to care about various parts of the CPU state either.
The kernel performs any tasks that might have come up during the preceding time slice (such as collecting data from input and output, or I/O, operations).
Also not part of the task switch (more "things that happen before or after kernel decides to do a task switch").
The kernel is now ready to let another process run. The kernel analyzes the list of processes that are ready to run and chooses one.
Sort of; but it's no as simple as a list, and sometimes (e.g. high-priority real time thread unblocks and preempts a less important task) the kernel knows which task it needs to switch to without doing anything extra.
The kernel prepares the memory for this new process, and then prepares the CPU.
For memory; kernel mostly just loads a new "reference for new task's virtual address space" (e.g. a single mov cr3, ... instruction on 80x86). For CPU state, it's the reverse of "2. The kernel records the current state of the CPU ..." above (loading what was previously saved, where some CPU state isn't loaded and isn't saved).
The kernel tells the CPU how long the time slice for the new process will last.
Yes.
The kernel switches the CPU into user mode and hands control of the CPU to the process.
Not really. It's better (more accurate) to say after the kernel has finished doing a task switch, the kernel code for the new task does whatever it wants (and eventually may return to user-space); and that "things that happen after task switch finished" are not part of the task switch.

Related

Difference between Kernel, Kernel-Thread and User-Thread

i'm not sure, if i totally understand the above mentioned differences, so i'd like to explain it on my own and you can interrupt me, as far as i get wrong:
"A kernel is the initial piece of code which creates kernel-threads. Kernel threads are processes managed by the kernel. user-threads are part of a process. If you have a single-threaded process, than the whole process itself would be a user-thread. User-Threads make system-calls and this system-calls are served by a specific kernel-Thread which belongs to the calling user-threads. So for ervery user-thread which make a system call, a kernel-thread is created and after the kernel-thread has done its job, it gives control back to the user-thread and then the kernel-threas is destroyed."
Would this be ok?
Thank you!
Many greetings from Germany!
I don't think that's a very good mental model for kernel vs user. I think it's useful to look at the implementation of these abstractions in order to fully understand them:
What is a Kernel?
A kernel is basically just a piece of memory. It was privileged enough to be loaded before anything else, thereby allowing it to set the CPU's interrupt vectors.
Interrupts control everything, including I/O, timers, and virtual memory. That means that the kernel gets to decide how all that is handled.
A library is also just a piece of memory, and you can very well look at the kernel as the "system call library", among other things. But because the kernel represents the hardware, that piece of memory is shared among everyone.
Kernel Mode vs User Mode
Kernel mode is the CPU's "natural" mode, with no restrictions (on x86 CPUS - "ring 0"). User mode (on x86 CPUs - "ring 3") is when the CPU is instructed to trigger an interrupt whenever certain instructions are used or whenever some memory locations are accessed. This allows the kernel to have the CPU execute specific kernel code when the user tries to access kernel memory or memory representing I/O ports or hardware memory such as the GPU's frame buffer.
Processes and Threads
A process is also just a piece of memory, consisting of its own heap and the memory used by libraries, among which is the kernel.
A thread (= a unit of scheduling) is just a stack with an ID that the kernel knows of and tracks. That's the call stack that the CPU uses when the thread is running. User threads have 2 stacks: one for user mode and one for kernel mode - but they still have the same ID.
Because the kernel controls timers, it sets up a timer to go off e.g. every 1 ms. When the timer triggers ("timer interrupt"), the CPU runs the callback that the kernel set up for that interrupt, where the kernel can see that the current thread has been running for a while and decide to unschedule it and schedule another thread instead.
Virtual Memory Context
By "virtual memory context" I mean all the memory that can be accessed by the CPU. This includes all the memory of the process - including the user-mode heap and memory of libraries, user-mode call stacks of all process threads, kernel-mode stack of all threads in the system, the kernel's heap memory, I/O ports, and hardware memory.
When an interrupt or a system call occur, the virtual memory context doesn't change, only a CPU flag is flipped (i.e. from ring 3 to ring 0) and the CPU is now back in its "natural" kernel mode where it can freely access kernel memory, I/O ports and hardware memory.
When a new process is created, what actually happens is that a new thread is created, and assigned a new virtual memory context. Therefore, every process starts as single-threaded. That thread can later ask the kernel via a system call to create more threads (= stacks) which share its virtual memory context (= process), or ask the kernel to create more threads, each with a new virtual memory context (= new processes).
Kernel Threads
Like any other library, the kernel can have its own background threads for optimization purposes. When such a need arises (which can happen in the memory context of any process when servicing a system call), the kernel will create new threads and give them a special memory context, which is a context that only contains the kernel's memory, with no access to memory of any process.
You're mixing up a few somewhat different concepts.
To follow from what you wrote, there is a Kernel, which is a piece of code that handles all internal operations of the Operating System. It does create kernel threads, but the Kernel threads are nothing special. They are just threads which run in "Kernel-Mode" and are not associated with any "User-Mode" process.
Now we have a concept which is lacking from your explanation and is the key to understand it better. Kernel-Mode (or sometimes called system mode), along with User-Mode make up CPU modes available to OS.
Kernel-Mode is a kind of trusted execution mode, which allows the code to access any memory and execute any instruction. It handles I/O and system interrupts.
User-Mode is a limited mode, which does not allow the executing code to access any memory address except those associated with the User-Mode process.
Also User-Mode cannot access I/O or those many OS related function (such as handle or process creation). For these operations, User-Mode code should call into Kernel-Mode, by a system call (as you have correctly mentioned).
A system call is a special CPU instruction which switches the CPU mode to Kernel-Mode and starts executing a special code provided by OS which dispatches different system calls. So, it means the work is NOT scheduled for a Kernel-Mode thread, instead the OS (kernel/trusted) code is executed in the context of the same User-Mode thread. The only thing that happens is that CPU mode changes to Kernel-Mode.
As for completing jobs in a Kernel-thread, I should say although in some cases, some operations (e.g. I/O) might be scheduled for a separate Kernel thread to complete, but the Kernel threads are not created and destroyed in the process of a system call.
Backed by:
10+ years of driver development experience
Also:
http://www.linfo.org/kernel_mode.html
https://learn.microsoft.com/en-us/windows-hardware/drivers/gettingstarted/user-mode-and-kernel-mode

How does a kernel return from the thread

I am doing some study hardcore study on computers etc. so I can get started on my own mini Hello World OS.
I was looking a how kernels work and I was wondering how the kernel makes the current thread return to the kernel (so it can switch to another) even though the kernel isn't running and the thread has no instruction to do so.
Does it use some kind of CPU interrupt that goes back to the kernel after a few nanoseconds?
Does it use some kind of CPU interrupt that goes back to the kernel after a few nanoseconds?
It is during timer interrupts and (blocking) system calls that the kernel decides whether to keep executing the currently active thread(s) or switch to another thread. The timer interupt handler updates resource usages, such as consumed system and user time, for the currently running process and scheduler_tick() function that decides whether a process/tread need to be pre-empted.
See "Preemption and Context Switching" on page 62 of Linux Kernel Development book.
The kernel, however, must know when to call schedule(). If it called schedule() only
when code explicitly did so, user-space programs could run indefinitely. Instead, the kernel
provides the need_resched flag to signify whether a reschedule should be performed (see
Table 4.1).This flag is set by scheduler_tick() when a process should be preempted, and
by try_to_wake_up() when a process that has a higher priority than the currently run-
ning process is awakened.The kernel checks the flag, sees that it is set, and calls schedule() to switch to a new process.The flag is a message to the kernel that the scheduler should be invoked as soon as possible because another process deserves to run.
Does it use some kind of CPU interrupt
Yes! Modern preemptive kernels are absolutely dependent upon interrupts from hardware to deliver good I/O performance. Keyboard, mouse, disk, NIC, USB, etc. drivers are all entered from interrupts and can make threads that are waiting on them ready/running when required (e.g., when data is available).
Threads can also change state as a result of making an OS call that changes the caller's own state of that of another thread.
The interrupt from the hardware timer is one of many interrupt sources and is only special in that many system operations have timeouts that are signaled by this interrupt. Other than that, the timer interrupt just causes a reschedule which, in most cases, changes nothing re. the ready/running state of threads. If the machine is grossly CPU-overloaded to the point where there are more ready threads than there are cores, there is a side-effect of the timer interrupt that causes CPU time to be shared amongst the ready threads.
Do not fixate on the timer interrupt—the other driver interrupts are absolutely essential. It is not impossible to build a functional preemptive multithreaded kernel with no timer interrupt at all.

Thread context switch Vs. process context switch

Could any one tell me what is exactly done in both situations? What is the main cost each of them?
The main distinction between a thread switch and a process switch is that during a thread switch, the virtual memory space remains the same, while it does not during a process switch.
Both types involve handing control over to the operating system kernel to perform the context switch. The process of switching in and out of the OS kernel along with the cost of switching out the registers is the largest fixed cost of performing a context switch.
A more fuzzy cost is that a context switch messes with the processors cacheing mechanisms. Basically, when you context switch, all of the memory addresses that the processor "remembers" in its cache effectively become useless. The one big distinction here is that when you change virtual memory spaces, the processor's Translation Lookaside Buffer (TLB) or equivalent gets flushed making memory accesses much more expensive for a while. This does not happen during a thread switch.
Process context switching involves switching the memory address space. This includes memory addresses, mappings, page tables, and kernel resources—a relatively expensive operation. On some architectures, it even means flushing various processor caches that aren't sharable across address spaces. For example, x86 has to flush the TLB and some ARM processors have to flush the entirety of the L1 cache!
Thread switching is context switching from one thread to another in the same process (switching from thread to thread across processes is just process switching).Switching processor state (such as the program counter and register contents) is generally very efficient.
First of all, operating system brings outgoing thread in a kernel mode if it is not already there, because thread switch can be performed only between threads, that runs in kernel mode. Then the scheduler is invoked to make a decision about thread to which will be performed switching. After decision is made, kernel saves part of the thread context that is located in CPU (CPU registers) into the dedicated place in memory (frequently on the top of the kernel stack of outgoing thread). Then the kernel performs switch from kernel stack of outgoing thread on to kernel stack of the incoming thread. After that, kernel loads previously stored context of incoming thread from memory into CPU registers. And finally returns control back into user mode, but in user mode of the new thread.
In the case when OS has determined that incoming thread runs in another process, kernel performs one additional step: sets new active virtual address space.
The main cost in both scenarios is related to a cache pollution. In most cases, the working set used by the outgoing thread will differ significantly from working set which is used by the incoming thread. As a result, the incoming thread will start its life with avalanche of cache misses, thus flushing old and useless data from the caches and loading the new data from memory. The same is true for TLB (Translation Look Aside Buffer, which is on the CPU). In the case of reset of virtual address space (threads run in different processes) the penalty is even worse, because reset of virtual address space leads to the flushing of the entire TLB, even if new thread actually needs to load only few new entries. As a result, the new thread will start its time quantum with lots TLB misses and frequent page walking. Direct cost of threads switch is also not negligible (from ~250 and up to ~1500-2000 cycles) and depends on the CPU complexity, states of both threads and sets of registers which they actually use.
P.S.: Good post about context switch overhead: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
process switching: it is a transition between two memory resident of process in a multiprogramming environment;
context switching: it is a changing context from an executing program to an interrupt service routine (ISR).
In Thread Context Switching, the virtual memory space remains the same while it is not in the case of Process Context Switch. Also, Process Context Switch is costlier than Thread Context Switch.
I think main difference is when calling switch_mm() which handles memory descriptors of old and new task. In the case of threads, the virtual memory address space is unchanged (threads share virtual memory), so very little has to be done, and therefore less costly.
Though thread context switching needs to change the execution context (registers, stack pointers, program counters), they don't need to change address space as processes context switches do. There's an additional cost when you switch address space, more memory access (paging, segmentation, etc) and you have to flush TLB when entering or exiting a new process...
In short, the thread context switch does not assign a brand new set of memory and pid, it uses the same as the parent since it is running within the same process. A process one spawns a new process and thus assigns new mem and pid.
There is a loooooot more to it. They have written books on it.
As for cost, a process context switch >>>> thread as you have to reset all of the stack counters etc.
Assuming that The CPU the OS runs has got Some High Latency Devices Attached,
It makes sense to run another thread Of the Process's Address Space, while the high latency device responds back.
But, if the High Latency Device is responding faster than the time to need do set up of table + translation of Virtual To Physical memories for a NEW Process, then it is questionable if a switch is essential at all.
Also, HOT cache(data needed for running the process/thread is reachable in less time) is better choice.

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

what is a reentrant kernel

What is a reentrant kernel?
Much simpler answer:
Kernel Re-Entrance
If the kernel is not re-entrant, a process can only be suspended while it is in user mode. Although it could be suspended in kernel mode, that would still block kernel mode execution on all other processes. The reason for this is that all kernel threads share the same memory. If execution would jump between them arbitrarily, corruption might occur.
A re-entrant kernel enables processes (or, to be more precise, their corresponding kernel threads) to give away the CPU while in kernel mode. They do not hinder other processes from also entering kernel mode. A typical use case is IO wait. The process wants to read a file. It calls a kernel function for this. Inside the kernel function, the disk controller is asked for the data. Getting the data will take some time and the function is blocked during that time.
With a re-entrant kernel, the scheduler will assign the CPU to another process (kernel thread) until an interrupt from the disk controller indicates that the data is available and our thread can be resumed. This process can still access IO (which needs kernel functions), like user input. The system stays responsive and CPU time waste due to IO wait is reduced.
This is pretty much standard for today's desktop operating systems.
Kernel pre-emption
Kernel pre-emption does not help in the overall throughput of the system. Instead, it seeks for better responsiveness.
The idea here is that normally kernel functions are only interrupted by hardware causes: Either external interrupts, or IO wait cases, where it voluntarily gives away control to the scheduler. A pre-emptive kernel instead also interrupts and suspends kernel functions just like it would interrupt processes in user mode. The system is more responsive, as processes e.g. handling mouse input, are woken up even while heavy work is done inside the kernel.
Pre-emption on kernel level makes things harder for the kernel developer: The kernel function cannot be suspended only voluntarily or by interrupt handlers (which are somewhat a controlled environment), but also by any other process due to the scheduler. Care has to be taken to e.g. avoid deadlocks: A thread locks resource A but needing resource B is interrupted by another thread which locks resource B, but then needs resource A.
Take my explanation of pre-emption with a grain of salt. I'm happy for any corrections.
All Unix kernels are reentrant. This means that several processes may be executing in Kernel Mode at the same time. Of course, on uniprocessor systems, only one process can progress, but many can be blocked in Kernel Mode when waiting for the CPU or the completion of some I/O operation. For instance, after issuing a read to a disk on behalf of a process, the kernel lets the disk controller handle it and resumes executing other processes. An interrupt notifies the kernel when the device has satisfied the read, so the former process can resume the execution.
One way to provide reentrancy is to write functions so that they modify only local variables and do not alter global data structures. Such functions are called reentrant functions . But a reentrant kernel is not limited only to such reentrant functions (although that is how some real-time kernels are implemented). Instead, the kernel can include nonreentrant functions and use locking mechanisms to ensure that only one process can execute a nonreentrant function at a time.
If a hardware interrupt occurs, a reentrant kernel is able to suspend the current running process even if that process is in Kernel Mode. This capability is very important, because it improves the throughput of the device controllers that issue interrupts. Once a device has issued an interrupt, it waits until the CPU acknowledges it. If the kernel is able to answer quickly, the device controller will be able to perform other tasks while the CPU handles the interrupt.
Now let's look at kernel reentrancy and its impact on the organization of the kernel. A kernel control path denotes the sequence of instructions executed by the kernel to handle a system call, an exception, or an interrupt.
In the simplest case, the CPU executes a kernel control path sequentially from the first instruction to the last. When one of the following events occurs, however, the CPU interleaves the kernel control paths :
A process executing in User Mode invokes a system call, and the corresponding kernel control path verifies that the request cannot be satisfied immediately; it then invokes the scheduler to select a new process to run. As a result, a process switch occurs. The first kernel control path is left unfinished, and the CPU resumes the execution of some other kernel control path. In this case, the two control paths are executed on behalf of two different processes.
The CPU detects an exception-for example, access to a page not present in RAM-while running a kernel control path. The first control path is suspended, and the CPU starts the execution of a suitable procedure. In our example, this type of procedure can allocate a new page for the process and read its contents from disk. When the procedure terminates, the first control path can be resumed. In this case, the two control paths are executed on behalf of the same process.
A hardware interrupt occurs while the CPU is running a kernel control path with the interrupts enabled. The first kernel control path is left unfinished, and the CPU starts processing another kernel control path to handle the interrupt. The first kernel control path resumes when the interrupt handler terminates. In this case, the two kernel control paths run in the execution context of the same process, and the total system CPU time is accounted to it. However, the interrupt handler doesn't necessarily operate on behalf of the process.
An interrupt occurs while the CPU is running with kernel preemption enabled, and a higher priority process is runnable. In this case, the first kernel control path is left unfinished, and the CPU resumes executing another kernel control path on behalf of the higher priority process. This occurs only if the kernel has been compiled with kernel preemption support.
These information available on http://jno.glas.net/data/prog_books/lin_kern_2.6/0596005652/understandlk-CHP-1-SECT-6.html
More On http://linux.omnipotent.net/article.php?article_id=12496&page=-1
The kernel is the core part of an operating system that interfaces directly with the hardware and schedules processes to run.
Processes call kernel functions to perform tasks such as accessing hardware or starting new processes. For certain periods of time, therefore, a process will be executing kernel code. A kernel is called reentrant if more than one process can be executing kernel code at the same time. "At the same time" can mean either that two processes are actually executing kernel code concurrently (on a multiprocessor system) or that one process has been interrupted while it is executing kernel code (because it is waiting for hardware to respond, for instance) and that another process that has been scheduled to run has also called into the kernel.
A reentrant kernel provides better performance because there is no contention for the kernel. A kernel that is not reentrant needs to use a lock to make sure that no two processes are executing kernel code at the same time.
A reentrant function is one that can be used by more than one task concurrently without fear of data corruption. Conversely, a non-reentrant function is one that cannot be shared by more than one task unless mutual exclusion to the function is ensured either by using a semaphore or by disabling interrupts during critical sections of code. A reentrant function can be interrupted at any time and resumed at a later time without loss of data. Reentrant functions either use local variables or protect their data when global variables are used.
A reentrant function:
Does not hold static data over successive calls
Does not return a pointer to static data; all data is provided by the caller of the function
Uses local data or ensures protection of global data by making a local copy of it
Must not call any non-reentrant functions

Resources