Linux System Call Flow Sequence - linux

I had a question regarding the deep working of Linux.
Lets say a multi-threaded process is being executed in the CPU. We will have a thread which is being executed on the CPU in such a case. At a more broader picture we will have the corresponding page belonging to the Process being loaded in the RAM for execution.
Lets say the thread makes a system call. I am a bit unclear on the workings after this. The Interrupt will generate a call. One of my questions is who will answer this call?
Lets say that the system has m:n user level thread to kernel level thread mapping, I am assuming that the corresponding Kernel Level Thread will answer this call.
So the Kernel will lookup the Interrupt Vector Table and get the routine which needs to be executed. My next question is which stack will be used in the execution of the Interrupt? Will it be the Kernel Thread's Stack or the User level Thread's Stack? (I am assuming that it will be the Kernel Thread's Stack.)
Coming back to the flow of the program lets say the operation is opening a file using fopen. The subsequent question I have is how will the jump from the ISR to System Call take place? Or is our ISR mapped to a System Call?
Also at a more broader picture when the Kernel Thread is being executed I am assuming that the "OS region" on the RAM will be used to house the pages which are executing the System Call.
Again looking at it from a different angle (Hope your still with me) finally I am assuming that the corresponding Kernel Thread is being handled by the CPU Scheduler where in a context switch would have happened from the User Level Thread to the corresponding Kernel Level Thread when the fopen System Call was being answered.
I have made a lot of assumptions and it would be absolutely fantastic if anyone could clear the doubts or at least guide me in the right direction.

Note: I work predominately with ARM machines so some of these things might be ARM specific. Also, I'm going to try and simplify it as much as I can. Feel free to correct anything that might be wrong or oversimplified.
Lets say the thread makes a system call. I am a bit unclear on the workings after this. The Interrupt will generate a call. One of my questions is who will answer this call?
Usually, the processor will start executing at some predetermined location in kernel mode. The kernel will save the current process state and look at the userspace registers to determine which system call was requested and dispatch that to the correct system call handler.
So the Kernel will lookup the Interrupt Vector Table and get the routine which needs to be executed. My next question is which stack will be used in the execution of the Interrupt? Will it be the Kernel Thread's Stack or the User level Thread's Stack? (I am assuming that it will be the Kernel Thread's Stack.)
I'm pretty sure it will switch to a kernel stack. There would be some pretty severe security problems with information leaks if they used the userspace stack.
Coming back to the flow of the program lets say the operation is opening a file using fopen. The subsequent question I have is how will the jump from the ISR to System Call take place? Or is our ISR mapped to a System Call?
fopen() is actually a libc function and not a system call itself. It may (and in most cases will) call the open() syscall in its implementation though.
So, the process (roughly) is:
Userspace calls fopen()
fopen performs a system call to open()
This triggers some sort of exception or interrupt. In response, the processor switches into a more privileged mode and starts executing at some preset location in the kernel.
Kernel determines what kind of interrupt and exception it is and handles it appropriately. In our case, it will be a system call.
Kernel determines which system call is being requested by reading the userspace registers and extracts any arguments and passes it to the appropriate handler.
Handler runs.
Kernel puts any return code into userspace registers.
Kernel transfers execution back to where the exception occured.
Also at a more broader picture when the Kernel Thread is being executed I am assuming that the "OS region" on the RAM will be used to house the pages which are executing the System Call.
Pages don't execute anything :) Usually, in Linux, any address mapped above 0xC0000000 belongs to the kernel.
Again looking at it from a different angle (Hope your still with me) finally I am assuming that the corresponding Kernel Thread is being handled by the CPU Scheduler where in a context switch would have happened from the User Level Thread to the corresponding Kernel Level Thread when the fopen System Call was being answered.
With a preemptive kernel, threads effectively aren't discriminated against. With my understanding, a new thread isn't created for the purpose of servicing a system call - it just runs in the same thread from which the system call was requested in, except in kernel mode.
That means a thread that is in kernel mode servicing a system call can be scheduled out just the same as any other thread. Hence, this is where you hear about 'userspace context' when developing for the kernel. It means it's executing in kernel mode on a usermode thread.
It's a little difficult to explain this so I hope I got it right.

Related

Are user code in the end executed in kernel mode?

For what i learned from Operating System Concepts and online searching:
all user threads are finally mapped to kernel threads for being scheduled to physical CPUs
kernel threads can only be executed in kernel mode
above two arguments leads to the conclusion:
user code are all executed in kernel mode
is this right?
i have read the whole book and searched for many articles, the question still holds.
at Wikipedia, it says about LWP:
Kernel threads
Kernel threads are handled entirely by the kernel. They need not be associated with a process; a kernel can create them whenever it needs to perform a particular task. Kernel threads cannot execute in user mode. LWPs (in systems where they are a separate layer) bind to kernel threads and provide a user-level context. This includes a link to the shared resources of the process to which the LWP belongs. When a LWP is suspended, it needs to store its user-level registers until it resumes, and the underlying kernel thread must also store its own kernel-level registers.
also what does it means when saying about user-level registers and kernel level registers?
after digging and digging, i have following temp conclusion, but i am not sure. Hope the question further be answered and clearifed:
kernel thread, depending on discussion context, has two meanings:
when talking about user/kernel threading, kernel thread means a kernel task that totally execute in kernel mode and only execute kernel codes, like ksoftirqd for handling bottom half of interrupts
when taking about threading model, namely how user code is mapped into schedulable entities in kernel, kernel thread means a task that is schedulable by kernel
further about threading model and light weight processes in Linux:
in old times the operating system does not know thread, it only know processes(tasks) and threads are implmented by thread libraries totally in user side. There is a inherent problem for this that is if one user thread is blocked, such as I/O, all the user threads are blocked, because there is only one schedulable tasks in the kernel for this process. From the perspective of the kernel, the whole process is blocked. To solve this problem, light weight process(LWP), also called virtual processor(VP) is invented.
LWP is a intermedia data structure between user thread and a kernel thread(the second meaning above). LWP binds a user thread with a kernel thread(task), which in before is bounded with a user process. Simply put: in before a user process occupies a kernel thread(task), now with LWP a user thread can occupy a individual kernel thread(task), without sharing it with other user threads. (I think) This is why it is called light weight process. The advantage of this model is obvious, if one of the user thread is blocked, other user threads has ways to continue being executed by other kernel threads(tasks).
A kernel thread(task) acutually knows nothing about user process. It is just a task, a schedulable entity created, managed, destroyed totally by kernel itself. But a LWP belongs to a specific process and knows other LWPs that also belongs to the same one. LWP is like a bridge between user process and kernel thread(task).
When a kernel thread(task) that is bound to a LWP is scheduled by the kernel, the user level registers(pointed by LWP) is loaded into CPU, also the kernel thread(task) has registers and they are also loaded into CPU. From the standing point of CPU, a LWP is a kernel thread(task). It does not care it executes kernel code or user code.
user/kernel mode, user/kernel thread: they are independent. In Linux, a user thread created by pthread essentially is a kernel thread and this thread can execute in both user mode or kernel mode, depending on whether the thread is executing user code or kernel code.
All user threads are finally mapped to kernel threads.
That is not a useful way to think about threads. In most operating systems, a program can ask the OS to create a new thread and the program can provide a pointer to a function for the new thread to call. There's no "mapping" that happens there.* The new thread runs in exactly the same way as the program's original (a.k.a., "main") thread. It runs application code in user mode except, occasionally, when it makes a system call, and then for the duration of the system call it runs kernel code in kernel mode.
Many programming languages come with an OS-independent library that provides some kind of a Thread object. The thread object is not the same thing as the actual thread. It's more of a handle that the application uses to control the OS thread. If you like, you can say that those thread objects are "mapped" to OS threads, but that's still somewhat abusing the notion of what a "mapping" is.
kernel threads can only be executed in kernel mode
If you aren't writing OS code, it's best to avoid saying "kernel thread" altogether. In the Linux OS in particular, "kernel thread" means something, and it has nothing whatever to do with application code. Linux kernel threads are threads that are created by the OS for the OS, and they never run "user" (i.e., application) code.
It's possible for an application program to create and schedule its own threads, completely unknown to the OS. Some people call those "user threads." Some used to call them "green threads." Back in the old days, before any OS had thread support, we just called them "threads." Doing threads that way is a lot of work, for little reward. (Can't schedule them preemptively.) Outside of the realm of tiny, embedded, real-time systems, almost nobody bothers to do it anymore.
* But wait! Things will get more complicated in the near future when Java's Project Loom hits the main stream. Threads traditionally are expensive. In particular, each thread must have its own contiguous call stack—usually a chunk of at least a few megabytes—allocated to it. The goal of project loom is to make threads as cheap as any other object.
They way they intend to make threads "cheap" is to "virtualize" them, and to break up their call stacks into linked lists of reclaimable heap objects. Under project loom, a limited number of real OS threads that are scheduled by the OS scheduler will, in turn, schedule and execute the code of a multitude of "virtual" application threads, and so there really will be something going on that feels a bit like "mapping."
I won't be at all surprised if the same idea spreads to other languages.
There are two different meanings of kernel threads. When threading people talk about "kernel threads" they mean "threads the kernel knows about" i.e. "threads that are controlled by the kernel". When kernel people talk about "kernel threads" they mean "threads that run in kernel mode".
"Threads the kernel knows about" are contrasted to "user threads" which are hidden from the kernel and controlled by the program itself.
No, not all threads controlled by the kernel run in kernel mode. The kernel controls the scheduling of threads that run in kernel mode, and also threads that run in user mode.
The quote about LWPs is talking about systems where the scheduler thinks that all threads are kernel-mode threads. To run a user-mode thread (which they call an LWP because it's not really a thread because all threads are kernel-mode threads) the thread has to call a function like RunLWP(pointer_to_lwp);.
I don't know which system is like this. Linux is not like this; Windows is not like this. This is a weird, overly complicated design which is why it's not normally used.
The "registers" are where the CPU remembers what it is currently doing. The most important one is the "instruction pointer" register (some CPUs call it something different) which remembers which instruction is next. If you remember all of the register values, and then come back later and set them to the same values, the CPU will carry on like nothing happened. That's why threading works - the thread can't tell that it's been interrupted, because all of the registers have the same values as if it wasn't interrupted. Here's a list of registers on x86-class CPUs. You don't need to know them for this question - it just might be interesting.
When an interrupt happens, depending on the CPU type, the CPU will save the instruction pointer and maybe one or two other registers. The interrupt handler has to save the rest (or be careful not to change them). Here about halfway down you can see how an x86-class CPU switches from user-space to an interrupt handler when an interrupt occurs.
So this RunLWP function would save the current registers (from the kernel) and set them according to the last time the LWP stopped running. Then the LWP runs. Then when some interrupt happens, the interrupt handler would save the current registers (from user-space) and set them according to the saved kernel handlers, so the kernel code after RunLWP runs. Probably. Again, I don't know any actual system like this, but it seems like the logical way to do things. The reason it should return back to the kernel code instead of the user code is so that the kernel code can decide whether it wants to keep running the LWP or not.
I don't know why they would say the interrupt handler would save both the kernel-space and user-space registers. Current CPUs generally only have one set of registers which software has to swap out when it wants to make the CPU change what it is doing. RunLWP would have to save the kernel registers and load the user ones, then the interrupt handler would have to save the user registers and load the interrupt handler ones. It could be that the CPUs which these systems were designed for did have two sets of registers.

Is there a separate kernel level thread for handling system calls by user processes?

I understand that user level threads are implemented in user space and kernel level threads in kernel space. I have also read that user level threads are mapped onto kernel level threads to actually run the user level threads.
What exactly is meant by "implemented"? Does this mean the thread control blocks are defined in user and kernel space respectively?
What happens when a system call is made? Which kernel thread (or user thread IDK) does this system call run on? And does each kernel level stack have its own stack?
I have an understanding that threads are just parts of a process. When we deal with kernel threads, what is the corresponding process here? And what are the kernel processes and can you give examples?
I have referred to other answers as well, but haven't received satisfaction.
It depends on the implementation of the OS.
But usually, like in Linux, the system call is executed on the thread that called it. And each thread has a user stack and a kernel stack.
See How does a system call work and How is the system call in Linux implemented? for more details. And I hope this link can clear up your question about "kernel threads".

Under what circumstances does control pass from userspace to the Linux kernel space?

I'm trying to understand which events can cause a transition from userspace to the linux kernel. If it's relevant, the scope of this question can be limited to the x86/x86_64 architecture.
Here are some sources of transitions that I'm aware of:
System calls (which includes accessing devices) causes a context switch from userspace to kernel space.
Interrupts will cause a context switch. As far as I know, this also includes scheduler preemptions, since a scheduler usually relies on a timer interrupt to do its work.
Signals. It seems like at least some signals are implemented using interrupts but I don't know if some are implemented differently so I'm listing them separately.
I'm asking two things here:
Am I missing any userspace->kernel path?
What are the various code paths that are involved in these context switches?
One you are missing: Exceptions
(which can be further broken down in faults, traps and aborts)
For example a page fault, breakpoint, division by zero or floating-point exception. Technically, one can view exceptions as interrupts but not really the way you have defined an interrupt in your question.
You can find a list of x86 exceptions at this osdev webpage.
With regard to your second question:
What are the various code paths that are involved in these context
switches?
That really depends on the architecture and OS, you will need to be more specific. For x86, when an interrupt occurs you go to the IDT entry and for SYSENTER you get to to address specified in the MSR. What happens after that is completely up to the OS.
No one wrote a complete answer so I will try to incorporate the comments and partial answers into an answer. Feel free to comment or edit the answer to improve it.
For the purposes of this question and answer, userspace to kernel transitions mean a change in processor state that allows access to kernel code and memory. In short I will refer to these transistions as context switches.
When discussing events that can trigger userspace to kernel transitions, it is important to separate the OS constructs that we are used to (signals, system calls, scheduling) that require context switches and the way these constructs are implemented, using context switches.
In x86, there are two central ways for context switches to occur: interrupts and SYSENTER. Interrupts are a processor feature, which causes a context switch when certain events happen:
Hardware devices may request an interrupt, for example, a timer/clock can cause an interrupt when a certain amount of time has elapsed. A keyboard can interrupt when keys are pressed. It's also called a hardware interrupt.
Userspace can initiate an interrupt. For example, the old way to perform a system call in Linux on x86 was to execute INT 0x80 with arguments passed through the registers. Debugging breakpoints are also implemented using interrupts, with the debugger replacing an instruction with INT 0x3. This type of an interrupt is called a software interrupt.
The CPU itself generates interrupts in certain situations, like when memory is accessed without permissions, when a user divides by zero, or when one core must notify another core that it needs to do something. This type of interrupt is called an exception, and you can read more about them in #esm 's answer.
For a broader discussion of interrupts see here: http://wiki.osdev.org/Interrupt
SYSENTER is an instruction that provides the modern path to cause a context switch for the particular case of performing a system call.
The code that handles the context switching due to interrupts or SYSENTER in Linux can be found in arch/x86/kernel/entry_{32|64}.S.
There are many situations in which a higher-level Linux construct might cause a context switch. Here are a few examples:
If a system call got to int 0x80 or sysenter instruction, a context switch occurs. Some system call routines can use userspace information to get the information the system call was meant to get. In this case, no context switch will occur.
Many times scheduling doesn't require an interrupt: a thread will perform a system call, and the return from the syscall is delayed until it is scheduled again. For processses that are in a section where syscalls aren't performed, Linux relies on timer interrupts to gain control.
Virtual memory access to a memory location that was paged out will cause a segmentation fault, and therefore a context switch.
Signals are usually delivered when a process is already "switched out" (see comments by #caf on the question), but sometimes an inter-processor interrupt is used to deliver the signal between two running processes.

Where would the CPU context interrupted by ptrace be, userspace stack or kernel stack?

On Linux x86_64, when I use ptrace to stop a process, would all the threads' CPU contexts would be saved, or just the process's CPU context be saved?
Is the context on process's userspace stack or kernel stack? Or somewhere else? Or multiple copies?
For other situations (not ptrace), where could the interrupted (including exception and syscall) CPU context saved, kernel stack, userspace stack or somewhere else?
Is ptrace an interrupt?
Update
It seems that, ptrace's context pt_regs_x86_t, where to save is determined by the programmers. But would the kernel also stores a copy for the interrupted context?
Yes, the kernel will always store context for any thread that is not currently executing. That context is largely the same whether the thread is being ptrace'd or not. The difference is in how/whether the thread can be scheduled anew -- if it's being ptraced, the tracing process will decide when it can be resumed.
The thread's user-space context is stored on the kernel stack (but it's important to note that there is a separate kernel stack area for each thread). And that will be the same whether the thread entered the kernel by executing a system call, or was suspended due to an interrupt -- and ultimately, those are the only two ways a thread can be suspended.
As you discovered, when a process is ptrace'd, the tracing program is given access to the traced threads' registers in the state they had when the threads last executed. That is accomplished by simply copying the saved registers from the traced thread's kernel stack.
Finally, it's worth noting that if you look at linux kernel code, you won't find any concrete representation of a process. A process is just a group of related threads that share various parts of their state: process ID, address space, file descriptors, etc.

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources