context_switch and system call virtualization

context_switch and system call virtualization - linux

When the kernel is executing a context_switch(), i.e. when it is in that function, is it possible that some other task makes a system call? As far as I understand, as the processor state is being swapped, no other code can actually execute until the context_switch completes. Is this understanding correct?
background: Basically, I want to swap system call tables on context switches. I have 2 versions of the syscall vector, with one of them containing addresses of modified sys call code. When a "process of interest" is scheduled in, I will update the sys_call_table pointer to point to the new one. and then swap it out when the process of interest is swapped out.
I'm new to kernel dev, so feedback on my approach is also welcome.
PS: I know about ptrace() and it doesn't suit my needs. Too many context switches involved in virtualizing syscalls UML style.

I suspect you may be thinking about it wrong. Even if the kernel doesn't allow syscalls from other tasks while context_switch is executing, that is only likely to extend as far as the currently executing processor. The other processors can still be executing syscalls while one core is context switching.

Related

Under what circumstances does control pass from userspace to the Linux kernel space?

I'm trying to understand which events can cause a transition from userspace to the linux kernel. If it's relevant, the scope of this question can be limited to the x86/x86_64 architecture.
Here are some sources of transitions that I'm aware of:
System calls (which includes accessing devices) causes a context switch from userspace to kernel space.
Interrupts will cause a context switch. As far as I know, this also includes scheduler preemptions, since a scheduler usually relies on a timer interrupt to do its work.
Signals. It seems like at least some signals are implemented using interrupts but I don't know if some are implemented differently so I'm listing them separately.
I'm asking two things here:
Am I missing any userspace->kernel path?
What are the various code paths that are involved in these context switches?

One you are missing: Exceptions
(which can be further broken down in faults, traps and aborts)
For example a page fault, breakpoint, division by zero or floating-point exception. Technically, one can view exceptions as interrupts but not really the way you have defined an interrupt in your question.
You can find a list of x86 exceptions at this osdev webpage.
With regard to your second question:
What are the various code paths that are involved in these context
switches?
That really depends on the architecture and OS, you will need to be more specific. For x86, when an interrupt occurs you go to the IDT entry and for SYSENTER you get to to address specified in the MSR. What happens after that is completely up to the OS.

No one wrote a complete answer so I will try to incorporate the comments and partial answers into an answer. Feel free to comment or edit the answer to improve it.
For the purposes of this question and answer, userspace to kernel transitions mean a change in processor state that allows access to kernel code and memory. In short I will refer to these transistions as context switches.
When discussing events that can trigger userspace to kernel transitions, it is important to separate the OS constructs that we are used to (signals, system calls, scheduling) that require context switches and the way these constructs are implemented, using context switches.
In x86, there are two central ways for context switches to occur: interrupts and SYSENTER. Interrupts are a processor feature, which causes a context switch when certain events happen:
Hardware devices may request an interrupt, for example, a timer/clock can cause an interrupt when a certain amount of time has elapsed. A keyboard can interrupt when keys are pressed. It's also called a hardware interrupt.
Userspace can initiate an interrupt. For example, the old way to perform a system call in Linux on x86 was to execute INT 0x80 with arguments passed through the registers. Debugging breakpoints are also implemented using interrupts, with the debugger replacing an instruction with INT 0x3. This type of an interrupt is called a software interrupt.
The CPU itself generates interrupts in certain situations, like when memory is accessed without permissions, when a user divides by zero, or when one core must notify another core that it needs to do something. This type of interrupt is called an exception, and you can read more about them in #esm 's answer.
For a broader discussion of interrupts see here: http://wiki.osdev.org/Interrupt
SYSENTER is an instruction that provides the modern path to cause a context switch for the particular case of performing a system call.
The code that handles the context switching due to interrupts or SYSENTER in Linux can be found in arch/x86/kernel/entry_{32|64}.S.
There are many situations in which a higher-level Linux construct might cause a context switch. Here are a few examples:
If a system call got to int 0x80 or sysenter instruction, a context switch occurs. Some system call routines can use userspace information to get the information the system call was meant to get. In this case, no context switch will occur.
Many times scheduling doesn't require an interrupt: a thread will perform a system call, and the return from the syscall is delayed until it is scheduled again. For processses that are in a section where syscalls aren't performed, Linux relies on timer interrupts to gain control.
Virtual memory access to a memory location that was paged out will cause a segmentation fault, and therefore a context switch.
Signals are usually delivered when a process is already "switched out" (see comments by #caf on the question), but sometimes an inter-processor interrupt is used to deliver the signal between two running processes.

How system calls are handled in Linux on ARM machine

I have some doubt regarding system call in Linux on ARM processor.
In ARM system calls are handled in SWI mode. My doubt is do we perform entire required work in SWI mode or only part of that work is done in SWI mode and then we move to some process context? As per my understanding some system calls can take significant time and performing that work in SWI is not a good idea.
Also how do we return to calling user process? I mean in case of non-blocking system call how do we notify the user that required task is completed by system call?

I think you're missing two concepts.
CPU privilege modes and use of swi are both an implementation detail of system calls
Non-blocking system calls don't work that way
Sure, under Linux, we use swi instructions and maintain privilege separation to implement system calls, but this doesn't reflect ARM systems in general. When you talk about Linux specifically, I think it makes more sense to refer to concepts like kernel vs user mode.
The Linux kernel have been preemptive for a long time now. If your system call is taking too long and exceeds the time quantum allocated to that process/thread, the scheduler will just kick in and do a context switch. Likewise, if your system call just consists of waiting for an event (like for I/O), it'll just get switched out until there's data available.
Taking this into account you don't usually have to worry about whether your system call takes too long. However, if you're spending a significant amount of time in a system call that's doing something that isn't waiting for some event, chances are that you're doing something in the kernel that should be done in user mode.
When the function handling the system call returns a value, it usually goes back to some sort of glue logic which restores the user context and allows the original user mode program to keep running.
Non-blocking system calls are something almost completely different. The system call handling function usually will check if it can return data at that very instant without waiting. If it can, it'll return whatever is available. It can also tell the user "I don't have any data at the moment but check back later" or "that's all, there's no more data to be read". The idea is they return basically instantly and don't block.
Finally, on your last question, I suspect you're missing the point of a system call.
You should never have to know when a task is 'completed' by a system call. If a system call doesn't return an error, you, as the process have to assume it succeeded. The rest is in the implementation details of the kernel. In the case of non-blocking system calls, they will tell you what to expect.
If you can provide an example for the last question, I may be able to explain in more detail.

Linux System Call Flow Sequence

I had a question regarding the deep working of Linux.
Lets say a multi-threaded process is being executed in the CPU. We will have a thread which is being executed on the CPU in such a case. At a more broader picture we will have the corresponding page belonging to the Process being loaded in the RAM for execution.
Lets say the thread makes a system call. I am a bit unclear on the workings after this. The Interrupt will generate a call. One of my questions is who will answer this call?
Lets say that the system has m:n user level thread to kernel level thread mapping, I am assuming that the corresponding Kernel Level Thread will answer this call.
So the Kernel will lookup the Interrupt Vector Table and get the routine which needs to be executed. My next question is which stack will be used in the execution of the Interrupt? Will it be the Kernel Thread's Stack or the User level Thread's Stack? (I am assuming that it will be the Kernel Thread's Stack.)
Coming back to the flow of the program lets say the operation is opening a file using fopen. The subsequent question I have is how will the jump from the ISR to System Call take place? Or is our ISR mapped to a System Call?
Also at a more broader picture when the Kernel Thread is being executed I am assuming that the "OS region" on the RAM will be used to house the pages which are executing the System Call.
Again looking at it from a different angle (Hope your still with me) finally I am assuming that the corresponding Kernel Thread is being handled by the CPU Scheduler where in a context switch would have happened from the User Level Thread to the corresponding Kernel Level Thread when the fopen System Call was being answered.
I have made a lot of assumptions and it would be absolutely fantastic if anyone could clear the doubts or at least guide me in the right direction.

Note: I work predominately with ARM machines so some of these things might be ARM specific. Also, I'm going to try and simplify it as much as I can. Feel free to correct anything that might be wrong or oversimplified.
Lets say the thread makes a system call. I am a bit unclear on the workings after this. The Interrupt will generate a call. One of my questions is who will answer this call?
Usually, the processor will start executing at some predetermined location in kernel mode. The kernel will save the current process state and look at the userspace registers to determine which system call was requested and dispatch that to the correct system call handler.
So the Kernel will lookup the Interrupt Vector Table and get the routine which needs to be executed. My next question is which stack will be used in the execution of the Interrupt? Will it be the Kernel Thread's Stack or the User level Thread's Stack? (I am assuming that it will be the Kernel Thread's Stack.)
I'm pretty sure it will switch to a kernel stack. There would be some pretty severe security problems with information leaks if they used the userspace stack.
Coming back to the flow of the program lets say the operation is opening a file using fopen. The subsequent question I have is how will the jump from the ISR to System Call take place? Or is our ISR mapped to a System Call?
fopen() is actually a libc function and not a system call itself. It may (and in most cases will) call the open() syscall in its implementation though.
So, the process (roughly) is:
Userspace calls fopen()
fopen performs a system call to open()
This triggers some sort of exception or interrupt. In response, the processor switches into a more privileged mode and starts executing at some preset location in the kernel.
Kernel determines what kind of interrupt and exception it is and handles it appropriately. In our case, it will be a system call.
Kernel determines which system call is being requested by reading the userspace registers and extracts any arguments and passes it to the appropriate handler.
Handler runs.
Kernel puts any return code into userspace registers.
Kernel transfers execution back to where the exception occured.
Also at a more broader picture when the Kernel Thread is being executed I am assuming that the "OS region" on the RAM will be used to house the pages which are executing the System Call.
Pages don't execute anything :) Usually, in Linux, any address mapped above 0xC0000000 belongs to the kernel.
Again looking at it from a different angle (Hope your still with me) finally I am assuming that the corresponding Kernel Thread is being handled by the CPU Scheduler where in a context switch would have happened from the User Level Thread to the corresponding Kernel Level Thread when the fopen System Call was being answered.
With a preemptive kernel, threads effectively aren't discriminated against. With my understanding, a new thread isn't created for the purpose of servicing a system call - it just runs in the same thread from which the system call was requested in, except in kernel mode.
That means a thread that is in kernel mode servicing a system call can be scheduled out just the same as any other thread. Hence, this is where you hear about 'userspace context' when developing for the kernel. It means it's executing in kernel mode on a usermode thread.
It's a little difficult to explain this so I hope I got it right.

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?

A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.

To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)

http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html

This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Is it possible to create threads without system calls in Linux x86 GAS assembly?

Whilst learning the "assembler language" (in linux on a x86 architecture using the GNU as assembler), one of the aha moments was the possibility of using system calls. These system calls come in very handy and are sometimes even necessary as your program runs in user-space.
However system calls are rather expensive in terms of performance as they require an interrupt (and of course a system call) which means that a context switch must be made from your current active program in user-space to the system running in kernel-space.
The point I want to make is this: I'm currently implementing a compiler (for a university project) and one of the extra features I wanted to add is the support for multi-threaded code in order to enhance the performance of the compiled program. Because some of the multi-threaded code will be automatically generated by the compiler itself, this will almost guarantee that there will be really tiny bits of multi-threaded code in it as well. In order to gain a performance win, I must be sure that using threads will make this happen.
My fear however is that, in order to use threading, I must make system calls and the necessary interrupts. The tiny little (auto-generated) threads will therefore be highly affected by the time it takes to make these system calls, which could even lead to a performance loss...
my question is therefore twofold (with an extra bonus question underneath it):
Is it possible to write assembler
code which can run multiple threads
simultaneously on multiple cores at
once, without the need of system
calls?
Will I get a performance gain if I have really tiny threads (tiny as in the total execution time of the thread), performance loss, or isn't it worth the effort at all?
My guess is that multithreaded assembler code is not possible without system calls. Even if this is the case, do you have a suggestion (or even better: some real code) for implementing threads as efficient as possible?

The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP) to point to the code you want to run. This implies using system calls to get the OS to do what you want.
User threads won't give you the threading support that you want, because they all run on the same hardware thread.
Edit: Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.
Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.

"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".
The short answer is you can do multithreaded programming without
calling expensive OS task management primitives. Simply ignore the OS for thread
scheduling operations. This means you have to write your own thread
scheduler, and simply never pass control back to the OS.
(And you have to be cleverer somehow about your thread overhead
than the pretty smart OS guys).
We chose this approach precisely because windows process/thread/
fiber calls were all too expensive to support computation
grains of a few hundred instructions.
Our PARLANSE programming langauge is a parallel programming language:
See http://www.semdesigns.com/Products/Parlanse/index.html
PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism
construct, and schedules such grains by a combination of a highly
tuned hand-written scheduler and scheduling code generated by the
PARLANSE compiler that takes into account the context of grain
to minimimze scheduling overhead. For instance, the compiler
ensures that the registers of a grain contain no information at the point
where scheduling (e.g., "wait") might be required, and thus
the scheduler code only has to save the PC and SP. In fact,
quite often the scheduler code doesnt get control at all;
a forked grain simply stores the forking PC and SP,
switches to compiler-preallocated stack and jumps to the grain
code. Completion of the grain will restart the forker.
Normally there's an interlock to synchronize grains, implemented
by the compiler using native LOCK DEC instructions that implement
what amounts to counting semaphores. Applications
can fork logically millions of grains; the scheduler limits
parent grains from generating more work if the work queues
are long enough so more work won't be helpful. The scheduler
implements work-stealing to allow work-starved CPUs to grab
ready grains form neighboring CPU work queues. This has
been implemented to handle up to 32 CPUs; but we're a bit worried
that the x86 vendors may actually swamp use with more than
that in the next few years!
PARLANSE is a mature langauge; we've been using it since 1997,
and have implemented a several-million line parallel application in it.

Implement user-mode threading.
Historically, threading models are generalised as N:M, which is to say N user-mode threads running on M kernel-model threads. Modern useage is 1:1, but it wasn't always like that and it doesn't have to be like that.
You are free to maintain in a single kernel thread an arbitrary number of user-mode threads. It's just that it's your responsibility to switch between them sufficiently often that it all looks concurrent. Your threads are of course co-operative rather than pre-emptive; you basically scatted yield() calls throughout your own code to ensure regular switching occurs.

If you want to gain performance, you'll have to leverage kernel threads. Only the kernel can help you get code running simultaneously on more than one CPU core. Unless your program is I/O bound (or performing other blocking operations), performing user-mode cooperative multithreading (also known as fibers) is not going to gain you any performance. You'll just be performing extra context switches, but the one CPU that your real thread is running will still be running at 100% either way.
System calls have gotten faster. Modern CPUs have support for the sysenter instruction, which is significantly faster than the old int instruction. See also this article for how Linux does system calls in the fastest way possible.
Make sure that the automatically-generated multithreading has the threads run for long enough that you gain performance. Don't try to parallelize short pieces of code, you'll just waste time spawning and joining threads. Also be wary of memory effects (although these are harder to measure and predict) -- if multiple threads are accessing independent data sets, they will run much faster than if they were accessing the same data repeatedly due to the cache coherency problem.

Quite a bit late now, but I was interested in this kind of topic myself.
In fact, there's nothing all that special about threads that specifically requires the kernel to intervene EXCEPT for parallelization/performance.
Obligatory BLUF:
Q1: No. At least initial system calls are necessary to create multiple kernel threads across the various CPU cores/hyper-threads.
Q2: It depends. If you create/destroy threads that perform tiny operations then you're wasting resources (the thread creation process would greatly exceed the time used by the tread before it exits). If you create N threads (where N is ~# of cores/hyper-threads on the system) and re-task them then the answer COULD be yes depending on your implementation.
Q3: You COULD optimize operation if you KNEW ahead of time a precise method of ordering operations. Specifically, you could create what amounts to a ROP-chain (or a forward call chain, but this may actually end up being more complex to implement). This ROP-chain (as executed by a thread) would continuously execute 'ret' instructions (to its own stack) where that stack is continuously prepended (or appended in the case where it rolls over to the beginning). In such a (weird!) model the scheduler keeps a pointer to each thread's 'ROP-chain end' and writes new values to it whereby the code circles through memory executing function code that ultimately results in a ret instruction. Again, this is a weird model, but is intriguing nonetheless.
Onto my 2-cents worth of content.
I recently created what effectively operate as threads in pure assembly by managing various stack regions (created via mmap) and maintaining a dedicated area to store the control/individualization information for the "threads". It is possible, although I didn't design it this way, to create a single large block of memory via mmap that I subdivide into each thread's 'private' area. Thus only a single syscall would be required (although guard pages between would be smart these would require additional syscalls).
This implementation uses only the base kernel thread created when the process spawns and there is only a single usermode thread throughout the entire execution of the program. The program updates its own state and schedules itself via an internal control structure. I/O and such are handled via blocking options when possible (to reduce complexity), but this isn't strictly required. Of course I made use of mutexes and semaphores.
To implement this system (entirely in userspace and also via non-root access if desired) the following were required:
A notion of what threads boil down to:
A stack for stack operations (kinda self explaining and obvious)
A set of instructions to execute (also obvious)
A small block of memory to hold individual register contents
What a scheduler boils down to:
A manager for a series of threads (note that processes never actually execute, just their thread(s) do) in a scheduler-specified ordered list (usually priority).
A thread context switcher:
A MACRO injected into various parts of code (I usually put these at the end of heavy-duty functions) that equates roughly to 'thread yield', which saves the thread's state and loads another thread's state.
So, it is indeed possible to (entirely in assembly and without system calls other than initial mmap and mprotect) to create usermode thread-like constructs in a non-root process.
I only added this answer because you specifically mention x86 assembly and this answer was entirely derived via a self-contained program written entirely in x86 assembly that achieves the goals (minus multi-core capabilities) of minimizing system calls and also minimizes system-side thread overhead.

System calls are not that slow now, with syscall or sysenter instead of int. Still, there will only be an overhead when you create or destroy the threads. Once they are running, there are no system calls. User mode threads will not really help you, since they only run on one core.

First you should learn how to use threads in C (pthreads, POSIX theads). On GNU/Linux you will probably want to use POSIX threads or GLib threads.
Then you can simply call the C from assembly code.
Here are some pointers:
Posix threads: link text
A tutorial where you will learn how to call C functions from assembly: link text
Butenhof's book on POSIX threads link text

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string