Jumping to-and fro between Kernel and user code in Linux - linux

I am doing some kernel hacking on Linux running x86-64 for a research project. From a kernel routine I need to jump to a user mode code page and immediately return back to kernel code. In other words, I need to do a trampoline on user code while executing in the kernel.
I am wondering whether this can be at all possible or not. If possible, can somebody give some idea how this can be achieved?

It is unlikely to be possible "easily".
Without knowing your application, and without suggesting that you rethink your kernel<->app interface, a possible hack for you could work like this: have the application register a piece of trampoline code with your kernel component by just passing the address of that code. The trampoline code would execute your "real" user mode function, then issue another syscall or exception to return to the kernel.
While this is not exactly a user-mode subroutine, it gets reasonably close: when your application calls whatever kernel function needs to do the callback, the kernel function can just save the real return address, change it to the registered trampoline address and return to user mode. The trampoline will call the function, the syscall/exception following it will kick you back into the kernel and you can continue whatever you were doing there.
You probably don't need to worry about security anyway, but if you do you'd probably have to make sure the "return from trampoline" syscall is only accepted from processes where you still have an open trampoline hack going on.
You can also take a look at how signals work; they are about having the kernel interrupt an application and having the application invoke a signal handler; a signal-like implementation would even work without your application having an active syscall going on (but it will also have all the limitations of a signal handler).
In fact, maybe you can just use a signal? Again, take a look at how signals work in the kernel, and just signal your user-code. Install the appropiate signal handler in your application, and have the signal handler invoke the "return from userspace trampoline" syscall.
Either way, it sounds a bit... hackish. Without ever having done any kernel stuff, I would assume that interfacing with your application through a device node, socket or similar mechanism would probably be a much better way... or just have your syscalls return a "to do" result item to the application telling it to invoke siome user space code and report back with another syscall.

Related

Can I safely access potentially unallocated memory addresses?

I'm trying to create memcpy like function that will fail gracefully (ie return an error instead of segfaulting) when given an address in memory that is part of an unallocated page. I think the right approach is to install a sigsegv signal handler, and do something in the handler to make the memcpy function stop copying.
But I'm not sure what happens in the case my program is multithreaded:
Is it possible for the signal handler to execute in another thread?
What happens if a segfault isn't related to any memcpy operation?
How does one handle two threads executing memcpy concurrently?
Am I missing something else? Am I looking for something that's impossible to implement?
Trust me, you do not want to go down that road. It's a can of worms for many reasons. Correct signal handling is already hard in single threaded environments, yet alone in multithreaded code.
First of all, returning from a signal handler that was caused by an exception condition is undefined behavior - it works in Linux, but it's still undefined behavior nevertheless, and it will give you problems sooner or later.
From man 2 sigaction:
The behaviour of a process is undefined after it returns normally from
a signal-catching function for a SIGBUS, SIGFPE, SIGILL or SIGSEGV
signal that was not generated by kill(), sigqueue() or raise().
(Note: this does not appear on the Linux manpage; but it's in SUSv2)
This is also specified in POSIX. While it works in Linux, it's not good practice.
Below the specific answers to your questions:
Is it possible for the signal handler to execute in another thread?
Yes, it is. A signal is delivered to any thread that is not blocking it (but is delivered only to one, of course), although in Linux and many other UNIX variants, exception-related signals (SIGILL, SIGFPE, SIGBUS and SIGSEGV) are usually delivered to the thread that caused the exception. This is not required though, so for maximum portability you shouldn't rely on it.
You can use pthread_sigmask(2) to block signals in every thread but one; that way you make sure that every signal is always delivered to the same thread. This makes it easy to have a single thread dedicated to signal handling, which in turn allows you to do synchronous signal handling, because the thread may use sigwait(2) (note that multithreaded code should use sigwait(2) rather than sigsuspend(2)) until a signal is delivered and then handle it synchronously. This is a very common pattern.
What happens if a segfault isn't related to any memcpy operation?
Good question. The signal is delivered, and there is no (trivial) way to portably differentiate a genuine segfault from a segfault in memcpy(3).
If you have one thread taking care of every signal, like I mentioned above, you could use sigwaitinfo(2), and then examine the si_addr field of siginfo_t once sigwaitinfo(2) returned. The si_addr field is the memory location that caused the fault, so you could compare that to the memory addresses passed to memcpy(3).
But some platforms, most notably Mac OS, do not implement sigwaitinfo(2) or its cousin sigtimedwait(2).
So there's no way to do it portably.
How does one handle two threads executing memcpy concurrently?
I don't really understand this question, what's so special about multithreaded memcpy(3)? It is the caller's responsibility to make sure regions of memory being read from and written to are not concurrently accessed; memcpy(3) isn't (and never was) thread-safe if you pass it overlapping buffers.
Am I missing something else? Am I looking for something that's
impossible to implement?
If you're concerned with portability, I would say it's pretty much impossible. Even if you just focus on Linux, it will be hard. If this was something easy to do, by this time someone would have probably done it already.
I think you're better off building your own allocator and force user code to rely on it. Then you can store state and manage allocated memory, and easily tell if the buffers passed are valid or not.

How system calls are handled in Linux on ARM machine

I have some doubt regarding system call in Linux on ARM processor.
In ARM system calls are handled in SWI mode. My doubt is do we perform entire required work in SWI mode or only part of that work is done in SWI mode and then we move to some process context? As per my understanding some system calls can take significant time and performing that work in SWI is not a good idea.
Also how do we return to calling user process? I mean in case of non-blocking system call how do we notify the user that required task is completed by system call?
I think you're missing two concepts.
CPU privilege modes and use of swi are both an implementation detail of system calls
Non-blocking system calls don't work that way
Sure, under Linux, we use swi instructions and maintain privilege separation to implement system calls, but this doesn't reflect ARM systems in general. When you talk about Linux specifically, I think it makes more sense to refer to concepts like kernel vs user mode.
The Linux kernel have been preemptive for a long time now. If your system call is taking too long and exceeds the time quantum allocated to that process/thread, the scheduler will just kick in and do a context switch. Likewise, if your system call just consists of waiting for an event (like for I/O), it'll just get switched out until there's data available.
Taking this into account you don't usually have to worry about whether your system call takes too long. However, if you're spending a significant amount of time in a system call that's doing something that isn't waiting for some event, chances are that you're doing something in the kernel that should be done in user mode.
When the function handling the system call returns a value, it usually goes back to some sort of glue logic which restores the user context and allows the original user mode program to keep running.
Non-blocking system calls are something almost completely different. The system call handling function usually will check if it can return data at that very instant without waiting. If it can, it'll return whatever is available. It can also tell the user "I don't have any data at the moment but check back later" or "that's all, there's no more data to be read". The idea is they return basically instantly and don't block.
Finally, on your last question, I suspect you're missing the point of a system call.
You should never have to know when a task is 'completed' by a system call. If a system call doesn't return an error, you, as the process have to assume it succeeded. The rest is in the implementation details of the kernel. In the case of non-blocking system calls, they will tell you what to expect.
If you can provide an example for the last question, I may be able to explain in more detail.

Network i/o in parallel for a FUSE file system

My motivation
I'd love to write a distributed file system using FUSE. I'm still designing the code before I jump in. It'll be possibly written in C or Go, the question is, how do I deal with network i/o in parallel?
My problem
More specifically, I want my file system to write locally, and have a thread do the network overhead asynchronously. It doesn't matter if it's slightly delayed in my case, I simply want to avoid slow writes to files because the code has to contact some slow server somewhere.
My understanding
There's two ideas conflicting in my head. One is that the FUSE kernel module uses the ABI of my program to hijack the process and call the specific FUSE function names I implemented (sync or async, w/e), the other is that.. the program is running, and blocking to receive events from the kernel module (which I don't think is the case, but I could be wrong).
Whatever it is, does it means I can simply start a thread and do network stuff? I'm a bit lost on how that works. Thanks.
You don't need to do any hijacking. The FUSE kernel module registers as a filesystem provider (of type fusefs). It then services read/write/open/etc calls, by dispatching them to the user-mode process. When that process returns, the kernel module gets the return value, and returns from the corresponding system call.
If you want to have the server (i.e. user mode process) by asynchronous and multi-threaded, all you have to do is dispatch the operation (assuming it's write - you can't parallelize input this way) to another thread in that process, and return immediately to FUSE. That way, your user mode process can, at its leisure, write out to the remote server.
You could similarly try to parallelize read, but the issue here is that you won't be able to return to FUSE (and thus release the reading process) until you have at least the beginning of the data read.

Linux System Call Flow Sequence

I had a question regarding the deep working of Linux.
Lets say a multi-threaded process is being executed in the CPU. We will have a thread which is being executed on the CPU in such a case. At a more broader picture we will have the corresponding page belonging to the Process being loaded in the RAM for execution.
Lets say the thread makes a system call. I am a bit unclear on the workings after this. The Interrupt will generate a call. One of my questions is who will answer this call?
Lets say that the system has m:n user level thread to kernel level thread mapping, I am assuming that the corresponding Kernel Level Thread will answer this call.
So the Kernel will lookup the Interrupt Vector Table and get the routine which needs to be executed. My next question is which stack will be used in the execution of the Interrupt? Will it be the Kernel Thread's Stack or the User level Thread's Stack? (I am assuming that it will be the Kernel Thread's Stack.)
Coming back to the flow of the program lets say the operation is opening a file using fopen. The subsequent question I have is how will the jump from the ISR to System Call take place? Or is our ISR mapped to a System Call?
Also at a more broader picture when the Kernel Thread is being executed I am assuming that the "OS region" on the RAM will be used to house the pages which are executing the System Call.
Again looking at it from a different angle (Hope your still with me) finally I am assuming that the corresponding Kernel Thread is being handled by the CPU Scheduler where in a context switch would have happened from the User Level Thread to the corresponding Kernel Level Thread when the fopen System Call was being answered.
I have made a lot of assumptions and it would be absolutely fantastic if anyone could clear the doubts or at least guide me in the right direction.
Note: I work predominately with ARM machines so some of these things might be ARM specific. Also, I'm going to try and simplify it as much as I can. Feel free to correct anything that might be wrong or oversimplified.
Lets say the thread makes a system call. I am a bit unclear on the workings after this. The Interrupt will generate a call. One of my questions is who will answer this call?
Usually, the processor will start executing at some predetermined location in kernel mode. The kernel will save the current process state and look at the userspace registers to determine which system call was requested and dispatch that to the correct system call handler.
So the Kernel will lookup the Interrupt Vector Table and get the routine which needs to be executed. My next question is which stack will be used in the execution of the Interrupt? Will it be the Kernel Thread's Stack or the User level Thread's Stack? (I am assuming that it will be the Kernel Thread's Stack.)
I'm pretty sure it will switch to a kernel stack. There would be some pretty severe security problems with information leaks if they used the userspace stack.
Coming back to the flow of the program lets say the operation is opening a file using fopen. The subsequent question I have is how will the jump from the ISR to System Call take place? Or is our ISR mapped to a System Call?
fopen() is actually a libc function and not a system call itself. It may (and in most cases will) call the open() syscall in its implementation though.
So, the process (roughly) is:
Userspace calls fopen()
fopen performs a system call to open()
This triggers some sort of exception or interrupt. In response, the processor switches into a more privileged mode and starts executing at some preset location in the kernel.
Kernel determines what kind of interrupt and exception it is and handles it appropriately. In our case, it will be a system call.
Kernel determines which system call is being requested by reading the userspace registers and extracts any arguments and passes it to the appropriate handler.
Handler runs.
Kernel puts any return code into userspace registers.
Kernel transfers execution back to where the exception occured.
Also at a more broader picture when the Kernel Thread is being executed I am assuming that the "OS region" on the RAM will be used to house the pages which are executing the System Call.
Pages don't execute anything :) Usually, in Linux, any address mapped above 0xC0000000 belongs to the kernel.
Again looking at it from a different angle (Hope your still with me) finally I am assuming that the corresponding Kernel Thread is being handled by the CPU Scheduler where in a context switch would have happened from the User Level Thread to the corresponding Kernel Level Thread when the fopen System Call was being answered.
With a preemptive kernel, threads effectively aren't discriminated against. With my understanding, a new thread isn't created for the purpose of servicing a system call - it just runs in the same thread from which the system call was requested in, except in kernel mode.
That means a thread that is in kernel mode servicing a system call can be scheduled out just the same as any other thread. Hence, this is where you hear about 'userspace context' when developing for the kernel. It means it's executing in kernel mode on a usermode thread.
It's a little difficult to explain this so I hope I got it right.

does kernel's panic() function completely freezes every other process?

I would like to be confirmed that kernel's panic() function and the others like kernel_halt() and machine_halt(), once triggered, guarantee complete freezing of the machine.
So, are all the kernel and user processes frozen? Is panic() interruptible by the scheduler? The interrupt handlers could still be executed?
Use case: in case of serious error, I need to be sure that the hardware watchdog resets the machine. To this end, I need to make sure that no other thread/process is keeping the watchdog alive. I need to trigger a complete halt of the system. Currently, inside my kernel module, I simply call panic() to freeze everything.
Also, the user-space halt command is guaranteed to freeze the system?
Thanks.
edit: According to: http://linux.die.net/man/2/reboot, I think the best way is to use reboot(LINUX_REBOOT_CMD_HALT): "Control is given to the ROM monitor, if there is one"
Thank you for the comments above. After some research, I am ready to give myself a more complete answer, below:
At least for the x86 architecture, the reboot(LINUX_REBOOT_CMD_HALT) is the way to go. This, in turn, calls the syscall reboot() (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L433). Then, for the LINUX_REBOOT_CMD_HALT flag (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L480), the syscall calls kernel_halt() (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L394). That function calls syscore_shutdown() to execute all the registered system core shutdown callbacks, displays the "System halted" message, then it dumps the kernel, AND, finally, it calls machine_halt(), that is a wrapper for native_machine_halt() (see: http://lxr.linux.no/linux+v3.6.6/arch/x86/kernel/reboot.c#L680). It is this function that stops the other CPUs (through machine_shutdown()), then calls stop_this_cpu() to disable the last remaining working processor. The first thing that this function does is to disable interrupts on the current processor, that is the scheduler is no more able to take control.
I am not sure why the syscall reboot() still calls do_exit(0), after calling kernel_halt(). I interpret it like that: now, with all processors marked as disabled, the syscall reboot() calls do_exit(0) and ends itself. Even if the scheduler is awoken, there are no more enabled processors on which it could schedule some task, nor interrupt: the system is halted. I am not sure about this explanation, as the stop_this_cpu() seems to not return (it enters an infinite loop). Maybe is just a safeguard, for the case when the stop_this_cpu() fails (and returns): in this case, do_exit() will end cleanly the current task, then the panic() function is called.
As for the panic() code (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/panic.c#L69), the function first disables the local interrupts, then it disables all the other processors, except the current one by calling smp_send_stop(). Finally, as the sole task executing on the current processor (which is the only processor still alive), with all local interrupts disabled (that is, the preemptible scheduler -- a timer interrupt, after all -- has no chance...), then the panic() function loops some time or it calls emergency_restart(), that is supposed to restart the processor.
If you have better insight, please contribute.

Resources