How system calls are handled in Linux on ARM machine - linux

I have some doubt regarding system call in Linux on ARM processor.
In ARM system calls are handled in SWI mode. My doubt is do we perform entire required work in SWI mode or only part of that work is done in SWI mode and then we move to some process context? As per my understanding some system calls can take significant time and performing that work in SWI is not a good idea.
Also how do we return to calling user process? I mean in case of non-blocking system call how do we notify the user that required task is completed by system call?

I think you're missing two concepts.
CPU privilege modes and use of swi are both an implementation detail of system calls
Non-blocking system calls don't work that way
Sure, under Linux, we use swi instructions and maintain privilege separation to implement system calls, but this doesn't reflect ARM systems in general. When you talk about Linux specifically, I think it makes more sense to refer to concepts like kernel vs user mode.
The Linux kernel have been preemptive for a long time now. If your system call is taking too long and exceeds the time quantum allocated to that process/thread, the scheduler will just kick in and do a context switch. Likewise, if your system call just consists of waiting for an event (like for I/O), it'll just get switched out until there's data available.
Taking this into account you don't usually have to worry about whether your system call takes too long. However, if you're spending a significant amount of time in a system call that's doing something that isn't waiting for some event, chances are that you're doing something in the kernel that should be done in user mode.
When the function handling the system call returns a value, it usually goes back to some sort of glue logic which restores the user context and allows the original user mode program to keep running.
Non-blocking system calls are something almost completely different. The system call handling function usually will check if it can return data at that very instant without waiting. If it can, it'll return whatever is available. It can also tell the user "I don't have any data at the moment but check back later" or "that's all, there's no more data to be read". The idea is they return basically instantly and don't block.
Finally, on your last question, I suspect you're missing the point of a system call.
You should never have to know when a task is 'completed' by a system call. If a system call doesn't return an error, you, as the process have to assume it succeeded. The rest is in the implementation details of the kernel. In the case of non-blocking system calls, they will tell you what to expect.
If you can provide an example for the last question, I may be able to explain in more detail.

Related

Why can a signal handler not run during system calls?

In Linux a process that is waiting for IO can be in either of the states TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.
The latter, for example, is the case when the process waits until the read from a regular file completes. If this read takes very long or forever (for whatever reasons), it leads to the problem that the process cannot receive any signals during that time.
The former, is the case if, for example, one waits for a read from a socket (which may take an unbounded amount of time). If such a read call gets interrupted, however, it leads to the problem that it requires the programmer/user to handle the errno == EINTER condition correctly, which he might forget.
My question is: Wouldn't it be possible to allow all system calls to be interrupted by signals and, moreover, in such a way that after the signal handler ran, the original system call simply continues its business? Wouldn't this solve both problems? If it is not possible, why?
The two halves of the question are related but I may have digested too much spiked eggnog to determine if there is so much a one-to-one as opposed to a one-to-several relationship here.
On the kernel side I think a TASK_KILLABLE state was added a couple of years back. The consequence of that was that while the process blocked in a system call (for whatever reason), if the kernel encountered a signal that was going to kill the process anyway, it would let nature take its course. That reduces the chances of a process being permanent stuck in an uninterruptible state. What progress has been made in changing individual system calls to avail themselves of that option I do not know.
On the user side, SA_RESTART and its convenience function cousin siginterrupt go some distance to alleviating the application programmer pain. But the fact that these are specified per signal and aren't honored by all system calls is a pretty good hint that there are reasons why a blanket interruption scheme like you propose can't be implemented, or at least easily implemented. Just thinking about the possible interactions of the matrices of signals, individual system calls, calls that have historically expected and backwardly supported behavior (and possibly multiple historical semantics based on their system bloodline!), is a bit dizzying. But maybe that is the eggnog.

How does the Linux kernel realize reentrancy?

All Unix kernels are reentrant: several processes may be executing in kernel
mode at the same time. How can I realize this effect in code? How should I handle the situation whereby many processes invoke system calls, pending in kernel mode?
[Edit - the term "reentrant" gets used in a couple of different senses. This answer uses the basic "multiple contexts can be executing the same code at the same time." This usually applies to a single routine, but can be extended to apply to a set of cooperating routines, generally routines which share data. An extreme case of this is when applied to a complete program - a web server, or an operating system. A web-server might be considered non-reentrant if it could only deal with one client at a time. (Ugh!) An operating system kernel might be called non-reentrant if only one process/thread/processor could be executing kernel code at a time.
Operating systems like that occurred during the transition to multi-processor systems. Many went through a slow transition from written-for-uniprocessors to one-single-lock-protects-everything (i.e. non-reentrant) through various stages of finer and finer grained locking. IIRC, linux finally got rid of the "big kernel lock" at approx. version 2.6.37 - but it was mostly gone long before that, just protecting remnants not yet converted to a multiprocessing implementation.
The rest of this answer is written in terms of individual routines, rather than complete programs.]
If you are in user space, you don't need to do anything. You call whatever system calls you want, and the right thing happens.
So I'm going to presume you are asking about code in the kernel.
Conceptually, it's fairly simple. It's also pretty much identical to what happens in a multi-threaded program in user space, when multiple threads call the same subroutine. (Let's assume it's a C program - other languages may have differently named mechanisms.)
When the system call implementation is using automatic (stack) variables, it has its own copy - no problem with re-entrancy. When it needs to use global data, it generally needs to use some kind of locking - the specific locking required depends on the specific data it's using, and what it's doing with that data.
This is all pretty generic, so perhaps an example might help.
Let's say the system call want to modify some attribute of a process. The process is represented by a struct task_struct which is a member of various linked lists. Those linked lists are protected by the tasklist_lock. Your system call gets the tasklist_lock, finds the right process, possibly gets a per-process lock controlling the field it cares about, modifies the field, and drops both locks.
One more detail, which is the case of processes executing different system calls, which don't share data with each other. With a reasonable implementation, there are no conflicts at all. One process can get itself into the kernel to handle its system call without affecting the other processes. I don't remember looking specifically at the linux implementation, but I imagine it's "reasonable". Something like a trap into an exception handler, which looks in a table to find the subroutine to handle the specific system call requested. The table is effectively const, so no locks required.

context_switch and system call virtualization

When the kernel is executing a context_switch(), i.e. when it is in that function, is it possible that some other task makes a system call? As far as I understand, as the processor state is being swapped, no other code can actually execute until the context_switch completes. Is this understanding correct?
background: Basically, I want to swap system call tables on context switches. I have 2 versions of the syscall vector, with one of them containing addresses of modified sys call code. When a "process of interest" is scheduled in, I will update the sys_call_table pointer to point to the new one. and then swap it out when the process of interest is swapped out.
I'm new to kernel dev, so feedback on my approach is also welcome.
PS: I know about ptrace() and it doesn't suit my needs. Too many context switches involved in virtualizing syscalls UML style.
I suspect you may be thinking about it wrong. Even if the kernel doesn't allow syscalls from other tasks while context_switch is executing, that is only likely to extend as far as the currently executing processor. The other processors can still be executing syscalls while one core is context switching.

Network i/o in parallel for a FUSE file system

My motivation
I'd love to write a distributed file system using FUSE. I'm still designing the code before I jump in. It'll be possibly written in C or Go, the question is, how do I deal with network i/o in parallel?
My problem
More specifically, I want my file system to write locally, and have a thread do the network overhead asynchronously. It doesn't matter if it's slightly delayed in my case, I simply want to avoid slow writes to files because the code has to contact some slow server somewhere.
My understanding
There's two ideas conflicting in my head. One is that the FUSE kernel module uses the ABI of my program to hijack the process and call the specific FUSE function names I implemented (sync or async, w/e), the other is that.. the program is running, and blocking to receive events from the kernel module (which I don't think is the case, but I could be wrong).
Whatever it is, does it means I can simply start a thread and do network stuff? I'm a bit lost on how that works. Thanks.
You don't need to do any hijacking. The FUSE kernel module registers as a filesystem provider (of type fusefs). It then services read/write/open/etc calls, by dispatching them to the user-mode process. When that process returns, the kernel module gets the return value, and returns from the corresponding system call.
If you want to have the server (i.e. user mode process) by asynchronous and multi-threaded, all you have to do is dispatch the operation (assuming it's write - you can't parallelize input this way) to another thread in that process, and return immediately to FUSE. That way, your user mode process can, at its leisure, write out to the remote server.
You could similarly try to parallelize read, but the issue here is that you won't be able to return to FUSE (and thus release the reading process) until you have at least the beginning of the data read.

Are GPIO APIs in linux deterministic in time taken?

I need to call gpio_get_value, gpio_set_value, gpio_direction_input/output in my driver, and there is a timing requirement that requests the function calls to be returned in less than 5us time.
Can gpiolib meet this requirement or is it not deterministic? If not, what could be the solution? directly accessing GPIO registers?
Thanks a lot.
Calling one of that functions involves executing a linux kernel function, thus there are at least two context switches: one to run kernel code and one to come back to userland.
These switches can mean some time can get wasted if interrupts or signal come in to be executed.
Anyways if you need deterministic time deadlines you need to switch to a real-time patched kernel Real Time linux wiki homepage
I don't know if 5us is feasible, it depends on the system load, on the active drivers (e.g. there's a read/write filesystem?), but you can test.

Resources