Network i/o in parallel for a FUSE file system - linux

My motivation
I'd love to write a distributed file system using FUSE. I'm still designing the code before I jump in. It'll be possibly written in C or Go, the question is, how do I deal with network i/o in parallel?
My problem
More specifically, I want my file system to write locally, and have a thread do the network overhead asynchronously. It doesn't matter if it's slightly delayed in my case, I simply want to avoid slow writes to files because the code has to contact some slow server somewhere.
My understanding
There's two ideas conflicting in my head. One is that the FUSE kernel module uses the ABI of my program to hijack the process and call the specific FUSE function names I implemented (sync or async, w/e), the other is that.. the program is running, and blocking to receive events from the kernel module (which I don't think is the case, but I could be wrong).
Whatever it is, does it means I can simply start a thread and do network stuff? I'm a bit lost on how that works. Thanks.

You don't need to do any hijacking. The FUSE kernel module registers as a filesystem provider (of type fusefs). It then services read/write/open/etc calls, by dispatching them to the user-mode process. When that process returns, the kernel module gets the return value, and returns from the corresponding system call.
If you want to have the server (i.e. user mode process) by asynchronous and multi-threaded, all you have to do is dispatch the operation (assuming it's write - you can't parallelize input this way) to another thread in that process, and return immediately to FUSE. That way, your user mode process can, at its leisure, write out to the remote server.
You could similarly try to parallelize read, but the issue here is that you won't be able to return to FUSE (and thus release the reading process) until you have at least the beginning of the data read.

Related

How system calls are handled in Linux on ARM machine

I have some doubt regarding system call in Linux on ARM processor.
In ARM system calls are handled in SWI mode. My doubt is do we perform entire required work in SWI mode or only part of that work is done in SWI mode and then we move to some process context? As per my understanding some system calls can take significant time and performing that work in SWI is not a good idea.
Also how do we return to calling user process? I mean in case of non-blocking system call how do we notify the user that required task is completed by system call?
I think you're missing two concepts.
CPU privilege modes and use of swi are both an implementation detail of system calls
Non-blocking system calls don't work that way
Sure, under Linux, we use swi instructions and maintain privilege separation to implement system calls, but this doesn't reflect ARM systems in general. When you talk about Linux specifically, I think it makes more sense to refer to concepts like kernel vs user mode.
The Linux kernel have been preemptive for a long time now. If your system call is taking too long and exceeds the time quantum allocated to that process/thread, the scheduler will just kick in and do a context switch. Likewise, if your system call just consists of waiting for an event (like for I/O), it'll just get switched out until there's data available.
Taking this into account you don't usually have to worry about whether your system call takes too long. However, if you're spending a significant amount of time in a system call that's doing something that isn't waiting for some event, chances are that you're doing something in the kernel that should be done in user mode.
When the function handling the system call returns a value, it usually goes back to some sort of glue logic which restores the user context and allows the original user mode program to keep running.
Non-blocking system calls are something almost completely different. The system call handling function usually will check if it can return data at that very instant without waiting. If it can, it'll return whatever is available. It can also tell the user "I don't have any data at the moment but check back later" or "that's all, there's no more data to be read". The idea is they return basically instantly and don't block.
Finally, on your last question, I suspect you're missing the point of a system call.
You should never have to know when a task is 'completed' by a system call. If a system call doesn't return an error, you, as the process have to assume it succeeded. The rest is in the implementation details of the kernel. In the case of non-blocking system calls, they will tell you what to expect.
If you can provide an example for the last question, I may be able to explain in more detail.

Linux Kernel Procfs multiple read/writes

How does the Linux kernel handle multiple reads/writes to procfs? For instance, if two processes write to procfs at once, is one process queued (i.e. a kernel trap actually blocks one of the processes), or is there a kernel thread running for each core?
The concern is if you have a buffer used within a function (static to the global space), do you have to protect it or will the code be run sequentially?
It depends on each and every procfs file implementation. No one can even give you a definite answer because each driver can implement its own procfs folder and files (you didn't specify any specific files. Quick browsing in http://lxr.free-electrons.com/source/fs/proc/ shows that some files do use locks).
In either way you can't use the global buffer because a context switch can always occur, if not in the kernel then it can catch your reader thread right after it finishes the read syscall and before it started to process the read data.

How do programs communicate with each other?

How do procceses communicate with each other? Using everything I've learnt to fo with programming so far, I'm unable to explain how sockets, file systems and other things to do with sending messages between programs work.
Btw I use a Linux based OS if your going to add anything OS specific. Thanks in advance. The question's been bugging me for ages. I'm also guessing the kernel has something to do with it.
In case of most IPC (InterProcess Communication) mechanisms, the general answer to your question is this: process A calls the kernel passing a pointer to a buffer with data to be transferred to process B, process B calls the kernel (or is already blocked on a call to the kernel) passing a pointer to a buffer to be filled with data from process A.
This general description is true for sockets, pipes, System V message queues, ordinary files etc. As you can see the cost of communication is high since it involves at least one context switch.
Signals constitute an asynchronous IPC mechanism in which one process can send a simple notification to another process triggering a handler registered by the second process (alternatively doing nothing, stopping or killing that process if no handler is registered, depending on the signal).
For transferring large amount of data one can use System V shared memory in which case two processes can access the same portion of main memory. Note that even in this case, one needs to employ a synchronization mechanism, like System V semaphores, which result in context switches as well.
This is why when processes need to communicate often, it is better to make them threads in a single process.

do_mmap_pgoff for other processes

In a linux kernel syscall, I want to map a region of memory in a similar manner as calling mmap from user mode. If I wanted to map the region for the current process, I could simply use do_mmap_pgoff. Instead, however, I want to map the region in a different process while running in kernel mode. do_mmap_pgoff assumes/knows it is mapping for the current process and does not allow for anything else.
What I am planning on doing is replicating do_mmap_pgoff to take extra arguments specifying the task_struct and mm_struct of whatever process I want to map. However, this is very undesirable as I must manually traverse through many functions in the kernel source and essentially make duplicates of those functions so that they no longer assume they are doing work on behalf of current.
Is there a better way to map memory in a process other than current while operating in kernel mode?
It's no surprise that those functions in kernel source assume that they change the mapping of the current process, and that it hasn't changed in the 20 years Linux exists. There's a reason why processes don't change memory mappings of other processes.
It's very "un-UNIXy".
If you elaborate on what you are trying to accomplish then perhaps people can suggest a more UNIX-y way for it.
Anyway, to focus on the question at hand, if you wouldn't like to perform hefty modifications to mm/* code, then I suggest you implement a workaround:
Find a context in which you can make your kernel code run in the context of the target process. For example, in a modular way - a /sys or /proc file. Or, in a non-modular way: modify a system call that is being called frequently, or another code path - for example the signal handling code.
Implement an "RPC", the source process can queue a request on the change of mapping in a Then, it can sleep until the target process enters that context and picks up on the request, waking up the source process when it is done modifying its own mapping. This is effectively an emulation of a "remote" call to do_mmap_pgoff(), and it can be implemented using mechanisms exposed in linux/wait.h.

issuing a disk read from bottom-half of device driver

In a Xen setup, IO accesses from guest VMs go through a privileged domain called dom0 that is just a modified Linux kernel which has calls from and to the XEN hypervisor. For block IO, they have a split driver model whose front-end is in the guest VM and the backend of the driver in the domain0. The backend just creates a 'bio' structure and invokes submit_bio() as in traditional linux block driver code.
My goal here is to check if there is any problem in the data written to disk(lost data, silently corrupted writes, misdirected writes, etc). So I need to read the data that was written to disk and compare it with a on-cache copy of data (this is a common disk function called 'read after write'). My question is, is it not possible to invoke __bread() from my backend driver level ? The kernel crashes when __bread is invoked.. Could anyone understand the reason for this ? Also, if this ain't possible, what other ways are out there to read a specific block of data from disk at the driver's bottom-half ?
Can I intercept and clone the bio structure of the writes, and change the operation as read in my new bio and invoke submit_bio() again ? I did that, but the sector number in the bio structure that is returned by the completion callback of submit_bio() is some random value and not the ones I sent..
Thanks.
If this were my task, I'd try first writing a new scheduling algorithm. Start by copying cfq or deadline or noop or as scheduling code and start working on it from there to self-submit read commands after accepting write requests. noop would probably be the easiest one to modify to read immediately after write, and propagate errors upwards, but I can't imagine the performance would be very good. But, if you use one of the other schedulers as base, it would probably be much more difficult to signal an error immediately after the write -- perhaps a few seconds would have elapsed before reads were scheduled again -- so it would really only be useful as a diagnostic after the fact, and not something that could benefit applications directly.

Resources