do_mmap_pgoff for other processes - linux

In a linux kernel syscall, I want to map a region of memory in a similar manner as calling mmap from user mode. If I wanted to map the region for the current process, I could simply use do_mmap_pgoff. Instead, however, I want to map the region in a different process while running in kernel mode. do_mmap_pgoff assumes/knows it is mapping for the current process and does not allow for anything else.
What I am planning on doing is replicating do_mmap_pgoff to take extra arguments specifying the task_struct and mm_struct of whatever process I want to map. However, this is very undesirable as I must manually traverse through many functions in the kernel source and essentially make duplicates of those functions so that they no longer assume they are doing work on behalf of current.
Is there a better way to map memory in a process other than current while operating in kernel mode?

It's no surprise that those functions in kernel source assume that they change the mapping of the current process, and that it hasn't changed in the 20 years Linux exists. There's a reason why processes don't change memory mappings of other processes.
It's very "un-UNIXy".
If you elaborate on what you are trying to accomplish then perhaps people can suggest a more UNIX-y way for it.
Anyway, to focus on the question at hand, if you wouldn't like to perform hefty modifications to mm/* code, then I suggest you implement a workaround:
Find a context in which you can make your kernel code run in the context of the target process. For example, in a modular way - a /sys or /proc file. Or, in a non-modular way: modify a system call that is being called frequently, or another code path - for example the signal handling code.
Implement an "RPC", the source process can queue a request on the change of mapping in a Then, it can sleep until the target process enters that context and picks up on the request, waking up the source process when it is done modifying its own mapping. This is effectively an emulation of a "remote" call to do_mmap_pgoff(), and it can be implemented using mechanisms exposed in linux/wait.h.

Related

Time waste of execv() and fork()

I am currently learning about fork() and execv() and I had a question regarding the efficiency of the combination.
I was shown the following standard code:
pid = fork();
if(pid < 0){
//handle fork error
}
else if (pid == 0){
execv("son_prog", argv_son);
//do father code
I know that fork() clones the entire process (copying the entire heap, etc) and that execv() replaces the current address space with that of the new program. With this in mind, doesn't it make it very inefficient to use this combination? We are copying the entire address space of a process and then immediately overwrite it.
So my question:
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this, even though we have waste?
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this even though we have waste?
You have to create a new process somehow. There are very few ways for a userspace program to accomplish that. POSIX used to have vfork() alognside fork(), and some systems may have their own mechanisms, such as Linux-specific clone(), but since 2008, POSIX specifies only fork() and the posix_spawn() family. The fork + exec route is more traditional, is well understood, and has few drawbacks (see below). The posix_spawn family is designed as a special purpose substitute for use in contexts that present difficulties for fork(); you can find details in the "Rationale" section of its specification.
This excerpt from the Linux man page for vfork() may be illuminating:
Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent’s memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables are held in a register.
(Emphasis added)
Thus, your concern about waste is not well-founded for modern systems (not limited to Linux), but it was indeed an issue historically, and there were indeed mechanisms designed to avoid it. These days, most of those mechanisms are obsolete.
Another answer states:
However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done.
Obviously, one person's bad old days are a lot younger than others remember.
The original UNIX systems did not have the memory for running multiple processes and they did not have an MMU for keeping several processes in physical memory ready-to-run at the same logical address space: they swapped out processes to disk that it wasn't currently running.
The fork system call was almost entirely the same as swapping out the current process to disk, except for the return value and for not replacing the remaining in-memory copy by swapping in another process. Since you had to swap out the parent process anyway in order to run the child, fork+exec was not incurring any overhead.
It's true that there was a period of time when fork+exec was awkward: when there were MMUs that provided a mapping between logical and physical address space but page faults did not retain enough information that copy-on-write and a number of other virtual-memory/demand-paging schemes were feasible.
This situation was painful enough, not just for UNIX, that page fault handling of the hardware was adapted to become "replayable" pretty fast.
Not any longer. There's something called COW (Copy On Write), only when one of the two processes (Parent/Child) tries to write to a shared data, it is copied.
In the past:
The fork() system call copied the address space of the calling process (the parent) to create a new process (the child).
The copying of the parent's address space into the child was the most expensive part of the fork() operation.
Now:
A call to fork() is frequently followed almost immediately by a call to exec() in the child process, which replaces the child's memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent's address space is largely wasted, because the child process will use very little of its memory before calling exec().
For this reason, later versions of Unix took advantage of virtual memory hardware to allow the parent and child to share the memory mapped into their respective address spaces until one of the processes actually modifies it. This technique is known as copy-on-write. To do this, on fork() the kernel would copy the address space mappings from the parent to the child instead of the contents of the mapped pages, and at the same time mark the now-shared pages read-only. When one of the two processes tries to write to one of these shared pages, the process takes a page fault. At this point, the Unix kernel realizes that the page was really a "virtual" or "copy-on-write" copy, and so it makes a new, private, writable copy of the page for the faulting process. In this way, the contents of individual pages aren't actually copied until they are actually written to. This optimization makes a fork() followed by an exec() in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec().
It turns out all those COW page faults are not at all cheap when the process has a few gigabytes of writable RAM. They're all gonna fault once even if the child has long since called exec(). Because the child of fork() is no longer allowed to allocate memory even for the single threaded case (you can thank Apple for that one), arranging to call vfork()/exec() instead is hardly more difficult now.
The real advantage to the vfork()/exec() model is you can set the child up with an arbitrary current directory, arbitrary environment variables, and arbitrary fs handles (not just stdin/stdout/stderr), an arbitrary signal mask, and some arbitrary shared memory (using the shared memory syscalls) without having a twenty-argument CreateProcess() API that gets a few more arguments every few years.
It turned out the "oops I leaked handles being opened by another thread" gaffe from the early days of threading was fixable in userspace w/o process-wide locking thanks to /proc. The same would not be in the giant CreateProcess() model without a new OS version, and convincing everybody to call the new API.
So there you have it. An accident of design ended up far better than the directly designed solution.
It's not that expensive (relatively to spawning a process directly), especially with copy-on-write forks like you find in Linux , and it's kind of elegant for:
when you really just want to fork off a clone of the current process (I find this to be very useful for testing)
for when you need to do something just before loading the new executable
(redirect filedescriptors, play with signal masks/dispositions, uids, etc.)
POSIX now has posix_spawn that effectively allows you to combine fork/and-exec (possibly more efficiently than fork+exec; if it is more efficient, it'll usually be implemented through some cheaper but less robust fork (clone/vfork) followed by exec), but the way it achieves #2 is through a ton of relatively messy options, which can never be as complete and powerful and clean as just allowing you to run arbitrary code just before the new process image is loaded.
A process created by exec() et al, will inherit its file handles from the parent process (including stdin, stdout, stderr). If the parent changes these after calling fork() but before calling exec() then it can control the child's standard streams.

Process-specific data in kernel

Say I have some process calling file device operation like read. Before this read the process also called a syscall(defined by me), providing me with some information relevant to the read(and possibly other future reads done by this process). What is the best way of achieving this sort of information flow in the kernel? Is there any good way to store process-specific information other than making some pid-indexed list?
I'd like the syscall information stored in kernel to be inherited by children of that process too. Would it be possible to achieve that without (somehow) traversing the process child-parent tree(and that wouldn't give me the inheritance I want because after forking I don't want changes in parent to affect the child)?
Just like we have init_task variable which gives the starting address of the runqueue and which can be accessible anywhere in the user as well as kernel space, you can add a variable which will be set to the appropriate value by your system call and then accessed by your read(appropriate) methods.

How does the Linux kernel realize reentrancy?

All Unix kernels are reentrant: several processes may be executing in kernel
mode at the same time. How can I realize this effect in code? How should I handle the situation whereby many processes invoke system calls, pending in kernel mode?
[Edit - the term "reentrant" gets used in a couple of different senses. This answer uses the basic "multiple contexts can be executing the same code at the same time." This usually applies to a single routine, but can be extended to apply to a set of cooperating routines, generally routines which share data. An extreme case of this is when applied to a complete program - a web server, or an operating system. A web-server might be considered non-reentrant if it could only deal with one client at a time. (Ugh!) An operating system kernel might be called non-reentrant if only one process/thread/processor could be executing kernel code at a time.
Operating systems like that occurred during the transition to multi-processor systems. Many went through a slow transition from written-for-uniprocessors to one-single-lock-protects-everything (i.e. non-reentrant) through various stages of finer and finer grained locking. IIRC, linux finally got rid of the "big kernel lock" at approx. version 2.6.37 - but it was mostly gone long before that, just protecting remnants not yet converted to a multiprocessing implementation.
The rest of this answer is written in terms of individual routines, rather than complete programs.]
If you are in user space, you don't need to do anything. You call whatever system calls you want, and the right thing happens.
So I'm going to presume you are asking about code in the kernel.
Conceptually, it's fairly simple. It's also pretty much identical to what happens in a multi-threaded program in user space, when multiple threads call the same subroutine. (Let's assume it's a C program - other languages may have differently named mechanisms.)
When the system call implementation is using automatic (stack) variables, it has its own copy - no problem with re-entrancy. When it needs to use global data, it generally needs to use some kind of locking - the specific locking required depends on the specific data it's using, and what it's doing with that data.
This is all pretty generic, so perhaps an example might help.
Let's say the system call want to modify some attribute of a process. The process is represented by a struct task_struct which is a member of various linked lists. Those linked lists are protected by the tasklist_lock. Your system call gets the tasklist_lock, finds the right process, possibly gets a per-process lock controlling the field it cares about, modifies the field, and drops both locks.
One more detail, which is the case of processes executing different system calls, which don't share data with each other. With a reasonable implementation, there are no conflicts at all. One process can get itself into the kernel to handle its system call without affecting the other processes. I don't remember looking specifically at the linux implementation, but I imagine it's "reasonable". Something like a trap into an exception handler, which looks in a table to find the subroutine to handle the specific system call requested. The table is effectively const, so no locks required.

Network i/o in parallel for a FUSE file system

My motivation
I'd love to write a distributed file system using FUSE. I'm still designing the code before I jump in. It'll be possibly written in C or Go, the question is, how do I deal with network i/o in parallel?
My problem
More specifically, I want my file system to write locally, and have a thread do the network overhead asynchronously. It doesn't matter if it's slightly delayed in my case, I simply want to avoid slow writes to files because the code has to contact some slow server somewhere.
My understanding
There's two ideas conflicting in my head. One is that the FUSE kernel module uses the ABI of my program to hijack the process and call the specific FUSE function names I implemented (sync or async, w/e), the other is that.. the program is running, and blocking to receive events from the kernel module (which I don't think is the case, but I could be wrong).
Whatever it is, does it means I can simply start a thread and do network stuff? I'm a bit lost on how that works. Thanks.
You don't need to do any hijacking. The FUSE kernel module registers as a filesystem provider (of type fusefs). It then services read/write/open/etc calls, by dispatching them to the user-mode process. When that process returns, the kernel module gets the return value, and returns from the corresponding system call.
If you want to have the server (i.e. user mode process) by asynchronous and multi-threaded, all you have to do is dispatch the operation (assuming it's write - you can't parallelize input this way) to another thread in that process, and return immediately to FUSE. That way, your user mode process can, at its leisure, write out to the remote server.
You could similarly try to parallelize read, but the issue here is that you won't be able to return to FUSE (and thus release the reading process) until you have at least the beginning of the data read.

Faster forking of large processes on Linux?

What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?
My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.
Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?
I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).
On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.
See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.
To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)
I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.
benchmarking the various spawns over 1000 runs at 100M RSS
user system total real
fspawn (fork/exec): 0.100000 15.460000 40.570000 ( 41.366389)
pspawn (posix_spawn): 0.010000 0.010000 0.540000 ( 0.970577)
Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.
Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.
Did you actually measure how much time forks take? Quoting the page you linked,
Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)
So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.
But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.
posix_spawn()
This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:
Swapping is generally too slow for a realtime environment.
Dynamic address translation is not available everywhere that POSIX might be useful.
Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.
Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.
I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.
If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.
I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/
pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);
Excerpt:
The system call clone() comes to the rescue. Using clone() we create a
child process which has the following features:
The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is
created. As a result of this, any change to any non-stack variable
made by the child is visible by the parent process. This is similar to
threads, and therefore completely different from fork(), and also very
dangerous – we don’t want the child to mess up the parent.
The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
The most important: This thread-like child process can call exec().
In a nutshell, by calling clone in the following way, we create a
child process which is very similar to a thread but still can call
exec():
However I think it may still be subject to the setuid problem:
http://ewontfix.com/7/ "setuid and vfork"
Now we get to the worst of it. Threads and vfork allow you to get in a
situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid (or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn, implemented naively with vfork, to
run an external command. It doesn’t care if this command runs as root
or with low privileges, since it’s a fixed command line with fixed
environment and can’t do anything harmful. (As a stupid example, let’s
say it’s running date as an external command because the programmer
couldn’t figure out how to use strftime.)
Since it doesn’t care, it calls setuid in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap new code over top of the
running posix_spawn code, or to change the strings posix_spawn is
passing to exec in the child. Whoops.

Resources