Time waste of execv() and fork() - linux

I am currently learning about fork() and execv() and I had a question regarding the efficiency of the combination.
I was shown the following standard code:
pid = fork();
if(pid < 0){
//handle fork error
}
else if (pid == 0){
execv("son_prog", argv_son);
//do father code
I know that fork() clones the entire process (copying the entire heap, etc) and that execv() replaces the current address space with that of the new program. With this in mind, doesn't it make it very inefficient to use this combination? We are copying the entire address space of a process and then immediately overwrite it.
So my question:
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this, even though we have waste?

What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this even though we have waste?
You have to create a new process somehow. There are very few ways for a userspace program to accomplish that. POSIX used to have vfork() alognside fork(), and some systems may have their own mechanisms, such as Linux-specific clone(), but since 2008, POSIX specifies only fork() and the posix_spawn() family. The fork + exec route is more traditional, is well understood, and has few drawbacks (see below). The posix_spawn family is designed as a special purpose substitute for use in contexts that present difficulties for fork(); you can find details in the "Rationale" section of its specification.
This excerpt from the Linux man page for vfork() may be illuminating:
Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent’s memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables are held in a register.
(Emphasis added)
Thus, your concern about waste is not well-founded for modern systems (not limited to Linux), but it was indeed an issue historically, and there were indeed mechanisms designed to avoid it. These days, most of those mechanisms are obsolete.

Another answer states:
However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done.
Obviously, one person's bad old days are a lot younger than others remember.
The original UNIX systems did not have the memory for running multiple processes and they did not have an MMU for keeping several processes in physical memory ready-to-run at the same logical address space: they swapped out processes to disk that it wasn't currently running.
The fork system call was almost entirely the same as swapping out the current process to disk, except for the return value and for not replacing the remaining in-memory copy by swapping in another process. Since you had to swap out the parent process anyway in order to run the child, fork+exec was not incurring any overhead.
It's true that there was a period of time when fork+exec was awkward: when there were MMUs that provided a mapping between logical and physical address space but page faults did not retain enough information that copy-on-write and a number of other virtual-memory/demand-paging schemes were feasible.
This situation was painful enough, not just for UNIX, that page fault handling of the hardware was adapted to become "replayable" pretty fast.

Not any longer. There's something called COW (Copy On Write), only when one of the two processes (Parent/Child) tries to write to a shared data, it is copied.
In the past:
The fork() system call copied the address space of the calling process (the parent) to create a new process (the child).
The copying of the parent's address space into the child was the most expensive part of the fork() operation.
Now:
A call to fork() is frequently followed almost immediately by a call to exec() in the child process, which replaces the child's memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent's address space is largely wasted, because the child process will use very little of its memory before calling exec().
For this reason, later versions of Unix took advantage of virtual memory hardware to allow the parent and child to share the memory mapped into their respective address spaces until one of the processes actually modifies it. This technique is known as copy-on-write. To do this, on fork() the kernel would copy the address space mappings from the parent to the child instead of the contents of the mapped pages, and at the same time mark the now-shared pages read-only. When one of the two processes tries to write to one of these shared pages, the process takes a page fault. At this point, the Unix kernel realizes that the page was really a "virtual" or "copy-on-write" copy, and so it makes a new, private, writable copy of the page for the faulting process. In this way, the contents of individual pages aren't actually copied until they are actually written to. This optimization makes a fork() followed by an exec() in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec().

It turns out all those COW page faults are not at all cheap when the process has a few gigabytes of writable RAM. They're all gonna fault once even if the child has long since called exec(). Because the child of fork() is no longer allowed to allocate memory even for the single threaded case (you can thank Apple for that one), arranging to call vfork()/exec() instead is hardly more difficult now.
The real advantage to the vfork()/exec() model is you can set the child up with an arbitrary current directory, arbitrary environment variables, and arbitrary fs handles (not just stdin/stdout/stderr), an arbitrary signal mask, and some arbitrary shared memory (using the shared memory syscalls) without having a twenty-argument CreateProcess() API that gets a few more arguments every few years.
It turned out the "oops I leaked handles being opened by another thread" gaffe from the early days of threading was fixable in userspace w/o process-wide locking thanks to /proc. The same would not be in the giant CreateProcess() model without a new OS version, and convincing everybody to call the new API.
So there you have it. An accident of design ended up far better than the directly designed solution.

It's not that expensive (relatively to spawning a process directly), especially with copy-on-write forks like you find in Linux , and it's kind of elegant for:
when you really just want to fork off a clone of the current process (I find this to be very useful for testing)
for when you need to do something just before loading the new executable
(redirect filedescriptors, play with signal masks/dispositions, uids, etc.)
POSIX now has posix_spawn that effectively allows you to combine fork/and-exec (possibly more efficiently than fork+exec; if it is more efficient, it'll usually be implemented through some cheaper but less robust fork (clone/vfork) followed by exec), but the way it achieves #2 is through a ton of relatively messy options, which can never be as complete and powerful and clean as just allowing you to run arbitrary code just before the new process image is loaded.

A process created by exec() et al, will inherit its file handles from the parent process (including stdin, stdout, stderr). If the parent changes these after calling fork() but before calling exec() then it can control the child's standard streams.

Related

Is linux fork insecure

I was reading this article
It says that the fork create a copy of itself and fork man also says so
. The entire virtual address space of the parent is replicated in the child
Does this mean child process can read all my process memory state ?
Can child process dump the entire parent memory state and it can be analysed to extract parent variable and its value. ?
But the article also says that two process cannot ready each other data.
So i am confused ?
Yes, the child process can read a pristine copy of all of the parent process state (but when writing, only its own address space is affected) just after a fork(2). However, most of the time, the child would eventually use execve(2) to start a new program, and that would "clear" and replace the copy of the original parent's address space (by a fresh address space). Notice that execve and mmap(2) (see also shared memory in shm_overview(7)...) are the common ways to change the address space in virtual memory of some process (and how the kernel handles page faults).
The kernel uses (and sets up the MMU for) lazy copy on write machinery to make the child's address space a copy of the parent's one, so fork is quite efficient in practice.
Read also proc(5), then type the follow commands:
cat /proc/self/maps
cat /proc/$$/maps
sudo cat /proc/1/maps
and understand what is happening
Read also the wikipage on fork, and the Advanced Linux Programming book.
There is no insecurity, because if the child is changing some data (e.g. a variable, a heap or stack location, ...) it does not affect the parent process.
If the program doing the fork is keeping some password in some virtual memory location, the child process would be able to read that location as long as it is executing the same program. Once the child did a successful execve (which is the common situation, and what any shell is doing) the previous address space is gone and replaced by a new one, described in the ELF executable of that exec-ed program.
There is no "lie" or "insecurity" in that Unix model. But contrarily to several other operating systems, Unix & POSIX have two separate system calls for creating a new process (fork) and executing a new program (execve). Other systems might have some single spawn operation mixing the two abilities. posix_spawn is often implemented by a mixture of fork & execve (and so are system(3) & popen(3), also using waitpid(2) & /bin/sh....).
The advantage of that Unix approach (having separated fork & execve) is that after the fork and before the execve in the child you can do a lot of useful things (e.g. closing useless file descriptors, ...). Operating Systems not separating the two features may need to have a quite complex spawning primitive.
There are rare occasions where a fork is not followed by some execve. Some MPI implementations might do that, and you might also do that. But then you know that you are able to read all the parent's address space thru your own copy - so what you felt was an insecurity is becoming a useful feature. In the old days you had the obsolete vfork which blocked the parents. There is not need to use it today; actually, fork is often implemented thru clone(2) which you should not use directly in practice (see futex(7)...) but only thru POSIX pthreads. But thinking of fork as a magical cloner of your process might help.
When coding (even in C) don't forget to test against failure of fork and of execve. See perror(3)
PS. the fork syscall is as difficult to understand as the multiverse idea. Both are "forking" the time!
When you call fork(), the new process will get access to the copy of the parent process memory (i.e. variables, file descriptors etc).
This is in contrast with threads, where all threads share the same memory space, i.e. variable modified in one thread will get a new value in all other threads.
So if, after forking, parent process modifies memory, the child process will not see that change - because the memory has been copied, the child process' memory would not get altered.

Fork creates a new process that is exactly the same as its parent

From the assumption made in the title of my question "Fork create a new process that is exactly the same as its parent". I am wondering how a fork is really made by the operating system.
Considering a heavy process (huge RAM footprint) that fork itself to accomplish a small task (list the files into a directory). From the assumption, I expect that the child process will be as big as the first one. However my common sense tells me that it cannot be the case.
How does it work in the real world?
As others mentioned in the comments, a technique called Copy-On-Write mitigates the heavy cost of copying the entire memory space of the parent process. Copy-on-write means that memory pages are shared read-only between parent and child until either of them decides to write -- at which point the page is copied and each process gets its own private copy. This technique easily prevents a huge amount of copying that in a lot of cases would be a waste of time because the child will exec() or do something simple and exit.
Here's what happens in detail:
When you call fork(2), the only immediate cost you incur is the cost of allocating a new unique process descriptor and the cost of copying the parent's page tables. In Linux, fork(2) is implemented by the clone(2) syscall, which is a more general syscall that allows the caller to control which parts of the new process are shared with the parent. When called from fork(2), a set of flags are passed to indicate that nothing is to be shared (you can choose to share memory, file descriptors, etc - this is how threads are implemented: by calling clone(2) with CLONE_VM, which means "share the memory space").
Under the hood, each process's memory page has a bit flag that is the copy-on-write flag that indicates whether that page should be copied before being written to. fork(2) marks every writeable page in a process with that bit. Each page also maintains a reference count.
So, when a process forks, the kernel sets the copy-on-write bit on every non-private, writeable page of that process and increments the reference count by one. The child process has pointers to these same pages.
Then, every page is marked read-only so that an attempt to write to the page generates a page fault - this is needed to wake up the kernel so that it has a chance of seeing what happened and what needs to be done.
When either of the processes writes to a page that is still being shared, and thus is marked read-only, the kernel wakes up and attempts to figure out why there is a page fault. Assuming the parent / child process is writing to a legit location, the kernel eventually sees that the page fault was generated because the page is marked copy-on-write and there is more than one reference to that page.
The kernel then allocates memory, copies the page into the new location, and the write can proceed.
What is different across forks
You said that fork(2) creates a new process that is exactly the same as its parent. This is not quite true. There are several differences between the parent and the child:
The process ID is different
The parent process ID is different
The child's resource usage (CPU time, etc) are set to 0
File locks owned by the parent are not inherited
The set of pending signals on the child is cleared
Pending alarms are cleared on the child
About vfork
The vfork(2) syscall is very similar to fork(), but it does absolutely no copying - it doesn't even copy the parent's page tables. With the introduction of copy-on-write, it's not as widely used anymore, but historically it was used by processes that would call exec() after forking.
Naturally, attempting to write to memory in the child process after a vfork() results in chaos.
When a process forked, the child process use the same page table as its parent util parent or child write to its space!So

When is clone() and fork better than pthreads?

I am beginner in this area.
I have studied fork(), vfork(), clone() and pthreads.
I have noticed that pthread_create() will create a thread, which is less overhead than creating a new process with fork(). Additionally the thread will share file descriptors, memory, etc with parent process.
But when is fork() and clone() better than pthreads? Can you please explain it to me by giving real world example?
Thanks in Advance.
clone(2) is a Linux specific syscall mostly used to implement threads (in particular, it is used for pthread_create). With various arguments, clone can also have a fork(2)-like behavior. Very few people directly use clone, using the pthread library is more portable. You probably need to directly call clone(2) syscall only if you are implementing your own thread library - a competitor to Posix-threads - and this is very tricky (in particular because locking may require using futex(2) syscall in machine-tuned assembly-coded routines, see futex(7)). You don't want to directly use clone or futex because the pthreads are much simpler to use.
(The other pthread functions require some book-keeping to be done internally in libpthread.so after a clone during a pthread_create)
As Jonathon answered, processes have their own address space and file descriptor set. And a process can execute a new executable program with the execve syscall which basically initialize the address space, the stack and registers for starting a new program (but the file descriptors may be kept, unless using close-on-exec flag, e.g. thru O_CLOEXEC for open).
On Unix-like systems, all processes (except the very first process, usuallyinit, of pid 1) are created by fork (or variants like vfork; you could, but don't want to, use clone in such way as it behaves like fork).
(technically, on Linux, there are some few weird exceptions which you can ignore, notably kernel processes or threads and some rare kernel-initiated starting of processes like /sbin/hotplug ....)
The fork and execve syscalls are central to Unix process creation (with waitpid and related syscalls).
A multi-threaded process has several threads (usually created by pthread_create) all sharing the same address space and file descriptors. You use threads when you want to work in parallel on the same data within the same address space, but then you should care about synchronization and locking. Read a pthread tutorial for more.
I suggest you to read a good Unix programming book like Advanced Unix Programming and/or the (freely available) Advanced Linux Programming
The strength and weakness of fork (and company) is that they create a new process that's a clone of the existing process.
This is a weakness because, as you pointed out, creating a new process has a fair amount of overhead. It also means communication between the processes has to be done via some "approved" channel (pipes, sockets, files, shared-memory region, etc.)
This is a strength because it provides (much) greater isolation between the parent and the child. If, for example, a child process crashes, you can kill it and start another fairly easily. By contrast, if a child thread dies, killing it is problematic at best -- it's impossible to be certain what resources that thread held exclusively, so you can't clean up after it. Likewise, since all the threads in a process share a common address space, one thread that ran into a problem could overwrite data being used by all the other threads, so just killing that one thread wouldn't necessarily be enough to clean up the mess.
In other words, using threads is a little bit of a gamble. As long as your code is all clean, you can gain some efficiency by using multiple threads in a single process. Using multiple processes adds a bit of overhead, but can make your code quite a bit more robust, because it limits the damage a single problem can cause, and makes it much easy to shut down and replace a process if it does run into a major problem.
As far as concrete examples go, Apache might be a pretty good one. It will use multiple threads per process, but to limit the damage in case of problems (among other things), it limits the number of threads per process, and can/will spawn several separate processes running concurrently as well. On a decent server you might have, for example, 8 processes with 8 threads each. The large number of threads helps it service a large number of clients in a mostly I/O bound task, and breaking it up into processes means if a problem does arise, it doesn't suddenly become completely un-responsive, and can shut down and restart a process without losing a lot.
These are totally different things. fork() creates a new process. pthread_create() creates a new thread, which runs under the context of the same process.
Thread share the same virtual address space, memory (for good or for bad), set of open file descriptors, among other things.
Processes are (essentially) totally separate from each other and cannot modify each other.
You should read this question:
What is the difference between a process and a thread?
As for an example, if I am your shell (eg. bash), when you enter a command like ls, I am going to fork() a new process, and then exec() the ls executable. (And then I wait() on the child process, but that's getting out of scope.) This happens in an entire different address space, and if ls blows up, I don't care, because I am still executing in my own process.
On the other hand, say I am a math program, and I have been asked to multiply two 100x100 matrices. We know that matrix multiplication is an Embarrassingly Parallel problem. So, I have the matrices in memory. I spawn of N threads, who each operate on the same source matrices, putting their results in the appropriate location in the result matrix. Remember, these operate in the context of the same process, so I need to make sure they are not stamping on each other's data. If N is 8 and I have an eight-core CPU, I can effectively calculate each part of the matrix simultaneously.
Process creation mechanism on unix using fork() (and family) is very efficient.
Morever , most unix system doesnot support kernel level threads i.e thread is not entity recognized by kernel. Hence thread on such system cannot get benefit of CPU scheduling at kernel level. pthread library does that which is not kerenl rather some process itself.
Also on such system pthreads are implemented using vfork() and as light weight process only.
So using threading has no point except portability on such system.
As per my understanding Sun-solaris and windows has kernel level thread and linux family doesn't support kernel threads.
with processes pipes and unix doamin sockets are very efficient IPC without synchronization issues.
I hope it clears why and when thread should be used practically.

Linux system call for creating process and thread

I read in a paper that the underlying system call to create processes and threads is actually the same, and thus the cost of creating processes over threads is not that great.
First, I wanna know what is the system call that creates
processes/threads (possibly a sample code or a link?)
Second, is
the author correct to assume that creating processes instead of
threads is inexpensive?
EDIT:
Quoting article:
Replacing pthreads with processes is surprisingly inexpensive,
especially on Linux where both pthreads and processes are invoked
using the same underlying system call.
Processes are usually created with fork, threads (lightweight processes) are usually created with clone nowadays. However, anecdotically, there exist 1:N thread models, too, which don't do either.
Both fork and clone map to the same kernel function do_fork internally. This function can create a lightweight process that shares the address space with the old one, or a separate process (and many other options), depending on what flags you feed to it. The clone syscall is more or less a direct forwarding of that kernel function (and used by the higher level threading libraries) whereas fork wraps do_fork into the functionality of the 50 year old traditional Unix function.
The important difference is that fork guarantees that a complete, separate copy of the address space is made. This, as Basil points out correctly, is done with copy-on-write nowadays and therefore is not nearly as expensive as one would think.
When you create a thread, it just reuses the original address space and the same memory.
However, one should not assume that creating processes is generally "lightweight" on unix-like systems because of copy-on-write. It is somewhat less heavy than for example under Windows, but it's nowhere near free.
One reason is that although the actual pages are not copied, the new process still needs a copy of the page table. This can be several kilobytes to megabytes of memory for processes that use larger amounts of memory.
Another reason is that although copy-on-write is invisible and a clever optimization, it is not free, and it cannot do magic. When data is modified by either process, which inevitably happens, the affected pages fault.
Redis is a good example where you can see that fork is everything but lightweight (it uses fork to do background saves).
The underlying system call to create threads is clone(2) (it is Linux specific). BTW, the list of Linux system calls is on syscalls(2), and you could use the strace(1) command to understand the syscalls done by some process or command. Processes are usually created with fork(2) (or vfork(2), which is not much useful these days). However, you could (and some C standard libraries might do that) create them with some particular form of clone. I guess that the kernel is sharing some code to implement clone, fork etc... (since some functionalities, e.g. management of the virtual address space, are common).
Indeed, process creation (and also thread creation) is generally quite fast on most Unix systems (because they use copy-on-write machinery for the virtual memory), typically a small fraction of a millisecond. But you could have pathological cases (e.g. thrashing) which makes that much longer.
Since most C standard library implementations are free software on Linux, you could study the source code of the one on your system (often GNU glibc, but sometimes musl-libc or something else).

Faster forking of large processes on Linux?

What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?
My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.
Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?
I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).
On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.
See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.
To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)
I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.
benchmarking the various spawns over 1000 runs at 100M RSS
user system total real
fspawn (fork/exec): 0.100000 15.460000 40.570000 ( 41.366389)
pspawn (posix_spawn): 0.010000 0.010000 0.540000 ( 0.970577)
Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.
Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.
Did you actually measure how much time forks take? Quoting the page you linked,
Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)
So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.
But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.
posix_spawn()
This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:
Swapping is generally too slow for a realtime environment.
Dynamic address translation is not available everywhere that POSIX might be useful.
Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.
Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.
I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.
If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.
I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/
pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);
Excerpt:
The system call clone() comes to the rescue. Using clone() we create a
child process which has the following features:
The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is
created. As a result of this, any change to any non-stack variable
made by the child is visible by the parent process. This is similar to
threads, and therefore completely different from fork(), and also very
dangerous – we don’t want the child to mess up the parent.
The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
The most important: This thread-like child process can call exec().
In a nutshell, by calling clone in the following way, we create a
child process which is very similar to a thread but still can call
exec():
However I think it may still be subject to the setuid problem:
http://ewontfix.com/7/ "setuid and vfork"
Now we get to the worst of it. Threads and vfork allow you to get in a
situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid (or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn, implemented naively with vfork, to
run an external command. It doesn’t care if this command runs as root
or with low privileges, since it’s a fixed command line with fixed
environment and can’t do anything harmful. (As a stupid example, let’s
say it’s running date as an external command because the programmer
couldn’t figure out how to use strftime.)
Since it doesn’t care, it calls setuid in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap new code over top of the
running posix_spawn code, or to change the strings posix_spawn is
passing to exec in the child. Whoops.

Resources