What happens when a process is forked? - linux

I've read about fork and from what I understand, the process is cloned but which process? The script itself or the process that launched the script?
For example:
I'm running rTorrent on my machine and when a torrent completes, I have a script run against it. This script fetches data from the web so it takes a few seconds to complete. During this time, my rtorrent process is frozen. So I made the script fork using the following
my $pid = fork();
if ($pid == 0) { blah blah blah; exit 0; }
If I run this script from the CLI, it comes back to the shell within a second while it runs in the background, exactly as I intended. However, when I run it from rTorrent, it seems to be even slower than before. So what exactly was forked? Did the rtorrent process clone itself and my script ran in that, or did my script clone itself? I hope this makes sense.

The fork() function returns TWICE! Once in the parent process, and once in the child process. In general, both processes are IDENTICAL in every way, as if EACH one had just returned from fork(). The only difference is that in one, the return value from fork() is 0, and in the other it is non-zero (the PID of the child process).
So whatever process was running your Perl script (if it is an embedded Perl interpreter inside rTorrent then rTorrent would be the process) would be duplicated at exactly the point that the fork() happened.

I believe I found the problem by looking through rTorrent's source. For some processes, it will read all of the output sent to stdout before continuing. If this is happening to your process, rTorrent will block until you close the stdout process. Because you're forking, your child process shares the same stdout as the parent. Your parent process will exit, but the pipe remains open (because your child process is still running). If you did an strace of rTorrent, I'd bet that it'd be blocked on this read() call while executing your command.
Try closing/redirecting stdout in your perl script before the fork().

The entire process containing the interpreter forks. Fortunately memory is copy-on-write so it doesn't need to copy all the process memory in order to fork. However, things such as file descriptors remain open. This allows child processes to handle them, but may cause issues if they aren't closed appropriately. In general, fork() should not be used in an embedded interpreter except under extreme duress.

To answer the nominal question, since you commented that the accepted answer fails to do so, fork affects the process in which it is called. In your example of rTorrent spawning a Perl process which then calls fork, it is the Perl process which is duplicated, since it was the Perl process which called fork.
In the general case, there is no way for a process to fork any process other than itself. If it were possible to tell another arbitrary process to go fork itself, that would open up no end of security and performance issues.

My advice would be "don't do that".
If the Perl interpreter is embedded within the rtorrent process, you've almost certainly forked an entire rtorrent process, the effects of which are probably ill-defined at best. It's generally a bad idea to play with process-level stuff in an embedded interpreter regardless of language.
There's an excellent chance that some sort of lock is not being properly released, or that threads within the processes are proceeding in unintended and possibly competing ways.

When we create a process using fork the child process will have the copy of the address space.So the child also can use the address space.And it also can access the files which is opened by the parent.We can have the control over the child.To get the complete status of the child we can use wait.

Related

What happens to threads when exec() is called?

I am taking an OS class and trying to wrap my head around this question, any help would be appreciated:
Consider a multi-threaded process with 10 threads. Thread 3 invokes an execlp() system call. Describe what will happen to each of the threads.
My understanding of exec() is that is replaces the current process with a new one, and it's main difference from fork() is that fork() creates a clone and you end up with duplicates.
So if exec() replaces the current process, would it kill the threads of the old process and replace them with the new one? Any help will be appreciated.
exec()...replaces the current process with a new one.
Actually, it's still the same process after calling exec, and that's important because the parent process may still need to communicate with it, signal it, etc. What exec does is, it guts the process—wipes out all of it's virtual memory, resets all of its signal handlers, unlocks locks, closes some open files, etc. (See here for more)—and then it loads a new program into the existing process and starts executing it.
would it kill the threads of the old process...?
man 2 execve says, "All threads other than the calling thread are destroyed during
an execve(). Mutexes, condition variables, and other pthreads
objects are not preserved."

Linux fork, execve - no wait zombies

In Linux & C, will not waiting (waitpid) for a fork-execve launched process create zombies?
What is the correct way to launch a new program (many times) without waiting and without resource leaks?
It would also be launched from a 2nd worker thread.
Can the first program terminate first cleanly if launched programs have not completed?
Additional: In my case I have several threads that can fork-execve processes at ANY TIME and THE SAME TIME -
1) Some I need to wait for completion and want to report any errors codes with waitpid
2) Some I do not want to block the thread and but would like to report errors
3) Some I don't want to wait and don't care about the outcome and could run after the program terminates
For #2, should I have to create an additional thread to do waitpid ?
For #3, should I do a fork-fork-execve and would ending the 1st fork cause the 2nd process to get cleaned up (no zombie) separately via init ?
Additional: I've read briefly (not sure I understand all) about using nohup, double fork, setgpid(0,0), signal(SIGCHLD, SIG_IGN).
Doesn't global signal(SIGCHLD, SIG_IGN) have too many side effects like getting inherited (or maybe not) and preventing monitoring other processes you do want to wait for ?
Wouldn't relying on init to cleanup resources leak while the program continues to run (weeks in my case)?
In Linux & C, will not waiting (waitpid) for a fork-execve launched process create zombies?
Yes, they become zombies after death.
What is the correct way to launch a new program (many times) without waiting and without resource leaks? It would also be launched from a 2nd worker thread.
Set SIGCHLd to SIG_IGN.
Can the first program terminate first cleanly if launched programs have not completed?
Yes, orphaned processes will be adopted by init.
I ended up keeping an array of just the fork-exec'd pids I did not wait for (other fork-exec'd pids do get waited on) and periodically scanned the list using
waitpid( pids[xx], &status, WNOHANG ) != 0
which gives me a chance report outcome and avoid zombies.
I avoided using global things like signal handlers that might affect other code elsewhere.
It seemed a bit messy.
I suppose that fork-fork-exec would be an alternative to asynchronously monitor the other program's completion by the first fork, but then the first fork needs cleanup.
In Windows, you just keep a handle to the process open if you want to check status without worry of pid reuse, or close the handle if you don't care what the other process does.
(In Linux, there seems no way for multiple threads or processes to monitor the status of the same process safely, only the parent process-thread can, but not my issue here.)

Does the Linux scheduler prefer to run child process after fork()?

Does the Linux scheduler prefer to run the child process after fork() to the father process?
Usually, the forked process will execute exec of some kind so, it is better to let child process to run before father process(to prevent copy on write).
I assume that the child will execute exec as first operation after it will be created.
Is my assumption (that the scheduler will prefer child process) correct. If not, why? If yes, is there more reasons to run child first?
To quote The Linux Programming Interface (pg. 525) for a general answer:
After a fork(), it is indeterminate which process - the parent or the child - next has access to the CPU. (On a multiprocessor system, they may both simultaneously get access to a CPU.)
The book goes on about the differences in kernel versions and also mentions CFS / Linux 2.6.32:
[...] since Linux 2.6.32, it is once more the parent that is, by default, run first after a fork(). This default can be changed by assigning a nonzero value to the Linux-specific /proc/sys/kernel/sched_child_runs_first file.
This behaviour is still present with CFS although there are some concerns for the future of this feature. Looking at the CFS implementation, it seems to schedule the parent before the child.
The way to go for you would be to set /proc/sys/kernel/sched_child_runs_first to a non-zero value.
Edit: This answer analyzes the default behaviour and compares it to sched_child_runs_first.
For the case where the child calls exec at the first opportunity you can use vfork instead of fork. vfork suspends the parent until the child calls _exit or exec*. However, once it calls exec the child will be suspended if code has to be loaded from disk. In this case, the parent has a good chance to continue before the child does.

Terminate all child process in LInux

I am developing a sandbox on linux. And now i am confused terminating all process in the sandbox.
My sandbox works as follows:
At first only one process run in the sandbox.
Then it can create several child process.
And child process will create their subprocess also.
And parent process may exit at some time before its children exited.
At last sandbox will terminate all the process.
I used to do this by using killall or pkill -u with a unique user attached to the sandbox.But it seems doesn't work on the program which uses fork() fastly.
Then I search for the source code of pkill and realized that pkill is lose of atomicity.
So how could i achieve my goal ?
You could use process groups setpgid(2) and sessions setsid(2), but I don't qualify what you do as a sandbox (in particular because if one of the processes is setuid or change its process group or session itself, you'll lose it; read execve(2) carefully and several times!). Notice that kill(2) with a negative pid kills an entire process group.
Read a good book like Advanced Linux Programming. Consider also using chroot(2).
And explain what and why you really want to do. sandboxing is harder that what you think. See also capabilities(7), credentials(7) and SElinux.

How to use fork() in unix? Why not something of the form fork(pointerToFunctionToRun)?

I am having some trouble understanding how to use Unix's fork(). I am used to, when in need of parallelization, spawining threads in my application. It's always something of the form
CreateNewThread(MyFunctionToRun());
void myFunctionToRun() { ... }
Now, when learning about Unix's fork(), I was given examples of the form:
fork();
printf("%d\n", 123);
in which the code after the fork is "split up". I can't understand how fork() can be useful. Why doesn't fork() have a similar syntax to the above CreateNewThread(), where you pass it the address of a function you want to run?
To accomplish something similar to CreateNewThread(), I'd have to be creative and do something like
//pseudo code
id = fork();
if (id == 0) { //im the child
FunctionToRun();
} else { //im the parent
wait();
}
Maybe the problem is that I am so used to spawning threads the .NET way that I can't think clearly about this. What am I missing here? What are the advantages of fork() over CreateNewThread()?
PS: I know fork() will spawn a new process, while CreateNewThread() will spawn a new thread.
Thanks
fork() says "copy the current process state into a new process and start it running from right here." Because the code is then running in two processes, it in fact returns twice: once in the parent process (where it returns the child process's process identifier) and once in the child (where it returns zero).
There are a lot of restrictions on what it is safe to call in the child process after fork() (see below). The expectation is that the fork() call was part one of spawning a new process running a new executable with its own state. Part two of this process is a call to execve() or one of its variants, which specifies the path to an executable to be loaded into the currently running process, the arguments to be provided to that process, and the environment variables to surround that process. (There is nothing to stop you from re-executing the currently running executable and providing a flag that will make it pick up where the parent left off, if that's what you really want.)
The UNIX fork()-exec() dance is roughly the equivalent of the Windows CreateProcess(). A newer function is even more like it: posix_spawn().
As a practical example of using fork(), consider a shell, such as bash. fork() is used all the time by a command shell. When you tell the shell to run a program (such as echo "hello world"), it forks itself and then execs that program. A pipeline is a collection of forked processes with stdout and stdin rigged up appropriately by the parent in between fork() and exec().
If you want to create a new thread, you should use the Posix threads library. You create a new Posix thread (pthread) using pthread_create(). Your CreateNewThread() example would look like this:
#include <pthread.h>
/* Pthread functions are expected to accept and return void *. */
void *MyFunctionToRun(void *dummy __unused);
pthread_t thread;
int error = pthread_create(&thread,
NULL/*use default thread attributes*/,
MyFunctionToRun,
(void *)NULL/*argument*/);
Before threads were available, fork() was the closest thing UNIX provided to multithreading. Now that threads are available, usage of fork() is almost entirely limited to spawning a new process to execute a different executable.
below: The restrictions are because fork() predates multithreading, so only the thread that calls fork() continues to execute in the child process. Per POSIX:
A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called. [THR] [Option Start] Fork handlers may be established by means of the pthread_atfork() function in order to maintain application invariants across fork() calls. [Option End]
When the application calls fork() from a signal handler and any of the fork handlers registered by pthread_atfork() calls a function that is not asynch-signal-safe, the behavior is undefined.
Because any library function you call could have spawned a thread on your behalf, the paranoid assumption is that you are always limited to executing async-signal-safe operations in the child process between calling fork() and exec().
History aside, there are some fundamental differences with respect to ownership of resource and life time between processes and threads.
When you fork, the new process occupies a completely separate memory space. That's a very important distinction from creating a new thread. In multi-threaded applications you have to consider how you access and manipulate shared resources. Processed that have been forked have to explicitly share resources using inter-process means such as shared memory, pipes, remote procedure calls, semaphores, etc.
Another difference is that fork()'ed children can outlive their parent, where as all threads die when the process terminates.
In a client-server architecture where very, very long uptime is expected, using fork() rather than creating threads could be a valid strategy to combat memory leaks. Rather than worrying about cleaning up memory leaks in your threads, you just fork off a new child process to process each client request, then kill the child when it's done. The only source of memory leaks would then be the parent process that dispatches events.
An analogy: You can think of spawning threads as opening tabs inside a single browser window, while forking is like opening separate browser windows.
It would be more valid to ask why CreateNewThread doesn't just return a thread id like fork() does... after all fork() set a precedent. Your opinion's just coloured by you having seen one before the other. Take a step back and consider that fork() duplicates the process and continues execution... what better place than at the next instruction? Why complicate things by adding a function call into the bargain (and then one what only takes void*)?
Your comment to Mike says "I can't understand is in which contexts you'd want to use it.". Basically, you use it when you want to:
run another process using the exec family of functions
do some parallel processing independently (in terms of memory usage, signal handling, resources, security, robustness), for example:
each process may have intrusive limits of the number of file descriptors they can manage, or on a 32-bit system - the amount of memory: a second process can share the work while getting its own resources
web browsers tend to fork distinct processes because they can do some initialisation then call operating system functions to permanently reduce their privileges (e.g. change to a less-trusted user id, change the "root" directory under which they can access files, or make some memory pages read-only); most OSes don't allow the same extent of fine-grained permission-setting on a per-thread basis; another benefit is if a child process seg-faults or similar the parent process can handle that and continue, whereas similar faults in multi-threaded code raise questions about whether memory has been corrupted - or locks have been held - by the crashing thread such that remaining threads are compromised
BTW / using UNIX/Linux doesn't mean you have to give up threads for fork()ing processes... you can use pthread_create() and related functions if you're more comfortable with the threading paradigm.
Letting the difference between spawning a process and a thread set aside for a second: Basically, fork() is a more fundamental primitive. While SpawnNewThread has to do some background work to get the program counter in the right spot, fork does no such work, it just copies (or virtually copies) your program memory and continues the counter.
Fork has been with us for a very, very, long time. Fork was thought of before the idea of 'start a thread running a particular function' was a glimmer in anyone's eye.
People don't use fork because it's 'better,' we use it because it is the one and only unprivileged user-mode process creation function that works across all variations of Linux. If you want to create a process, you have to call fork. And, for some purposes, a process is what you need, not a thread.
You might consider researching the early papers on the subject.
It is worth noting that multi-processing not exactly the same as multi-threading. The new process created by fork share very little context with the old one, which is quite different from the case for threads.
So, lets look at the unixy thread system: pthread_create has semantics similar to CreateNewThread.
Or, to turn it around, lets look at the windows (or java or other system that makes its living with threads) way of spawning a process identical to the one you're currently running (which is what fork does on unix)...well, we could except that there isn't one: that just not part of the all-threads-all-the-time model. (Which is not a bad thing, mind you, just different).
You fork whenever you want to more than one thing at the same time. It’s called multitasking, and is really useful.
Here for example is a telnetish like program:
#!/usr/bin/perl
use strict;
use IO::Socket;
my ($host, $port, $kidpid, $handle, $line);
unless (#ARGV == 2) { die "usage: $0 host port" }
($host, $port) = #ARGV;
# create a tcp connection to the specified host and port
$handle = IO::Socket::INET->new(Proto => "tcp",
PeerAddr => $host,
PeerPort => $port)
or die "can't connect to port $port on $host: $!";
$handle->autoflush(1); # so output gets there right away
print STDERR "[Connected to $host:$port]\n";
# split the program into two processes, identical twins
die "can't fork: $!" unless defined($kidpid = fork());
if ($kidpid) {
# parent copies the socket to standard output
while (defined ($line = <$handle>)) {
print STDOUT $line;
}
kill("TERM" => $kidpid); # send SIGTERM to child
}
else {
# child copies standard input to the socket
while (defined ($line = <STDIN>)) {
print $handle $line;
}
}
exit;
See how easy that is?
Fork()'s most popular use is as a way to clone a server for each new client that connect()s (because the new process inherits all file descriptors in whatever state they exist).
But I've also used it to initiate a new (locally running) service on-demand from a client.
That scheme is best done with two calls to fork() - one stays in the parent session until the server is up and running and able to connect, the other (I fork it off from the child) becomes the server and departs the parent's session so it can no longer be reached by (say) SIGQUIT.

Resources