Does the clone() system call ultimately rely on fork functionality? - linux

For a class I'm taking I've been doing some work directly with the clone() system call in Linux. I got curious about how it actually worked and started doing some digging. What is confusing me is that it seems to rely on some of the same underpinnings as fork() functionality (they call the same do_fork() function albeit with different arguments). On one hand, this makes sense to me as a thread is really just a light-weight process but I was always under the impression that there were some significant differences between the way a thread was created an the way a process was created. I did some digging into the implementation of do_fork() and subsequently copy_process() (which do_fork() calls) but I haven't been able to convince myself I understand what's going on.
So, to the guru's out there, am I missing something or is this actually how it works? Are there flags that basically tell the OS just how much to copy as well as what instruction to begin execution of the new task at (I'm thinking the answer has to be yes, but I'm just not sure how they translate)?
Below is the code I'm looking at, perhaps you could explain how the arguments that are passed in control whether a light-weight or heavy-weight process is created.
asmlinkage int sys_fork(struct pt_regs *regs){
#ifdef CONFIG_MMU
return do_fork(SIGCHLD, regs->ARM_sp, regs, 0, NULL, NULL);
#else
/* can not support in nommu mode */
return(-EINVAL);
#endif
}
asmlinkage int sys_clone(unsigned long clone_flags, unsigned long newsp,
int __user *parent_tidptr, int tls_val,
int __user *child_tidptr, struct pt_regs *regs)
{
if (!newsp)
newsp = regs->ARM_sp;
return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
}
Thanks!

Actually, at the conceptual level, the Linux kernel doesn't know anything about processes or threads, it only knows about "tasks".
A Linux task can be a process, a thread or something in between. (Incidentally, this means that the strange children that vfork() creates fit perfectly well into the Linux "task" paradigm).
Now, tasks can share several things, see all the CLONE_* flags in the manpage for clone(2). (Not all these flags can be described as sharing, some specify more complex behaviours).
Or new tasks can choose to have their own copies of the respective resources. And since 2.6.16, they can do so after having been started, see unshare(2).
For instance, the only difference between a vfork() and a fork() call, is that vfork() has CLONE_VM and CLONE_VFORK set. CLONE_VM makes it share its parent's memory (the same way threads share memory), while CLONE_VFORK makes the parent block until the child releases its memory mappings (by calling execve() or _exit()).
Note that Linux is not the only OS to generalize processes and threads in this manner. Plan 9 has rfork().

Nothing in the clone manpage suggests that it's "lightweight".
The critical difference is that fork creates a new address space, while clone optionally shares the address space between the parent and child, as well as file handles and so forth.
This shared address space enables lightweight IPC later on, but the process itself is not slimmer.

I understand that the difference between all the three clone,fork and vfork is in the flags because finally all the three calls the do_fork() in kernel
fork()-->C_lib-->sys_fork()-->do_fork()
vfork()-->C_lib-->sys_vfork()-->do_fork()
clone()-->C_lib-->sys_clone()-->do_fork()
The difference between the fork and vfork is that vfork guarantees that child will execute first and parent will block until child calls exit or exec. vfork passes extra flag that is CLONE_VM, this flag ask kernel not to duplicate the page table, the reason is simple the child will either do exit or exec, if child exits nothing would be done, if child does the exec the page table will definitely be changed. i hope the fork and vfork flags are clear now at kernel level.
Now lets look at the clone flags
The main usage of clone is to implement thread, where the memory space shared other then stack. Along with same parameter as fork and vfork the clone also takes the function pointer as parameter, which is called as soon as the child process is created.

Related

Time waste of execv() and fork()

I am currently learning about fork() and execv() and I had a question regarding the efficiency of the combination.
I was shown the following standard code:
pid = fork();
if(pid < 0){
//handle fork error
}
else if (pid == 0){
execv("son_prog", argv_son);
//do father code
I know that fork() clones the entire process (copying the entire heap, etc) and that execv() replaces the current address space with that of the new program. With this in mind, doesn't it make it very inefficient to use this combination? We are copying the entire address space of a process and then immediately overwrite it.
So my question:
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this, even though we have waste?
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this even though we have waste?
You have to create a new process somehow. There are very few ways for a userspace program to accomplish that. POSIX used to have vfork() alognside fork(), and some systems may have their own mechanisms, such as Linux-specific clone(), but since 2008, POSIX specifies only fork() and the posix_spawn() family. The fork + exec route is more traditional, is well understood, and has few drawbacks (see below). The posix_spawn family is designed as a special purpose substitute for use in contexts that present difficulties for fork(); you can find details in the "Rationale" section of its specification.
This excerpt from the Linux man page for vfork() may be illuminating:
Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent’s memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables are held in a register.
(Emphasis added)
Thus, your concern about waste is not well-founded for modern systems (not limited to Linux), but it was indeed an issue historically, and there were indeed mechanisms designed to avoid it. These days, most of those mechanisms are obsolete.
Another answer states:
However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done.
Obviously, one person's bad old days are a lot younger than others remember.
The original UNIX systems did not have the memory for running multiple processes and they did not have an MMU for keeping several processes in physical memory ready-to-run at the same logical address space: they swapped out processes to disk that it wasn't currently running.
The fork system call was almost entirely the same as swapping out the current process to disk, except for the return value and for not replacing the remaining in-memory copy by swapping in another process. Since you had to swap out the parent process anyway in order to run the child, fork+exec was not incurring any overhead.
It's true that there was a period of time when fork+exec was awkward: when there were MMUs that provided a mapping between logical and physical address space but page faults did not retain enough information that copy-on-write and a number of other virtual-memory/demand-paging schemes were feasible.
This situation was painful enough, not just for UNIX, that page fault handling of the hardware was adapted to become "replayable" pretty fast.
Not any longer. There's something called COW (Copy On Write), only when one of the two processes (Parent/Child) tries to write to a shared data, it is copied.
In the past:
The fork() system call copied the address space of the calling process (the parent) to create a new process (the child).
The copying of the parent's address space into the child was the most expensive part of the fork() operation.
Now:
A call to fork() is frequently followed almost immediately by a call to exec() in the child process, which replaces the child's memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent's address space is largely wasted, because the child process will use very little of its memory before calling exec().
For this reason, later versions of Unix took advantage of virtual memory hardware to allow the parent and child to share the memory mapped into their respective address spaces until one of the processes actually modifies it. This technique is known as copy-on-write. To do this, on fork() the kernel would copy the address space mappings from the parent to the child instead of the contents of the mapped pages, and at the same time mark the now-shared pages read-only. When one of the two processes tries to write to one of these shared pages, the process takes a page fault. At this point, the Unix kernel realizes that the page was really a "virtual" or "copy-on-write" copy, and so it makes a new, private, writable copy of the page for the faulting process. In this way, the contents of individual pages aren't actually copied until they are actually written to. This optimization makes a fork() followed by an exec() in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec().
It turns out all those COW page faults are not at all cheap when the process has a few gigabytes of writable RAM. They're all gonna fault once even if the child has long since called exec(). Because the child of fork() is no longer allowed to allocate memory even for the single threaded case (you can thank Apple for that one), arranging to call vfork()/exec() instead is hardly more difficult now.
The real advantage to the vfork()/exec() model is you can set the child up with an arbitrary current directory, arbitrary environment variables, and arbitrary fs handles (not just stdin/stdout/stderr), an arbitrary signal mask, and some arbitrary shared memory (using the shared memory syscalls) without having a twenty-argument CreateProcess() API that gets a few more arguments every few years.
It turned out the "oops I leaked handles being opened by another thread" gaffe from the early days of threading was fixable in userspace w/o process-wide locking thanks to /proc. The same would not be in the giant CreateProcess() model without a new OS version, and convincing everybody to call the new API.
So there you have it. An accident of design ended up far better than the directly designed solution.
It's not that expensive (relatively to spawning a process directly), especially with copy-on-write forks like you find in Linux , and it's kind of elegant for:
when you really just want to fork off a clone of the current process (I find this to be very useful for testing)
for when you need to do something just before loading the new executable
(redirect filedescriptors, play with signal masks/dispositions, uids, etc.)
POSIX now has posix_spawn that effectively allows you to combine fork/and-exec (possibly more efficiently than fork+exec; if it is more efficient, it'll usually be implemented through some cheaper but less robust fork (clone/vfork) followed by exec), but the way it achieves #2 is through a ton of relatively messy options, which can never be as complete and powerful and clean as just allowing you to run arbitrary code just before the new process image is loaded.
A process created by exec() et al, will inherit its file handles from the parent process (including stdin, stdout, stderr). If the parent changes these after calling fork() but before calling exec() then it can control the child's standard streams.

How does the kernel separate threads from processes

Suppose I have a browser process like Firefox, that has pid = 123. Firefox has 5 opened tabs each running in a separate thread, so in total it has 5 threads.
So I want to know in depth, how the kernel will separate the process into the thread to execute in struct task_struct or in the thread_info.
Like struct task_struct is a task descriptor of the task list.
where does struct task_struct contain a reference or a link to these five threads.
Does the struct thread_struct of a process like Firefox contain reference to all the 5 thread
OR
each thread is treated like a process inside the Linux kernel.
Unlike Windows, Linux does not have an implementation of "threads" in the kernel. The kernel gives us what are sometimes called "lightweight processes", which are a generalization of the concepts of "processes" and "threads", and can be used to implement either.
It may be confusing when you read kernel code and see things like thread_struct on the one hand, and pid (process ID) on the other. In reality, both are one and the same. Don't be confused by the terminology.
Each lightweight process has a completely different thread_info and task_struct (with embedded thread_struct). You seem to think that the task_struct of one lightweight process should have pointers to the task_structs of other (userspace) "threads" in the same (userspace) "process". This is not the case. Inside the kernel, each "thread" is a separate process, and the scheduler deals with each one separately.
Linux has a system call called clone which is used to create new lightweight processes. When you call clone, you must provide various flags which indicate what will be shared between the new process and the existing process. They can share their address space, or they can each have a different address space. They can share their open files, or they can each have their own list of open files. They can share their signal handlers, or they can each have their own signal handlers. They can be in the same "thread group", or they can be in different thread groups. And so on...
Although "threads" and "processes" are the same thing in Linux, you can implement what we normally think of as "processes" by using clone to create processes which do not share their address space, open files, signal handlers, etc.
You can also implement what we normally think of as "threads" by using clone to create processes which DO share their address space, open files, signal handlers, etc.
If you look at the definition of task_struct, you will find that it has pointers to other structs such as mm_struct (address space), files_struct (open files), sighand_struct (signal handlers), and so on. When you clone a new "process", all of these structs will be copied. When you clone a new "thread", these structs will be shared between the new and old task_structs -- they will both point to the same mm_struct, the same files_struct, and so on. Either way, you are just providing different flags to clone to tell it what to copy, and what to share.
I just mentioned "thread groups" above, so you might wonder about that. In short, each "thread" in a "process" has its own PID, but they all share the same TGID (thread group ID). The TGIDs are all equal to the PID of the first program thread. Userspace "PIDs", like those shown in ps, or in /proc, are actually "TGIDs" in the kernel. Naturally, clone has a flag to determine whether a new lightweight process will have a new TGID (thus putting it in a new "thread group") or not.
UNIX processes also have "parents" and "children". There are pointers in a Linux task_struct which implement the parent-child relationships. And, as you might have guessed, clone has a flag to determine what the parent of a new lightweight process will be. It can either be the process which called clone, OR the parent of the process which called clone. Can you figure out which is used when creating a "process", and which is used when creating a "thread"?
Look at the manpage for clone; it will be very educational. Also try strace on a program which uses pthreads to see clone in use.
(A lot of this was written from memory; others should feel free to edit in corrections as necessary)

when is the system call set_tid_address used?

i have been trying to undertand the system calls, and want to understand how set_tid_address works. bascially from what i have read is that it returns the pid of the program or process which is executed.
I have tested this with ls, however with some commands like uptime, top etc i dont see set_tid_address being used. Why is that?
The clone() syscall can take a CLONE_CHILD_CLEARTID flag, that the value at child_tidptr (another clone() argument) gets cleared and an associated futex signal a wake-up when the child thread exits. This is used to implement pthread_join() (the parent thread waits on the futex).
set_tid_address() allows to pthread_join() on the initial thread. More information in the following LKML threads:
[patch] threading fix, tid-2.5.47-A3
[patch] user-vm-unlock-2.5.31-A2
As to why some programs call set_tid_address() and others don't, the answer is easy. Programs linked (directly or indirectly) to libpthread call set_tid_address. ls is linked to librt, which is linked to libpthread, so it runs the initialization for NPTL.
According to the Linux Programmer's Manual, set_tid_address is used to:
set pointer to thread ID
When it is finished, it returns the PID of the calling process. Unfortunately the manual is vague as to when you would actually want to use this system call.
In any case, why do you think that these commands are using set_tid_address?

If I have a process, and I clone it, is the PID the same?

Just a quick question, if I clone a process, the PID of the cloned process is the same, yes ? fork() creates a child process where the PID differs, but everything else is the same. Vfork() creates a child process with the same PID. Exec works to change a process currently in execution to something else.
Am I correct in all of these statements ?
Not quite. If you clone a process via fork/exec, or vfork/exec, you will get a new process id. fork() will give you the new process with a new process id, and exec() replaces that process with a new process, but maintaining the process id.
From here:
The vfork() function differs from
fork() only in that the child process
can share code and data with the
calling process (parent process). This
speeds cloning activity significantly
at a risk to the integrity of the
parent process if vfork() is misused.
Neither fork() nor vfork() keep the same PID although clone() can in one scenario (*a). They are all different ways to achieve roughly the same end, the creation of a distinct child.
clone() is like fork() but there are many things shared by the two processes and this is often used to enable threading.
vfork() is a variant of clone in which the parent is halted until the child process exits or executes another program. It's more efficient in those cases since it doesn't involve copying page tables and such. Basically, everything is shared between the two processes for as long as it takes the child to load another program.
Contrast that last option with the normal copy-on-write where memory itself is shared (until one of the processes writes to it) but the page tables that reference that memory are copied. In other words, vfork() is even more efficient than copy-on-write, at least for the fork-followed-by-immediate-exec use case.
But, in most cases, the child has a different process ID to the parent.
*a Things become tricky when you clone() with CLONE_THREAD. At that stage, the processes still have different identifiers but what constitutes the PID begins to blur. At the deepest level, the Linux scheduler doesn't care about processes, it schedules threads.
A thread has a thread ID (TID) and a thread group ID (TGID). The TGID is what you get from getpid().
When a thread is cloned without CLONE_THREAD, it's given a new TID and it also has its TGID set to that value (i.e., a brand new PID).
With CLONE_THREAD, it's given a new TID but the TGID (hence the reported process ID) remains the same as the parent so they really have the same PID. However, they can distinguish themselves by getting the TID from gettid().
There's quite a bit of trickery going on there with regard to parent process IDs and delivery of signals (both to the threads within a group and the SIGCHLD to the parent), all which can be examined from the clone() man page.
It deserves some explanation. And it's simple as rain.
Consider this. A program has to do some things at the same time. Say, your program is printing "hello world!", each second, until somebody enters "hello, Mike", then, each second, it prints that string, waiting for John to change that in the future.
How do you write this the standard way? In your program, that basically prints "hello," you must create another branch that is waiting for user input.
You create two processes, one outputting those strings, and another one, waiting the user input. And, the only way to create a new process in UNIX was calling the system call fork(), like this:
ret = fork();
if(ret > 0) /* parent, continue waiting */
else /* child */
This scheme posed numerous problems. The user enters "Mike" but you have no simple way to pass that string to the parent process so that it'd be able to print that, because +each+ process has its own view of memory that isn't shared with the child.
When the processes are created by fork(), each one receives a copy of the memory existing at that moment, and if that memory really changes later, the mapping that was identical for those memory segments will be chaged at once (it's called a copy-on-write mechanism).
Another thingies to share between the child and the parent are, for example, opened file descriptors, descriptors of the shared memory, input/outpue stuff, etc., that also wouldn't survive after fork().
So. The very fork() call had to be alleviated, to include shared memory/signals etc. But how? This was the idea behind clone(). That call takes a flag indicating what exatly would you share with the child. For example, the memory, the signal handlers, etc. And if you call this with flag=0, this will be identical to fork(), up to the args they take. And when POSIX pthreads are created, that flag will reflect the attributes you have indicated in pthread_attr.
From the kernel point of view, there's no difference between the processes created such way, and no special semantics to differentiate the "processess". The kernel does not even know, what that "thread" is, it creates a new process, but it simply combines it as belogning to that process group that had the parent who called it, taking care what that process may do. So, you have different procesess (that share the same pid) combined in a process group each assigned with a different "TID" (that starts from PID of the parent).
Care to explain that clone() does exactly that. You may pass this whaterver you need (as the matter of fact, the old vfork() call will do). Are you going to share memory? Hanlers? You may tune everything, just be sure you don't clash with the pthreads library written right away around this very call.
An important thing, the kernel vesion is quite outrageous, it expects just 2 out of 4 parameters to be passed, the user stack, and options.
Since PID is an unique identifier for a process, there's no way to have two distinct process with the same PID.
Threads (which have the same visible 'pid') are implemented with the clone() call. When the flag CLONE_THREAD is supplied then the new process (a 'thread') share the Thread Group Identifier (TGID) with its creator process. getpid actually returns the TGID.
See the clone manpage for more details.
In summary the real PID, as seen by the kernel is always different. The visible PID is the same for threads.

The difference between fork(), vfork(), exec() and clone()

I was looking to find the difference between these four on Google and I expected there to be a huge amount of information on this, but there really wasn't any solid comparison between the four calls.
I set about trying to compile a kind of basic at-a-glance look at the differences between these system calls and here's what I got. Is all this information correct/am I missing anything important ?
Fork : The fork call basically makes a duplicate of the current process, identical in almost every way (not everything is copied over, for example, resource limits in some implementations but the idea is to create as close a copy as possible).
The new process (child) gets a different process ID (PID) and has the PID of the old process (parent) as its parent PID (PPID). Because the two processes are now running exactly the same code, they can tell which is which by the return code of fork - the child gets 0, the parent gets the PID of the child. This is all, of course, assuming the fork call works - if not, no child is created and the parent gets an error code.
Vfork: The basic difference between vfork() and fork() is that when a new process is created with vfork(), the parent process is temporarily suspended, and the child process might borrow the parent's address space. This strange state of affairs continues until the child process either exits, or calls execve(), at which point the parent
process continues.
This means that the child process of a vfork() must be careful to avoid unexpectedly modifying variables of the parent process. In particular, the child process must not return from the function containing the vfork() call, and it must not call exit() (if it needs to exit, it should use _exit(); actually, this is also true for the child of a normal fork()).
Exec: The exec call is a way to basically replace the entire current process with a new program. It loads the program into the current process space and runs it from the entry point. exec() replaces the current process with a the executable pointed by the function. Control never returns to the original program unless there is an exec() error.
Clone: clone(), as fork(), creates a new process. Unlike fork(), these calls allow the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers.
When the child process is created with clone(), it executes the function application fn(arg) (This differs from fork(), where execution continues in the child from the point of the original fork() call.) The fn argument is a pointer to a function that is called by the child process at the beginning of its execution. The arg argument is passed to the fn function.
When the fn(arg) function application returns, the child process terminates. The integer returned by fn is the exit code for the child process. The child process may also terminate explicitly by calling exit(2) or after receiving a fatal signal.
Information gotten from:
Differences between fork and exec
http://www.allinterview.com/showanswers/59616.html
http://www.unixguide.net/unix/programming/1.1.2.shtml
http://linux.about.com/library/cmd/blcmdl2_clone.htm
Thanks for taking the time to read this ! :)
vfork() is an obsolete optimization. Before good memory management, fork() made a full copy of the parent's memory, so it was pretty expensive. since in many cases a fork() was followed by exec(), which discards the current memory map and creates a new one, it was a needless expense. Nowadays, fork() doesn't copy the memory; it's simply set as "copy on write", so fork()+exec() is just as efficient as vfork()+exec().
clone() is the syscall used by fork(). with some parameters, it creates a new process, with others, it creates a thread. the difference between them is just which data structures (memory space, processor state, stack, PID, open files, etc) are shared or not.
execve() replaces the current executable image with another one loaded from an executable file.
fork() creates a child process.
vfork() is a historical optimized version of fork(), meant to be used when execve() is called directly after fork(). It turned out to work well in non-MMU systems (where fork() cannot work in an efficient manner) and when fork()ing processes with a huge memory footprint to run some small program (think Java's Runtime.exec()). POSIX has standardized the posix_spawn() to replace these latter two more modern uses of vfork().
posix_spawn() does the equivalent of a fork()/execve(), and also allows some fd juggling in between. It's supposed to replace fork()/execve(), mainly for non-MMU platforms.
pthread_create() creates a new thread.
clone() is a Linux-specific call, which can be used to implement anything from fork() to pthread_create(). It gives a lot of control. Inspired on rfork().
rfork() is a Plan-9 specific call. It's supposed to be a generic call, allowing several degrees of sharing, between full processes and threads.
fork() - creates a new child process, which is a complete copy of the parent process. Child and parent processes use different virtual address spaces, which is initially populated by the same memory pages. Then, as both processes are executed, the virtual address spaces begin to differ more and more, because the operating system performs a lazy copying of memory pages that are being written by either of these two processes and assigns an independent copies of the modified pages of memory for each process. This technique is called Copy-On-Write (COW).
vfork() - creates a new child process, which is a "quick" copy of the parent process. In contrast to the system call fork(), child and parent processes share the same virtual address space. NOTE! Using the same virtual address space, both the parent and child use the same stack, the stack pointer and the instruction pointer, as in the case of the classic fork()! To prevent unwanted interference between parent and child, which use the same stack, execution of the parent process is frozen until the child will call either exec() (create a new virtual address space and a transition to a different stack) or _exit() (termination of the process execution). vfork() is the optimization of fork() for "fork-and-exec" model. It can be performed 4-5 times faster than the fork(), because unlike the fork() (even with COW kept in the mind), implementation of vfork() system call does not include the creation of a new address space (the allocation and setting up of new page directories).
clone() - creates a new child process. Various parameters of this system call, specify which parts of the parent process must be copied into the child process and which parts will be shared between them. As a result, this system call can be used to create all kinds of execution entities, starting from threads and finishing by completely independent processes. In fact, clone() system call is the base which is used for the implementation of pthread_create() and all the family of the fork() system calls.
exec() - resets all the memory of the process, loads and parses specified executable binary, sets up new stack and passes control to the entry point of the loaded executable. This system call never return control to the caller and serves for loading of a new program to the already existing process. This system call with fork() system call together form a classical UNIX process management model called "fork-and-exec".
The fork(),vfork() and clone() all call the do_fork() to do the real work, but with different parameters.
asmlinkage int sys_fork(struct pt_regs regs)
{
return do_fork(SIGCHLD, regs.esp, &regs, 0);
}
asmlinkage int sys_clone(struct pt_regs regs)
{
unsigned long clone_flags;
unsigned long newsp;
clone_flags = regs.ebx;
newsp = regs.ecx;
if (!newsp)
newsp = regs.esp;
return do_fork(clone_flags, newsp, &regs, 0);
}
asmlinkage int sys_vfork(struct pt_regs regs)
{
return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0);
}
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_VM 0x00000100 /* set if VM shared between processes */
SIGCHLD means the child should send this signal to its father when exit.
For fork, the child and father has the independent VM page table, but since the efficiency, fork will not really copy any pages, it just set all the writeable pages to readonly for child process. So when child process want to write something on that page, an page exception happen and kernel will alloc a new page cloned from the old page with write permission. That's called "copy on write".
For vfork, the virtual memory is exactly by child and father---just because of that, father and child can't be awake concurrently since they will influence each other. So the father will sleep at the end of "do_fork()" and awake when child call exit() or execve() since then it will own new page table. Here is the code(in do_fork()) that the father sleep.
if ((clone_flags & CLONE_VFORK) && (retval > 0))
down(&sem);
return retval;
Here is the code(in mm_release() called by exit() and execve()) which awake the father.
up(tsk->p_opptr->vfork_sem);
For sys_clone(), it is more flexible since you can input any clone_flags to it. So pthread_create() call this system call with many clone_flags:
int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL | CLONE_SETTLS | CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | CLONE_SYSVSEM);
Summary: the fork(),vfork() and clone() will create child processes with different mount of sharing resource with the father process. We also can say the vfork() and clone() can create threads(actually they are processes since they have independent task_struct) since they share the VM page table with father process.
in fork(), either child or parent process will execute based on cpu selection..
But in vfork(), surely child will execute first. after child terminated, parent will execute.

Resources