what (1>resources are shared) and what (2>resources are created new) during (1>new process) and (2>new thread) creation in linux?
I searched for it, but nowhere it is mentioned what resources are created new and what are shared
When you call fork() and create a child, all descriptors open in the parent before the call to fork are shared between the parent and the child. For instance a socket in parent and say the parent calls accept and then calls fork. The connected socket is then shared between the parent and the child. Normally, the child then reads and writes the connected socket and the parent closes the connected socket.
In the traditional UNIX model, when a parent process needs something performed by another entity, it forks a child process and lets the child perform the processing. While this paradigm has served well for many years there are issues as well:
fork is expensive. Memory is copied from the parent to the child, all descriptors are duplicated in the child and so on. Some optimizations have been made in recent days with copy-on-write, which avoids copy until the child needs its own copy.
While passing information from parent to child is easy, the reverse takes some work. And in order to achieve the pass information IPC (Inter Process Communication) is required.
So LINUX introduced clone(). clone() allows the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers.
Then comes the threads. They are also known as lightweight processes. Thread creation can be 10-100 times faster than process creation as you can guess. All threads within a process share the same global memory. This makes the sharing of information easy between threads, but along with this comes the requirement to synchronize access.
To sum up, all threads share the following:
Process information
Most Data
Open files (eg descriptors)
Signal Handlers
Current working dir
user and group ids
But each thread has its own:
ThreadID
set of registers
stack for local variables and return addresses
errno
signal mask
priority
I learned that,in linux fork() is used to create a new process. It allocates new memory for the child process then copies data from parent process to the child process such as file descriptors. Then exec() can be used to load its own data and overwrite the process space. But I wonder is it necessary to copy data of parent process to child process? How about loading the real data of a child process to its process space directly?
Read more about fork (system call), address space, virtual memory (the kernel is using the MMU), copy-on-write, processes ...
Read also Advanced Linux Programming. It has several chapters explaining these difficult concepts.
Parent and child processes have different address spaces, but after the fork the parent & child address spaces are nearly equal (thanks to virtual memory & copy-on-write techniques). The only difference being the result of the fork(2) syscall (which is [almost] the only way to create a process)
The execve(2) is entirely replacing the address space (and execution context) of its invoking process and is used to start a new executable program (often an ELF binary executable).
You don't need to copy data from parent to child process. The kernel is doing that magically for you.
You may want to do some inter-process communication (IPC) between parent and child, usually thru pipes (read pipe(7) & pipe(2) & poll(2)...), to be set up before the fork. You might want to use shared memory -avoid it if you are a newbie, since it is tricky to use correctly- but you should care about synchronization. See shm_overview(7) & sem_overview(7) for more.
Use also strace(1) and study the source code of some free software shell (like sash or bash)
Here is my question:
if a process (father) create a new process (child) with fork(),which of these data structure do not share between father and the son??
-process ID
-heap
-code
-stack
Relation for Process ID
Upon successful completion, fork() returns a value of 0 to the child
process and returns the process ID of the child process to the parent
process. Otherwise, a value of -1 is returned to the parent process, no
child process is created, and the global variable errno is set to indi-
cate the error
Relation of heap or memory space
The child gets an exact copy of the parents address space, which in many cases is likely to be laid out in the same format as the parent address space. I have to point out that each one will have it's own virtual address space for it's memory, such that each could have the same data at the same address, yet in different address spaces. Also, linux uses copy on write when creating child processes. This means that the parent and child will share the parent address space until one of them does a write, at which point the memory will be physically copied to the child. This eliminates unneeded copies when execing a new process. Since you're just going to overwrite the memory with a new executable, why bother copying it?
Relation for code
There is no object-oriented inheritence in C.
Fork'ing in C is basically the process being stopped while it is running, and an entire copy of it being made in (effectively) a different memory space, then both processes being told to continue. They will both continue from where the parent was paused. The only way you can tell which process you are in is to check the return value of the fork() call.
In such a situation the child doesn't really inherit everything from the parent process, it's more like it gets a complete copy of everything the parent had.
Stack
child process gets separate instance of global variable declared in parent process".
The point of separate processes is to separate memory. So you can't share variables between the parent and the child process once the fork occured.
How does a child process modify or read data in parent process after vfork()?
Are the variables declared in parent process directly accessible to the child?
I have a process which creates some data structures. I then need to fork a child process
which needs to read/write these data structures. The child will be an exec'ed process different from the parent.
One process cannot directly modify another's memory. What you would typically do is create a pipe or other mechanism that can cross process boundaries. The open descriptors will be inherited by the child process if you use fork(). It can then send messages to the parent instructing it to modify the data structures as required.
The form of the messages can be the difficult part of this design. You can:
Design a protocol that carries values and instructions on what to do with them.
Use an existing marshaling tool such as Google Protocol Buffers.
Use Remote Procedure Calls with one of the existing RPC mechanisms (i.e. SUN or ONC-RPC).
You can also use a manually set-up shared memory scheme that would allow both processes to access common memory. The parent process would allocate storage for its data structures in that shared memory. The child process would map that also into its space and access those structures. You would need to use some sort of sync mechanism depending on how you use the data.
Just a quick question, if I clone a process, the PID of the cloned process is the same, yes ? fork() creates a child process where the PID differs, but everything else is the same. Vfork() creates a child process with the same PID. Exec works to change a process currently in execution to something else.
Am I correct in all of these statements ?
Not quite. If you clone a process via fork/exec, or vfork/exec, you will get a new process id. fork() will give you the new process with a new process id, and exec() replaces that process with a new process, but maintaining the process id.
From here:
The vfork() function differs from
fork() only in that the child process
can share code and data with the
calling process (parent process). This
speeds cloning activity significantly
at a risk to the integrity of the
parent process if vfork() is misused.
Neither fork() nor vfork() keep the same PID although clone() can in one scenario (*a). They are all different ways to achieve roughly the same end, the creation of a distinct child.
clone() is like fork() but there are many things shared by the two processes and this is often used to enable threading.
vfork() is a variant of clone in which the parent is halted until the child process exits or executes another program. It's more efficient in those cases since it doesn't involve copying page tables and such. Basically, everything is shared between the two processes for as long as it takes the child to load another program.
Contrast that last option with the normal copy-on-write where memory itself is shared (until one of the processes writes to it) but the page tables that reference that memory are copied. In other words, vfork() is even more efficient than copy-on-write, at least for the fork-followed-by-immediate-exec use case.
But, in most cases, the child has a different process ID to the parent.
*a Things become tricky when you clone() with CLONE_THREAD. At that stage, the processes still have different identifiers but what constitutes the PID begins to blur. At the deepest level, the Linux scheduler doesn't care about processes, it schedules threads.
A thread has a thread ID (TID) and a thread group ID (TGID). The TGID is what you get from getpid().
When a thread is cloned without CLONE_THREAD, it's given a new TID and it also has its TGID set to that value (i.e., a brand new PID).
With CLONE_THREAD, it's given a new TID but the TGID (hence the reported process ID) remains the same as the parent so they really have the same PID. However, they can distinguish themselves by getting the TID from gettid().
There's quite a bit of trickery going on there with regard to parent process IDs and delivery of signals (both to the threads within a group and the SIGCHLD to the parent), all which can be examined from the clone() man page.
It deserves some explanation. And it's simple as rain.
Consider this. A program has to do some things at the same time. Say, your program is printing "hello world!", each second, until somebody enters "hello, Mike", then, each second, it prints that string, waiting for John to change that in the future.
How do you write this the standard way? In your program, that basically prints "hello," you must create another branch that is waiting for user input.
You create two processes, one outputting those strings, and another one, waiting the user input. And, the only way to create a new process in UNIX was calling the system call fork(), like this:
ret = fork();
if(ret > 0) /* parent, continue waiting */
else /* child */
This scheme posed numerous problems. The user enters "Mike" but you have no simple way to pass that string to the parent process so that it'd be able to print that, because +each+ process has its own view of memory that isn't shared with the child.
When the processes are created by fork(), each one receives a copy of the memory existing at that moment, and if that memory really changes later, the mapping that was identical for those memory segments will be chaged at once (it's called a copy-on-write mechanism).
Another thingies to share between the child and the parent are, for example, opened file descriptors, descriptors of the shared memory, input/outpue stuff, etc., that also wouldn't survive after fork().
So. The very fork() call had to be alleviated, to include shared memory/signals etc. But how? This was the idea behind clone(). That call takes a flag indicating what exatly would you share with the child. For example, the memory, the signal handlers, etc. And if you call this with flag=0, this will be identical to fork(), up to the args they take. And when POSIX pthreads are created, that flag will reflect the attributes you have indicated in pthread_attr.
From the kernel point of view, there's no difference between the processes created such way, and no special semantics to differentiate the "processess". The kernel does not even know, what that "thread" is, it creates a new process, but it simply combines it as belogning to that process group that had the parent who called it, taking care what that process may do. So, you have different procesess (that share the same pid) combined in a process group each assigned with a different "TID" (that starts from PID of the parent).
Care to explain that clone() does exactly that. You may pass this whaterver you need (as the matter of fact, the old vfork() call will do). Are you going to share memory? Hanlers? You may tune everything, just be sure you don't clash with the pthreads library written right away around this very call.
An important thing, the kernel vesion is quite outrageous, it expects just 2 out of 4 parameters to be passed, the user stack, and options.
Since PID is an unique identifier for a process, there's no way to have two distinct process with the same PID.
Threads (which have the same visible 'pid') are implemented with the clone() call. When the flag CLONE_THREAD is supplied then the new process (a 'thread') share the Thread Group Identifier (TGID) with its creator process. getpid actually returns the TGID.
See the clone manpage for more details.
In summary the real PID, as seen by the kernel is always different. The visible PID is the same for threads.