Not able to understand fork() description

Not able to understand fork() description - linux

I was studying about virtual memory management from galvin ,I am unable to understand this statement :
In addition to separating logical memory from physical memory ,virtual
memory allows files and memory to be shared by two or more processes
through page sharing .This leads to the following benefits
Virtual memory can allow pages to be shared during process creation with fork() system call,thus speeding up process creation .
How can pages be shared with fork()? Please clarify.

I believe the text is referring to the copy-on-write optimisation done for fork().
Basically a fork() clones a process, duplicating its entire memory.
This can take a long time, especially for processes which use a lot of money. Moreover, it's very common for a fork() to be immediately followed by an exec(), rendering the previous copy pointless.
Instead of doing all of that work for every fork() modern Unixes create the new process, but don't copy all of the memory. They just point the virtual memory pages for both the original process and the new one to the same physical pages.
This greatly reduces the cost of a fork(), in terms of reduced copies and reduced memory usage.
The downside is that whenever either the fork()ed process or the original process write to a page the write raises an exception (because the physical pages are marked read-only) and the page is copied after all.
Fortunately it turns out this doesn't happen all that often.

pages may be shared by fork if two processes having same or different virtual address page share same physical memory frame. they have same entry of frame number in their page table

Related

What is the difference between calling mmap() on a disk file before or after a fork?

I've been working on understanding how mmap() works with disk-backed files, and am mostly getting it, but I still have this question.
In a situation with a master process that forks a bunch of worker child processes, and a file-backed read-only mmapped db, does it matter if the mmaps happen in the master process before the forks, or in the child processes?
My understanding is that if it happens in the master process before the fork, then in the memory page table, all of the mapped
pages are given the setting to make a page fault when they are read, triggering the kernel to load the page from disk (or from the page cache), and after the fork one child's reading of a page will mean the page is there in the mmap ready for other children to read without causing a major page fault.
But if the mmap happens in the child processes after the fork, do the other worker children get the benefit of sharing those loaded pages--are they all in effect using the same underlying mmap? Or does each worker child have to trigger a page fault and load each page themselves?

It makes no difference to page fault activity. The page map of a file is global to the OS, it says whether a particular page is in RAM or not. The PTE of every process that has the file mapped points to this common data structure. There will only be a page fault for the first process that tries to access a page that isn't in RAM. That will trigger it to be read in, and other processes that try to access the same page will be able to use that RAM.
One difference between the two scenarios is whether the virtual addresses assigned to the mapped block are the same. If you call mmap before forking, the address will be copied in all the children. If you call mmap after forking, they could get different addresses. Using the same address in all processes allows you to pass pointers into the mapped block between the processes if you want. You can have pointers between objects within the block. If they're not all at the same address, you need to use offsets, and the processes will all have to add the offsets to the base address.

Time waste of execv() and fork()

I am currently learning about fork() and execv() and I had a question regarding the efficiency of the combination.
I was shown the following standard code:
pid = fork();
if(pid < 0){
//handle fork error
}
else if (pid == 0){
execv("son_prog", argv_son);
//do father code
I know that fork() clones the entire process (copying the entire heap, etc) and that execv() replaces the current address space with that of the new program. With this in mind, doesn't it make it very inefficient to use this combination? We are copying the entire address space of a process and then immediately overwrite it.
So my question:
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this, even though we have waste?

What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this even though we have waste?
You have to create a new process somehow. There are very few ways for a userspace program to accomplish that. POSIX used to have vfork() alognside fork(), and some systems may have their own mechanisms, such as Linux-specific clone(), but since 2008, POSIX specifies only fork() and the posix_spawn() family. The fork + exec route is more traditional, is well understood, and has few drawbacks (see below). The posix_spawn family is designed as a special purpose substitute for use in contexts that present difficulties for fork(); you can find details in the "Rationale" section of its specification.
This excerpt from the Linux man page for vfork() may be illuminating:
Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent’s memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables are held in a register.
(Emphasis added)
Thus, your concern about waste is not well-founded for modern systems (not limited to Linux), but it was indeed an issue historically, and there were indeed mechanisms designed to avoid it. These days, most of those mechanisms are obsolete.

Another answer states:
However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done.
Obviously, one person's bad old days are a lot younger than others remember.
The original UNIX systems did not have the memory for running multiple processes and they did not have an MMU for keeping several processes in physical memory ready-to-run at the same logical address space: they swapped out processes to disk that it wasn't currently running.
The fork system call was almost entirely the same as swapping out the current process to disk, except for the return value and for not replacing the remaining in-memory copy by swapping in another process. Since you had to swap out the parent process anyway in order to run the child, fork+exec was not incurring any overhead.
It's true that there was a period of time when fork+exec was awkward: when there were MMUs that provided a mapping between logical and physical address space but page faults did not retain enough information that copy-on-write and a number of other virtual-memory/demand-paging schemes were feasible.
This situation was painful enough, not just for UNIX, that page fault handling of the hardware was adapted to become "replayable" pretty fast.

Not any longer. There's something called COW (Copy On Write), only when one of the two processes (Parent/Child) tries to write to a shared data, it is copied.
In the past:
The fork() system call copied the address space of the calling process (the parent) to create a new process (the child).
The copying of the parent's address space into the child was the most expensive part of the fork() operation.
Now:
A call to fork() is frequently followed almost immediately by a call to exec() in the child process, which replaces the child's memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent's address space is largely wasted, because the child process will use very little of its memory before calling exec().
For this reason, later versions of Unix took advantage of virtual memory hardware to allow the parent and child to share the memory mapped into their respective address spaces until one of the processes actually modifies it. This technique is known as copy-on-write. To do this, on fork() the kernel would copy the address space mappings from the parent to the child instead of the contents of the mapped pages, and at the same time mark the now-shared pages read-only. When one of the two processes tries to write to one of these shared pages, the process takes a page fault. At this point, the Unix kernel realizes that the page was really a "virtual" or "copy-on-write" copy, and so it makes a new, private, writable copy of the page for the faulting process. In this way, the contents of individual pages aren't actually copied until they are actually written to. This optimization makes a fork() followed by an exec() in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec().

It turns out all those COW page faults are not at all cheap when the process has a few gigabytes of writable RAM. They're all gonna fault once even if the child has long since called exec(). Because the child of fork() is no longer allowed to allocate memory even for the single threaded case (you can thank Apple for that one), arranging to call vfork()/exec() instead is hardly more difficult now.
The real advantage to the vfork()/exec() model is you can set the child up with an arbitrary current directory, arbitrary environment variables, and arbitrary fs handles (not just stdin/stdout/stderr), an arbitrary signal mask, and some arbitrary shared memory (using the shared memory syscalls) without having a twenty-argument CreateProcess() API that gets a few more arguments every few years.
It turned out the "oops I leaked handles being opened by another thread" gaffe from the early days of threading was fixable in userspace w/o process-wide locking thanks to /proc. The same would not be in the giant CreateProcess() model without a new OS version, and convincing everybody to call the new API.
So there you have it. An accident of design ended up far better than the directly designed solution.

It's not that expensive (relatively to spawning a process directly), especially with copy-on-write forks like you find in Linux , and it's kind of elegant for:
when you really just want to fork off a clone of the current process (I find this to be very useful for testing)
for when you need to do something just before loading the new executable
(redirect filedescriptors, play with signal masks/dispositions, uids, etc.)
POSIX now has posix_spawn that effectively allows you to combine fork/and-exec (possibly more efficiently than fork+exec; if it is more efficient, it'll usually be implemented through some cheaper but less robust fork (clone/vfork) followed by exec), but the way it achieves #2 is through a ton of relatively messy options, which can never be as complete and powerful and clean as just allowing you to run arbitrary code just before the new process image is loaded.

A process created by exec() et al, will inherit its file handles from the parent process (including stdin, stdout, stderr). If the parent changes these after calling fork() but before calling exec() then it can control the child's standard streams.

Is it feasible to implemenent Linux concurrency primitives that give better isolation than threads but comparable performance?

Consider a following application: a web search server that upon start creates a large in-memory index of web pages based on data read from disk. Once initialized, in-memory index can not be modified and multiple threads are started to serve user queries. Assume the server is compiled to native code and uses OS threads.
Now, threading model gives no isolation between threads. A buggy thread or any non thread safe code, can corrupt the index or corrupt memory that was allocated by and logically belongs to some other thread. Such problems are difficult to detect and debug.
Theoretically, Linux allows to enforce a better isolation. Once index is initialized, memory it occupies can be marked read only. Threads can be replaced with processes that share the index (shared memory) but other than that have separate heaps and can not corrupt each other. Illegal operation are automatically detected by hardware and the operating system. No mutexes or other synchronization primitives are needed. Memory related data races are completely eliminated.
Is such model feasible in practice? Are you aware of any real life application that do such things? Or maybe there are some fundamental difficulties that make such model impractical? Do you think such approach would introduce a performance overhead compared to traditional threads? Theoretically, memory that is used is the same, but are there some implementation-related issues that would make things slower?

The obvious solution is to not use threads at all. Use separate processes. Since each process has much in common with code and readonly structures, making the readonly data shared is trivial: format it as needed for in-memory use within a file and map the file to memory.
Using this scheme, only the variable per-process data would be independent. The code would be shared and statically initialized data would be shared until written. If a process croaks, there is zero impact on other processes. No concurrency issues at all.

You can use mprotect() to make your index read-only. On a 64-bit system you can map the local memory for each thread at a random address (see this Wikipedia article on address space randomization) which makes the odds of memory corruption from one thread touching another astronomically small (and of course any corruption that misses mapped memory altogether will cause a segfault). Obviously you'll need to have different heaps for each thread.

I think you might find memcached interesting. Also, you can create a shared memory and open it as read-only and then create your threads. This should not cause much performance degradation.

Process VS thread : can two processes share the same shared memory ? can two threads ?

After thinking about the the whole concept of shared memory , a question came up:
can two processes share the same shared memory segment ? can two threads share the same shared memory ?
After thinking about it a little more clearly , I'm almost positive that two processes can share the same shared memory segment , where the first is the father and the second is the son , that was created with a fork() , but what about two threads ?
Thanks

can two processes share the same shared memory segment?
Yes and no. Typically with modern operating systems, when another process is forked from the first, they share the same memory space with a copy-on-write set on all pages. Any updates made to any of the read-write memory pages causes a copy to be made for the page so there will be two copies and the memory page will no longer be shared between the parent and child process. This means that only read-only pages or pages that have not been written to will be shared.
If a process has not been forked from another then they typically do not share any memory. One exception is if you are running two instances of the same program then they may share code and maybe even static data segments but no other pages will be shared. Another is how some operating systems allow applications to share the code pages for dynamic libraries that are loaded by multiple applications.
There are also specific memory-map calls to share the same memory segment. The call designates whether the map is read-only or read-write. How to do this is very OS dependent.
can two threads share the same shared memory?
Certainly. Typically all of the memory inside of a multi-threaded process is "shared" by all of the threads except for some relatively small stack spaces which are per-thread. That is usually the definition of threads in that they all are running within the same memory space.
Threads also have the added complexity of having cached memory segments in high speed memory tied to the processor/core. This cached memory is not shared and updates to memory pages are flushed into central storage depending on synchronization operations.

In general, a major point of processes is to prevent memory being shared! Inter-process comms via a shared memory segment is certainly possible on the most common OS, but the mechanisms are not there by default. Failing to set up, and manage, the shared area correctly will likely result in a segFault/AV if you're lucky and UB if not.
Threads belonging to the same process, however, do not have such hardware memory-management protection can pretty much share whatever they like, the obvious downside being that they can corrupt pretty much whatever they like. I've never actually found this to be a huge problem, esp. with modern OO languages that tend to 'structure' pointers as object instances, (Java, C#, Delphi).

Yes, two processes can both attach to a shared memory segment. A shared memory segment wouldn't be much use if that were not true, as that is the basic idea behind a shared memory segment - that's why it's one of several forms of IPC (inter-Process communication).
Two threads in the same process could also both attach to a shared memory segment, but given that they already share the entire address space of the process they are part of, there likely isn't much point (although someone will probably see that as a challenge to come up with a more-or-less valid use case for doing so).

In general terms, each process occupies a memory space isolated from all others in order to avoid unwanted interactions (including those which would represent security issues). However, there is usually a means for processes to share portions of memory. Sometimes this is done to reduce RAM footprint ("installed files" in VAX/VMS is/was one such example). It can also be a very efficient way for co-operating processes to communicate. How that sharing is implemented/structured/managed (e.g. parent/child) depends on the features provided by the specific operating system and design choices implemented in the application code.
Within a process, each thread has access to exactly the same memory space as all other threads of the same process. The only thing a thread has unique to itself is "execution context", part of which is its stack (although nothing prevents one thread from accessing or manipulating the stack "belonging to" another thread of the same process).

Memory addressing in assembly / multitasking

I understand how programs in machine code can load values from memory in to registers, perform jumps, or store values in registers to memory, but I don't understand how this works for multiple processes. A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I have another question regarding multitasking that is somewhat related. How does the OS, which isn't running, stop a thread and move on to the next. Is this done with timed interrupts? If so, then how can the values in registers be preserved for a thread. Are they saved to memory before control is given to a different thread? Or, rather than timed interrupts, does the thread simply choose a good time to give up control. In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Edit: Or are executables edited before being run to compensate for the correct offsets?

That's not how it works. All modern operating systems virtualize the available memory. Giving every process the illusion that it has 2 gigabytes of memory (or more) and doesn't have to share it with anybody. The key component in a machine that does this is the MMU, nowadays built in the processor itself. Another core feature of this virtualization is that it isolates processes. One misbehaving one cannot bring another one down with it.
Yes, a clock tick interrupt is used to interrupt the currently running code. Processor state is simply saved on the stack. The operating system scheduler then checks if any other thread is ready to run and has a high enough priority to get first in line. Some extra code ensures that everybody gets a fair share. Then it just a matter of setting the MMU to resume execution on the other thread. If no thread is ready to run then the CPU gets physically turned off with the HALT instruction. To be woken again by the next clock interrupt.
This is ten-thousand foot view, it is well covered in any book about operating system design.

A process is allocated memory on the fly, so must it use relative addressing?
No, it can use relative or absolute addressing depending on what it is trying to address.
At least historically, the various different addressing modes were more about local versus remote memory. Relative addressing was for memory addresses close to the current address while absolute was more expensive but could address anything. With modern virtual memory systems, these distinctions may be no longer necessary.
A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I'm not sure about this one. This is taken care of by the compiler normally. Again, modern virtual memory systems make make this complexity unnecessary.
Are they saved to memory before control is given to a different thread?
Yes. Typically all of the state (registers, etc.) is stored in a process control block (PCB), a new context is loaded, the registers and other context is loaded from the new PCB, and execution begins in the new context. The PCB can be stored on the stack or in kernel memory or in can utilize processor specific operations to optimize this process.
Or, rather than timed interrupts, does the thread simply choose a good time to give up control.
The thread can yield control -- put itself back at the end of the run queue. It can also wait for some IO or sleep. Thread libraries then put the thread in wait queues and switch to another context. When the IO is ready or the sleep expires, the thread is put back into the run queue. The same happens with mutex locks. It waits for the lock in a wait queue. Once the lock is available, the thread is put back into the run queue.
In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Either the thread can run (perform CPU instructions) or it is waiting -- either on IO or a sleep. It can ask to yield but typically it is doing so by [again] sleeping or waiting on IO.

I probably walked into this question quite late, but then, it may be of use to some other programmers. First - the theory.
The modern day operating system will virtualize the memory, and to do so, it maintains, within its system memory area, a series of page pointers. Each page is of a fixed size (usually 4K), and when any program seeks some memory, its allocated memory addresses that are virtualized using the memory page pointer. Its approximates the behaviour of "segment" registers in the prior generation of the processors.
Now when the scheduler decides to get another process running, it may or may not keep the previous process in memory. If it keeps it in memory, then all that the scheduler does is to save the entire register snapshot (now, including YMM registers - this bit was a complex issue earlier as there are no single instructions that saved the entire context : read up on XSAVE), and this has a fixed format (available in Intel SW manual). This is stored in the memory space of the scheduler itself, along with the information on the memory pages that were being used.
If however, the scheduler needs to "dump" the current process context that is about to go to sleep to the hard disk - this situation usually arises when the process that is waking up needs extraordinary amount of memory, then the scheduler writes the memory page files in the disk blocks (called pagefile - reserved area of memory - also the source of "old grandmother wisdom" that pagefile must be equal to size of real memory) and the scheduler preserves the memory page pointer addresses as offsets in the pagefile. When it wakes up, the scheduler reads from pagefile the offset address, allocates real memory and populates the memory page pointers, and then loads the contents from the disk blocks.
Now, to answer your specific questions :
1. Do u need to use only relative addressing, or you can use absolute?
And. You may use either - whatever u perceive to be as absolute is also relative as the memory page pointer relativizes that address in an invisible format. There is no really absolute memory address anywhere (including the io device memories) except the kernel of the operating system itself. To test this, u may unassemble any .EXE program, to see that the entry point is always CALL 0010 which clearly implies that each thread gets a different "0010" to start the execution.
How do threads get life and what if it surrenders the unused slice.
Ans. The threads usually get a slice - modern systems have 20ms as the usual standard - but this is sometimes changed in special purpose compilation for servers that do not have many hardware interrupts to deal with - in order of their position on the process queue. A thread usually surrenders its slice by calling function sleep(), which is a formal (and very nice way) to surrender your balance part of the time slice. Most libraries implementing asynchronous reads, or interrupt actions, call sleep() internally, but in many instances, top level programs also call sleep() - e.g. to create a time gap. An invocation to sleep will certainly change the process context - the CPU actually is not given the liberty to sleep using NOP.
The other method is to wait for an IO to complete, and this is handled differently. The program on asking for an IO process, will cede its time slice, and the process scheduler flags this thread to be in "WAITING FOR AN IO" state - and this thread will not be given a time slice by the processor till its intended IO is completed, or timed out. This feature helps programmers as they do not have to explicitly write a sleep_until_IO() kind of interface.
Trust this sets you going further in your explorations.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string