How is the code segment shared between processes in Linux? - linux

I have read about the copy-on-write principle which occurs when a new process is being forked in Linux.
I have also read about the fact that if multiple instances of one program are running at the same time, only one instance of the program code can be found in the memory.
I was wondering whether this is a direct consequence of the copy-on-write principle or not, and if it is not, what is the process which ensures that no unnecessary copies of the program's code reside in the memory?

I was wondering whether this is a direct consequence of the
copy-on-write principle or not
No, it's not. FWIW, you could have shared code segments without COW, and you could have COW without shared code segments. It's independent.
If shared program code were to be achieved as a consequence of COW, then only related processes could benefit from that.
For example, if process A forks twice and creates processes B and C, and then B and C call one of the seven exec functions on the same binary, then you could say that the code segment is shared because of COW - since the code segment is never written during execution, and is mapped read-only, then it must be automatically shared, right?
What if you start the same executable from another shell? (Or some other unrelated process forks and executes the same program? It doesn't have to be a shell...)
If code segment sharing was a consequence of COW, in this scenario we wouldn't benefit from sharing the code segment, because the processes are unrelated (so there are no COW-shared pages with the other instances to begin with).
Instead, the code segment is shared with memory mapped files. When loading a new executable in memory, mmap(2) is called to map the binary file's contents into memory.
and if it is not, what is the process which ensures that no
unnecessary copies of the program's code reside in the memory?
The exact implementation details depend on the operating system, but it's not that complicated. Conceptually, mmap(2) maps files into memory, so you just need to keep some state on the underlying file representation to keep track of which (if any) memory mappings are active for that file. Such information is usually kept in the file's inode.
Linux, for example, associates files with memory address spaces with the i_mapping field of struct inode. So, when mmap(2) is called on a binary for the first time, physical memory pages are allocated to hold information and the i_mapping field of that file's inode is set; later invocations will use the i_mapping field and realize that there is an address space associated with this inode, and because it is read-only, no physical pages are allocated, so everything ends up being shared. Note that the virtual memory might be different in each process, although it refers the same physical page (which means that the kernel will at least allocate and update each process's page tables, but that's about it).
The inode structure is defined in fs.h - I can only guess that other UNIX variants do this in a similar way.
Of course, this all works as long as the same binary file is used. If you copy the binary file and execute both copies separately, for obvious reasons, the code segment will not be shared.

The sharing of program code (sometimes called program text) relies on another mechanism: memory mapped files.
The key to understanding this, is that the code of the program does not need to be modified by the linker in order to resolve link to external symbols. Therefore, the operating system is only ever dealing in read-only copies of the program text, and it is inherently sharable amongst processes.
Upon run-time linking your program, the dynamic linker calls mmap() to create virtual address space for the your program's .so (and for any shared libraries it uses). At this stage, the file isn't backed by real pages of memory. Instead, as the program starts to execute, reads in the virtual address space of the file cause page-faults and the operating system either allocates a page, then fills it from disc, or if the page is already in memory, map to that.
A good place to learn more is Modern Operating Systems by Andrew Tanenbaum

Related

What structure is traversed to deallocate pages, when a process terminates? (Page Table or something else?)

I am trying to understand the nature of the operations carried out regarding the deallocation of physical memory when a process terminates.
Assumed that page table for the process is a multi-level tree structure thats implemented on Linux.
My current understanding is that the OS would need to deallocate each physical page frame that is mapped to whatever subset of the virtual addresses for which the Page Table entry (PTE) exists. This could happen by a traversal of the multi-level tree PT structure & for the PTEs that have their valid bit set, the physical frame descriptor corresponding to the PTE is added to the free list (which is used in the Buddy allocation process).
My question is: Is the traversal of the Page Table actually done for this? An alternative, faster way would be to maintain a linked list of the page frame descriptors allotted to a process, for each process & then traverse that linearly during process termination. Is this more generic & faster method instead followed?
I'm not sure that page gets physically deallocated at process ending.
My understanding is that MMU is managed by the kernel.
But each process has its own virtual address space, which the kernel changes:
for explicit syscalls changing it, ie. mmap(2)
at program start thru execve(2) (which can be thought of several virtual mmap-s as described by the segments of the ELF program executable file)
at process termination, as if each segment of the address space was virtually munmap-ed
And when a process terminates, it is its virtual address space (but not any physical RAM pages) which gets destroyed or deallocated!
So the page table (whatever definition you give to it) is probably managed inside the kernel by a few primitives like adding a segment to virtual address space and removing a segment from it. The virtual space is lazily managed, since the kernel uses copy on write techniques to make fork fast.
Don't forget that some pages (e.g. the code segment of shared libraries) are shared between processes and that every task of a multi-threaded process are sharing the same virtual address space.
BTW, the Linux kernel is free software, so you should study its source code (from http://kernel.org/). Look also on http://kernelnewbies.org ; memory management happens inside the mm/ subtree of the kernel source.
There are lots of resources. Look into linux-kernel-slides, slides#245 for a start, and there are many books and resources about the Linux kernel... Look for vm_area_struct, pgetable, etc...

Does binary stay in memory after program exits?

I know when a program first starts, it has massive page faults in the beginning since the code is not in memory, and thus need to load code from disk.
What happens when a program exits? Does the binary stay in memory? Would subsequent invocations of the program find that the code is already in memory and thus not have page faults (assuming nothing runs in between and pages stuff out to disk)?
It seems like the answer is no from running some experiments on my Linux machine. I ran some program over and over again, and observed the same number of page faults every time. It's a relatively quiet machine so I doubt stuff is getting paged out in between invocations. So, why is that? Why doesn't executable get to stay in memory?
There are two things to consider here:
1) The content of the executable file is likely kept in the OS cache (disk cache). While that data is still in the OS cache, every read for that data will hit the cache and the OS will honor the request without needing to re-read the file from disk
2) When a process exits, the OS unmaps every memory page mapped to a file, frees any memory (in general, releases every resource allocated by the process, including other resources, such as sockets, and so on). Strictly speaking, the physical memory may be zeroed, but not quite required (still, the security level of the OS may require to zero a page that is not used anymore - probably Windows NT, 2K, XP, etc, do that - see this Does Windows clear memory pages?). Another invocation of the same executable will create a brand new process which will map the same file in the memory, but the first access to those pages will still trigger page faults because, in the end, it is a new process, a different memory mapping. So yes, the page faults occur, but they are a lot cheaper for the second instance of the same executable compared to the first.
Of course, this is only about the read-only parts of the executable (the segments/modules containing the code and read-only data).
One may consider another scenario: forking. In this case, every page is marked as copy-on-write. When the first write occurs on each memory page, a hardware exception is triggered and intercepted by the OS memory manager. The OS determines if the page in question is allowed to be written (eg: if it is the stack, heap or any writable page in general) and if so, it allocates memory and copies the original content before allowing the process to modify the page - in order to preserve the original data in the other process. And yes, there is still another case - shared memory, where the exact physical memory is mapped to two or more processes. In this case, the copy-on-write flag is, of course, not set on the memory pages.
Hope this clarifies what is going on with the memory pages.
What I highly suspect is that parts, information blobs are not promptly erased from RAM unless there's a new request for more RAM from actually running code. For that part what probably happens is OS reusing OS dependent bits from RAM, on a next execution e.g. I think this is true for OS initiated resources (and probably not for all resources but some).
Actually most of your questions are highly implementation-dependant. But for most used OS:
What happens when a program exits? Does the binary stay in memory?
Yes, but the memory blocks are marked as unused (and thus could be allocated to other processes).
Would subsequent invocations of the program find that the code is
already in memory and thus not have page faults (assuming nothing runs
in between and pages stuff out to disk)?
No, those blocks are considered empty. Some/all blocks might have been overwritten already.
Why doesn't executable get to stay in memory?
Why would it stay? When a process is finished, all of its allocated resources are freed.
One of the reasons is that one generally wants to clear everything out on a subsequent invocation in case their was a problem in the previous.
Plus, the writeable data must be moved out.
That said, some systems do have mechanisms for keeping executable and static data in memory (possibly not linux). For example, the VMS operating system allows the system manager to install executables and shared libraries so that they remain in memory (paging allowed). The same system can be used to create create writeable shared memory allowing interprocess communication and for modifications to the memory to remain in memory (possibly paged out).

Running two processes in Unix/Linux

When the kernel creates two processes whose code section is same, does the kernel actually copy the code to the virtual address space of both processes? In other words, if I create two processes of the same program, in memory, do we have two copies of the program or just one copy?
Obviously, it may depend on implementation but I'm asking in traditional Unix OS.
Does the kernel actually copy the code to the virtual address space of both processes?
The text segment will be mapped (rather than copied) into the virtual address space of each process, but will be referring to the same physical space (so the kernel will only have one copy of the text in memory).
The data and bss segments will also be mapped into the virtual address space of each process, but these will be created per process. At process initiation, the data from the data and bss segments from the executable will be mapped/copied into the process's virtual memory; if it was not copied ab initio then as soon as the processes start writing to the data the process will be given its own private copy.
Clearly, shared memory and mmap'd memory are handled after the process starts. Shared memory is always shared between processes; that's its raison d'ĂȘtre. What happens with mmap depends on the flags used, but it is often shared too.
Modern operating systems will use Copy-on-Write to avoid duplicating pages until they are actually updated. Note that on many systems (including Linux) this can lead to overcommit, where the OS doesn't actually have enough RAM to cope with all the copying required should every process decide to modify un-duplicated pages.

Linux Shared Library & Memory space [duplicate]

While I was studying about shared library I read a statement
Although the code of a shared library is shared among multiple
processes, its variables are not. Each process that uses the library
has its own copies of the global and static variables that are defined
within the library.
I just have few doubts.
Whether code part of each process are in separate address space?
Whether shared-library code part are in some some global(unique) address space.
I am just a starter so please help me understand.
Thanks!
Shared libraries are loaded into a process by memory-mapping the file into some portion of the process's address-space. When multiple processes load the same library, the OS simply lets them share the same physical RAM.
Portions of the library that can be modified, such as static globals, are generally loaded in copy-on-write mode, so that when a write is attempted, a page fault occurs, the kernel responds by copying the affected page to another physical page of RAM (for that process only), the mapping redirected to the new page, and then finally the write operation completes.
To answer your specific points:
All processes have their own address space. The sharing of physical memory between processes is invisible to each process (unless they do so deliberately via a shared memory API).
All data and code live in physical RAM, which is a kind of address-space. Most of the addresses you are likely see, however, are virtual memory addresses belonging to the address-space of one process or another, even if that "process" is the kernel.

Is fork() copy-on-write a stable exposed behavior that can be used to implement read-only shared memory?

The man page on fork() states that it does not copy data pages, it maps them into the child process and puts a copy-on-write flag. Is that behavior:
consistent between flavors of Linux?
considered an implementation detail and therefore likely to change?
I'm wondering if I can use fork() as a means to get a shared read-only memory block on the cheap. If the memory is physically copied, it would be rather expensive - there's a lot of forking going on, and the data area is big enough - but I'm hoping not...
Linux running on machines without a MMU (memory management unit) will copy all process memory on fork().
However, those systems are usually very small and embedded and you probably don't have to worry about them.
Many services such as Apache's fork model, use the initialize and fork() method to share initialized data structures.
You should be aware that if you are using languages like Perl and Python that use reference-counted variables, or C++ shared_ptr's, this model will not work. It will not work because as the reference counts are adjusted up and down, the memory becomes unshared and gets copied.
This causes huge amounts of memory usage in Perl daemons like SpamAssassin that attempt to use an initialize and fork model.
Yes you can certainly rely on it on MMU-Linux kernels; this is almost everything.
However, the page size isn't the same everywhere.
It is possible to explicitly make a shared memory area for forked process, by using mmap() to create an anonymous map - one which is not backed by a physical file. On fork, this area will always remain shared (provided the child doesn't unmap it, or map something else in at the same address). You can mprotect it to be readonly if you want.
Memory allocated with (for example) malloc can easily end up sharing a page with something that isn't readonly, which means it gets copied anyway when another structure is modified. This includes internal structures used by the malloc implementation. So you might want to mmap a specific area for this purpose and allocate from that.
Can you rely on the fact that all Linux flavors do it this way? No. But you can rely on the fact that those who don't use an even faster method.
Therefore you should use the feature and rely on it and revisit your decision if you get a performance problem.
The success of this approach depends on how well you stick to your self-imposed "read-only" limitation. Both parent and child have to obey this stricture, else the memory gets copied.
This may not be the catastrophe you're envisioning, however. The kernel can copy as little as a single page (typically 4 KB) to implement CoW semantics. A typical Linux server will use something more complex, some sort of slab allocator, so the copied region could be much larger.
The main point is that this is decoupled from your program's conception of its memory use. If you malloc() 1 GB of RAM, fork off a child, and the child changes just the first byte of that memory block, the entire 1 GB block isn't copied. Perhaps as little as one page is copied, up to the slab size containing that first byte.
Yes
All the linux distros use the same kernel, albeit with slightly different versions and releases of it.
It's unlikely that another underlying fork(2) implementation will be faster any time soon, so it's a safe bet that copy-on-write will continue to be the mechanism. Perhaps it won't be forever, but for years, definitely.
Certainly some major software systems (for example, Phusion Passenger) use fork(2) in the same way that you want to, so you would not be the only one taking advantage of CoW.

Resources