Can a process start executing with 0 Pages in main memory? - linux

I am reading a Book on Operating System (Galvin). While explaining Demand Paging it says
In the extreme case, we can start executing a process with no pages in
memory. When the operating system sets the instruction pointer to the first
instruction of the process, which is on a non-memory-resident page, the process
immediately faults for the page.
My question is how OS can set the instruction pointer for a process for which not even single page is in memory (because the address in instruction pointer cannot be a disc or secondary memory address, It has to be a main memory address but 0 pages means nothing is in memory).

That's what virtual memory is. It means that there's an ephemeral mapping between logical addresses, which are known and constant, and physical addresses, which are transient. The normal level of processing then works purely in logical addresses, without necessarily having any knowledge of what's going on physically.
So the OS would e.g. say that the binary A is logically available at address N. It will then mark in the virtual map that the pages covering N to N+(size of binary) are currently faults. Having set the PC to N (or whatever the entry point is), the MMU will fire a fault as soon as the CPU tries to read from the PC. At that point the paging mechanism will catch the fault and do the usual things.

Related

What about the many addresses unpresented in the /proc/$pid/maps file?

Brief Version:
what status are the addresses unpresented in the maps file? Are they belongs to unallocated virtual pages or allocated from anonymous file or others?
Detailed Version
I'm learning about VM. In my book(CS:APP), I learned that all virtual pages can be cut into three sets: unallocated, allocated but not cached, allocated and cached.I have some questions about "what are allocated pages and unallocated pages? When are pages allocated?" And also, is stack and heap belongs to allocated pages or unallocated or only allocate when used?
Trying to solve these problems, I read the /proc/$pid/maps file, while I think I can get anything I want from it. In my mind, the file contains all memory mapping relations. But there isn't information about is it cached(I know maybe it cannot be seen from user mode...), and are the unpresented pages unallocated?
Honestly, I don't really know about the maps file. What I do know is that the information on every page is stored in page structures at all time. I'm gonna take x86-64 as an example.
For x86-64 on Linux you have Page Global Directory (PGD), Page Upper Directory (PUD), Page Middle Directory (PMD) and Page Directory (PD). The address of the bottom of the PGD table is stored in the CR3 register. PGD contains addresses of the PUDs, PUDs contain addresses of the PMDs, PMDs contain addresses of the PDs and PDs contain addresses of the physical pages.
A virtual address, of which only 48 bits are used, is split into 5 parts. The 12 least significant bits are the offset in the physical page. The next chunk of 9 bits is the offset in the PD. The next chunk is the offset in PMD etc. For example let's say you have virtual address 0x0000000000000123. This virtual address will be translated by the MMU in the CPU by looking at entry (offset) 0 of the PGD, entry 0 of the PUD, entry 0 of the PMD, entry 0 of the PD and finally offset 0x123 in the actual physical page in RAM. Every virtual address is 64 bits of which only the 48 least significant bits will be used.
At boot, the kernel makes checks to determine how much memory is available. It then builds its kernel structures accordingly.
When the kernel boots it will mark all pages as unallocated in its own structures (except for kernel needs). The page structure is important for this. The kernel has a page C structure for every page in the system (https://linux-kernel-labs.github.io/refs/heads/master/labs/memory_mapping.html and https://elixir.bootlin.com/linux/v4.6/source/include/linux/mm_types.h). This structure informs the kernel whether the page is allocated or not.
Each physical page in the system has a struct page associated with
it to keep track of whatever it is we are using the page for at the
moment. Note that we have no way to track which tasks are using
a page, though if it is a pagecache page, rmap structures can tell us
who is mapping it.
At first the pages are mostly unallocated. When you start a new process by launching an executable as the user of the system, pages are allocated for your process. On Linux, executables are ELF files. ELF is a conventional format which separates code in loadable segments. Each segment gets a virtual address at which it is going to be loaded in the virtual address space.
Let's say you have an elf file with one segment which should be loaded at virtual address 0x400000. When you launch that ELF executable, the Linux kernel will call certain functions which will look at the size of the code and allocate pages accordingly. The kernel will look at its structures and determine using algorithms where the process will be allocated in RAM. It will then setup the page tables according to where the virtual addresses for that process should land in actual physical memory.
What's important to understand is that each CPU core in the system has only one process running at a time. Each core has it's own set of page tables. When a process switch occurs for one core, the page tables are swapped completely to point to somewhere else in RAM. The same virtual address can point anywhere in RAM depending on how the page tables are set up.
The kernel holds a task_struct for every process running in the system. The task_struct contains a field named pgd which is a pointer to the PGD of the process. Each process has its very own PGD. If you dereference the pointer to the PGD you get the actual value of the first entry of the PGD. This first entry is the address of the PUD. With this only pointer, the kernel can reach every table belonging to the process and modify them at will.
While a process is running, it can ask for more memory. This is called dynamic memory allocation. The kernel has no way to know how much memory the process is going to ask in advance since it is dynamic (done while code is executing). When the process asks for more memory, the kernel determines what page to give to the process depending on an algorithm. It then marks this page as allocated to that process. The task_struct contains a mm field which is of type mm_struct (https://manybutfinite.com/post/how-the-kernel-manages-your-memory/). It is a memory descriptor for that process so that the kernel can know what memory the process is using. In itself the process doesn't need that information since the process should rely only on itself to ask for memory properly to the operating system and to not jump somewhere in RAM where it doesn't belong.
You ask about heap and stack. The stack for a process is allocated at the beginning of the process and I think it has a fixed size. If you overflow the stack, you will throw a CPU exception which will prompt the kernel to kill your process. Each CPU core has a special register called RSP. This is the stack pointer. It points to the top of the stack (the stack grows downward toward low memory). When the kernel allocates a stack for a process you launch, it will set up this register to point at the top of it. The stack pointer contains a virtual address. It will thus be translated using the page tables just like any address.
The heap is allocated and managed completely by the OS. It doesn't have special registers like the stack. It is allocated only when the process asks for more memory during code execution. The kernel knows in advance how much memory a process requires. It is all written in the ELF executable. All static memory is allocated during compilation and thus the kernel knows everything about the size of static memory. The only moment it requires to allocate new memory to a process is when the process actually asks for it. In C++ you use the keyword new to ask for heap memory dynamically. If you don't use this keyword, then the kernel knows in advance where your variables will be allocated (where they will be in memory). Only the stack will be used by static memory.

In case of a process context switch, is the virtual address space (VAS) of the new process loaded into the CPU context (CPU's registers)?

I have read:
Process switching is context switching from one process to a different process. It involves switching out all of the process abstractions and resources in favor of those belonging to a new process. Most notably and expensively, this means switching the memory address space. This includes memory addresses, mappings, page tables, and kernel resources—a relatively expensive operation.
Also:
A context is the contents of a CPU's registers and program counter at any point in time.
Context switching can be described in slightly more detail as the kernel (i.e., the core of the operating system) performing the following activities with regard to processes (including threads) on the CPU: (1) suspending the progression of one process and storing the CPU's state (i.e., the context) for that process somewhere in memory, (2) retrieving the context of the next process from memory and restoring it in the CPU's registers and (3) returning to the location indicated by the program counter (i.e., returning to the line of code at which the process was interrupted) in order to resume the process.
As the VAS is separate for each process and can be of size up to 4GB, is the whole VAS of a process loaded into the CPU context in case of context switch of the process?
Also as each process has separate page table, is the page table also brought into the CPU context in case of a context switch ?
If no, then why is a process context switch slower than a thread context switch (threads share the same VAS)?
As the VAS is separate for each process and can be of size upto 4GB,is the whole VAS of a process is loaded in the CPU context in case of context switch of the process ?
Also as each process has separate page table, does the page table is also bought in the CPU context in case of context switch ?
These questions are related. You swap out one virtual address space for another by changing which set of page tables is performing the virtual -> linear translation. That's how the address space swap is accomplished.
Let's consider a very simple example.
Say we have two processes PA and PB. Both processes are executing their program image at virtual address 0x1000.
Not visible to the processes are a set of page tables, which map the virtual address space to physical pages of RAM:
Pagetable TA maps virtual address 0x1000 to physical address 0x88000
Pagetable TB maps virtual 0x1000 to physical 0x99000.
Let's say the theoretical CPU has a register called PP (pagetable pointer)
After the processes have been initialized, "swapping the virtual address space" between the two is simple. To load the address space for PA, you simply put the address of TA in PP, and now that process "sees" the memory at 0x88000. And likewise the address of TB for PB, so he will "see" the memory at 0x99000.
When switching between threads (of the same process), the virtual address space does not need changed (because all threads of a given process share the same virtual address space).
Of course there are other things which need swapped in as well (like the CPU registers), but for this discussion, we're only concerned with virtual memory.
On x86 CPUs, the CR3 register is the pointer to the base of the page table hierarchy. It is this register which the OS changes to change address spaces when swapping processes.
Of course, it's more complicated than that. Because the possible virtual address space is so large (4 GiB on x86-32, and 16 exabytes on x86-64), the pagetables themselves would take up a ridiculous amount of space (one entry for every 4 KiB page). To alleviate this, additional levels of indirection are added to the pagetables, which is why I referred to them as a hierarchy. On x86-64, there are 4 levels.
Now imagine if the CPU had to "walk" these paging structures for every virtual-to-physical translation. A single read from virtual memory would require a total of 5 memory accesses! This would be terribly slow.
Enter the Translation lookaside buffer, or TLB. The TLB caches these translations, so a given virtual-to-physical translation only requires the pagetables to be walked once. After that, the TLB remembers the translation, and is much faster. (Of course the TLB can get full, but cache eviction is another story).
So say PA is running, and all of a sudden the kernel swaps in the address space for PB. Now all of those cached virtual-to-physical translations are no longer valid for the new virtual address space! That means we need to flush the TLB, or clear all of its entries out. And because of that, the CPU has to do the slow pagetable walking again, until the TLB cache "heats up" again.
This is why it's considered "expensive" to swap virtual address spaces. Not because it's hard to write to CR3, but because we trash the TLB every time we do.

Virtual memory sections and memory mapping area

As process has virtual memory which is copied into RAM during run time. As given in the previous post.
Which part of process virtual memory layout does mmap() uses?
I have following doubles :
If memory mapping is inside unallocated memory and it is inside process's virtual memory. As virtual memory helps to avoid one process to touch other process's virtual memory. Then how can memory mapping is used for Interprocess Communication(IPC)?
In OS like Linux, whether has each individual process separate section of heap, stack and memory mapping or all processes have one common section for heap, stack and MMAP?
Example :
if there are P1,P2 and P3 processes are running on linux OS. will all have common table as given in picture or each individual task have separate table to each section.
In 32 bit system, 2^32=4 gigabytes of virtual memory is possible and 1G byte is reserved for kernel and 3 gigabytes for userspace applications. can each individual process have up to 3 gigabytes of virtual memory or sum of all userspace applications size could be 3 gigabytes (i.e virtual memory size of (P1+P2+P3)<=3 gigabytes)?
--
Learner
Using memory mapping for IPC works by mapping the same range of physical memory into two or more virtual address ranges in different processes. This works for communication because both processes are using the exact same memory cells (although they might "see" them differently, at different addresses). You change a value in one mapping, and it is instantly visible in the other mapping in a different process because it is the very same memory.
Every process has its own independent stack and heap. The OS does not care about that at all, it only cares about pages. The heap and the stack are things that are implemented by the application (via the runtime). When you call a function like malloc, the allocator in the runtime either returns a block that it already had reserved earlier or one that it has recylced (you called free earlier), or it asks the OS to reserve some more memory (sbrk or mmap). When you first access this memory, the OS sees a page fault and verifies that you are allowed to access this location (because you've reserved it) and then provides a valid page.
Every process can use (as in "reserve") the whole available address space (3GiB in your example). This does not interfere with any other process. Note that due to fragmentation and alignment, and because your executable and the stack take away a little bit, you will in practice not be able to allocate the full 3 GiB, but you can get close to it.
All processes together can use as much virtual memory as is available on the system (physical RAM plus swap space), but they can only use as much as there is physical memory available at the same time (minus a little bit for this and that, like unpageable kernel memory and such).

Where does virtual memory exist in linux?

As program is stored on flash/disk. For it execution, program is loaded into virtual memory and is mapped to RAM by virtual manager. During its execution process is in RAM. Then where does virtual memory exist (where it has all .text, .data, .stack, .heap)?
The virtual memory is a view of the RAM plus maybe some swap space provided by a virtual memory manager. Modern OSs have virtual memory managers and provide virtual memory to processes so that the executing program can behave as if it had a contiguous address space whose size is not limited by the actual RAM. The pages or blocks making up the virtual memory can be mapped anywhere in the RAM, so that contiguos virtual pages need to be stored in contiguos RAM areas. Or they can be swapped out to page space or swap space, waiting there until needed, whereupon they're read by the OS and mapped to some RAM page.
When you say
During its execution process is in RAM.
This is not entirely correct. Some or all memory pages that belong to the process may be swapped out, as explained.
One more word concerning the answers and comments that say that "virtual" means it doesn't exist. This makes no sense. On the contrary, according to Webster:
being such in essence or effect ...
Hence virtual memory is something (therefore, it exists!) that behaves as if it were memory.
Virtual memory is just like an illusion of RAM. It uses paging to acquire additional RAM that could be used by the processes in operating system.
Virtual memory means memory you can access with "normal" momory access methods, although it isn't clear where the data is actually stored.
It may be
actually in RAM
in a swap area
in another file (memory mapped file)
and access to it will be handled appropriately.
It is a layer of, well, virtualization so that you as a programmer don't have to worry about where the data is actually put.
The original purpose was mainly to be able to provide more memory to processes than we actually have and to extend it with means of swap space, but there are even more:
The OS is free to use the RAM for whatever it seems necessary, e. g. caching. Under some circumstances, it may be more effective to use RAM for cache than for holding parts of a program which hasn't been used for a long time.
Provide additional memory to a program when it requests it: if you call malloc(), the program's library may request the OS to provide a part of memory which can be attached seamlessly into the address space.
Avoid stack overflow: if the stack grows larger and larger, the respective memory section may be extended as well transparently so that the program won't have to worry about it.
A system can even do "overcommitment" of memory: if a process requests a large amount of memory, the OS may say "yes, ok", i. e. provide the memory to the program. That means in the first place "allow the program to access a certain address space area", but this address space is not immediately backed by memory. Only as soon as the program accesses this memory the mapping will be done, and if this cannot be fulfilled, the program is crashed by the Out of emory killer (at least, under Linux).
All this works by page-wise (1 page = 4 kiB) assignment of physical memory to a program, viewed via the program's address space, and this in the amount and frequency as it is needed.

Why is the ELF execution entry point virtual address of the form 0x80xxxxx and not zero 0x0?

When executed, program will start running from virtual address 0x80482c0. This address doesn't point to our main() procedure, but to a procedure named _start which is created by the linker.
My Google research so far just led me to some (vague) historical speculations like this:
There is folklore that 0x08048000 once was STACK_TOP (that is, the stack grew downwards from near 0x08048000 towards 0) on a port of *NIX to i386 that was promulgated by a group from Santa Cruz, California. This was when 128MB of RAM was expensive, and 4GB of RAM was unthinkable.
Can anyone confirm/deny this?
As Mads pointed out, in order to catch most accesses through null pointers, Unix-like systems tend to make the page at address zero "unmapped". Thus, accesses immediately trigger a CPU exception, in other words a segfault. This is quite better than letting the application go rogue. The exception vector table, however, can be at any address, at least on x86 processors (there is a special register for that, loaded with the lidt opcode).
The starting point address is part of a set of conventions which describe how memory is laid out. The linker, when it produces an executable binary, must know these conventions, so they are not likely to change. Basically, for Linux, the memory layout conventions are inherited from the very first versions of Linux, in the early 90's. A process must have access to several areas:
The code must be in a range which includes the starting point.
There must be a stack.
There must be a heap, with a limit which is increased with the brk() and sbrk() system calls.
There must be some room for mmap() system calls, including shared library loading.
Nowadays, the heap, where malloc() goes, is backed by mmap() calls which obtain chunks of memory at whatever address the kernel sees fit. But in older times, Linux was like previous Unix-like systems, and its heap required a big area in one uninterrupted chunk, which could grow towards increasing addresses. So whatever was the convention, it had to stuff code and stack towards low addresses, and give every chunk of the address space after a given point to the heap.
But there is also the stack, which is usually quite small but could grow quite dramatically in some occasions. The stack grows down, and when the stack is full, we really want the process to predictably crash rather than overwriting some data. So there had to be a wide area for the stack, with, at the low end of that area, an unmapped page. And lo! There is an unmapped page at address zero, to catch null pointer dereferences. Hence it was defined that the stack would get the first 128 MB of address space, except for the first page. This means that the code had to go after those 128 MB, at an address similar to 0x080xxxxx.
As Michael points out, "losing" 128 MB of address space was no big deal because the address space was very large with regards to what could be actually used. At that time, the Linux kernel was limiting the address space for a single process to 1 GB, over a maximum of 4 GB allowed by the hardware, and that was not considered to be a big issue.
Why not start at address 0x0? There's at least two reasons for this:
Because address zero is famously known as a NULL pointer, and used by programming languages to sane check pointers. You can't use an address value for that, if you're going to execute code there.
The actual contents at address 0 is often (but not always) the exception vector table, and is hence not accessible in non-privileged modes. Consult the documentation of your specific architecture.
As for the entrypoint _start vs main:
If you link against the C runtime (the C standard libraries), the library wraps the function named main, so it can initialize the environment before main is called. On Linux, these are the argc and argv parameters to the application, the env variables, and probably some synchronization primitives and locks. It also makes sure that returning from main passes on the status code, and calls the _exit function, which terminates the process.

Resources