Kernel code responsible for virtual momory - linux

I know 'even a single process can have a virtual address space larger than the system's physical memory' so Just want to know which kernel code is responsible to create virtual memory larger than physical memory?
Second thing is, Can i change the code to make it little large, Is there any performance benefit If i change the code to expand virtual memory?

All the memory management (and address space) management code is involved.
From the application point of view, you should understand more virtual memory (the kernel controls the MMU and handles page faults), notably the mmap(2), mprotect(2), madvise(2), execve(2) syscalls. Applications change their address space using these syscalls. You can use the proc(5) filesystem to query about it. For instance cat /proc/self/maps is showing the address space of the process executing that cat
Read also Advanced Linux Programming. Learn more about VDSO & ASLR.
Within the kernel, the relevant source code is mostly its mm/ subdirectory
(but nearly every filesystem has mmap specific code, and page faults are also related to scheduling, etc...)

Related

What structure is traversed to deallocate pages, when a process terminates? (Page Table or something else?)

I am trying to understand the nature of the operations carried out regarding the deallocation of physical memory when a process terminates.
Assumed that page table for the process is a multi-level tree structure thats implemented on Linux.
My current understanding is that the OS would need to deallocate each physical page frame that is mapped to whatever subset of the virtual addresses for which the Page Table entry (PTE) exists. This could happen by a traversal of the multi-level tree PT structure & for the PTEs that have their valid bit set, the physical frame descriptor corresponding to the PTE is added to the free list (which is used in the Buddy allocation process).
My question is: Is the traversal of the Page Table actually done for this? An alternative, faster way would be to maintain a linked list of the page frame descriptors allotted to a process, for each process & then traverse that linearly during process termination. Is this more generic & faster method instead followed?
I'm not sure that page gets physically deallocated at process ending.
My understanding is that MMU is managed by the kernel.
But each process has its own virtual address space, which the kernel changes:
for explicit syscalls changing it, ie. mmap(2)
at program start thru execve(2) (which can be thought of several virtual mmap-s as described by the segments of the ELF program executable file)
at process termination, as if each segment of the address space was virtually munmap-ed
And when a process terminates, it is its virtual address space (but not any physical RAM pages) which gets destroyed or deallocated!
So the page table (whatever definition you give to it) is probably managed inside the kernel by a few primitives like adding a segment to virtual address space and removing a segment from it. The virtual space is lazily managed, since the kernel uses copy on write techniques to make fork fast.
Don't forget that some pages (e.g. the code segment of shared libraries) are shared between processes and that every task of a multi-threaded process are sharing the same virtual address space.
BTW, the Linux kernel is free software, so you should study its source code (from http://kernel.org/). Look also on http://kernelnewbies.org ; memory management happens inside the mm/ subtree of the kernel source.
There are lots of resources. Look into linux-kernel-slides, slides#245 for a start, and there are many books and resources about the Linux kernel... Look for vm_area_struct, pgetable, etc...

why kernel needs virtual addressing?

In Linux each process has its virtual address space (e.g. 4 GB in case of 32 bit system, wherein 3GB is reserved for process and 1 GB for kernel). This virtual addressing mechanism helps isolating the address space of each process. This is understandable in case of process since there are many processes. But since we have 1 kernel only so why do we need virtual addressing for kernel?
The reason the kernel is "virtual" is not to deal with paging as such, it is becuase the processor can only run in one mode at a time. So once you turn on paged memory mapping (Bit 31 in CR0 on x86), the processor is expecting ALL memory accesses to go through the page-mapping mechanism. So, since we do want to access the kernel even after we have enabled paging (virtual memory), it needs to exist somewhere in the virtual space.
The "reserving" of memory is more about "easy way to determine if an address is kernel or user-space" than anything else. It would be perfectly possible to put a little bit of kernel at address 12345-34121, another bit of kernel at 101900-102400 and some other bit of kernel at 40000000-40001000. But it would make life difficult for every aspect of the kernel and userspace - there would be gaps/holes to deal with [there already are such holes/gapes, but having more wouldn't exactly help things]. By setting a fixed limit for "userspace is from here to here, kernel is from end of userspace to X", it makes life much easier in that respect. We can just say kernel = 0; if (address > max_userspace) kernel=1; in some code.
Of course, the kerneln only takes up as much PHYSICAL memory as it will actually use - so the common thinking that "it's a waste to take up a whole gigabyte for the kernel" is wrong - the kernel itself is a few (a dozen or so for a very "big" kernel) megabytes. The modules loaded can easily add up to several more megabytes, and graphics drivers from ATI and nVidia easily another few megabytes just for the kernel moduel for that itself. The kernel also uses some bits of memory to store "kernel data", such as tasks, queues, semaphores, files and other "stuff" the kernel has to deal with. A few megabytes is used for this as well.
Virtual Memory Management is that feature of Linux which enables Multi-tasking in system without any limitation on no. of task or amount of memory used by each task. The Linux Memory Manager Subsystem (along with MMU hardware) facilitates VMM support, where memory or mem-mapped device are accessed through virtual addresses. Within Linux everything, both kernel and user components, works with virtual address except when dealing with real hardware. That's when the Memory Manager takes its place, does virtual-to-physical address translation and points to physical mem/dev location.
A process is an abstract entity, defined by kernel to which system resources are allocated in order to execute a program. In Linux Process Management the kernel is an integrated part of a process memory map. A process has two main regions, like two faces of one coin:
User Space view - contains user program sections (Code, Data, Stack, Heap, etc...) used by process
Kernel Space view - contains kernel data structures that maintain information (PID. States, FD, Resource Usage, etc...) about the process
Every process in Linux system has a unique and separate User Space Region. This feature of Linux VMM isolates each process program sections from one and other. But all processes in the system shares the common Kernel Space Region. When a process needs service from the kernel it must execute the kernel code in this region, or in other words kernel is performing on behalf of user process request.

For arm Linux, could threads in user space access virtual address of Kernel space?

Virtual memory is split two parts. In tradition, 0~3GB is for user space and 3GB~4GB for kernel space.
My question:
Could the thread in user space access memory of kernel space?
For ARM datasheet, the access attribution is in the charge of domain access control register. But in kernel source code,the domain value in page table entry of user space virtual memory is same as kernel space's page table entry.
In fact, your application might access page 0xFFFF0000, as it contains the swi-handler and a couple of other userspace-helpers. So no, the 3/1 split is nothing magical, it's just very easy for the kernel to manage.
Usually the kernel will setup all memory above 3GB to be only accessible by the kernel-domain itself. If a driver needs to share memory between user and kernel-space it will usually provide an mmap interface, which then creates an aliased mapping, so you have two virtual addresses for the same physical address. This only works reliably on VIPT-Cache systems or with a LOT of careful explicit cache flushing. If you don't want this you CAN hack the kernel to make a chunk of memory ABOVE the 3G-split accessible to userspace. But then all userspace applications will share this memory. I've done this once for a special application on a armv5-system.
Userspace code getting Kernel memory? The only kernel that ever allowed that was DOS and its archaic friends.
But back to the question, look at this example C code:
char c=42;
*c=42;
We take one byte (a char) and assign it the numeric value 42. We then dereference this non-pointer, which will probably try to access the 42nd byte of virtual memory, which is almost definitely not your memory, and, for the sake of this example, Kernel memory. guess what happens when you run this (if you manage to hold the compiler at gunpoint):
Segmentation fault
Linux has memory protection like any modern operating system. If you try to access the memory of another process, your process will be terminated before it can do anything (other things I'm not so sure about happen with debuggers though). Even if that memory was that of another Userland process, you would still get terminated. I'm almost sure that root programs can't access other programs memory, or Kernel memory. The only way to access Kernel memory is to be part of the Kernel, or indirectly through the kernel's cooperation.

Is heap allocated on memory pages?

In Linux x86-64 environment, is the entire process allocated on virtual memory pages? By entire process i mean the text, data, bss, heap and stack?
Also, when libc calls Brk, does the kernel returns memory that is managed via pages by virtual memory manager ?
Lastly, can a process get memory on heap, which is not managed by virtual memory manager, in other words, can a process get access to physical memory?
In Linux x86-64 environment, is the entire process allocated on virtual memory pages?
Yes, all processes have a virtual address space, i.e. have their own page table and virtual memory to physical memory mapping pattern.
Also, when libc calls Brk, does the kernel returns memory that is managed via pages by virtual memory manager ?
Yes, in fact, if you aren't hacking the OS kernel, virtual memory is transparent to you.
can a process get memory on heap, which is not managed by virtual memory manager, in other words, can a process get access to physical memory?
No, you can't manage physical memory per my knowledge unless you run your program without support from OS. Because process has its own virtual space, all your action related to memory management is on virtual memory.
A process has one or more tasks (scheduled by the kernel) which for a multi-threaded process are the processes' threads (and for a non-threaded process the task running the process), and it has an address space (and some other resources, e.g. opened file descriptors).
Of course, the address space is in virtual memory. The kernel is allowed to swap pages (to e.g. the swap zone of your disk). It tries hard to avoid doing that (swapping pages to disk is very slow, because the disk access time is in dozens of milliseconds, while the RAM access time is in tenth of microsecond).
text & bss etc are virtual memory segments, which are memory mappings. You can think of a process space as a memory map. The mmap(2) system call is the way to modify it. When an executable is started with execve system call, the kernel establish a few mappings (e.g for text, data, bss, stack, ...). The sbrk(2) system call also change it. Most malloc implementations use mmap (at least for big enough zones) and sometimes sbrk.
You can avoid that a memory range is swapped out by locking it into RAM using the mlock(2) syscall, which usually requires root privilege. It is rarely useful in practice (unless you code real-time applications). There is also the msync syscall (to flush memory to disk), you can of course map a portion of file into virtual memory (using mmap), you can change the protection with mprotect(2), remove map with munmap(2), extend a mapping with mremap -a Linux specific syscall-, and you could even catch the SIGSEGV signal and handle it (often in a machine specific way). The madvise(2) syscall enables you to tune paging with hints.
You can understand the memory map of a process of pid 1234 by reading the /proc/1234/maps file (or also /proc/1234/smaps). (From inside an application, you can use /proc/self/ instead of /proc/1234/ ...) I suggest you to run in a terminal:
cat /proc/self/maps
which will show you the memory map of the process running that cat command. You can also use the pmap utility.
Most recent linux kernels provide Adress Space Layout Randomization (so two similar processes running the same program on the same input have different mmap-ed & malloc-ed addresses). You could disable it thru /proc/sys/kernel/randomize_va_space
Except in very rare circumstances (uClinux), processes only see virtual memory, which is mapped to physical memory by the kernel.
The kernel can be asked to make specific mappings that give a predictable physical address for a given virtual address; you need the appropriate capability to do that however, as this breaks down the process separation.
On execve, the current mappings are replaced by the loadable segments from the ELF file specified; these are mapped so that referenced pages are loaded from the ELF file (some initial readahead is also performed). The brk system call mainly extends the non-executable mapping with the highest addresses (excluding the stack mapping) by a few pages, allowing the process to access more virtual addresses without being sent a SIGSEGV.
The heap is generally managed by the process internally, but the virtual address space assigned to heap objects must be known to the virtual memory manager beforehand in order to create a mapping. malloc will generally look into its internal tables for a region that is already mapped and usable, and if none can be found, use either brk() or mmap() to create more mappings.

Dynamic memory managment under Linux

I know that under Windows, there are API functions like global_alloc() and such, which allocate memory, and return a handle, then this handle can be locked and a pointer returned, then unlocked again. When unlocked, the system can move this piece of memory around when it runs low on space, optimising memory usage.
My question is that is there something similar under Linux, and if not, how does Linux optimize its memory usage?
Those Windows functions come from a time when all programs were running in the same address space in real mode. Linux, and modern versions of Windows, run programs in separate address spaces, so they can move them about in RAM by remapping what physical address a particular virtual address resolves to in the page tables. No need to burden the programmer with such low level details.
Even on Windows, it's no longer necessary to use such functions except when interacting with a small number of old APIs. I believe Raymond Chen's blog and book have some discussions of the topic if you are interested in more detail. Eg here's part 4 of a series on the history of GlobalLock.
Not sure what Linux equivalent is but in ATT UNIX there are "scatter gather" memory management functions in the memory manager of the core OS. In a virtual memory operating environment there are no absolute addresses so applications don't have an equivalent function. The executable object loader (loads executable file into memory where it becomes a process) uses memory addressing from the memory manager that is all kept track of in virtual memory blocks maintained in its page table (which contains the physical memory addresses). Bottom line is your applications physical memory layout is likely in no way ever linear or accessible directly.

Resources