does linux load program-pages on demand? - linux

I wrote a program daemon that starts some sort of self-healing procedure when the hard drive controller of a computer crashes. This program already works fine, but I have concerns that the program (about 18KB compiled file size) may not be fully loaded into RAM by the operating system and that - when I'm really unlucky - some program pages have to be loaded from the disk exactly when the program has to come active and disk accesses are no longer possible.
After all, most of the time the program stays in an endless loop checking if everything is okay and 95% of the program code isn't used. So, I think, the Kernel may optimize the RAM usage by removing unused program pages from RAM.
So, my question: does Linux load and keep all program code pages into memory, making it unnecessary to access the hard disk again to run the program code itself, once the program has started?
Technical details: Linux Kernel 2.6.36+, about 1 GB of RAM, Debian 5, no swap space active
I already learned that I can prevent swapping by calling mlockall(MCL_CURRENT | MCL_FUTURE);, but wondering if I really need to update my machines.

No, the program code pages are memory mapped into the address space of the process, not so differently than any other mmap(), so that if you don't access these pages in a long time, they can eventually be removed from RAM. To avoid it, just use the mlockall() call.

From mlockall manual
mlockall() locks all pages mapped into the address space of the calling process.
This includes the pages of the code, data and stack segment, as well as shared
libraries, user space kernel data, shared memory, and memory-mapped files. All
mapped pages are guaranteed to be resident in RAM when the call returns success‐
fully; the pages are guaranteed to stay in RAM until later unlocked.
So, if locked, pages will be here. However, modifying mounted hard disk partition is always great risk, regardless of any kind of locks.

Related

When and how is mmap'ed memory swapped in and out?

In my understanding, mmap'ing a file that fits into RAM will be like having the file in memory.
Say that we have 16G of RAM, and we first mmap a 10G file that we use for a while. This should be fairly efficient in terms of access. If we then mmap a second 10G file, will that cause the first one be swapped out? Or parts of it? If so, when will this happen? At the mmap call, or on accessing the memory area of the newly loaded file?
And if we want to access the memory of the pointer for the first file again, will that make it load the swap the file in again? So, say we alternate reading between memory corresponding to the first file and the second file, will that lead to disastrous performance?
Lastly, if any of this is true, would it be better to mmap several smaller files?
As has been discussed, your file will be accessed in pages; on x86_64 (and IA32) architectures, a page is typically 4096 bytes. So, very little if any of the file will be loaded at mmap time. The first time you access some page in either file, then the kernel will generate a page fault and load some of your file. The kernel may prefetch pages, so more than one page may be loaded. Whether it does this depends on your access pattern.
In general, your performance should be good if your working set fits in memory. That is, if you're only regularly accesning 3G of file across the two files, so long as you have 3G of RAM available to your process, things should generally be fine.
On a 64-bit system there's no reason to split the files, and you'll be fine if the parts you need tend to fit in RAM.
Note that if you mmap an existing file, swap space will not be required to read that file. When an object is backed by a file on the filesystem, the kernel can read from that file rather than swap space. However, if you specify MMAP_PRIVATE in your call to mmap, swap space may be required to hold changed pages until you call msync.
Your question does not have a definitive answer, as swapping in/out is handled by your kernel, and each kernel will have a different implementation (and linux itself offers different profiles depending on your usage, RT, desktop, server…)
Generally speaking, though, whatever you load in memory is done using pages, so your mmap'ed file in memory is loaded (and offloaded) by pages between all the levels of memory (the caches, RAM and swap).
Then if you load two 10GB data into memory, you'll have parts of both between the RAM and your Swap, and the kernel will try to keep in RAM the pages you're likely to use now and guess what you'll load next.
What it means is that if you do truly random access to a few bytes of data in both files alternatively, you should expect awful performance, if you access contiguous chunks sequentially from both files alternatively, you should expect decent performance.
You can read some more details about kernel paging into virtual memory theory:
https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html
https://en.wikipedia.org/wiki/Paging

Does binary stay in memory after program exits?

I know when a program first starts, it has massive page faults in the beginning since the code is not in memory, and thus need to load code from disk.
What happens when a program exits? Does the binary stay in memory? Would subsequent invocations of the program find that the code is already in memory and thus not have page faults (assuming nothing runs in between and pages stuff out to disk)?
It seems like the answer is no from running some experiments on my Linux machine. I ran some program over and over again, and observed the same number of page faults every time. It's a relatively quiet machine so I doubt stuff is getting paged out in between invocations. So, why is that? Why doesn't executable get to stay in memory?
There are two things to consider here:
1) The content of the executable file is likely kept in the OS cache (disk cache). While that data is still in the OS cache, every read for that data will hit the cache and the OS will honor the request without needing to re-read the file from disk
2) When a process exits, the OS unmaps every memory page mapped to a file, frees any memory (in general, releases every resource allocated by the process, including other resources, such as sockets, and so on). Strictly speaking, the physical memory may be zeroed, but not quite required (still, the security level of the OS may require to zero a page that is not used anymore - probably Windows NT, 2K, XP, etc, do that - see this Does Windows clear memory pages?). Another invocation of the same executable will create a brand new process which will map the same file in the memory, but the first access to those pages will still trigger page faults because, in the end, it is a new process, a different memory mapping. So yes, the page faults occur, but they are a lot cheaper for the second instance of the same executable compared to the first.
Of course, this is only about the read-only parts of the executable (the segments/modules containing the code and read-only data).
One may consider another scenario: forking. In this case, every page is marked as copy-on-write. When the first write occurs on each memory page, a hardware exception is triggered and intercepted by the OS memory manager. The OS determines if the page in question is allowed to be written (eg: if it is the stack, heap or any writable page in general) and if so, it allocates memory and copies the original content before allowing the process to modify the page - in order to preserve the original data in the other process. And yes, there is still another case - shared memory, where the exact physical memory is mapped to two or more processes. In this case, the copy-on-write flag is, of course, not set on the memory pages.
Hope this clarifies what is going on with the memory pages.
What I highly suspect is that parts, information blobs are not promptly erased from RAM unless there's a new request for more RAM from actually running code. For that part what probably happens is OS reusing OS dependent bits from RAM, on a next execution e.g. I think this is true for OS initiated resources (and probably not for all resources but some).
Actually most of your questions are highly implementation-dependant. But for most used OS:
What happens when a program exits? Does the binary stay in memory?
Yes, but the memory blocks are marked as unused (and thus could be allocated to other processes).
Would subsequent invocations of the program find that the code is
already in memory and thus not have page faults (assuming nothing runs
in between and pages stuff out to disk)?
No, those blocks are considered empty. Some/all blocks might have been overwritten already.
Why doesn't executable get to stay in memory?
Why would it stay? When a process is finished, all of its allocated resources are freed.
One of the reasons is that one generally wants to clear everything out on a subsequent invocation in case their was a problem in the previous.
Plus, the writeable data must be moved out.
That said, some systems do have mechanisms for keeping executable and static data in memory (possibly not linux). For example, the VMS operating system allows the system manager to install executables and shared libraries so that they remain in memory (paging allowed). The same system can be used to create create writeable shared memory allowing interprocess communication and for modifications to the memory to remain in memory (possibly paged out).

Allocating "temporary" memory (in Linux)

I'm trying to find any system functionality that would allow a process to allocate "temporary" memory - i.e. memory that is considered discardable by the process, and can be take away by the system when memory is needed, but allowing the process to benefit from available memory when possible. In other words, the process tells the system it's OK to sacrifice the block of memory when the process is not using it. Freeing the block is also preferable to swapping it out (it's more expensive, or as expensive, to swap it out rather then re-constitute its contents).
Systems (e.g. Linux), have those things in the kernel, like F/S memory cache. I am looking for something like this, but available to the user space.
I understand there are ways to do this from the program, but it's really more of a kernel job to deal with this. To some extent, I'm asking the kernel:
if you need to reduce my, or another process residency, take these temporary pages off first
if you are taking these temporary pages off, don't swap them out, just unmap them
Specifically, I'm interested on a solution that would work on Linux, but would be interested to learn if any exist for any other O/S.
UPDATE
An example on how I expect this to work:
map a page (over swap). No difference to what's available right now.
tell the kernel that the page is "temporary" (for the lack of a better name), meaning that if this page goes away, I don't want it paged in.
tell the kernel that I need the temporary page "back". If the page was unmapped since I marked it "temporary", I am told that happened. If it hasn't, then it starts behaving as a regular page.
Here are the problems to have that done over existing MM:
To make pages not being paged in, I have to allocate them over nothing. But then, they can get paged out at any time, without notice. Testing with mincore() doesn't guarantee that the page will still be there by the time mincore() finishes. Using mlock() requires elevated privileges.
So, the closest I can get to this is by using mlock(), and anonymous pages. Following the expectations I outlined earlier, it would be:
map an anonymous, locked page. (MAP_ANON|MAP_LOCKED|MAP_NORESERVE). Stamp the page with magic.
for making page "temporary", unlock the page
when needing the page, lock it again. If the magic is there, it's my data, otherwise it's been lost, and I need to reconstitute it.
However, I don't really need for pages to be locked in RAM when I'm using them. Also, MAP_NORESERVE is problematic if memory is overcommitted.
This is what the VmWare ESXi server aka the Virtual Machine Monitor (VMM) layer implements. This is used in the Virtual Machines and is a way to reclaim memory from the virtual machine guests. Virtual machines that have more memory allocated than they actually are using/require are made to release/free it to the VMM so that it can assign it back to the Virtual Machines guests that are in need of it.
This technique of Memory Reclamation is mentioned in this paper: http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf
On similar lines, something similar you can implement in your kernel.
I'm not sure to understand exactly your needs. Remember that processes run in virtual memory (their address space is virtual), that the kernel is dealing with virtual to physical address translation (using the MMU) and with paging. So page fault can happen at any time. The kernel will choose to page-in or page-out at arbitrary moments - and will choose which page to swap (only the kernel care about RAM, and it can page-out any physical RAM page at will). Perhaps you want the kernel to tell you when a page is genuinely discarded. How would the kernel take away temporary memory from your process without your process being notified ? The kernel could take away and later give back some RAM.... (so you want to know when the given back memory is fresh)
You might use mmap(2) with MAP_NORESERVE first, then again (on the same memory range) with MAP_FIXED|MAP_PRIVATE. See also mincore(2) and mlock(2)
You can also later use madvise(2) with MADV_WONTNEED or MADV_WILLNEED etc..
Perhaps you want to mmap some device like /dev/null, /dev/full, /dev/zero or (more likely) write your own kernel module providing a similar device.
GNU Hurd has an external pager mechanism... You cannot yet get exactly that on Linux. (Perhaps consider mmap on some FUSE mounted file).
I don't understand what you want to happen when the kernel is paging out your memory, and what you want to happen when the kernel is paging in again such a page because your process is accessing it. Do you want to get a zero-ed page, or a SIGSEGV ?

Preventing minor page faults in real time process on linux

I need to make the process to run in real time as much as possible.
All the communication is done via shared memory - memory mapped files - no system calls at all - it uses busy waiting on shared memory.
The process runs under real time priority and all memory is locked with mlockall(MCL_CURRENT|MCL_FUTURE) which succeeds and process has enough ulimits
to have all the memory locked.
When I run it on it perf stat -p PID I still get counts of minor page faults.
I tested this with both process affinity and without.
Question:
Is it possible to eliminate them at all - even minor page faults?
I solved this problem by switching from memory mapped files to POSIX shared memory shm_open + memory locking.
If I understand the question correct, avoiding the minor page faults completely isn't possible. In most modern OS's including Linux, the OS doesn't load all the text and data segments into memory when the programs start. It allocates internal data structures and pages are essentially faulted in when the text and data are needed. This causing a page fault physical memory is made available to the process, swapping page from backing store. Therefore, minor page fault could be avoided without accessing backing store which might not possible.

Accessing any memory locations under Linux 2.6.x

I'm using Slackware 12.2 on an x86 machine. I'm trying to debug/figure out things by dumping specific parts of memory. Unfortunately my knowledge on the Linux kernel is quite limited to what I need for programming/pentesting.
So here's my question: Is there a way to access any point in memory? I tried doing this with a char pointer so that It would only be a byte long. However the program crashed and spat out something in that nature of: "can't access memory location". Now I was pointing at the 0x00000000 location which where the system stores it's interrupt vectors (unless that changed), which shouldn't matter really.
Now my understanding is the kernel will allocate memory (data, stack, heap, etc) to a program and that program will not be able to go anywhere else. So I was thinking of using NASM to tell the CPU to go directly fetch what I need but I'm unsure if that would work (and I would need to figure out how to translate MASM to NASM).
Alright, well there's my long winded monologue. Essentially my question is: "Is there a way to achieve this?".
Anyway...
If your program is running in user-mode, then memory outside of your process memory won't be accessible, by hook or by crook. Using asm will not help, nor will any other method. This is simply impossible, and is a core security/stability feature of any OS that runs in protected mode (i.e. all of them, for the past 20+ years). Here's a brief overview of Linux kernel memory management.
The only way you can explore the entire memory space of your computer is by using a kernel debugger, which will allow you to access any physical address. However, even that won't let you look at the memory of every process at the same time, since some processes will have been swapped out of main memory. Furthermore, even in kernel mode, physical addresses are not necessarily the same as the addresses visible to the process.
Take a look at /dev/mem or /dev/kmem (man mem)
If you have root access you should be able to see your memory there. This is a mechanism used by kernel debuggers.
Note the warning: Examining and patching is likely to lead to unexpected results when read-only or write-only bits are present.
From the man page:
mem is a character device file that is an image of
the main memory of the computer. It may be used, for
example, to examine (and even patch) the system.
Byte addresses in mem are interpreted as physical
memory addresses. References to nonexistent locations
cause errors to be returned.
...
The file kmem is the same as mem, except that the
kernel virtual memory rather than physical memory is
accessed.

Resources