I want to know what files are cached in Page Cache, and want to free the cache space of a specific file pragmatically. It is possible for me to write kernel module or even modify the kernel code if needed. Can anyone give me some clues?
Firstly, the kernel does not maintain a master list of all files in the page cache, because it has no need for such information. Instead, given an inode you can look up the associated page cache pages, and vice-versa.
For each page cache struct page, page_mapping() will return the struct address_space that it belongs to. The host member of struct address_space identifies the owning struct inode, and from there you can get the inode number and device.
mincore() returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access (page fault) if referenced. The kernel returns residency information about the pages starting at the address addr, and continuing for length bytes.
To test whether a file currently mapped into your process is in cache, call mincore with its mapped address.
To test whether an arbitrary file is in cache, open and map it, then follow the above.
There is a proposed fincore() system call which would not require mapping the file first, but (at this point in time) it's not yet generally available.
(And then madvise(MADV_DONTNEED)/fadvise(FADV_DONTNEED) can drop parts of a mapping/file from cache.)
You can free the contents of a file from the page cache under Linux by using
posix_fadvise(fd, POSIX_FADV_DONTNEED
As of Linux 2.6 this will immediately get rid of the parts of the page cache which are caching the given file or part of file; the call blocks until the operation is complete, but that behaviour is not guaranteed by posix.
Note that it won't have any effect if the pages have been modified, in that case you want to do a fdatasync or such like first.
EDIT: Sorry, I didn't fully read your question. I don't know how to tell which files are currently in the page cache. Sorry.
Related
I'm a beginner in Linux and Virtual Memory, still struggling in understanding the relationship between Virtual Memory and Executable Object Files.
let's say we have a executable object file a.out stored on hard drive disk, and lets say originally the a.out has a .data section with a global variable with a value of 2018.
When the loader run, it allocates a contiguous chunk of virtual pages marks them as invalid (i.e., not cached) and points their page table entries to the appropriate locations in the a.out. The loader never actually copies any data from disk into memory. The data is paged in automatically and on demand by the virtual memory system the first time each page is referenced.
My question is: suppose the program change the value of global variable from 2018 to 2019 on the run time and it seems that the virtual page that contains the global variable will eventually page out to the disk, which means that .data section has the global variable to be 2019 now, so we change the executable object file which are not supposed to be changed? otherwise we get a different value each time we finish and run the program again?
In general (not specifically for Linux)...
When an executable file is started, the OS (kernel) creates a virtual address space and an (initially empty) process, and examines the executable file's header. The executable file's header describes "sections" (e.g. .text, .rodata, .data, .bss, etc) where each section has different attributes - if the contents of the section should be put in the virtual address space or not (e.g. is a symbol table or something that isn't used at run-time), if the contents are part of the file or not (e.g. .bss), and if the area should be executable, read-only or read/write.
Typically, (used parts of) the executable file are cached by the virtual file system; and pieces of the file that are already in the VFS cache can be mapped (as "copy on write") into the new process' virtual address space. For parts that aren't already in the VFS cache, those pieces of the file can be mapped as "need fetching" into the new process' virtual address space.
Then the process is started (given CPU time).
If the process reads data from a page that hasn't been loaded yet; the OS (kernel) pauses the process, fetches the page from the file on disk into the VFS cache, then also maps the page as "copy on write" into the process; then allows the process to continue (allows the process to retry the read from the page that wasn't loaded, which will work now that the page is loaded).
If the process writes to a page that is still "copy on write"; the OS (kernel) pauses the process, allocates a new page and copies the original page's data into it, then replaces the original page with the process' own copy; then allows the process to continue (allows the process to retry the write which will work now that the process has it's own copy).
If the process writes to data from a page that hasn't been loaded yet; the OS (kernel) combines both of the previous things (fetches original page from disk into VFS cache, creates a copy, maps the process' copy into the process' virtual address space).
If the OS starts to run out of free RAM; then:
pages of file data that are in the VFS cache but aren't shared as "copy on write" with any process can be freed in the VFS without doing anything else. Next time the file is used those pages will be fetched from the file on disk into the VFS cache.
pages of file data that are in the VFS cache and are also shared as "copy on write" with any process can be freed in the VFS and the copies in any/all processes marked as "not fetched yet". Next time the file is used (including when a process accesses the "not fetched yet" page/s) those pages will be fetched from the file on disk into the VFS cache and then mapped as "copy on write" in the process/es).
pages of data that have been modified (either because they were originally "copy on write" but got copied, or because they weren't part of the executable file at all - e.g. .bss section, the executable's heap space, etc) can be saved to swap space and then freed. When the process accesses the page/s again they will be fetched from swap space.
Note: If the executable file is stored on unreliable media (e.g. potentially scratched CD) a "smarter than average" OS may load the entire executable file into VFS cache and/or swap space initially; because there's no sane way to handle "read error from memory mapped file" while the process is using the file other than making the process crash (e.g. SIGSEGV) and making it look like the executable was buggy when it was not, and because this improves reliability (because you're depending on more reliable swap and not depending on a less reliable scratched CD). Also; if the OS guards against file corruption or malware (e.g. has a CRC or digital signature built into executable files) then the OS may (should) load everything into memory (VFS cache) to check the CRC or digital signature before allowing the executable to be executed, and (for secure systems, in case the file on disk is modified while the executable is running) when freeing RAM may stored unmodified pages in "more trusted" swap space (the same as it would if the page was modified) to avoid fetching the data from the original "less trusted" file (partly because you don't want to do the whole digital signature check every time a page is loaded from the file).
My question is: suppose the program change the value of global variable from 2018 to 2019 on the run time and it seems that the virtual page that contains the global variable will eventually page out to the disk, which means that .data section has the global variable to be 2019 now, so we change the executable object file which are not supposed to be changed?
The page containing 2018 will begin as "not fetched", then (when its accessed) loaded into VFS cache and mapped into the process as "copy on write". At either of these points the OS may free the memory and fetch the data (that hasn't been changed) from the executable file on disk if it's needed again.
When the process modifies the global variable (changes it to contain 2019) the OS creates a copy of it for the process. After this point, if the OS wants to free the memory the OS needs to save the page's data in swap space, and load the page's data back from swap space if it's accessed again. The executable file is not modified and (for that page, for that process) the executable file isn't used again.
We are given a project where we implementing memory checkpointing (basic is just looking over pages and dumping data found to a file (also check info about the page (private, locked, etc)) and incremental which is where we only look at if data changed previously and dump it to a file). My understanding of this is we are pretty much building a smaller scale version of memory save states (I could be wrong but that's just what I'm getting from this). We are currently using VMA approach to our problem to go through the given range (as long as it doesn't go below or above the user space range (this means no kernel range or below user space)) in order to report the data found from the pages we encounter. I know the vma_area_struct is used to access vma (some functions including find_vma()). My issue is I'm not sure how we check the individual pages within this given range of addresses (user gives us) from using this vma_area_struct. I only know about struct page (this is pretty much it), but im still learning about the kernel in detail, so im bound to miss things. Is there something I'm missing about the vma_area_sruct when accessing pages?
Second question is, what do we use to iterate through each individual page within the found vma (from given start and end address)?
VMAs contain the virtual adresses of their first and (one after their) last bytes:
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
...
This means that in order to get the page's data you need to first figure out in what context is your code running?
If it's within the process context, then a simple copy_from_user approach might be enough to get the actual data and a page walk (through the entirety of your PGD/PUD/PMD/PTE) to get the PFN and then turn it to a struct page. (Take care not to use the seductive virt_to_page(addr) as this will only work on kernel addresses).
In terms of iteration, you need only iterate in PAGE_SIZEs, over the virtual addresses you get from the VMAs.
Note that this assumes that the pages are actually mapped. If not (!pte_present(pte_t a)) you might need to remap it yourself to access the data.
If your check is running in some other context (such as a kthread/interrupt) you must remap the page from the swap before accessing it which is a whole different case. If you want the easy way, I'd look up here: https://www.kernel.org/doc/gorman/html/understand/understand014.html to understand how to handle swap lookup / retrieval.
In CSAPP 2nd, Chapter 9, section 8 (in page 807)
Anonymous file: An area can also be mapped to an anonymous file,
created by the kernel, that contains all binary zeros. The first time
the CPU touches a virtual page in such an area, the kernel finds an
appropriate victim page in physical memory, swaps out the victim page
if it is dirty, overwrites the victim page with binary zeros, and
updates the page table to mark the page as resident. Notice that no
data is actually transferred between disk and memory. For this reason,
pages in areas that are mapped to anonymous files are sometimes called
demand-zero pages.
When the victim page is dirty.I think it should be wrote back to disk.Why " Notice that no data is actually transferred between disk and memory."?
Unfortunately, this is bad terminology on the part of Unix. Part of the problem is the historical lack of a hard file system (corrected in some Unix variants). In an idealized model of paging, user-created files can serve as page files. The static data (including code) can be paged directly from the executable file. The read/write data is paged from the page file. In that sense, the mapping is anonymous as there really is not a file but rather portion of a page file.
In most Unix variants, there is no page FILE but rather a swap partition. This is due poor design of the original Unix file system that has lived on for decades. The traditional Unix file system does not have the concept of a contiguous file. This makes it impossible to do logical I/O to a page file. Therefore, traditional Unix uses a swap partition instead.
Even if you map to a named file, on many Unix variations that mapping is only for the first READ. In the case of an anonymous mapping, the first read creates a demand zero page. To write it back to disk is goes to the swap partition on both cases. From the Unix perspective, calling this an "anonymous" mapping kind of makes sense but from the conceptual point of view (where one expects a memory to file mapping to be two-way) it makes no sense at all.
Is there a way that I can create a copy-on-write mapping via MAP_PRIVATE, write some data (ie, dirtying some pages), and then discard my changes, without using munmap and re-mmaping? The goal is to maintain the same virtual address for the given mapping (something not guaranteed to happen if I unmap & then mmap the same file again), but to discard all of my COW changes at once.
My understanding is that attempting to re-map the space via hinting the address and using the MAP_FIXED flag may have this effect; however I'm not sure if my interpretation of the MAP_FIXED docs is correct, or if this behaviour is guaranteed.
To quote from the mmap(2) docs:
If the memory region specified by addr and len overlaps pages of any existing
mapping(s), then the overlapped part of the existing mapping(s) will be
discarded.
Does "discarded" in this case mean that any COW pages will be thrown away, and new reads from the corresponding pages will fault and reflect changes on disk?
If you perform a mmap operation which overlaps with existing mappings, the Linux kernel will turf the overlapping part of the existing mappings as if an unmap had been done on them first. So for instance if you map a frame buffer where a shared library used to be, that memory now has nothing to do with the shared library; it points to the frame buffer.
The underlying page object from the removed mapping lives on independently of the mapping: pages are reference counted objects. When two maps share a view of the same page, it's simply due to the same page being "installed" in the different views. When a page is made dirty, and then unmapped, this does not create a dependency whereby the dirty page must be written out prior to the new mapping; the virtual memory can already be re-assigned to a new mapping (such as a piece of a graphics frame buffer) before the original dirty page (for example, part of a file-backed shared mapping) is flushed out.
About throwing away a mapping: I don't think you can do this. That is to say, if you have a mapping which is supposed to flush dirty pages to an underlying file, you cannot write to that memory and then unmap it quickly (or mmap something over it) in hopes that the write is never done. In Linux's madvise API, there is MAP_REMOVE operation which seems relevant, but according to the manual page, it seems to only work on tmpfs and shmfs. I think the only way to block the write from happening would be to do the time-honored ritual known as "dive for the power switch".
There is a way to map a file object such that changes are not propagated: namely, MAP_PRIVATE (opposite to MAP_SHARED). MAP_PRIVATE is needed, for instance, by debuggers like gdb which need to be able to put a breakpoint into an executable or shared library, without throwing a trap instruction into every instance of that executable or library in every running process (and the copy on disk!).
If you have a MAP_PRIVATE with modified parts, and you unmap it (or those parts) or map something over top of them, I believe they will be discarded. Those pages should have been subject to copy-on-write, and so the process which made them dirty should hold the one and only reference. When they are unmapped, their refcount drops to zero and since they are private pages, they get turfed.
The virtual address remains the same even after the copy. Only the physical address changes (and the associated memory mapping page registers).
After the process has written to the page, it is too late to undo it. The copy occurs during the first write to the memory region.
I'm wondering if there's a way to write-protect every page in a Linux
process' address space (from inside of the process itself, by way of
mprotect()). By "every page", I really mean every page of the
process's address space that might be written to by an ordinary
program running in user mode -- so, the program text, the constants,
the globals, and the heap -- but I would be happy with just constants,
globals, and heap. I don't want to write-protect the stack -- that
seems like a bad idea.
One problem is that I don't know where to start write-protecting
memory. Looking at /proc/pid/maps, which shows the sections of memory
in use for a given pid, they always seem to start with the address
0x08048000, with the program text. (In Linux, as far as I can tell,
the memory of a process is laid out with the program text at the
bottom, then constants above that, then globals, then the heap, then
an empty space of varying size depending on the size of the heap or
stack, and then the stack growing down from the top of memory at
virtual address 0xffffffff.) There's a way to tell where the top of
the heap is (by calling sbrk(0), which simply returns a pointer to the
current "break", i.e., the top of the heap), but not really a way to
tell where the heap begins.
If I try to protect all pages from 0x08048000 up to the break, I
eventually get an mprotect: Cannot allocate memory error. I don't know why mprotect would be
allocating memory anyway -- and Google is not very helpful. Any ideas?
By the way, the reason I want to do this is because I want to create a
list of all pages that are written to during a run of the program, and
the way that I can think of to do this is to write-protect all pages,
let any attempted writes cause a write fault, then implement a write
fault handler that will add the page to the list and then remove the write
protection. I think I know how to implement the handler, if only I could
figure out which pages to protect and how to do it.
Thanks!
You recieve ENOMEM from mprotect() if you try to call it on pages that aren't mapped.
Your best bet is to open /proc/self/maps, and read it a line at a time with fgets() to find all the mappings in your process. For each writeable mapping (indicated in the second field) that isn't the stack (indicated in the last field), call mprotect() with the right base address and length (calculated from the start and end addresses in the first field).
Note that you'll need to have your fault handler already set up at this point, because the act of reading the maps file itself will likely cause writes within your address space.
Start simple. Write-protect a few page and make sure your signal handler works for these pages. Then worry about expanding the scope of the protection. For example, you probably do not need to write-protect the code-section: operating systems can implement write-or-execute protection semantics on memory that will prevent code sections from ever being written to:
http://en.wikipedia.org/wiki/Self-modifying_code#Operating_systems