We are given a project where we implementing memory checkpointing (basic is just looking over pages and dumping data found to a file (also check info about the page (private, locked, etc)) and incremental which is where we only look at if data changed previously and dump it to a file). My understanding of this is we are pretty much building a smaller scale version of memory save states (I could be wrong but that's just what I'm getting from this). We are currently using VMA approach to our problem to go through the given range (as long as it doesn't go below or above the user space range (this means no kernel range or below user space)) in order to report the data found from the pages we encounter. I know the vma_area_struct is used to access vma (some functions including find_vma()). My issue is I'm not sure how we check the individual pages within this given range of addresses (user gives us) from using this vma_area_struct. I only know about struct page (this is pretty much it), but im still learning about the kernel in detail, so im bound to miss things. Is there something I'm missing about the vma_area_sruct when accessing pages?
Second question is, what do we use to iterate through each individual page within the found vma (from given start and end address)?
VMAs contain the virtual adresses of their first and (one after their) last bytes:
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
...
This means that in order to get the page's data you need to first figure out in what context is your code running?
If it's within the process context, then a simple copy_from_user approach might be enough to get the actual data and a page walk (through the entirety of your PGD/PUD/PMD/PTE) to get the PFN and then turn it to a struct page. (Take care not to use the seductive virt_to_page(addr) as this will only work on kernel addresses).
In terms of iteration, you need only iterate in PAGE_SIZEs, over the virtual addresses you get from the VMAs.
Note that this assumes that the pages are actually mapped. If not (!pte_present(pte_t a)) you might need to remap it yourself to access the data.
If your check is running in some other context (such as a kthread/interrupt) you must remap the page from the swap before accessing it which is a whole different case. If you want the easy way, I'd look up here: https://www.kernel.org/doc/gorman/html/understand/understand014.html to understand how to handle swap lookup / retrieval.
Related
I have been reading through some ARM code in order to try and understand what exactly the cpu_domain field inside the struct thread_info represents. In an attempt to understand how it is used, I looked through the places where the variable is referenced. I am trying to understand the following :-
Why is the field present in thread_info? I can see that when a context switch happens, the value is set / read, but why? What purpose does the field serve?
I had a look at the function modify_domain that seems to retrieve the domain value and set it in coprocessor CP15, c3. But where is this used? Any system call that takes in addresses verifies it against addr_limit, and page tables have the supervisor bit to check if reads/writes are allowed from userspace. So where do ARM domains come into the picture?
How is it possible, for a page whose present bit is set, to have a kpagecount equal to zero?
According to linux documentation:
/proc/kpagecount. This file contains a 64-bit count of the number
of times each page is mapped, indexed by PFN.
I did a toy application, and I printed out all pagemap entries for all vma's of a program.
I know, that kpagecount could be above 1, when a page is shared among different processes. Typical examples of these are the c-o-w of fork calls, or any libraries that are used by multiple programs.
In the case that it is zero, is when the last program that was using that particular page, does not need it any more, so the kernel can reclaim that page.
Is that correct?
However in the case of my toy app, I haven't issued any free command yet. So it does not make sense to have a heap page in ram (present bit set), and having the kpagecount to 0. So, is this counter accurate? Or am I missing something else?
Cheers!
So, a week after, I 've found the answer on theUnderstanding the Linux Kernelbook.
The _count variable, of the Page Descriptor structure, holds a reference counter to the page.
if it is set to -1, then the page frame is free, and it can be reclaimed by the kernel
if it is >=0 then it is assigned to 1 or more processes, or by the kernel itself
the page_count() function returns the value of _count, increased by one, which is the number of processes that share the page
Also, then _count is increased when:
the page is inserted into the swap cache
the flag PG_private in page descriptor is set
I am working on a project that requires heavy modifications in the Linux kernel. In one of the modifications I have to change the way the page fault handler works. I would like to be to intercept page faults from specific processes and satisfy them possible by getting copying data from another machine.
As a first step, I would like to write some experimentation code that can help me understand how Linux satisfies a page fault and how it also tells the process that the page fault can not be served right now and it needs to do a retry at a later time.
So, I would like to modify handle_mm_fault in a way that helps me understand all the above. Something like this:
int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
/* some code */
if(current->pid == my_target_pid)
{
/*
1. Chose randomly between 1 and 2 -> rand_num
2. if rand_num from (1) is 1 then allocate a block of memory, write 'X's to it and then give it to the process and then return.
3. if rand_num from (1) is 2 then tell process to come back later and then return.
*/
}
/* rest of handle_mm_fault for all other process here */
}
You can have a look at the struct vm_operations_struct. Its function member 'fault' is used to deal with the page fault situation
The question you described sound like page demanding for data abort.
First of all, data abort could happen because of invalid page mapping from kernel space or user space. handle_mm_fault is the sub-routine to fix the page table for user space in linux. Your design has to cover followings as far as I can understand.
You need a design in place to keep track of the right PID.
Have you ever considered, how do you decide which part of vma should rely on demanding
Page? The whole process VMA or just some parts? Linux could use other techniques to create memory mapping for user programs, such as mmap.
In order to avoid keeping retry, you have to fix the mapping anyway
as CPU will resume execution from aborted position. If you can't
server the mapping from your designated area immediately, a
temporary mapping should be created in stead and page out later.
I want to know what files are cached in Page Cache, and want to free the cache space of a specific file pragmatically. It is possible for me to write kernel module or even modify the kernel code if needed. Can anyone give me some clues?
Firstly, the kernel does not maintain a master list of all files in the page cache, because it has no need for such information. Instead, given an inode you can look up the associated page cache pages, and vice-versa.
For each page cache struct page, page_mapping() will return the struct address_space that it belongs to. The host member of struct address_space identifies the owning struct inode, and from there you can get the inode number and device.
mincore() returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access (page fault) if referenced. The kernel returns residency information about the pages starting at the address addr, and continuing for length bytes.
To test whether a file currently mapped into your process is in cache, call mincore with its mapped address.
To test whether an arbitrary file is in cache, open and map it, then follow the above.
There is a proposed fincore() system call which would not require mapping the file first, but (at this point in time) it's not yet generally available.
(And then madvise(MADV_DONTNEED)/fadvise(FADV_DONTNEED) can drop parts of a mapping/file from cache.)
You can free the contents of a file from the page cache under Linux by using
posix_fadvise(fd, POSIX_FADV_DONTNEED
As of Linux 2.6 this will immediately get rid of the parts of the page cache which are caching the given file or part of file; the call blocks until the operation is complete, but that behaviour is not guaranteed by posix.
Note that it won't have any effect if the pages have been modified, in that case you want to do a fdatasync or such like first.
EDIT: Sorry, I didn't fully read your question. I don't know how to tell which files are currently in the page cache. Sorry.
I'm wondering if there's a way to write-protect every page in a Linux
process' address space (from inside of the process itself, by way of
mprotect()). By "every page", I really mean every page of the
process's address space that might be written to by an ordinary
program running in user mode -- so, the program text, the constants,
the globals, and the heap -- but I would be happy with just constants,
globals, and heap. I don't want to write-protect the stack -- that
seems like a bad idea.
One problem is that I don't know where to start write-protecting
memory. Looking at /proc/pid/maps, which shows the sections of memory
in use for a given pid, they always seem to start with the address
0x08048000, with the program text. (In Linux, as far as I can tell,
the memory of a process is laid out with the program text at the
bottom, then constants above that, then globals, then the heap, then
an empty space of varying size depending on the size of the heap or
stack, and then the stack growing down from the top of memory at
virtual address 0xffffffff.) There's a way to tell where the top of
the heap is (by calling sbrk(0), which simply returns a pointer to the
current "break", i.e., the top of the heap), but not really a way to
tell where the heap begins.
If I try to protect all pages from 0x08048000 up to the break, I
eventually get an mprotect: Cannot allocate memory error. I don't know why mprotect would be
allocating memory anyway -- and Google is not very helpful. Any ideas?
By the way, the reason I want to do this is because I want to create a
list of all pages that are written to during a run of the program, and
the way that I can think of to do this is to write-protect all pages,
let any attempted writes cause a write fault, then implement a write
fault handler that will add the page to the list and then remove the write
protection. I think I know how to implement the handler, if only I could
figure out which pages to protect and how to do it.
Thanks!
You recieve ENOMEM from mprotect() if you try to call it on pages that aren't mapped.
Your best bet is to open /proc/self/maps, and read it a line at a time with fgets() to find all the mappings in your process. For each writeable mapping (indicated in the second field) that isn't the stack (indicated in the last field), call mprotect() with the right base address and length (calculated from the start and end addresses in the first field).
Note that you'll need to have your fault handler already set up at this point, because the act of reading the maps file itself will likely cause writes within your address space.
Start simple. Write-protect a few page and make sure your signal handler works for these pages. Then worry about expanding the scope of the protection. For example, you probably do not need to write-protect the code-section: operating systems can implement write-or-execute protection semantics on memory that will prevent code sections from ever being written to:
http://en.wikipedia.org/wiki/Self-modifying_code#Operating_systems