How to artificially cause a page fault in Linux kernel? - linux

I am pretty new to the Linux kernel. I would like to make the kernel fault every time a specified page 'P' is being fetched. One simple conceptual idea is to clear the bit indicating the presence of page 'P' in Page Table Entry (PTE).
Can anyone provide more details on how to go about achieving this in x86? Also please point me to where in the source code one needs to make this modification, if possible.
Background
I have to invoke my custom page handler which is applicable only for handling a set of pages in an user's application. This custom page handler must to be enabled after some prologue is executed in a given application. For testing purposes, I need to induce faults after my prologue is executed.
Currently the kernel loads everything well before my prologue is executed, so I need to artificially cause faults to test my handler.

I have not played with the swapping code since I moved from Minix to Linux, but a swapping algorithm does two things. When there is a shortage of memory, it moves the page from memory to disk, and when a page is needed, it copies it back (probably after moving another page to disk).
I would use the full swap out function that you are writing to clear the page present flag. I would probably also use a character device to send the command to the test code to force the swap.

Related

page fault in copy_to_user, how kernel map a page for user space address?

I've learned that when a page fault occurs in copy_to_user function, the exception table will be used.
But I found almost all fix would just set the return value and jump to the next instruction after the one which triggers page fault.
Where does the kernel do the mapping work for user space address?
I mean at least there is some place kernel will modify page table.
Your question is very unclear, a copy_to_user is basically a function for copying data from kernel-space to user-space. Mainly for security reasons as we don't want to give user access to kernel data structures and kernel-space. So we need a mechanism to request from the kernel to give us this data.
A new mapping will be added in the page-tables indeed. The mapping is done in
kernel-space where the page-tables reside.

Kprobe/Jprobe in the middle of a function

I want to intercept the load_elf_binary function in fs/binfmt_elf.c file, read a few custom section headers from the file passed to it via an argument and set a few registers(eax, ebx, ecx, edx) before returning from the function.
Now I read that Jprobes is a good way to access the arguments of the target function but the problem is that once the control returns from Jprobes function the register and stack values are restored as per it's specifications, so I am looking into a way around it and probably inserting a probe in the middle of the function (preferably towards the end) would be a good idea. Please correct me if I am wrong and help with this.
So, let me see if I understand what you're doing properly.
You've modified the CPU (running in an emulator?) so that instruction 0xF1 does some sort of cryptographic thing. You want to arrange for load_elf_binary to invoke this instruction on return, with registers set properly for this instruction to do its magic. Somehow custom sections are involved.
This is going to be very difficult to do in the way you state. There are a few major problems:
I'm not sure what your threat model is, but if your magic CPU instruction just decrypts the mapped data directly you'll modify the pages in the linux page cache, and the decrypted code or data will be visible to other processes that mmap these pages.
Moreover, if the kernel frees the pages later, the encrypted data will be reloaded into memory, resulting in crashes at unpredictable times.
If some process makes those pages dirty, the decrypted data will be flushed back to disk, leaving a mix of decrypted and encrypted data on disk.
If you use a JProbe, your callback is invoked on entry to the function, which is way too early anyway.
All in all, this isn't going to work too well the way you state it.
A better approach might be to define your own binfmt (or replace the load_binary callback in elf_format). Your binfmt could then load the binary in whatever way it needs to. If you want to leverage the existing ELF loader, you could delegate to load_elf_binary, and on return do whatever you need to manipulate the loaded process, without any of this JProbe stuff.
In either case, do be sure to remap all of the pages you're encrypting/decrypting as MAP_PRIVATE and mark them dirty before changing their contents.

page swap in Linux kernel

I know that Linux kernel has page cache to save recently used pages and blocks.
I understood that it helps to save time, because Linux doesn't need to get those blocks from a lower memory. When some block is missing in the cache, then Linux asks for it from lower level memory (by using some functions like submit_bio) and gets the block corresponding page.
I want to find the place in Linux kernel (3.10) where it checks for existence of the block in the page cache, and if it can't find this page, it brings the block from the block i/o layer.
I search for something like this in the code:
if( block's page exists in the cache )
return this page
else
bring the page of the searched block and return it
Can anyone post a link to the place in the kernel where this decision made?
The best place to start looking is going to be in mm.h: http://lxr.linux.no/linux+v3.10.10/include/linux/mm.h
Then take a look at the mm directory, which has files like page_io.c: http://lxr.linux.no/linux+v3.10.10/mm/page_io.c
Keep in mind that any architecture specific stuff will likely be defined in the arch directory for the system you are looking at. For example, here is the x86 page table management code: http://lxr.linux.no/linux+v3.10.10/arch/x86/mm/pgtable.c
Good luck! Remember, you are likely not going to find a section of code as clean as the example code you gave.

How to tell Linux that a mmap()'d page does not need to be written to swap if the backing physical page is needed?

Hopefully the title is clear. I have a chunk of memory obtained via mmap(). After some time, I have concluded that I no longer need the data within this range. I still wish to keep this range, however. That is, I do not want to call mummap(). I'm trying to be a good citizen and not make the system swap more than it needs.
Is there a way to tell the Linux kernel that if the given page is backed by a physical page and if the kernel decides it needs that physical page, do not bother writing that page to swap?
I imagine under the hood this magical function call would destroy any mapping between the given virtual page and physical page, if present, without writing to swap first.
Your question (as stated) makes no sense.
Let's assume that there was a way for you to tell the kernel to do what you want.
Let's further assume that it did need the extra RAM, so it took away your page, and didn't swap it out.
Now your program tries to read that page (since you didn't want to munmap the data, presumably you might try to access it). What is the kernel to do? The choices I see:
it can give you a new page filled with 0s.
it can give you SIGSEGV
If you wanted choice 2, you could achieve the same result with munmap.
If you wanted choice 1, you could mremap over the existing mapping with MAP_ANON (or munmap followed by new mmap).
In either case, you can't depend on the old data being there when you need it.
The only way your question would make sense is if there was some additional mechanism for the kernel to let you know that it is taking away your page (e.g. send you a special signal). But the situation you described is likely rare enough to warrant additional complexity.
EDIT:
You might be looking for madvise(..., MADV_DONTNEED)
You could munmap the region, then mmap it again with MAP_NORESERVE
If you know at initial mapping time that swapping is not needed, use MAP_NORESERVE

Can I write-protect every page in the address space of a Linux process?

I'm wondering if there's a way to write-protect every page in a Linux
process' address space (from inside of the process itself, by way of
mprotect()). By "every page", I really mean every page of the
process's address space that might be written to by an ordinary
program running in user mode -- so, the program text, the constants,
the globals, and the heap -- but I would be happy with just constants,
globals, and heap. I don't want to write-protect the stack -- that
seems like a bad idea.
One problem is that I don't know where to start write-protecting
memory. Looking at /proc/pid/maps, which shows the sections of memory
in use for a given pid, they always seem to start with the address
0x08048000, with the program text. (In Linux, as far as I can tell,
the memory of a process is laid out with the program text at the
bottom, then constants above that, then globals, then the heap, then
an empty space of varying size depending on the size of the heap or
stack, and then the stack growing down from the top of memory at
virtual address 0xffffffff.) There's a way to tell where the top of
the heap is (by calling sbrk(0), which simply returns a pointer to the
current "break", i.e., the top of the heap), but not really a way to
tell where the heap begins.
If I try to protect all pages from 0x08048000 up to the break, I
eventually get an mprotect: Cannot allocate memory error. I don't know why mprotect would be
allocating memory anyway -- and Google is not very helpful. Any ideas?
By the way, the reason I want to do this is because I want to create a
list of all pages that are written to during a run of the program, and
the way that I can think of to do this is to write-protect all pages,
let any attempted writes cause a write fault, then implement a write
fault handler that will add the page to the list and then remove the write
protection. I think I know how to implement the handler, if only I could
figure out which pages to protect and how to do it.
Thanks!
You recieve ENOMEM from mprotect() if you try to call it on pages that aren't mapped.
Your best bet is to open /proc/self/maps, and read it a line at a time with fgets() to find all the mappings in your process. For each writeable mapping (indicated in the second field) that isn't the stack (indicated in the last field), call mprotect() with the right base address and length (calculated from the start and end addresses in the first field).
Note that you'll need to have your fault handler already set up at this point, because the act of reading the maps file itself will likely cause writes within your address space.
Start simple. Write-protect a few page and make sure your signal handler works for these pages. Then worry about expanding the scope of the protection. For example, you probably do not need to write-protect the code-section: operating systems can implement write-or-execute protection semantics on memory that will prevent code sections from ever being written to:
http://en.wikipedia.org/wiki/Self-modifying_code#Operating_systems

Resources