How to serve a page fault in the linux kernel?

How to serve a page fault in the linux kernel? - linux

I am working on a project that requires heavy modifications in the Linux kernel. In one of the modifications I have to change the way the page fault handler works. I would like to be to intercept page faults from specific processes and satisfy them possible by getting copying data from another machine.
As a first step, I would like to write some experimentation code that can help me understand how Linux satisfies a page fault and how it also tells the process that the page fault can not be served right now and it needs to do a retry at a later time.
So, I would like to modify handle_mm_fault in a way that helps me understand all the above. Something like this:
int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
/* some code */
if(current->pid == my_target_pid)
{
/*
1. Chose randomly between 1 and 2 -> rand_num
2. if rand_num from (1) is 1 then allocate a block of memory, write 'X's to it and then give it to the process and then return.
3. if rand_num from (1) is 2 then tell process to come back later and then return.
*/
}
/* rest of handle_mm_fault for all other process here */
}

You can have a look at the struct vm_operations_struct. Its function member 'fault' is used to deal with the page fault situation

The question you described sound like page demanding for data abort.
First of all, data abort could happen because of invalid page mapping from kernel space or user space. handle_mm_fault is the sub-routine to fix the page table for user space in linux. Your design has to cover followings as far as I can understand.
You need a design in place to keep track of the right PID.
Have you ever considered, how do you decide which part of vma should rely on demanding
Page? The whole process VMA or just some parts? Linux could use other techniques to create memory mapping for user programs, such as mmap.
In order to avoid keeping retry, you have to fix the mapping anyway
as CPU will resume execution from aborted position. If you can't
server the mapping from your designated area immediately, a
temporary mapping should be created in stead and page out later.

Related

2 Questions about memory check pointing with linux kernel (custom implementation)

We are given a project where we implementing memory checkpointing (basic is just looking over pages and dumping data found to a file (also check info about the page (private, locked, etc)) and incremental which is where we only look at if data changed previously and dump it to a file). My understanding of this is we are pretty much building a smaller scale version of memory save states (I could be wrong but that's just what I'm getting from this). We are currently using VMA approach to our problem to go through the given range (as long as it doesn't go below or above the user space range (this means no kernel range or below user space)) in order to report the data found from the pages we encounter. I know the vma_area_struct is used to access vma (some functions including find_vma()). My issue is I'm not sure how we check the individual pages within this given range of addresses (user gives us) from using this vma_area_struct. I only know about struct page (this is pretty much it), but im still learning about the kernel in detail, so im bound to miss things. Is there something I'm missing about the vma_area_sruct when accessing pages?
Second question is, what do we use to iterate through each individual page within the found vma (from given start and end address)?

VMAs contain the virtual adresses of their first and (one after their) last bytes:
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
...
This means that in order to get the page's data you need to first figure out in what context is your code running?
If it's within the process context, then a simple copy_from_user approach might be enough to get the actual data and a page walk (through the entirety of your PGD/PUD/PMD/PTE) to get the PFN and then turn it to a struct page. (Take care not to use the seductive virt_to_page(addr) as this will only work on kernel addresses).
In terms of iteration, you need only iterate in PAGE_SIZEs, over the virtual addresses you get from the VMAs.
Note that this assumes that the pages are actually mapped. If not (!pte_present(pte_t a)) you might need to remap it yourself to access the data.
If your check is running in some other context (such as a kthread/interrupt) you must remap the page from the swap before accessing it which is a whole different case. If you want the easy way, I'd look up here: https://www.kernel.org/doc/gorman/html/understand/understand014.html to understand how to handle swap lookup / retrieval.

Access "current_task" pointer of another cpu in a SMP based linux system

I'm writing some kernel driver where I need to check which thread is running at a certain point on another cores. My driver runs one kernel thread on per core and I need to sync from time to time some of the threads to do certain task. What I can observe from debugging logs is that sometimes one thread waits for some other threads too much. I did some patch where I store the __preempt_count on other cores to check if any softirq/hardirq or preemption disabled delays my thread.
I also used the FTRACE to check the irqsoff and preemptirqsoff for the max duration of IRQs off and preemption disabled.
Till now I was able to spot kerneloops thread which is disabling interrupts up to 20msec, which I find too long.
Did a systemctl disable kerneloops and got rid of this issue.
Now I seem to deal with some preemption disabled windows. For future analysis of this driver I need a way to figure out which threads are being executed at a particular point in time on other cores. I'm trying to use FTRACE mainly with events for IRQ entry/exit, I also use trace_printk to push some debug log in the ftrace buffer to to have everything in one log, etc.
However, one thing that I would like to do is to access the current_task structure of other cores (the current ptr) and print the comm field which gives the name of the task (or pid value).
But I'm having hard times in getting this done.
For __preempt_count I had no issue:
int *p = per_cpu_ptr(&__preempt_count,cpu);
pr_info("Preempt count of cpu%u is 0x%08x\n",cpu,*p);
So far I had no issue with declaring or accessing per cpu variables, but for some reason, the current_task pointer triggers a page fault when trying to access it.
char buf[10];
struct task_struct *task = per_cpu_ptr(current_task,cpu);
snprintf(buf,8,"%s",task->comm);
pr_info("Task name: %s",buf);
Above code triggers always a page fault, NULL ptr bla bla.
I couldn't find the reason till now. I tried to print the pointer value for task, I got the same page fault.
Might be this because the address is not accessible by other cores? In kernel space should not be the case afaik. I also had no issue till now with per core variables and I played a lot with this.
Bottom line: what would be the right approach to access the current_task of other cores and print the comm/pid fields?
Many Thanks,
Daniel

I finally figured out what was wrong.
The difference between __preempt_count and current_task is that first one is defined as an int variable, whereas the 2nd one as a pointer to structure. In other words 1st one is defined as a variable and the 2nd one as a pointer.
Now, looking deeper into per cpu variables, they are just variables allocated by the compiler in separate memory locations, like an array. When per_cpu_ptr for a variable Foo is called, then the macro computes something like Foo[cpu], but that means the per_cpu_ptr needs the actual base address of the variable, meaning the & so that it can compute the relative address value starting from this.
When declaring: foo = per_cpu_ptr(&__preempt_count,cpu) , this address is already given = &__preempt_count
When declaring: bar = per_cpu_ptr(current_task,cpu), this address is not given, as the & is missing here. The current_task is a pointer but not the base address of the current_task array.
In both above cases the argument to per_cpu_ptr is a pointer, but here my understanding was wrong, it was not clear to me what is actually the pointer of the variable I need to pass, now it's clear: I have to pass the base address of the variable(var or pointer doesn't matter) so that the macro can compute the relative address for that cpu.
Therefore the right approaches that work are:
bar = per_cpu(current_task,cpu) which translates into *per_cpu_var(&current_task,cpu)
or directly
bar = *per_cpu_var(&current_task,cpu);

How to artificially cause a page fault in Linux kernel?

I am pretty new to the Linux kernel. I would like to make the kernel fault every time a specified page 'P' is being fetched. One simple conceptual idea is to clear the bit indicating the presence of page 'P' in Page Table Entry (PTE).
Can anyone provide more details on how to go about achieving this in x86? Also please point me to where in the source code one needs to make this modification, if possible.
Background
I have to invoke my custom page handler which is applicable only for handling a set of pages in an user's application. This custom page handler must to be enabled after some prologue is executed in a given application. For testing purposes, I need to induce faults after my prologue is executed.
Currently the kernel loads everything well before my prologue is executed, so I need to artificially cause faults to test my handler.

I have not played with the swapping code since I moved from Minix to Linux, but a swapping algorithm does two things. When there is a shortage of memory, it moves the page from memory to disk, and when a page is needed, it copies it back (probably after moving another page to disk).
I would use the full swap out function that you are writing to clear the page present flag. I would probably also use a character device to send the command to the test code to force the swap.

Repeated Minor Pagefaults at Same Address After Calling mlockall()

The Problem
In the course of attempting to reduce/eliminate the occurrence of minor pagefaults in an application, I discovered a confusing phenomenon; namely, I am repeatedly triggering minor pagefaults for writes to the same address, even though I thought I had taken sufficient steps to prevent pagefaults.
Background
As per the advice here, I called mlockall to lock all current and future pages into memory.
In my original use-case (which involved a rather large array) I also pre-faulted the data by writing to every element (or at least to every page) as per the advice here; though I realize the advice there is intended for users running a kernel with the RT patch, the general idea of forcing writes to thwart COW / demand paging should remain applicable.
I had thought that mlockall could be used to prevent minor page faults. While the man page only seems to guarantee that there will be no major faults,various other resources (e.g. above) state that it can be used to prevent minor page faults as well.
The kernel documentation seems to indicate this as well. For example, unevictable-lru.txt and pagemap.txt state that mlock()'ed pages are unevictable and therefore not suitable for reclamation.
In spite of this, I continued to trigger several minor pagefaults.
Example
I've created an extremely stripped down example to illustrate the problem:
#include <sys/mman.h> // mlockall
#include <stdlib.h> // abort
int main(int , char **) {
int x;
if (mlockall(MCL_CURRENT | MCL_FUTURE)) abort();
while (true) {
asm volatile("" ::: "memory"); // So GCC won't optimize out the write
x = 0x42;
}
return 0;
}
Here I repeatedly write to the same address. It is easy to see (e.g. via cat /proc/[pid]/status | awk '{print $10}') that I continue to have minor pagefaults long after the initialization is complete.
Running a modified version* of the pfaults.stp script included in systemtap-doc, I logged the time of each pagefault, address that triggered the fault, address of the instruction that triggered the fault, whether it was major/minor, and read/write. After the initial faults from startup and mlockall, all faults were identical: The attempt to write to x triggered a minor write fault.
The interval between successive pagefaults displays a striking pattern. For one particular run, the intervals were, in seconds:
2, 4, 4, 4.8, 8.16, 13.87, 23.588, 40.104, 60, 60, 60, 60, 60, 60, 60, 60, 60, ...
This appears to be (approximately) exponential back-off, with an absolute ceiling of 1 minute.
Running it on an isolated CPU has no impact; neither does running with a higher priority. However, running with a realtime priority eliminates the pagefaults.
The Questions
Is this behavior expected?
1a. What explains the timing?
Is it possible to prevent this?
Versions
I'm running Ubuntu 14.04, with kernel 3.13.0-24-generic and Systemtap version 2.3/0.156, Debian version 2.3-1ubuntu1 (trusty). Code compiled with gcc-4.8 with no extra flags, though optimization level doesn't seem to matter (provided the asm volatile directive is left in place; otherwise the write gets optimized out entirely)
I'm happy to include further details (e.g. exact stap script, original output, etc.) if they will prove relevant.
*Actually, the vm.pagefault probe was broken for my combination of kernel and systemtap because it referenced a variable that no longer existed in the kernel's handle_mm_fault function, but the fix was trivial)

#fche's mention of Transparent Huge Pages put me onto the right track.
A less careless read of the kernel documentation I linked to in the question shows that mlock does not prevent the kernel from migrating the page to a new page frame; indeed, there's an entire section devoted to migrating mlocked pages. Thus, simply calling mlock() does not guarantee that you will not experience any minor pagefaults
Somewhat belatedly, I see that this answer quotes the same passage and partially answers my question.
One of the reasons the kernel might move pages around is memory compaction, whereby the kernel frees up a large contiguous block of pages so a "huge page" can be allocated. Transparent huge pages can be easily disabled; see e.g. this answer.
My particular test case was the result of some NUMA balancing changes introduced in the 3.13 kernel.
Quoting the LWN article linked therein:
The scheduler will periodically scan through each process's address
space, revoking all access permissions to the pages that are currently
resident in RAM. The next time the affected process tries to access
that memory, a page fault will result. The scheduler will trap that
fault and restore access to the page in question...
This behavior of the scheduler can be disabled by setting the NUMA policy of the process to explicitly use a certain node. This can be done using numactl at the command line (e.g. numactl --membind=0) or a call to the libnuma library.
EDIT: The sysctl documentation explicitly states regarding NUMA balancing:
If the target workload is already bound to NUMA nodes then this feature should be disabled.
This can be done with sysctl -w kernel.numa_balancing=0
There may still be other causes for page migration, but this sufficed for my purposes.

Just speculating here, but perhaps what you're seeing is some normal kernel page-utilization-tracking (maybe even KSM or THP or cgroup), wherein it tries to ascertain how many pages are in active use. Probe the mark_page_accessed function e.g.

Can I write-protect every page in the address space of a Linux process?

I'm wondering if there's a way to write-protect every page in a Linux
process' address space (from inside of the process itself, by way of
mprotect()). By "every page", I really mean every page of the
process's address space that might be written to by an ordinary
program running in user mode -- so, the program text, the constants,
the globals, and the heap -- but I would be happy with just constants,
globals, and heap. I don't want to write-protect the stack -- that
seems like a bad idea.
One problem is that I don't know where to start write-protecting
memory. Looking at /proc/pid/maps, which shows the sections of memory
in use for a given pid, they always seem to start with the address
0x08048000, with the program text. (In Linux, as far as I can tell,
the memory of a process is laid out with the program text at the
bottom, then constants above that, then globals, then the heap, then
an empty space of varying size depending on the size of the heap or
stack, and then the stack growing down from the top of memory at
virtual address 0xffffffff.) There's a way to tell where the top of
the heap is (by calling sbrk(0), which simply returns a pointer to the
current "break", i.e., the top of the heap), but not really a way to
tell where the heap begins.
If I try to protect all pages from 0x08048000 up to the break, I
eventually get an mprotect: Cannot allocate memory error. I don't know why mprotect would be
allocating memory anyway -- and Google is not very helpful. Any ideas?
By the way, the reason I want to do this is because I want to create a
list of all pages that are written to during a run of the program, and
the way that I can think of to do this is to write-protect all pages,
let any attempted writes cause a write fault, then implement a write
fault handler that will add the page to the list and then remove the write
protection. I think I know how to implement the handler, if only I could
figure out which pages to protect and how to do it.
Thanks!

You recieve ENOMEM from mprotect() if you try to call it on pages that aren't mapped.
Your best bet is to open /proc/self/maps, and read it a line at a time with fgets() to find all the mappings in your process. For each writeable mapping (indicated in the second field) that isn't the stack (indicated in the last field), call mprotect() with the right base address and length (calculated from the start and end addresses in the first field).
Note that you'll need to have your fault handler already set up at this point, because the act of reading the maps file itself will likely cause writes within your address space.

Start simple. Write-protect a few page and make sure your signal handler works for these pages. Then worry about expanding the scope of the protection. For example, you probably do not need to write-protect the code-section: operating systems can implement write-or-execute protection semantics on memory that will prevent code sections from ever being written to:
http://en.wikipedia.org/wiki/Self-modifying_code#Operating_systems

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string