while I was thinking of making a networked paging (request the faulting page from remote node), I got this question:
First, let's consider the following steps:
1) a user-space program tries to access at memory X.
2) MMU walks the page table to find the physical address of X.
3) while walking the page table, it notice that the page table entry is invalid.
4) CPU traps and is catched by the Linux trap vector. (In ARM case, but I think x86 is also the same, right?)
5) At this point, I can retrieve the proper data from remote node, copy into some physical address and map it in page table.
6) Here goes the question: After this point, would the program that has page fault at X safely read the data?, Then, does it mean MMU or CPU somehow remembers the page faulting page table entry and return to that entry and resume the walking of page table?
If any of the steps are not right, please enlighten me.
Data abort handler just assigns to the pc the same value as before the data abort handling started, and instruction gets executed again, with right data in place, so data abort won't happen again.
The solution is tricky and non-portable.
You can get the values of the CPU registers, when the segmentation fault occurred, from a signal handler (link: http://man7.org/linux/man-pages/man2/sigaction.2.html). You need to analyse these to decide whether you can fix the situation. First you need to check that the instruction pointer is valid. Then, you need to check that the faulty address lies in a valid range. Then, you need to map memory for the non existent pages with mmap() system call. Then, you need to copy the required data to these pages. After the signal handler returns, the process will resume from where the segmentation fault had occurred.
Related
I know that:
When installing a SIGSEGV signal handler with sigaction and a sa_sigaction (rather than sa_handler), the signal handler receives a siginfo_t*, of which the si_addr is the address at which the fault occurred.
Using the ucontext_t we can inspect the values of registers, for example the instruction pointer, albeit not in a platform-independent way (Linux signal handling. How to get address of interrupted instruction?).
My question: can we also know which register caused the fault? Given that we don't have memory-to-memory moves, this should be only one register (after all, there is also only a single si_addr). Of course I could inspect all registers and search for si_addr, but there may be more than one match.
I would be perfectly happy with solutions that are not platform-independent.
The load/store address might not be in any single register; it could the result of an addressing mode like [rdi + rax*4 + 100] or something.
There is no easy solution to print what a full debugger would, other than running your program under a debugger to catch the fault in the first place, like a normal person. Or let it generate a coredump for you to analyze offline, if you need to debug crashes that happened on someone else's system.
The Linux kernel chooses to dump instruction bytes starting at the code address of the fault (or actually somewhat before it for context), and the contents of all registers. Disassembly to see the faulting instruction can be done after the fact, from the crashlog, along with seeing register contents, without needing to include a disassembler in the kernel itself. See What is "Code" in Linux Kernel crash messages? for an example of what Linux does, and of manually picking it apart instead of using decodecode.
Currently I am reading System calls chapter of Understanding linux kernel and I could not understand the fact that how linux kernel knows address argument passed via syscall() is invalid.
Book has mentioned that address checking is delayed until it is used and when linux made used this address it generates page fault.
It further mentioned a fault can happen in three case in kernel mode
• The kernel attempts to address a page belonging to the process
address space, but either the corresponding page frame does not exist,
or the kernel is trying to write a read-only page.
• Some kernel function includes a programming bug that causes the
exception to be raised when that program is executed; alternatively,
the exception might be caused by a transient hardware error.
• A system call service routine attempts to read or write into a
memory area whose address has been passed as a system call parameter,
but that address does not belong to the process address space.
These cases must be distinguished by the page fault handler, since the actions to be taken are quite different.The page fault handler can easily recognize the first case by determining whether the faulty linear address is included in one of the memory regions owned by the process.
But how kernel distinguishes between remaining two case. Although it is explained in the text book but it looks alien to me. Please help and explain.
The page fault handler __do_page_fault includes this piece of code:
if (!(error_code & X86_PF_USER) &&
!search_exception_tables(regs->ip)) {
bad_area_nosemaphore(regs, error_code, address, NULL);
return;
}
This condition !(error_code & X86_PF_USER) is true when the system call originated from kernel mode rather than user mode. This condition !search_exception_tables(regs->ip) is true when the page fault did not occur from executing one of the instructions that use a linear that was passed to the system call. Note that regs->ip holds the instruction pointer of the instruction that caused the page fault. When both of these conditions are true, it means that either there is a bug in some kernel function or that there is some hardware error (the second case).
regs contains a snapshot of all architectural registers at the time of the page fault. On x86, this includes the CS segment register. The RPL in that register can be used to determine whether system call originated from user mode or kernel mode.
The search_exception_tables performs a binary search on sorted arrays of instruction addresses that are built at compile-time when compiling the kernel. These are basically the instructions that access an address passed to the system call.
For the other two other cases you listed, the condition !(error_code & X86_PF_USER) would be false.
I've learned that when a page fault occurs in copy_to_user function, the exception table will be used.
But I found almost all fix would just set the return value and jump to the next instruction after the one which triggers page fault.
Where does the kernel do the mapping work for user space address?
I mean at least there is some place kernel will modify page table.
Your question is very unclear, a copy_to_user is basically a function for copying data from kernel-space to user-space. Mainly for security reasons as we don't want to give user access to kernel data structures and kernel-space. So we need a mechanism to request from the kernel to give us this data.
A new mapping will be added in the page-tables indeed. The mapping is done in
kernel-space where the page-tables reside.
I am pretty new to the Linux kernel. I would like to make the kernel fault every time a specified page 'P' is being fetched. One simple conceptual idea is to clear the bit indicating the presence of page 'P' in Page Table Entry (PTE).
Can anyone provide more details on how to go about achieving this in x86? Also please point me to where in the source code one needs to make this modification, if possible.
Background
I have to invoke my custom page handler which is applicable only for handling a set of pages in an user's application. This custom page handler must to be enabled after some prologue is executed in a given application. For testing purposes, I need to induce faults after my prologue is executed.
Currently the kernel loads everything well before my prologue is executed, so I need to artificially cause faults to test my handler.
I have not played with the swapping code since I moved from Minix to Linux, but a swapping algorithm does two things. When there is a shortage of memory, it moves the page from memory to disk, and when a page is needed, it copies it back (probably after moving another page to disk).
I would use the full swap out function that you are writing to clear the page present flag. I would probably also use a character device to send the command to the test code to force the swap.
Can someone explain what would be the reason of this abort.
I could not find an explanation in the manual.
Basically I am getting this error in the IFSC code - Synchronous external abort on translation table walk.
The IFSC are bits in the HSR register which is used with an ARMv7A using the virtualization extensions.
The IFSC is basically a virtualization version of the IFSR.
IFSC code - Synchronous external abort on translation table walk.
This means that the CPU had difficulty accessing the page tables. So your code may have jumped to some unmapped address. The first level MMU entry may contain an invalid entry or it contains a 2nd level page table address that gives a bus error when accessed. Basically, it means that something in the page tables did not map well when your faulting instruction executed. You need to inspect the faulting code and then manually walk your page tables to find the actual source of the errors.
If you recently change the table base, some code may execute from the TLB cache and then on the first actual walk, a fault like this will occur.
You probably need to give more information on the context of where the IFSC is read for more help.