Synchronous external abort on translation table walk - linux

Can someone explain what would be the reason of this abort.
I could not find an explanation in the manual.
Basically I am getting this error in the IFSC code - Synchronous external abort on translation table walk.
The IFSC are bits in the HSR register which is used with an ARMv7A using the virtualization extensions.

The IFSC is basically a virtualization version of the IFSR.
IFSC code - Synchronous external abort on translation table walk.
This means that the CPU had difficulty accessing the page tables. So your code may have jumped to some unmapped address. The first level MMU entry may contain an invalid entry or it contains a 2nd level page table address that gives a bus error when accessed. Basically, it means that something in the page tables did not map well when your faulting instruction executed. You need to inspect the faulting code and then manually walk your page tables to find the actual source of the errors.
If you recently change the table base, some code may execute from the TLB cache and then on the first actual walk, a fault like this will occur.
You probably need to give more information on the context of where the IFSC is read for more help.

Related

page fault in copy_to_user, how kernel map a page for user space address?

I've learned that when a page fault occurs in copy_to_user function, the exception table will be used.
But I found almost all fix would just set the return value and jump to the next instruction after the one which triggers page fault.
Where does the kernel do the mapping work for user space address?
I mean at least there is some place kernel will modify page table.
Your question is very unclear, a copy_to_user is basically a function for copying data from kernel-space to user-space. Mainly for security reasons as we don't want to give user access to kernel data structures and kernel-space. So we need a mechanism to request from the kernel to give us this data.
A new mapping will be added in the page-tables indeed. The mapping is done in
kernel-space where the page-tables reside.

How to artificially cause a page fault in Linux kernel?

I am pretty new to the Linux kernel. I would like to make the kernel fault every time a specified page 'P' is being fetched. One simple conceptual idea is to clear the bit indicating the presence of page 'P' in Page Table Entry (PTE).
Can anyone provide more details on how to go about achieving this in x86? Also please point me to where in the source code one needs to make this modification, if possible.
Background
I have to invoke my custom page handler which is applicable only for handling a set of pages in an user's application. This custom page handler must to be enabled after some prologue is executed in a given application. For testing purposes, I need to induce faults after my prologue is executed.
Currently the kernel loads everything well before my prologue is executed, so I need to artificially cause faults to test my handler.
I have not played with the swapping code since I moved from Minix to Linux, but a swapping algorithm does two things. When there is a shortage of memory, it moves the page from memory to disk, and when a page is needed, it copies it back (probably after moving another page to disk).
I would use the full swap out function that you are writing to clear the page present flag. I would probably also use a character device to send the command to the test code to force the swap.

Relevant debug data for a Linux target

For an embedded ARM system running in-field there is a need to retrieve relevant debug information when a user-space application crash occurs. Such information will be stored in a non-volatile memory so it could be retreived at a later time. All such information must be stored during runtime, and cannot use third-party applications due to memory consumption concerns.
So far I have thought of following:
Signal ID and corresponding PC / memory addresses in case a kernel SIG occurs;
Process ID;
What other information do you think it's relevant in order to indentify the causing problem and be able to do a fast debug afterwards?
Thank you!
Usually, to be able to understand an issue, you'll need every register (from r0 to r15), the CPSR, and the top of the stack (to be able to determine what happened before the crash). Please also note that, when your program is interrupt for any invalid operation (jump to invalid address, ...), the processor goes to an exception mode, while you need to dump the registers and stack in the context of your process.
To be able to investigate, using those data, you also must keep the ELF files (with debug information, if possible) from your build, to be able to interpret the content of your registers and stack.
In the end, the more information you keep, the easier the debug is, but it may be expensive to keep every memory sections used by your program at the time of the failure (as a matter of fact, I've never done this).
In postmortem analysis, you will face some limits :
Dynamically linked libraries : if your crash occurs in a dynamically loaded and linked code, you will also need the lib binary you are using on your target.
Memory corruption : memory corruption usually results in the call of random data as code. On ARM with linux, this will probably lead to a segfault, as you can't go to an other process memory area, and as your data will probably be marked as "never execute", nevertheless, when the crash happens, you may have already corrupted the data that could have allow you to identify the source of the corruption. Postmortem analysis isn't always able to identify the failure cause.

page swap in Linux kernel

I know that Linux kernel has page cache to save recently used pages and blocks.
I understood that it helps to save time, because Linux doesn't need to get those blocks from a lower memory. When some block is missing in the cache, then Linux asks for it from lower level memory (by using some functions like submit_bio) and gets the block corresponding page.
I want to find the place in Linux kernel (3.10) where it checks for existence of the block in the page cache, and if it can't find this page, it brings the block from the block i/o layer.
I search for something like this in the code:
if( block's page exists in the cache )
return this page
else
bring the page of the searched block and return it
Can anyone post a link to the place in the kernel where this decision made?
The best place to start looking is going to be in mm.h: http://lxr.linux.no/linux+v3.10.10/include/linux/mm.h
Then take a look at the mm directory, which has files like page_io.c: http://lxr.linux.no/linux+v3.10.10/mm/page_io.c
Keep in mind that any architecture specific stuff will likely be defined in the arch directory for the system you are looking at. For example, here is the x86 page table management code: http://lxr.linux.no/linux+v3.10.10/arch/x86/mm/pgtable.c
Good luck! Remember, you are likely not going to find a section of code as clean as the example code you gave.

What happens after segmentation fault in linux kernel?

while I was thinking of making a networked paging (request the faulting page from remote node), I got this question:
First, let's consider the following steps:
1) a user-space program tries to access at memory X.
2) MMU walks the page table to find the physical address of X.
3) while walking the page table, it notice that the page table entry is invalid.
4) CPU traps and is catched by the Linux trap vector. (In ARM case, but I think x86 is also the same, right?)
5) At this point, I can retrieve the proper data from remote node, copy into some physical address and map it in page table.
6) Here goes the question: After this point, would the program that has page fault at X safely read the data?, Then, does it mean MMU or CPU somehow remembers the page faulting page table entry and return to that entry and resume the walking of page table?
If any of the steps are not right, please enlighten me.
Data abort handler just assigns to the pc the same value as before the data abort handling started, and instruction gets executed again, with right data in place, so data abort won't happen again.
The solution is tricky and non-portable.
You can get the values of the CPU registers, when the segmentation fault occurred, from a signal handler (link: http://man7.org/linux/man-pages/man2/sigaction.2.html). You need to analyse these to decide whether you can fix the situation. First you need to check that the instruction pointer is valid. Then, you need to check that the faulty address lies in a valid range. Then, you need to map memory for the non existent pages with mmap() system call. Then, you need to copy the required data to these pages. After the signal handler returns, the process will resume from where the segmentation fault had occurred.

Resources