Kernel Oops page fault error codes for ARM - linux

What does error code after Oops give information about the panic in arm ex.
Oops: 17 [#1] PREEMPT SMP
what 17 give information in this case.
In x86 it represents -
bit 0 == 0: no page found 1: protection fault
bit 1 == 0: read access 1: write access
bit 2 == 0: kernel-mode access 1: user-mode access
bit 3 == 1: use of reserved bit detected
bit 4 == 1: fault was an instruction fetch
But i am not able to find any information in arm.
Thanks
Shunty

What you printed above as description of bits is page fault descriptions, not Oops faults.
See Linux's oops-tracing for more information on looking for Linux crash analysis.
Below is how your Oops: 17 [#1] PREEMPT SMP arch/arm/kernel/traps.c:
#define S_PREEMPT " PREEMPT"
...
#define S_SMP " SMP"
...
printk(KERN_EMERG "Internal error: %s: %x [#%d]" S_PREEMPT S_SMP S_ISA "\n", str, err, ++die_counter);
Page faults doesn't need to crash the kernel, as well as not all kernel crashes are page faults. So there is a high chance Oops: 17 is not related to page faults at all. (and as a bonus my wild guess is it is about scheduling / just sounds familiar to me.)

Looks like you're asking about the ARM Fault Status Register (FSR) bits. I looked up the kernel code (arch/arm/mm/fault.c) and found that this is what is actually passed as a parameter to the Oops code:
static void
__do_kernel_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
struct pt_regs *regs)
{
[...]
pr_alert("Unable to handle kernel %s at virtual address %08lx\n",
(addr < PAGE_SIZE) ? "NULL pointer dereference" :
"paging request", addr);
show_pte(mm, addr);
die("Oops", regs, **fsr**);
[...]
}
So, anyway, this I traced to the FSR register on the ARM(v4 and above?) MMU:
Source: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438d/BABFFDFD.html
...
[3:0] FS[3:0]
Fault Status bits. This field indicates the type of exception generated. Any encoding not listed is reserved:
b00001
Alignment fault.
b00100
Instruction cache maintenance fault[a].
b01100
Synchronous external abort on translation table walk, 1st level.
b01110
Synchronous external abort on translation table walk, 2nd level.
b11100
Synchronous parity error on translation table walk, 1st level.
b11110
Synchronous parity error on translation table walk, 2nd level.
b00101
Translation fault, 1st level.
b00111
Translation fault, 2nd level.
b00011
Access flag fault, 1st level.
b00110
Access flag fault, 2nd level.
b01001
Domain fault, 1st level.
b01011
Domain fault, 2nd level.
b01101
Permission fault, 1st level.
b01111
Permission fault, 2nd level.
b00010
Debug event.
b01000
Synchronous external abort, non-translation.
b11001
Synchronous parity error on memory access.
b10110
Asynchronous external abort.
b11000
Asynchronous parity error on memory access.
...
Disclaimer: I don't know whether this info is still relevant; the doc states it's for the ARM Cortex A15 and the page is marked as Superseded.
Could see this page also:
ARM926EJ-S Fault address and fault status registers

Related

Arm EL0 Address Translation

I want to translate a virtual user level address (EL0) to a physical address.
Arm provides an instruction AT to translate a virtual address to a physical address. This instruction should be called by privileged level (EL1), so I write a linux kernel module to execute this instruction. I pass a user level address (EL0) virt_addr through ioctl() to kernel module, and execute instruction AT.
According to the instruction manual, I execute the instruction by this way:
// S1E0R: S1 means stage 1, E0 means EL0 and R means read.
asm volatile("AT S1E0R, %[_addr_]" :: [_addr_]"r"(virt_addr));
// Register PAR_EL1 stores the translation result.
asm volatile("MRS %0, PAR_EL1": "=r"(phys_addr));
However, the value of register PAR_EL1, i.e. stored in phys_addr, is 0x80b (or 0b100000001011), which is not a correct result.
Then I found a manual about register PAR_EL1.
According to the manual, bit[0] is 1 means a fault occurs on the execution of an Address translation instruction.
bits[6:1] is Fault Status Code (FST). My FST is 0b000101. The manual tell me that the error information is: Translation fault, level 1.
I don't know the fault reason.
My cpu is Cortex-A53, and operating system is Linux-4.14.59.

What happens when you execute an instruction that your CPU does not support?

What happens if a CPU attempts to execute a binary that has been compiled with some instructions that your CPU doesn't support. I'm specifically wondering about some of the new AVX instructions running on older processors.
I'm assuming this can be tested for, and a friendly message could in theory be displayed to a user. Presumably most low level libraries will check this on your behalf. Assuming you didn't make this check, what would you expect to happen? What signal would your process receive?
A new instruction can be designed to be "legacy compatible" or it can not.
To the former class belong instructions like tzcnt or xacquire that have an encoding that produces valid instructions in older architecture: tzcnt is encoded as
rep bsf and xacquire is just repne.
The semantic is different of course.
To the second class belong the majority of new instructions, AVX being one popular example.
When the CPU encounters an invalid or reserved encoding it generates the #UD (for UnDefined) exception - that's interrupt number 6.
The Linux kernel set the IDT entry for #UD early in entry_64.S:
idtentry invalid_op do_invalid_op has_error_code=0
the entry points to do_invalid_op that is generated with a macro in traps.c:
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
the macro DO_ERROR generates a function that calls do_error_trap in the same file (here).
do_error_trap uses fill_trap_info (in the same file, here) to create a siginfo_t structure containing the Linux signal information:
case X86_TRAP_UD:
sicode = ILL_ILLOPN;
siaddr = uprobe_get_trap_addr(regs);
break;
from there the following calls happen:
do_trap in traps.c
force_sig_info in signal.c
specific_send_sig_info in signal.c
that ultimately culminates in calling the signal handler for SIGILL of the offending process.
The following program is a very simple example that generates an #UD
BITS 64
GLOBAL _start
SECTION .text
_start:
ud2
we can use strace to check the signal received by running that program
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x400080} ---
+++ killed by SIGILL +++
as expected.
As Cody Gray commented, libraries don't usually rely on SIGILL, instead they use a CPU dispatcher or check the presence of an instruction explicitly.

Why shouldn't I use ioremap on system memory for ARMv6+?

I need to reserve a large buffer of physically contiguous RAM from the kernel and be able to gaurantee that the buffer will always use a specific, hard-coded physical address. This buffer should remain reserved for the kernel's entire lifetime. I have written a chardev driver as an interface for accessing this buffer in userspace. My platform is an embedded system with ARMv7 architecture running a 2.6 Linux kernel.
Chapter 15 of Linux Device Drivers, Third Edition has the following to say on the topic (page 443):
Reserving the top of RAM is accomplished by passing a mem= argument to the kernel at boot time. For example, if you have 256 MB, the argument mem=255M keeps the kernel from using the top megabyte. Your module could later use the following code to gain access to such memory:
dmabuf = ioremap (0xFF00000 /* 255M */, 0x100000 /* 1M */);
I've done that plus a couple of other things:
I'm using the memmap bootarg in addition to the mem one. The kernel boot parameters documentation suggests always using memmap whenever you use mem to avoid address collisions.
I used request_mem_region before calling ioremap and, of course, I check that it succeeds before moving ahead.
This is what the system looks like after I've done all that:
# cat /proc/cmdline
root=/dev/mtdblock2 console=ttyS0,115200 init=/sbin/preinit earlyprintk debug mem=255M memmap=1M$255M
# cat /proc/iomem
08000000-0fffffff : PCIe Outbound Window, Port 0
08000000-082fffff : PCI Bus 0001:01
08000000-081fffff : 0001:01:00.0
08200000-08207fff : 0001:01:00.0
18000300-18000307 : serial
18000400-18000407 : serial
1800c000-1800cfff : dmu_regs
18012000-18012fff : pcie0
18013000-18013fff : pcie1
18014000-18014fff : pcie2
19000000-19000fff : cru_regs
1e000000-1fffffff : norflash
40000000-47ffffff : PCIe Outbound Window, Port 1
40000000-403fffff : PCI Bus 0002:01
40000000-403fffff : 0002:01:00.0
40400000-409fffff : PCI Bus 0002:01
40400000-407fffff : 0002:01:00.0
40800000-40807fff : 0002:01:00.0
80000000-8fefffff : System RAM
80052000-8045dfff : Kernel text
80478000-80500143 : Kernel data
8ff00000-8fffffff : foo
Everything so far looks good, and my driver is working perfectly. I'm able to read and write directly to the specific physical address I've chosen.
However, during bootup, a big scary warning (™) was triggered:
BUG: Your driver calls ioremap() on system memory. This leads
to architecturally unpredictable behaviour on ARMv6+, and ioremap()
will fail in the next kernel release. Please fix your driver.
------------[ cut here ]------------
WARNING: at arch/arm/mm/ioremap.c:211 __arm_ioremap_pfn_caller+0x8c/0x144()
Modules linked in:
[] (unwind_backtrace+0x0/0xf8) from [] (warn_slowpath_common+0x4c/0x64)
[] (warn_slowpath_common+0x4c/0x64) from [] (warn_slowpath_null+0x1c/0x24)
[] (warn_slowpath_null+0x1c/0x24) from [] (__arm_ioremap_pfn_caller+0x8c/0x144)
[] (__arm_ioremap_pfn_caller+0x8c/0x144) from [] (__arm_ioremap_caller+0x50/0x58)
[] (__arm_ioremap_caller+0x50/0x58) from [] (foo_init+0x204/0x2b0)
[] (foo_init+0x204/0x2b0) from [] (do_one_initcall+0x30/0x19c)
[] (do_one_initcall+0x30/0x19c) from [] (kernel_init+0x154/0x218)
[] (kernel_init+0x154/0x218) from [] (kernel_thread_exit+0x0/0x8)
---[ end trace 1a4cab5dbc05c3e7 ]---
Triggered from: arc/arm/mm/ioremap.c
/*
* Don't allow RAM to be mapped - this causes problems with ARMv6+
*/
if (pfn_valid(pfn)) {
printk(KERN_WARNING "BUG: Your driver calls ioremap() on system memory. This leads\n"
KERN_WARNING "to architecturally unpredictable behaviour on ARMv6+, and ioremap()\n"
KERN_WARNING "will fail in the next kernel release. Please fix your driver.\n");
WARN_ON(1);
}
What problems, exactly, could this cause? Can they be mitigated? What are my alternatives?
So I've done exactly that, and it's working.
Provide the kernel command line (e.g. /proc/cmdline) and the resulting memory map (i.e. /proc/iomem) to verify this.
What problems, exactly, could this cause?
The problem with using ioremap() on system memory is that you end up assigning conflicting attributes to the memory which causes "unpredictable" behavior.
See the article "ARM's multiply-mapped memory mess", which provides a history to the warning that you are triggering.
The ARM kernel maps RAM as normal memory with writeback caching; it's also marked non-shared on uniprocessor systems. The ioremap() system call, used to map I/O memory for CPU use, is different: that memory is mapped as device memory, uncached, and, maybe, shared. These different mappings give the expected behavior for both types of memory. Where things get tricky is when somebody calls ioremap() to create a new mapping for system RAM.
The problem with these multiple mappings is that they will have differing attributes. As of version 6 of the ARM architecture, the specified behavior in that situation is "unpredictable."
Note that "system memory" is the RAM that is managed by the kernel.
The fact that you trigger the warning indicates that your code is generating multiple mappings for a region of memory.
Can they be mitigated?
You have to ensure that the RAM you want to ioremap() is not "system memory", i.e. managed by the kernel.
See also this answer.
ADDENDUM
This warning that concerns you is the result of pfn_valid(pfn) returning TRUE rather than FALSE.
Based on the Linux cross-reference link that you provided for version 2.6.37,
pfn_valid() is simply returning the result of
memblock_is_memory(pfn << PAGE_SHIFT);
which in turn is simply returning the result of
memblock_search(&memblock.memory, addr) != -1;
I suggest that the kernel code be hacked so that the conflict is revealed.
Before the call to ioremap(), assign TRUE to the global variable memblock_debug.
The following patch should display the salient information about the memory conflict.
(The memblock list is ordered by base-address, so memblock_search() performs a binary search on this list, hence the use of mid as the index.)
static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
{
unsigned int left = 0, right = type->cnt;
do {
unsigned int mid = (right + left) / 2;
if (addr < type->regions[mid].base)
right = mid;
else if (addr >= (type->regions[mid].base +
type->regions[mid].size))
left = mid + 1;
- else
+ else {
+ if (memblock_debug)
+ pr_info("MATCH for 0x%x: m=0x%x b=0x%x s=0x%x\n",
+ addr, mid,
+ type->regions[mid].base,
+ type->regions[mid].size);
return mid;
+ }
} while (left < right);
return -1;
}
If you want to see all the memory blocks, then call memblock_dump_all() with the variable memblock_debug is TRUE.
[Interesting that this is essentially a programming question, yet we haven't seen any of your code.]
ADDENDUM 2
Since you're probably using ATAGs (instead of Device Tree), and you want to dedicate a memory region, fix up the ATAG_MEM to reflect this smaller size of physical memory.
Assuming you have made zero changes to your boot code, the ATAG_MEM is still specifying the full RAM, so perhaps this could be the source of the system memory conflict that causes the warning.
See this answer about ATAGs and this related answer.

Cannot Make Code Segment Execute-Only (Not Readable)

I'm trying to make the Code Segment Execute-Only (Not Readable).
But I FAILED after I tried everything the Manual told me to. Here is what I did to make the code segment unreadable.
>uname -a
Linux Emmet-VM 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:18:00 UTC 2015 i686 i686 i686 GNU/Linux
>lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty
First, I've found this in "Intel(R)64 and IA-32 Architectures Software Developer's Manual(Combined Volumes 1,2A,2B,2C,2D,3A,3B,3C and 3D)":
Set read-enable bit to enable read and Segment Types.(Sorry, I'm still not allowed to embed pictures in my posts, so links instead)
So, I guess if I change %CS, and let it point to a Segment Descriptor which has read-enable bit set as 0, I should make the Code Segment not readable.
Then, I use the code below to insert a new Segment into LDT.entry[2], and I do set the code segment type to 8, aka 1000B, which means "Execute-Only" according to "Segment Types" link posted above:
typedef struct user_desc UserDesc;
UserDesc *seg = (UserDesc*)malloc(sizeof(UserDesc));
seg->entry_number = 0x2;
seg->base_addr = 0x00000000;
seg->limit = 0xffffffff;
seg->seg_32bit = 0x1;
seg->contents = 0x02;
seg->read_exec_only = 0x1;
seg->limit_in_pages = 0x1;
seg->seg_not_present = 0x0;
seg->useable = 0x0;
int ret = modify_ldt(1, (void*)seg, sizeof(UserDesc));
After that, I change %CS to 0x17(00010111B, meaning the entry 2 in LDT) with ljmp.
asm("ljmp $0x17, $reload_cs\n"
"reload_cs:");
But, even with this, I still can read the byte code in code segment:
void foo() {printf("foo\n");}
void test(){
char* a = (char*)foo;
printf("0x%x\n", (unsigned int)a[0]);// This prints 0x55
}
If the code segment is unreadable, code above should throw a segment fault error. But it prints 0x55 successfully.
So, I wonder, is there any mistake I've made during my test?
Or is this just a mistake in Intel's Manual?
You are still accessing the code through DS when doing (unsigned int)a[0].
Write only segments don't exist (and if they did, it would be a bad idea to set DS write only).
If you did everything correctly mov eax, [cs:...] (NASM syntax) will fail (but mov eax, [ds:...] won't).
After a quick glance at the Intel Manual execute only pages should not exist (at least directly), so using mprotect with PROT_EXEC may be of limited use (the code would still be readable).
Worth a shot, though.
There are three ways around this.
None of which can be implemented without the aid of the OS though, so they are more theoretical than practical.
Protection keys
If the CPU supports them (See section 4.6.2 of the Intel manual 3), they introduce an asymmetry in how code and data are read.
Reading data is subject to the key protection.
Fetching however is not:
How a linear address’s protection key controls access to the address depends on the mode of a linear address:
A linear address’s protection controls only data accesses to the address. It does not in any way affect instructions fetches from the address.
So it's possible to set a protection key for the code pages that your application don't have in its PKRU register.
You would still be allowed to execute the code but not to read it.
Desync the TLBs
If your application has never touched the code pages for reading, they will occupy some entries in the ITLB but not in the DTLB.
If then, the OS map them as supervisor-only without flushing the TLBs, access to them is prevented when accessed as data (since no DTLB entries for those pages are present, forcing a walk on the memory) but thanks to the ITLB the code can still be fetched.
This is more involved in practice as code span multiple pages and is actually read as data by the OS.
EPT
The Extended Data Pages are used during virtualization to translate Guest physical addresses to Host physical addresses.
Though they seems just another level of indirection, they have separate Read, Write and Execute control bits.
A paper has been written about preventing the leakage of the kernel code (to counteract dynamic Return Oriented Programming).

Is there a list of errors will show up as `segfaults` when they are not really related to memory access?

In this question, I learned that attempting to run privileged instructions when not in ring 0 can cause what looks like a segfault in a user process, and I have two follow-up questions.
Is this true of all privileged instructions?
What other sorts of errors can cause a fake segfault but are not related to trying to read memory?
Read through the instruction set reference and see where #GP is listed for a non-memory issue. Incomplete list: CLI, CLTS, HLT, IN, INT (with an invalid vector), INVD, INVLPG, IRET (under circumstances), LDMXCSR(setting reserved bits), LGDT, LIDT, LLDT, LMSW, LTR, MONITOR (with ECX != 0), MOV (to CRx or DRx), MWAIT (with invalid ECX), OUT, RDMSR, RDPMC, SWAPGS, SYSEXIT, SYSRET, WBINVD, WRMSR, XGETBV (invalid ECX), XRSTOR, XSETBV

Resources