Why is the compiler adding an extra 'sxtw' instruction (resulting further in a kernel panic)? - linux

Issue/Symptom:
At the end of a function return, the compiler adds an sxtw instruction as seen in the disassembly, resulting in a return address of only 32 bits instead of 64 bits, resulting in a kernel panic:
Unable to handle kernel paging request at virtual address xxxx
Build Environment:
Platform : ARMV7LE
gcc, linux-4.4.60
Archictecture : arm64
gdb : aarch64-5.3-glibc-2.22/usr/bin/aarch64-linux-gdb
Details:
Here's the simplified project structure. It's been taken care of correctly in the corresponding makefile. Also note that file1.c and file2.c are part of same module.
../src/file1.c /* It has func1() defined as well as called /
../src/file2.c
../inc/files.h / There's no func1() declared in the header */
Cause of the issue:
A call to the func1() was added from the file2.c w/o func1 declaration in files.h or file2.c. (Basically the inclusion of func1 was accidentally missed in the files.h.)
Code compiled with no errors, but a warning as expected -- Implicit declaration of function func1.
At run time though, right after returning from func1 inside file2, the system crashed as it tried de-referencing the returned address from func1.
Further analysis showed that at the end of a function return, the compiler added an sxtw instruction as seen in the disassembly, resulting in a return address of only 32 bits instead of 64 bits, resulting in a kernel panic.
Unable to handle kernel paging request at virtual address xxxx
Note that x19 is of 64 bit while w0 is of 32 bit.
Note that x0 LS word matches with that of x19.
System crashed while de-referencing x19.
sxtw x19, w0 /* This was added by compiler as extra instruction /
ldp x1, x0, [x19,#304] / System crashed here */
Registers:
[ 91.388130] pc : [<ffffff80016c9074>] lr : [<ffffff80016c906c>] pstate: 80000145
[ 91.462090] sp : ffffff80094333b0
[ 91.552708] x29: ffffff80094333d0 x28: ffffffc06995408a
[ 91.652701] x27: ffffffc06c400a00 x26: 0000000000000000
[ 91.716243] x25: 0000000000000000 x24: ffffffc069958000
[ 91.779784] x23: ffffffc076e00000 x22: ffffffc06c400a00
[ 91.843326] x21: 0000000000000031 x20: ffffffc073060000
[ 91.906867] x19: 0000000066bfc780 x18: ffffff8009436888
[ 91.970409] x17: 0000000000000000 x16: ffffff8008193074
[ 92.033952] x15: 00000000000a8c06 x14: 2c30323030387830
[ 92.097492] x13: 3d7367616c66202c x12: 3038653030303030
[ 92.161034] x11: 3038666666666666 x10: 78303d646e65202c
[ 92.224576] x9 : 3063303030303030 x8 : 3030303030303030
[ 92.288117] x7 : 0000000000000880 x6 : 0000000000000000
[ 92.351659] x5 : ffffffc07fd10ad8 x4 : 0000000000000001
[ 92.415202] x3 : 0000000000000007 x2 : cb88537fdc8ba63c
[ 92.478743] x1 : 0000000000000000 x0 : ffffffc066bfc780
After adding the declaration of func1 in the files.h, the extra instruction and hence the crash was not seen.
Can someone please explain why the compiler added sxtw in this case?

You should have received at least two warnings, one about the missing function declaration and another one about the the implicit conversion from int to a pointer type.
The reason is that implicitly declared functions have a return type of int. Casting this int value to a 64-bit pointer throws away 32 bit of the result. This is the expected GNU C behavior, based on what C compilers for early 64-bit targets did. The sxtw instruction is required to implement this behavior. (Current C standards no longer have implicit function declarations, but GCC still has to support them for backwards compatibility with existing autoconf tests.)
Note that your platform is obviously Aarch64 (with 64-bit registers), not 32-bit ARMv7.

Related

RIP register doesn't understand valid memory address [duplicate]

I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?
Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.
Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).
You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.
This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.
You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx
Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.

How come Linux kernel interferes the execution of RISC-V custom0 instruction on Zedboard?

dummy_rocc is a naive built-in RoCC accelerator example in RISCV tools, where several custom0 instructions are defined. After setup dummy_rocc (either on Spike ISA simulator or on Rocket-FPGA, differently), we use dummy_rocc_test -- a user program testcase to verify the correctness of the dummy_rocc accelerator. We have two ways to run dummy_rocc_test, either on pk (proxy kernel) or on Linux.
I once setup dummy_rocc on Spike ISA simulator, the dummy_rocc_test worked well either on pk or on Linux.
Now I replace Spike with Rocket-FPGA on Zedboard. While the execution on pk succeeds:
root#zynq:~# ./fesvr-zynq pk /nfs/copy_to_rootfs/work/dummy_rocc_test
begin
after asm code
load x into accumulator 2 (funct=0)
read it back into z (funct=1) to verify it
accumulate 456 into it (funct=3)
verify it
do it all again, but initialize acc2 via memory this time (funct=2)
do it all again, but initialize acc2 via memory this time (funct=2)
do it all again, but initialize acc2 via memory this time (funct=2)
success!
the execution on Linux fails:
./fesvr-zynq +disk=/nfs/root.bin bbl /nfs/fpga-zynq/zedboard/fpga-images-zedboard/riscv/vmlinux
..................................Booting RISC-V Linux.........................................
/ # ./work/dummy_rocc_test
begin
after asm code
[ 0.400000] dummy_rocc_test[23]: unhandled signal 4 code 0x30001 at 0x0000000000800500 in ]
[ 0.400000] CPU: 0 PID: 23 Comm: dummy_rocc_test Not tainted 3.14.33-g043bb5d #1
[ 0.400000] task: ffffffff8fa3f500 ti: ffffffff8fb76000 task.ti: ffffffff8fb76000
[ 0.400000] sepc: 0000000000800500 ra : 00000000008004fc sp : 0000003fff943c70
[ 0.400000] gp : 0000000000882198 tp : 0000000000884700 t0 : 0000000000000000
[ 0.400000] t1 : 000000000080adc8 t2 : 8101010101010100 s0 : 0000003fff943ca0
[ 0.400000] s1 : 0000000000800d5c a0 : 000000000000000f a1 : 0000002000002000
[ 0.400000] a2 : 000000000000000f a3 : 000000000085cee8 a4 : 0000000000000001
[ 0.400000] a5 : 000000000000007b a6 : 0000000000000008 a7 : 0000000000000040
[ 0.400000] s2 : 0000000000000000 s3 : 00000000008a2668 s4 : 00000000008d8d98
[ 0.400000] s5 : 00000000008d7770 s6 : 0000000000000008 s7 : 00000000008d6000
[ 0.400000] s8 : 00000000008d8d60 s9 : 0000000000000000 s10: 00000000008a32b8
[ 0.400000] s11: ffffffffffffffff t3 : 000000000000000b t4 : 000000006ffffdff
[ 0.400000] t5 : 000000000000000a t6 : 000000006ffffeff
[ 0.400000] sstatus: 8000000000003008 sbadaddr: 0000000000800500 scause: 0000000000000002
Illegal instruction
A screenshot shows that the "signal 4" is caused by a custom0 instruction.
readelf screenshot of dummy_rocc_test
So my problem is "How come Linux kernel interferes the execution of RISC-V custom0 instruction on Zedboard? "
The source code of dummy_rocc_test is provided as reference:
// The following is a RISC-V program to test the functionality of the
// dummy RoCC accelerator.
// Compile with riscv64-unknown-elf-gcc dummy_rocc_test.c
// Run with spike --extension=dummy_rocc pk a.out
#include <assert.h>
#include <stdio.h>
#include <stdint.h>
int main() {
printf("begin\n");
uint64_t x = 123, y = 456, z = 0;
// load x into accumulator 2 (funct=0)
// asm code
asm volatile ("addi a1, a1, 2");
/// printf again
printf("after asm code\n");
asm volatile ("custom0 x0, %0, 2, 0" : : "r"(x));
printf("load x into accumulator 2 (funct=0)\n");
// read it back into z (funct=1) to verify it
asm volatile ("custom0 %0, x0, 2, 1" : "=r"(z));
printf("read it back into z (funct=1) to verify it\n");
assert(z == x);
// accumulate 456 into it (funct=3)
asm volatile ("custom0 x0, %0, 2, 3" : : "r"(y));
printf("accumulate 456 into it (funct=3)\n");
// verify it
asm volatile ("custom0 %0, x0, 2, 1" : "=r"(z));
printf("verify it\n");
assert(z == x+y);
// do it all again, but initialize acc2 via memory this time (funct=2)
asm volatile ("custom0 x0, %0, 2, 2" : : "r"(&x));
printf("do it all again, but initialize acc2 via memory this time (funct=2)\n");
asm volatile ("custom0 x0, %0, 2, 3" : : "r"(y));
printf("do it all again, but initialize acc2 via memory this time (funct=2)\n");
asm volatile ("custom0 %0, x0, 2, 1" : "=r"(z));
printf("do it all again, but initialize acc2 via memory this time (funct=2)\n");
assert(z == x+y);
printf("success!\n");
}
"Illegal instruction" means your processor threw an illegal instruction exception.
Since custom0 is not going to be something Linux will know how to execute in software (since it's something that's customizable!), Linux will panic and throw the error that you saw.
The question I have for you is "Did you implement the custom0 instruction in the processor? Is it enabled? Did the program execute your custom0 instruction properly and return the correct answer when you used the proxy-kernel?"

How to get right MIPS libc toolchain for embedded device

I've run into a problem (repetitively) with various company's' embedded linux products where GPL source code from them does not match what is actually running on a system. It's "close", but not quite right, especially with respect to the standard C library they use.
Isn't that a violation of the GPL?
Often this mismatch results in a programmer (like me) cross compiling only to have the device reply cryptically "file not found" or something similar when the program is run.
I'm not alone with this kind of problem -- For many people have threads directly and indirectly related to the problem: eg:
Compile parameters for MIPS based codesourcery toolchain?
And I've run into the problem on Sony devices, D-link, and many others. It's very common.
Making a new library is not a good solution, since most systems are ROMFS only, and LD_LIBRARY_PATH is sometimes broken -- so that installing a new library on the device wastes very limited memory and often won't work.
If I knew what the right source code version of the library was, I could go around the manufacturer's carelessness and compile it from the original developer's tree; but how can I find out which version I need when all I have is the binary of the library itself?
For example: I ran elfread -a libc.so.0 on a DSL modem's libc (see below); but I don't see anything here that could tell me exactly which libc it was...
How can I find the name of the source code, or an identifier from the library's binary so I can create a cross compiler using that library? eg: Can anyone tell me what source code this library came from, and how they know?
ELF Header:
Magic: 7f 45 4c 46 01 02 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, big endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Shared object file)
Machine: MIPS R3000
Version: 0x1
Entry point address: 0x5a60
Start of program headers: 52 (bytes into file)
Start of section headers: 0 (bytes into file)
Flags: 0x1007, noreorder, pic, cpic, o32, mips1
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 4
Size of section headers: 0 (bytes)
Number of section headers: 0
Section header string table index: 0
There are no sections in this file.
There are no sections to group in this file.
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
REGINFO 0x0000b4 0x000000b4 0x000000b4 0x00018 0x00018 R 0x4
LOAD 0x000000 0x00000000 0x00000000 0x2c9ee 0x2c9ee R E 0x1000
LOAD 0x02c9f0 0x0006c9f0 0x0006c9f0 0x009a0 0x040b8 RW 0x1000
DYNAMIC 0x0000cc 0x000000cc 0x000000cc 0x0579a 0x0579a RWE 0x4
Dynamic section at offset 0xcc contains 19 entries:
Tag Type Name/Value
0x0000000e (SONAME) Library soname: [libc.so.0]
0x00000004 (HASH) 0x18c
0x00000005 (STRTAB) 0x3e9c
0x00000006 (SYMTAB) 0x144c
0x0000000a (STRSZ) 6602 (bytes)
0x0000000b (SYMENT) 16 (bytes)
0x00000015 (DEBUG) 0x0
0x00000003 (PLTGOT) 0x6ce20
0x00000011 (REL) 0x5868
0x00000012 (RELSZ) 504 (bytes)
0x00000013 (RELENT) 8 (bytes)
0x70000001 (MIPS_RLD_VERSION) 1
0x70000005 (MIPS_FLAGS) NOTPOT
0x70000006 (MIPS_BASE_ADDRESS) 0x0
0x7000000a (MIPS_LOCAL_GOTNO) 11
0x70000011 (MIPS_SYMTABNO) 677
0x70000012 (MIPS_UNREFEXTNO) 17
0x70000013 (MIPS_GOTSYM) 0x154
0x00000000 (NULL) 0x0
There are no relocations in this file.
The decoding of unwind sections for machine type MIPS R3000 is not currently supported.
Histogram for bucket list length (total of 521 buckets):
Length Number % of total Coverage
0 144 ( 27.6%)
1 181 ( 34.7%) 27.1%
2 130 ( 25.0%) 66.0%
3 47 ( 9.0%) 87.1%
4 12 ( 2.3%) 94.3%
5 5 ( 1.0%) 98.1%
6 1 ( 0.2%) 99.0%
7 1 ( 0.2%) 100.0%
No version information found in this file.
Primary GOT:
Canonical gp value: 00074e10
Reserved entries:
Address Access Initial Purpose
0006ce20 -32752(gp) 00000000 Lazy resolver
0006ce24 -32748(gp) 80000000 Module pointer (GNU extension)
Local entries:
Address Access Initial
0006ce28 -32744(gp) 00070000
0006ce2c -32740(gp) 00030000
0006ce30 -32736(gp) 00000000
0006ce34 -32732(gp) 00010000
0006ce38 -32728(gp) 0006d810
0006ce3c -32724(gp) 0006d814
0006ce40 -32720(gp) 00020000
0006ce44 -32716(gp) 00000000
0006ce48 -32712(gp) 00000000
Global entries:
Address Access Initial Sym.Val. Type Ndx Name
0006ce4c -32708(gp) 000186c0 000186c0 FUNC bad section index[ 6] __fputc_unlocked
0006ce50 -32704(gp) 000211a4 000211a4 FUNC bad section index[ 6] sigprocmask
0006ce54 -32700(gp) 0001e2b4 0001e2b4 FUNC bad section index[ 6] free
0006ce58 -32696(gp) 00026940 00026940 FUNC bad section index[ 6] raise
...
truncated listing
....
Note:
The rest of this post is a blog showing how I came to ask the question above and to put useful information about the subject in one place.
Don't bother reading it unless you want to know I actually did research the question... in gory detail... and how NOT to answer my question.
The proper (theoretical) way to get a libc program running on (for example) a D-link modem would simply be to get the TRUE source code for the product from the manufacturer, and compile against those libraries.... (It's GPL !? right, so the law is on our side, right?)
For example: I just bought a D-Link DSL-520B modem and a 526B modem -- but found out after the fact that the manufacturer "forgot" to supply linux source code for the 520B but does have it for the 526B. I checked all of the DSL-5xxB devices online for source code & toolchains, finding to my delight that ALL of them (including 526B) -- contain the SAME pre-compiled libc.so.0 with MD5sum of 6ed709113ce615e9f170aafa0eac04a6 . So in theory, all supported modems in the DSL-5xxB family seemed to use the same libc library... and I hoped I might be able to use that library.
But after I figured out how to get the DSL modem itself to send me a copy of the installed /lib/libc.so.0 library -- I found to my disgust that they ALL use a library with MD5 sum of b8d492decc8207e724a0822641205078 . In NEITHER of the modems I bought (supported or not) was found the same library as contained in the source code toolchain.
To verify the toolchain from D-link was defective, I didn't compile a program (the toolchain wouldn't run on my PC anyway as it was the wrong binary format) -- but I found the toolchain had some pre-compiled mips binaries in it already; so I simply downloaded one to the modem and chmod +x -- and (surprise) I got the message "file not found." when I tried to run it ... It won't run.
So, I knew the toolchains were no good immediately, but not exactly why.
I decided to get a newer verson of MIPS GCC (binary version) that should have less bugs, more features and which is supported on most PC platforms. This is the way to GO!
See: Kernel.org pre-compiled binaries
I upgraded to gcc 4.9.0 after selecting the older "mips" verson from the above site to get the right FTP page; and set my shells' PATH variable to the /bin directory of the cross compiler once installed.
Then I copied all the header files and libraries from the D-link source code package into the new cross compiler just to verify that it could compile D-link libc binaries. And it did on the first try, compiling "hello world!" with no warnings or errors into a mips 32 big endian binary.
( START EDIT: ) #ChrisStratton points out in the comments (after this post) that my test of the toolchain is inadequate, and that using a newer GCC with an older library -- even though it links properly -- is flawed as a test. I wish there was a way to give him points for his comments -- I've become convinced that he's right; although that makes what D-link did even a worse problem -- for there's no way to know from the binaries on the modem which GCC they actually used. The GCC used for the kernel isn't necessarily the same used in user space.
In order to test the new compiler's compatibility with the modems and also make tools so I could get a copy of the actual libraries found on the modem: ( END EDIT ) I wrote a program that doesn't use the C library at all (but in two parts): It ran just fine... and the code is attached to show how it can be done.
The first listing is an assembly language program to bypass linking the standard C libraries on MIPS; and the second listing is a program meant to create an octal number dump of a binary file/stream using only the linux kernel. eg: It enables copying/pasting or scripting of binary data over telnet, netcat, etc... via ash/bash or busybox :) like a poor man's uucp.
// substart.S MIPS assembly language bypass of libc startup code
// it just calls main, and then jumps to the exit function
.text
.globl __start
__start: .ent __start
.frame $29, 32, $31
.set noreorder
.cpload $25
.set reorder
.cprestore 16
jal main
j exit
.end __start
// end substart.S
...and...
// octdump.c
// To compile w/o libc :
// mips-linux-gcc stubstart.S octdump.c -nostdlib -o octdump
// To compile with working libc (eg: x86 system) :
// gcc octdump.c -o octdump_x86
#include <syscall.h>
#include <errno.h>
#include <sys/types.h>
int* __errno_location(void) { return &errno; }
#ifdef _syscall1
// define three unix functions (exit,read,write) in terms of unix syscall macros.
_syscall1( void, exit, int, status );
_syscall3( ssize_t, read, int, fd, void*, buf, size_t, count );
_syscall3( ssize_t, write, int, fd, const void*, buf, size_t, count );
#endif
#include <unistd.h>
void oct( unsigned char c ) {
unsigned int n = c;
int m=6;
static unsigned char oval[6]={'\\','\\','0','0','0','0'};
if (n < 64) { m-=1; n <<= 3; }
if (n < 64) { m-=1; n <<= 3; }
if (n < 64) { m-=1; n <<= 3; }
oval[5]='0'+(n&7);
oval[4]='0'+((n>>3)&7);
oval[3]='0'+((n>>6)&7);
write( STDOUT_FILENO, oval, m );
}
int main(void) {
char buffer[255];
int count=1;
int i;
while (count>0) {
count=read( STDIN_FILENO, buffer, 17 );
if (count>0) write( STDOUT_FILENO, "echo -ne $'",11 );
for (i=0; i<count; ++i) oct( buffer[i] );
if (count>0) write( STDOUT_FILENO, "'\n", 2 );
}
write( STDOUT_FILENO,"#\n",2);
return 0;
}
Once mips' octdump was saved (chmod +x) as /var/octdump on the modem, it ran without errors.
(use your imagination about how I got it on there... Dlink's TFTP, & friends are broken.)
I was able to use octdump to copy all the dynamic libraries off the DSL modem and examine them, using an automated script to avoid copy/pasting by hand.
#!/bin/env python
# octget.py
# A program to upload a file off an embedded linux device via telnet
import socket
import time
import sys
import string
if len( sys.argv ) != 4 :
raise ValueError, "Usage: octget.py IP_OF_MODEM passwd path_to_file_to_get"
o = socket.socket( socket.AF_INET, socket.SOCK_STREAM )
o.connect((sys.argv[1],23)) # The IP address of the DSL modem.
time.sleep(1)
sys.stderr.write( o.recv(1024) )
o.send("admin\r\n");
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send(sys.argv[2]+"\r\n")
time.sleep(0.1)
o.send("sh\r\n")
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send("cd /var\r\n")
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send("./octdump.x < "+sys.argv[3]+"\r\n" );
sys.stderr.write( o.recv(21) )
get="y"
while get and not ('#' in get):
get = o.recv(4096)
get = get.translate( None, '\r' )
sys.stdout.write( get )
time.sleep(0.5)
o.close()
The DSL520B modem had the following libraries...
libcrypt.so.0 libpsi.so libutil.so.0 ld-uClibc.so.0 libc.so.0 libdl.so.0 libpsixml.so
... and I thought I might cross compile using these libraries since (at least in theory) -- GCC could link against them; and my problem might be solved.
I made very sure to erase all the incompatible .so libraries from gcc-4.9.0/mips-linux/mips-linux/lib, but kept the generic crt..o files; then I copied the modem's libraries into the cross compiler directory.
But even though the kernel version of the source code, and the kernel version of the modem matched -- GCC found undefined symbols in the crt files.... So, either the generic crt files or the modem libraries themselves are somehow defective... and I don't know why. Without knowing how to get the full library version of the ? ucLibc ? library, I'm not sure how I can get the CORRECT source code to recompile the libraries and the crt's from scratch.

virtual to physical address conversion in linux kernel

The following is used to translate virtual address to physical address in linux kernel. But what does it mean?
I have very limited knowledge of assembly
163 #define __pv_stub(from,to,instr,type) \
164 __asm__("# __pv_stub\n" \
165 "1: " instr " %0, %1, %2\n" \
166 " .pushsection .pv_table,\"a\"\n" \
167 " .long 1b\n" \
168 " .popsection\n" \
169 : "=r" (to) \
170 : "r" (from), "I" (type))
It's not really "assembly" as there is no instruction in this macro per se.
It's just a macro which inserts instr (an instruction passed to the macro) which has one input operand from, one immediate (constant) input operand type and a output operand to.
There is also the part between pushsection and popsection which records in a specific binary section pv_table the address of this instruction. That allows the kernel to find these places in its code if it wishes to.
The last part is the asm constraints and operands. It lists what the compiler will replace %0, %1 and %2 with. %0 is the first listed ("=r"(to)), it means that %0 will be any general purpose register, that is an output operand that will be stored in the macro argument to. The other 2 are similar except they're input operands: from is a register so gets "r" but type is an immediate so is "i"
See http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Extended-Asm.html#Extended-Asm for details
So consider this code from the kernel (http://lxr.linux.no/linux+v3.9.4/arch/arm/include/asm/memory.h#L172)
static inline unsigned long __virt_to_phys(unsigned long x)
{ unsigned long t;
__pv_stub(x, t, "add", __PV_BITS_31_24);
return t;
}
__pv_stub will be equivalent to t = x + __PV_BITS_31_24 (instr == add, from == x, to == t, type == __PV_BITS_31_24)
So you might wonder why anybody would do such a complicated thing instead of just writing t = x + __PV_BITS_31_24 in the code.
The reason is the pv_table I mentioned above. The address of all these statements is recorded in a specific elf section. Under some circumstances, the kernel patches these instructions at runtime (so needs to be able to easily find all of them) hence the need for a table.
The ARM port does exactly that here: http://lxr.linux.no/linux+v3.9.4/arch/arm/kernel/head.S#L541
It's used only if CONFIG_ARM_PATCH_PHYS_VIRT is used to compile the kernel:
CONFIG_ARM_PATCH_PHYS_VIRT:
Patch phys-to-virt and virt-to-phys translation functions at
boot and module load time according to the position of the
kernel in system memory.
This can only be used with non-XIP MMU kernels where the base
of physical memory is at a 16MB boundary, or theoretically 64K
for the MSM machine class.

Why can't I handle NMI?

I want to handle NMI and do something when NMI occur. Firstly I write a naive nmi handler:
static irqreturn_t nmi_handler(int irq, void* dev_id) {
printk("-#_#- I'm TT, I am handling NMI.\n");
return IRQ_HANDLED;
}
And write a module to register my nmi handler, then use APIC to trigger NMI 5 times:
static void __init ipi_init(void) {
printk("-#_#- I'm coming again, hahaha!\n");
int result = request_irq(NMI_VECTOR,
nmi_handler, IRQF_DISABLED, "NMI Watchdog", NULL);
printk("--- the result of request_irq is: %d\n", result);
int i;
for (i = 0; i < 5; ++i) {
apic->send_IPI_allbutself(NMI_VECTOR);
ssleep(1);
}
}
Now I type "insmod xxx.ko" to install this module, after that, I check the /var/log/syslog:
kernel: [ 1166.231005] -#_#- I'm coming again, hahaha!
kernel: [ 1166.231028] --- the result of request_irq is: 0
kernel: [ 1166.231050] Uhhuh. NMI received for unknown reason 00 on CPU 1.
kernel: [ 1166.231055] Do you have a strange power saving mode enabled?
kernel: [ 1166.231058] Dazed and confused, but trying to continue
kernel: [ 1167.196293] Uhhuh. NMI received for unknown reason 00 on CPU 1.
kernel: [ 1167.196293] Do you have a strange power saving mode enabled?
kernel: [ 1167.196293] Dazed and confused, but trying to continue
kernel: [ 1168.201288] Uhhuh. NMI received for unknown reason 00 on CPU 1.
kernel: [ 1168.201288] Do you have a strange power saving mode enabled?
kernel: [ 1168.201288] Dazed and confused, but trying to continue
kernel: [ 1169.235553] Uhhuh. NMI received for unknown reason 00 on CPU 1.
kernel: [ 1169.235553] Do you have a strange power saving mode enabled?
kernel: [ 1169.235553] Dazed and confused, but trying to continue
kernel: [ 1170.236343] Uhhuh. NMI received for unknown reason 00 on CPU 1.
kernel: [ 1170.236343] Do you have a strange power saving mode enabled?
kernel: [ 1170.236343] Dazed and confused, but trying to continue
It shows that I register nmi_handler successfully(result=0), and NMI were triggered 5 times, but I didn't find sting that should be outputed in nmi_handler.
I work on Ubuntu 10.04 LTS, Intel Pentium 4 Dual-core.
Does it mean my NMI handler didn't execute?
How do I handler NMI in Linux?
Nobody?
My partner gave me 3 more days, so I read the source code and ULK3, now I can answer question 1:
Does it mean my NMI handle didn't execute?
In fact, IRQ number and INT vector number are different! The function request_irq() call setup_irq():
/**
* setup_irq - setup an interrupt
* #irq: Interrupt line to setup
* #act: irqaction for the interrupt
*
* Used to statically setup interrupts in the early boot process.
*/
int setup_irq(unsigned int irq, struct irqaction *act)
{
struct irq_desc *desc = irq_to_desc(irq);
return __setup_irq(irq, desc, act);
}
Look at this: #irq: Interrupt line to setup
. The argument irq is interrupt line number, not interrupt vector number. Look up ULK3 PDF, P203, Timer interrupt has IRQ 0, but its INT nr is 32! So I trigger the INT2(NMI) but my handler handle the INT34 actually! I want to find more evidence in source code(e.g. how to convert IRQ to INT? I modify my handler and init, I request irq=2, and Linux allot INT=50), but get nothing, expect linux-xxx/arch/x86/include/asm/irq_vectors.h
/*
* IDT vectors usable for external interrupt sources start
* at 0x20:
*/
#define FIRST_EXTERNAL_VECTOR 0x20
Wait me for a while...let me read more codes to answer question 2.

Resources