Initial state of program registers and stack on Linux ARM - linux

I'm currently playing with ARM assembly on Linux as a learning exercise. I'm using 'bare' assembly, i.e. no libcrt or libgcc. Can anybody point me to information about what state the stack-pointer and other registers will at the start of the program before the first instruction is called? Obviously pc/r15 points at _start, and the rest appear to be initialised to 0, with two exceptions; sp/r13 points to an address far outside my program, and r1 points to a slightly higher address.
So to some solid questions:
What is the value in r1?
Is the value in sp a legitimate stack allocated by the kernel?
If not, what is the preferred method of allocating a stack; using brk or allocate a static .bss section?
Any pointers would be appreciated.

Since this is Linux, you can look at how it is implemented by the kernel.
The registers seem to be set by the call to start_thread at the end of load_elf_binary (if you are using a modern Linux system, it will almost always be using the ELF format). For ARM, the registers seem to be set as follows:
r0 = first word in the stack
r1 = second word in the stack
r2 = third word in the stack
sp = address of the stack
pc = binary entry point
cpsr = endianess, thumb mode, and address limit set as needed
Clearly you have a valid stack. I think the values of r0-r2 are junk, and you should instead read everything from the stack (you will see why I think this later). Now, let's look at what is on the stack. What you will read from the stack is filled by create_elf_tables.
One interesting thing to notice here is that this function is architecture-independent, so the same things (mostly) will be put on the stack on every ELF-based Linux architecture. The following is on the stack, in the order you would read it:
The number of parameters (this is argc in main()).
One pointer to a C string for each parameter, followed by a zero (this is the contents of argv in main(); argv would point to the first of these pointers).
One pointer to a C string for each environment variable, followed by a zero (this is the contents of the rarely-seen envp third parameter of main(); envp would point to the first of these pointers).
The "auxiliary vector", which is a sequence of pairs (a type followed by a value), terminated by a pair with a zero (AT_NULL) in the first element. This auxiliary vector has some interesting and useful information, which you can see (if you are using glibc) by running any dynamically-linked program with the LD_SHOW_AUXV environment variable set to 1 (for instance LD_SHOW_AUXV=1 /bin/true). This is also where things can vary a bit depending on the architecture.
Since this structure is the same for every architecture, you can look for instance at the drawing on page 54 of the SYSV 386 ABI to get a better idea of how things fit together (note, however, that the auxiliary vector type constants on that document are different from what Linux uses, so you should look at the Linux headers for them).
Now you can see why the contents of r0-r2 are garbage. The first word in the stack is argc, the second is a pointer to the program name (argv[0]), and the third probably was zero for you because you called the program with no arguments (it would be argv[1]). I guess they are set up this way for the older a.out binary format, which as you can see at create_aout_tables puts argc, argv, and envp in the stack (so they would end up in r0-r2 in the order expected for a call to main()).
Finally, why was r0 zero for you instead of one (argc should be one if you called the program with no arguments)? I am guessing something deep in the syscall machinery overwrote it with the return value of the system call (which would be zero since the exec succeeded). You can see in kernel_execve (which does not use the syscall machinery, since it is what the kernel calls when it wants to exec from kernel mode) that it deliberately overwrites r0 with the return value of do_execve.

Here's what I use to get a Linux/ARM program started with my compiler:
/** The initial entry point.
*/
asm(
" .text\n"
" .globl _start\n"
" .align 2\n"
"_start:\n"
" sub lr, lr, lr\n" // Clear the link register.
" ldr r0, [sp]\n" // Get argc...
" add r1, sp, #4\n" // ... and argv ...
" add r2, r1, r0, LSL #2\n" // ... and compute environ.
" bl _estart\n" // Let's go!
" b .\n" // Never gets here.
" .size _start, .-_start\n"
);
As you can see, I just get the argc, argv, and environ stuff from the stack at [sp].
A little clarification: The stack pointer points to a valid area in the process' memory. r0, r1, r2, and r3 are the first three parameters to the function being called. I populate them with argc, argv, and environ, respectively.

Here's the uClibc crt. It seems to suggest that all registers are undefined except r0 (which contains a function pointer to be registered with atexit()) and sp which contains a valid stack address.
So, the value you see in r1 is probably not something you can rely on.
Some data are placed on the stack for you.

I've never used ARM Linux but I suggest you either look at the source for the libcrt and see what they do, or use gdb to step into an existing executable. You shouldn't need the source code just step through the assembly code.
Everything you need to find out should happen within the very first code executed by any binary executable.
Hope this helps.
Tony

Related

$gp, .cpload and position independence on MIPS

I'm reading about PIC implementation on MIPS on Linux here. It says:
The global pointer which is stored in the $gp register (aka $28) is a callee saved register.
The Wikipedia article about MIPS says the same.
However, according to them, when a .cpload directive is being used in function prologue, it clobbers the previous value of $gp without saving it first. When a .cprestore is used, it saves the current $gp to the stack frame, as opposed to the value of $gp that was there on function entrance. Same goes for the effect .cprestore has on jal/jalr: it restores $gp once the callee returns - assuming the callee might've clobbered it.
And finally, there's nothing in the function epilogue about $gp.
All in all, doesn't sound like a callee-saved register to me. Sounds like a caller-saved register. What am I misunderstanding here?
Linux programs on MIPS can be compiled as pic or not. If compiled as pic, then they must use "abicalls", and its behaviour is a little different from that of the no-abicalls convention.
From the "section Position-Independent Function Prologue" of the "SYSTEM V APPLICATION BINARY INTERFACE - MIPS Processor Supplement 3rd Edition" we can cite:
After calculating the gp, a function allocates the local stack space and saves the gp on the stack, so it can be restored after subsequent function calls. In other words, the gp is a caller saved register.
The code in the following figure illustrates a position-independent function prologue. _gp_disp represents the offset between the beginning of the function and the global offset table.
name:
la gp, _gp_disp
addu gp, gp, t9
addiu sp, sp, –64
sw gp, 32(sp)
So in summary, if you're using -mabicalls then gp is calculated at the beginning of all the functions needing global symbols (with some exceptions), and additionally any code (abi or not) that calls abi code will ensure that the called function address is stored in t9.

Forcing MSVC to generate FIST instructions with the /QIfist option

I'm using the /QIfist compiler switch regularly, which causes the compiler to generate FISTP instructions to round floating point values to integers, instead of calling the _ftol helper function.
How can I make it use FIST(P) DWORD, instead of QWORD?
FIST QWORD requires the CPU to store the result on stack, then read stack into register and finally store to destination memory, while FIST DWORD just stores directly into destination memory.
FIST QWORD requires the CPU to store the result on stack, then read stack into register and finally store to destination memory, while FIST DWORD just stores directly into destination memory.
I don't understand what you are trying to say here.
The FIST and FISTP instructions differ from each other in exactly two ways:
FISTP pops the top value off of the floating point stack, while FIST does not. This is the obvious difference, and is reflected in the opcode naming: FISTP has that P suffix, which means "pop", just like ADDP, etc.
FISTP has an additional encoding that works with 64-bit (QWORD) operands. That means you can use FISTP to convert a floating point value to a 64-bit integer. FIST, on the other hand, maxes out at 32-bit (DWORD) operands.
(I don't think there's a technical reason for this. I certainly can't imagine it is related to the popping behavior. I assume that when the Intel engineers added support for 64-bit operands some time later, they figured there was no reason for a non-popping version. They were probably running out of opcode encodings.)
There are lots of online references for the x86 instruction set. For example, this site is the top hit for most Google searches. Or you can look in Intel's manuals (FIST/FISTP are on p. 365).
Where the two instructions read the value from, and where they store it to, is exactly the same. Both read the value from the top of the floating point stack, and both store the result to memory.
There would be absolutely no advantage to the compiler using FIST instead of FISTP. Remember that you always have to pop all values off of the floating point stack when exiting from a function, so if FIST is used, you'd have to follow it by an additional FSTP instruction. That might not be any slower, but it would needlessly inflate the code.
Besides, there's another reason that the compiler prefers FISTP: the support for 64-bit operands. It allows the code generator to be identical, regardless of what size integer you're rounding to.
The only time you might prefer to use FIST is if you're hand-writing assembly code and want to re-use the floating point value on the stack after rounding it. The compiler doesn't need to do that.
So anyway, all of that to say that the answer to your question is no. The compiler can't be made to generate FIST instructions automatically. If you're still insistent, you can write inline assembly that uses whatever instructions you want:
int32 RoundToNearestEven(float value)
{
int32 result;
__asm
{
fld DWORD PTR value
fist DWORD PTR result
// do something with the value on the floating point stack...
//
// ... but be sure to pop it off before returning
fstp st(0)
}
return result;
}

injecting shellcode

i have a .cmd file on a webserver with a variable user="...", vulnerable against buffer overflows. I can execute the .cmd file via ssh or via web.
Now i have this shellcode:
#include <stdio.h>
char sc[] =
"
...
";
void main(void)
{
void(*s)(void);
printf("size: %d\n", sizeof(sc));
s = sc;
s();
}
my problem is, i don't know how this all plays together. I know what the Assembler and the C code does, but how do i inject the code into the running cmd file?
cat "shellcode " | nc host cmd
You generally have to insert enough data so that your write ends up in part of the program's memory that gets executed. How to do that exactly depends entirely on the structure of the program with the overflow.
But, imagine if the program were specified by this ASM listing:
[SECTION .text]
global _start
_start:
;;...
jmpagain:
jmp next
uname db "username"
next:
mov eax,uname
;;...
jmp jmpagain
the string "username" is in memory immediately adjacent to an address that the instruction pointer visits. If the program writes some data to that area of memory without checking it's bounds, it will overwrite the code, and anytime the instruction pointer revisits the next function, the machine is going to execute whatever data overflowed in that part of memory. Supposing the write you are exploiting starts at the beginning of the string, and stops writing on some condition that provides enough room for you to inject your shellcode, you would prepend a byte string of the same length as "username" to your shellcode in the input. Then the beginning of your shellcode would be at the address of the next label.
But this is just a simple example demonstration of the basic principle. Actually getting your data to an area of memory that the instruction pointer visits is likely going to be a lot trickier. If you have access to the command file in question, you can debug it and dump the memory and trace how the area of memory is written to to see how you need to overflow the buffer to get to IP reachable memory.
It's important to reiterate that your shellcode not only needs to make it to memory that the instruction pointer passes over, it needs to be correctly aligned in that memory to execute in the way you expect it to. If the instruction pointer lands somewhere in the middle of your code rather than at the beginning of the first instruction, it obviously isn't going to do what you expect it to.

Good references for the syscalls

I need some reference but a good one, possibly with some nice examples. I need it because I am starting to write code in assembly using the NASM assembler. I have this reference:
http://bluemaster.iu.hio.no/edu/dark/lin-asm/syscalls.html
which is quite nice and useful, but it's got a lot of limitations because it doesn't explain the fields in the other registers. For example, if I am using the write syscall, I know I should put 1 in the EAX register, and the ECX is probably a pointer to the string, but what about EBX and EDX? I would like that to be explained too, that EBX determines the input (0 for stdin, 1 for something else etc.) and EDX is the length of the string to be entered, etc. etc. I hope you understood me what I want, I couldn't find any such materials so that's why I am writing here.
Thanks in advance.
The standard programming language in Linux is C. Because of that, the best descriptions of the system calls will show them as C functions to be called. Given their description as a C function and a knowledge of how to map them to the actual system call in assembly, you will be able to use any system call you want easily.
First, you need a reference for all the system calls as they would appear to a C programmer. The best one I know of is the Linux man-pages project, in particular the system calls section.
Let's take the write system call as an example, since it is the one in your question. As you can see, the first parameter is a signed integer, which is usually a file descriptor returned by the open syscall. These file descriptors could also have been inherited from your parent process, as usually happens for the first three file descriptors (0=stdin, 1=stdout, 2=stderr). The second parameter is a pointer to a buffer, and the third parameter is the buffer's size (as an unsigned integer). Finally, the function returns a signed integer, which is the number of bytes written, or a negative number for an error.
Now, how to map this to the actual system call? There are many ways to do a system call on 32-bit x86 (which is probably what you are using, based on your register names); be careful that it is completely different on 64-bit x86 (be sure you are assembling in 32-bit mode and linking a 32-bit executable; see this question for an example of how things can go wrong otherwise). The oldest, simplest and slowest of them in the 32-bit x86 is the int $0x80 method.
For the int $0x80 method, you put the system call number in %eax, and the parameters in %ebx, %ecx, %edx, %esi, %edi, and %ebp, in that order. Then you call int $0x80, and the return value from the system call is on %eax. Note that this return value is different from what the reference says; the reference shows how the C library will return it, but the system call returns -errno on error (for instance -EINVAL). The C library will move this to errno and return -1 in that case. See syscalls(2) and intro(2) for more detail.
So, in the write example, you would put the write system call number in %eax, the first parameter (file descriptor number) in %ebx, the second parameter (pointer to the string) in %ecx, and the third parameter (length of the string) in %edx. The system call will return in %eax either the number of bytes written, or the error number negated (if the return value is between -1 and -4095, it is a negated error number).
Finally, how do you find the system call numbers? They can be found at /usr/include/linux/unistd.h. On my system, this just includes /usr/include/asm/unistd.h, which finally includes /usr/include/asm/unistd_32.h, so the numbers are there (for write, you can see __NR_write is 4). The same goes for the error numbers, which come from /usr/include/linux/errno.h (on my system, after chasing the inclusion chain I find the first ones at /usr/include/asm-generic/errno-base.h and the rest at /usr/include/asm-generic/errno.h). For the system calls which use other constants or structures, their documentation tells which headers you should look at to find the corresponding definitions.
Now, as I said, int $0x80 is the oldest and slowest method. Newer processors have special system call instructions which are faster. To use them, the kernel makes available a virtual dynamic shared object (the vDSO; it is like a shared library, but in memory only) with a function you can call to do a system call using the best method available for your hardware. It also makes available special functions to get the current time without even having to do a system call, and a few other things. Of course, it is a bit harder to use if you are not using a dynamic linker.
There is also another older method, the vsyscall, which is similar to the vDSO but uses a single page at a fixed address. This method is deprecated, will result in warnings on the system log if you are using recent kernels, can be disabled on boot on even more recent kernels, and might be removed in the future. Do not use it.
If you download that web page (like it suggests in the second paragraph) and download the kernel sources, you can click the links in the "Source" column, and go directly to the source file that implements the system calls. You can read their C signatures to see what each parameter is used for.
If you're just looking for a quick reference, each of those system calls has a C library interface with the same name minus the sys_. So, for example, you could check out man 2 lseek to get the information about the parameters forsys_lseek:
off_t lseek(int fd, off_t offset, int whence);
where, as you can see, the parameters match the ones from your HTML table:
%ebx %ecx %edx
unsigned int off_t unsigned int

Why do the 32-bit and 64-bit Compiled Versions of this Program Populate Memory in this Way?

I am trying to better understand how the stack and heap work. I have run into a snag when comparing the 32-bit and 64-bit compiled versions of the same program. In both cases I used a guest Fedora 15 VM (both 32 and 64), gcc for compiling, gdb for debugging, and the same host hardware. The program in question is very simple and immediately below:
C program
void function(int a, int b, int c, int d){
int value;
char buffer[10];
value = 1234;
buffer[0] = 'A';
}
int main(){
function(1, 2, 3, 4);
}
In the interest of space, I omitted the assembly dump of the program; however if anyone thinks it might help them answer my questions, I'd be happy to include it.
32-bit Compiled Program:
Parameters 4 (0xbffff3e4), 3 (0xbffff3e0), 2 (0xbffff3dc) and 1 (0xbffff3d8) are pushed onto the stack first. Next the location of the instruction following the call for function()--or return address--is placed on the stack (0x080483d1). Next the value of the base pointer for the previous stack (0xbffff3e8) is pushed on to the stack.
(gdb) x/16xw $esp
0xbffff3c0: 0x00000000 0x410759c3 0x4105d237 0x00000000
0xbffff3d0: 0xbffff3e8 0x080483d1 0x00000001 0x00000002//pointers
0xbffff3e0: 0x00000003 0x00000004 0x00000000 0x4105d413//followed by params
0xbffff3f0: 0x00000001 0xbffff484 0xbffff48c 0x41040fc4
64-bit Compiled Program:
However; here the values 4, 3, 2, and 1 are nowhere to be seen. All I can see, no matter how far down the stack I look is the return address (0x4004ae) and previous stack frame's Base Pointer (0x7fffffffe210).
(gdb) x/16xg $rsp
0x7fffffffe200: 0x00007fffffffe210 0x00000000004004ae //pointers
0x7fffffffe210: 0x0000000000000000 0x00000036d042139d
0x7fffffffe220: 0x0000000000000000 0x00007fffffffe2f8
0x7fffffffe230: 0x0000000100000000 0x0000000000400491
0x7fffffffe240: 0x0000000000000000 0x7ade47f577d82f75
0x7fffffffe250: 0x0000000000400390 0x00007fffffffe2f0
0x7fffffffe260: 0x0000000000000000 0x0000000000000000
0x7fffffffe270: 0x8521b80ab3982f75 0x7ab3e77151682f75
64-bit Compiled Program with print statement:
Now, after adding a simple print statement:
printf("%d, %c\n", flag, buffer[0]);
in function(), I can see the wayward parameters (see below, 0x7fffffffe1e0-0x7fffffffe1ec). I can also see the Base Pointer from the previous stack frame, 0x7fffffffe210 (in 0x7fffffffe200) and the return address 0x400520 (in 0x7fffffffe208). I believe it changed due to the new print statement. Why are 4, 3, 2, and 1 not visible without a print statement in this case? Is the 64-bit implementation of the gcc compiler smart enough to not 'waste' memory for parameters and local variables which are never used?
(gdb) x/16xg $rsp
0x7fffffffe1e0: 0x0000000300000004 0x0000000100000002 //parameters
0x7fffffffe1f0: 0x0000000000000000 0x00000000004003e0
0x7fffffffe200: 0x00007fffffffe210 0x0000000000400520 //pointers
0x7fffffffe210: 0x0000000000000000 0x00000036d042139d
0x7fffffffe220: 0x0000000000000000 0x00007fffffffe2f8
0x7fffffffe230: 0x0000000100000000 0x0000000000400503
0x7fffffffe240: 0x0000000000000000 0xd3c0c92559feaed9
0x7fffffffe250: 0x00000000004003e0 0x00007fffffffe2f0
Finally, why does the 32 bit OS place the parameters 4, 3, 2, and 1 higher in the stack than it does the previously mentioned pointers. And why does the 64 bit OS instead place the parameters lower in the stack than said pointers? I was under the impression that passed parameters were always placed on the stack first (and hence, would be in a larger-value memory address since the stack grows toward smaller addresses). Then the saved base pointer and return address followed (so the base pointer could be reset to its previous value and the calling function could be returned to). This is the behavior I am observing in the 32-bit compiled code, but not the 64-bit version. What am I misunderstanding? I appreciate any insight into this matter and apologize if my questions are unclear. Please let me know any way I can be more concise (or if I am factually incorrect at any point).
Thank you in advance.
The 64-bit ABI used by Linux differs considerably from the 32-bit ABI: in the 64-bit world, arguments are often passed in registers, rather than on the stack.
Before adding the printf(), you're not finding the arguments on the stack because the first (up to) 6 integer or pointer arguments get passed in registers (in the order %rdi, %rsi, %rdx, %rcx, %r8, %r9).
After adding the printf(), they probably get saved on the stack in the process of register contents being shuffled around for the printf() call - take a look at the assembly; it's probably obvious once you know what the ABI looks like.

Resources