gcc x64 stack manipulation - linux

I try to understand the way gcc x64 organize the stack, a small program generate this asm
(gdb) disassemble *main
Dump of assembler code for function main:
0x0000000000400534 <main+0>: push rbp
0x0000000000400535 <main+1>: mov rbp,rsp
0x0000000000400538 <main+4>: sub rsp,0x30
0x000000000040053c <main+8>: mov DWORD PTR [rbp-0x14],edi
0x000000000040053f <main+11>: mov QWORD PTR [rbp-0x20],rsi
0x0000000000400543 <main+15>: mov DWORD PTR [rsp],0x7
0x000000000040054a <main+22>: mov r9d,0x6
0x0000000000400550 <main+28>: mov r8d,0x5
0x0000000000400556 <main+34>: mov ecx,0x4
0x000000000040055b <main+39>: mov edx,0x3
0x0000000000400560 <main+44>: mov esi,0x2
0x0000000000400565 <main+49>: mov edi,0x1
0x000000000040056a <main+54>: call 0x4004c7 <addAll>
0x000000000040056f <main+59>: mov DWORD PTR [rbp-0x4],eax
0x0000000000400572 <main+62>: mov esi,DWORD PTR [rbp-0x4]
0x0000000000400575 <main+65>: mov edi,0x400688
0x000000000040057a <main+70>: mov eax,0x0
0x000000000040057f <main+75>: call 0x400398 <printf#plt>
0x0000000000400584 <main+80>: mov eax,0x0
0x0000000000400589 <main+85>: leave
0x000000000040058a <main+86>: ret
Why it reserve up to 0x30 bytes just to save edi and rsi
I don't see any where restore values of edi and rsi as required by ABI
edi and rsi save at position that has delta 0x20 - 0x14 = 0xC, not a continuous region, does it make sense?
follow is source code
int mix(int a,int b,int c,int d,int e,int f, int g){
return a | b | c | d | e | f |g;
}
int addAll(int a,int b,int c,int d,int e,int f, int g){
return a+b+c+d+e+f+g+mix(a,b,c,d,e,f,g);
}
int main(int argc,char **argv){
int total;
total = addAll(1,2,3,4,5,6,7);
printf("result is %d\n",total);
return 0;
}
Edit
It's seem that stack has stored esi,rdi, 7th parameter call to addAll and total , it should take 4x8 = 32 (0x20) bytes, it round up to 0x30 for some reasons.

I dont know your original code, but locals are also stored on the stack, and when you have some local variables that space is also "allocated". Also for alignment reason it can be, that he "rounded" to the next multiple of 16. I would guess you have a local for passing the result from your addAll to the printf, and that is stored at rbp-04.
I just had look in your linked ABI - where does it say that the callee has to restore rdi and rsi? It says already on page 15, footnote:
Note that in contrast to the Intel386 ABI, %rdi, and %rsi belong to the called function, not
the caller.
Afaik they are used for passing the first arguments to the callee.
0xC are 12. This comes also from alignment, as you can see, he just needs to store edi not rdi, for alignment purpose I assume that he aligns it on a 4 byte border, while si is rsi, which is 64 bit and aligned on 8 byte border.

2: The ABI explicitly says that rdi/rsi need NOT be saved by the called function; see page 15 and footnote 5.
1 and 3: unsure; probably stack alignment issues.

Related

Top of a stack frame is not on the lowest address when examining $rsp [duplicate]

I have tried to understand this basing on a square function in c++ at godbolt.org . Clearly, return, parameters and local variables use “rbp - alignment” for this function.
Could someone please explain how this is possible?
What then would rbp + alignment do in this case?
int square(int num){
int n = 5;// just to test how locals are treated with frame pointer
return num * num;
}
Compiler (x86-64 gcc 11.1)
Generated Assembly:
square(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi. ;\\Both param and local var use rbp-*
mov DWORD PTR[rbp-4], 5. ;//
mov eax, DWORD PTR [rbp-20]
imul eax, eax
pop rbp
ret
This is one of those cases where it’s handy to distinguish between parameters and arguments. In short: arguments are the values given by the caller, while parameters are the variables holding them.
When square is called, the caller places the argument in the rdi register, in accordance with the standard x86-64 calling convention. square then allocates a local variable, the parameter, and places the argument in the parameter. This allows the parameter to be used like any other variable: be read, written into, having its address taken, and so on. Since in this case it’s the callee that allocated the memory for the parameter, it necessarily has to reside below the frame pointer.
With an ABI where arguments are passed on the stack, the callee would be able to reuse the stack slot containing the argument as the parameter. This is exactly what happens on x86-32 (pass -m32 to see yourself):
square(int): # #square(int)
push ebp
mov ebp, esp
push eax
mov eax, dword ptr [ebp + 8]
mov dword ptr [ebp - 4], 5
mov eax, dword ptr [ebp + 8]
imul eax, dword ptr [ebp + 8]
add esp, 4
pop ebp
ret
Of course, if you enabled optimisations, the compiler would not bother with allocating a parameter on the stack in the callee; it would just use the value in the register directly:
square(int): # #square(int)
mov eax, edi
imul eax, edi
ret
GCC allows "leaf" functions, those that don't call other functions, to not bother creating a stack frame. The free stack is fair game to do so as these fns wish.

how to properly display a value from the stack using a system call [duplicate]

I am trying to use the write syscall in order to reproduce the putchar function behavior which prints a single character. My code is as follows,
asm_putchar:
push rbp
mov rbp, rsp
mov r8, rdi
call:
mov rax, 1
mov rdi, 1
mov rsi, r8
mov rdx, 1
syscall
return:
mov rsp, rbp
pop rbp
ret
From man 2 write, you can see the signature of write is,
ssize_t write(int fd, const void *buf, size_t count);
It takes a pointer (const void *buf) to a buffer in memory. You can't pass it a char by value, so you have to store it to memory and pass a pointer.
(Don't print one char at a time unless you only have one to print, that's really inefficient. Construct a buffer in memory and print that. e.g. this x86-64 Linux NASM function: How do I print an integer in Assembly Level Programming without printf from the c library?)
A NASM version of GCC: putchar(char) in inline assembly:
; x86-64 System V calling convention: input = byte in DIL
; clobbers: RDI, RSI, RDX, RCX, R11 (last 2 by syscall itself)
; returns: RAX = write return value: 1 for success, -1..-4095 for error
writechar:
mov byte [rsp-4], dil ; store the char from RDI
mov edi, 1 ; EDI = fd=1 = stdout
lea rsi, [rsp-4] ; RSI = buf
mov edx, edi ; RDX = len = 1
syscall ; rax = write(1, buf, 1)
ret
If you do pass an invalid pointer in RSI, such as '2' (integer 50), the system call will return -EFAULT (-14) in RAX. (The kernel returns error codes on bad pointers to system calls, instead of delivering a SIGSEGV like it would if you deref in user-space).
See also What are the return values of system calls in Assembly?
Instead of writing code to check return values, in toy programs / experiments you should just run them under strace ./a.out, especially if you're writing your own _start without libc there won't be any other system calls during startup that you don't make yourself, so it's very easy to read the output. How should strace be used?

Compact shellcode to print a 0-terminated string pointed-to by a register, given puts or printf at known absolute addresses?

Background: I am a beginner trying to understand how to golf assembly, in particular to solve an online challenge.
EDIT: clarification: I want to print the value at the memory address of RDX. So “SUPER SECRET!”
Create some shellcode that can output the value of register RDX in <= 11 bytes. Null bytes are not allowed.
The program is compiled with the c standard library, so I have access to the puts / printf statement. It’s running on x86 amd64.
$rax : 0x0000000000010000 → 0x0000000ac343db31
$rdx : 0x0000555555559480 → "SUPER SECRET!"
gef➤ info address puts
Symbol "puts" is at 0x7ffff7e3c5a0 in a file compiled without debugging.
gef➤ info address printf
Symbol "printf" is at 0x7ffff7e19e10 in a file compiled without debugging.
Here is my attempt (intel syntax)
xor ebx, ebx ; zero the ebx register
inc ebx ; set the ebx register to 1 (STDOUT
xchg ecx, edx ; set the ECX register to RDX
mov edx, 0xff ; set the length to 255
mov eax, 0x4 ; set the syscall to print
int 0x80 ; interrupt
hexdump of my code
My attempt is 17 bytes and includes null bytes, which aren't allowed. What other ways can I lower the byte count? Is there a way to call puts / printf while still saving bytes?
FULL DETAILS:
I am not quite sure what is useful information and what isn't.
File details:
ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=5810a6deb6546900ba259a5fef69e1415501b0e6, not stripped
Source code:
void main() {
char* flag = get_flag(); // I don't get access to the function details
char* shellcode = (char*) mmap((void*) 0x1337,12, 0, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
mprotect(shellcode, 12, PROT_READ | PROT_WRITE | PROT_EXEC);
fgets(shellcode, 12, stdin);
((void (*)(char*))shellcode)(flag);
}
Disassembly of main:
gef➤ disass main
Dump of assembler code for function main:
0x00005555555551de <+0>: push rbp
0x00005555555551df <+1>: mov rbp,rsp
=> 0x00005555555551e2 <+4>: sub rsp,0x10
0x00005555555551e6 <+8>: mov eax,0x0
0x00005555555551eb <+13>: call 0x555555555185 <get_flag>
0x00005555555551f0 <+18>: mov QWORD PTR [rbp-0x8],rax
0x00005555555551f4 <+22>: mov r9d,0x0
0x00005555555551fa <+28>: mov r8d,0xffffffff
0x0000555555555200 <+34>: mov ecx,0x22
0x0000555555555205 <+39>: mov edx,0x0
0x000055555555520a <+44>: mov esi,0xc
0x000055555555520f <+49>: mov edi,0x1337
0x0000555555555214 <+54>: call 0x555555555030 <mmap#plt>
0x0000555555555219 <+59>: mov QWORD PTR [rbp-0x10],rax
0x000055555555521d <+63>: mov rax,QWORD PTR [rbp-0x10]
0x0000555555555221 <+67>: mov edx,0x7
0x0000555555555226 <+72>: mov esi,0xc
0x000055555555522b <+77>: mov rdi,rax
0x000055555555522e <+80>: call 0x555555555060 <mprotect#plt>
0x0000555555555233 <+85>: mov rdx,QWORD PTR [rip+0x2e26] # 0x555555558060 <stdin##GLIBC_2.2.5>
0x000055555555523a <+92>: mov rax,QWORD PTR [rbp-0x10]
0x000055555555523e <+96>: mov esi,0xc
0x0000555555555243 <+101>: mov rdi,rax
0x0000555555555246 <+104>: call 0x555555555040 <fgets#plt>
0x000055555555524b <+109>: mov rax,QWORD PTR [rbp-0x10]
0x000055555555524f <+113>: mov rdx,QWORD PTR [rbp-0x8]
0x0000555555555253 <+117>: mov rdi,rdx
0x0000555555555256 <+120>: call rax
0x0000555555555258 <+122>: nop
0x0000555555555259 <+123>: leave
0x000055555555525a <+124>: ret
Register state right before shellcode is executed:
$rax : 0x0000000000010000 → "EXPLOIT\n"
$rbx : 0x0000555555555260 → <__libc_csu_init+0> push r15
$rcx : 0x000055555555a4e8 → 0x0000000000000000
$rdx : 0x0000555555559480 → "SUPER SECRET!"
$rsp : 0x00007fffffffd940 → 0x0000000000010000 → "EXPLOIT\n"
$rbp : 0x00007fffffffd950 → 0x0000000000000000
$rsi : 0x4f4c5058
$rdi : 0x00007ffff7fa34d0 → 0x0000000000000000
$rip : 0x0000555555555253 → <main+117> mov rdi, rdx
$r8 : 0x0000000000010000 → "EXPLOIT\n"
$r9 : 0x7c
$r10 : 0x000055555555448f → "mprotect"
$r11 : 0x246
$r12 : 0x00005555555550a0 → <_start+0> xor ebp, ebp
$r13 : 0x00007fffffffda40 → 0x0000000000000001
$r14 : 0x0
$r15 : 0x0
(This register state is a snapshot at the assembly line below)
●→ 0x555555555253 <main+117> mov rdi, rdx
0x555555555256 <main+120> call rax
Since I already spilled the beans and "spoiled" the answer to the online challenge in comments, I might as well write it up. 2 key tricks:
Create 0x7ffff7e3c5a0 (&puts) in a register with lea reg, [reg + disp32], using the known value of RDI which is within the +-2^31 range of a disp32. (Or use RBP as a starting point, but not RSP: that would need a SIB byte in the addressing mode).
This is a generalization of the code-golf trick of lea edi, [rax+1] trick to create small constants from other small constants (especially 0) in 3 bytes, with code that runs less slowly than push imm8 / pop reg.
The disp32 is large enough to not have any zero bytes; you have a couple registers to choose from in case one had been too close.
Copy a 64-bit register in 2 bytes with push reg / pop reg, instead of 3-byte mov rdi, rdx (REX + opcode + modrm). No savings if either push needs a REX prefix (for R8..R15), and actually costs bytes if both are "non-legacy" registers.
See other answers on Tips for golfing in x86/x64 machine code on codegolf.SE for more.
bits 64
lea rsi, [rdi - 0x166f30]
;; add rbp, imm32 ; alternative, but that would mess up a call-preserved register so we might crash on return.
push rdx
pop rdi ; copy RDX to first arg, x86-64 SysV calling convention
jmp rsi ; tailcall puts
This is exactly 11 bytes, and I don't see a way for it to be smaller. add r64, imm32 is also 7 bytes, same as LEA. (Or 6 bytes if the register is RAX, but even the xchg rax, rdi short form would cost 2 bytes to get it there, and the RAX value is still the fgets return value, which is the small mmap buffer address.)
The puts function pointer doesn't fit in 32 bits, so we need a REX prefix on any instruction that puts it into a register. Otherwise we could just mov reg, imm32 (5 bytes) with the absolute address, not deriving it from another register.
$ nasm -fbin -o exploit.bin -l /dev/stdout exploit.asm
1 bits 64
2 00000000 488DB7D090E9FF lea rsi, [rdi - 0x166f30]
3 ;; add rbp, imm32 ; we can avoid messing up any call-preserved registers
4 00000007 52 push rdx
5 00000008 5F pop rdi ; copy to first arg
6 00000009 FFE6 jmp rsi ; tailcall
$ ll exploit.bin
-rw-r--r-- 1 peter peter 11 Apr 24 04:09 exploit.bin
$ ./a.out < exploit.bin # would work if the addresses in my build matched yours
My build of your incomplete .c uses different addresses on my machine, but it does reach this code (at address 0x10000, mmap_min_addr which mmap picks after the amusing choice of 0x1337 as a hint address, which isn't even page aligned but doesn't result in EIVAL on current Linux.)
Since we only tailcall puts with correct stack alignment and don't modify any call-preserved registers, this should successfully return to main.
Note that 0 bytes (ASCII NUL, not NULL) would actually work in shellcode for this test program, if not for the requirement that forbids it.
The input is read using fgets (apparently to simulate a gets() overflow).
fgets actually can read a 0 aka '\0'; the only critical character is 0xa aka '\n' newline. See Is it possible to read null characters correctly using fgets or gets_s?
Often buffer overflows exploit a strcpy or something else that stops on a 0 byte, but fgets only stops on EOF or newline. (Or the buffer size, a feature gets is missing, hence its deprecation and removal from even the ISO C standard library! It's literally impossible to use safely unless you control the input data). So yes, it's totally normal to forbid zero bytes.
BTW, your int 0x80 attempt is not viable: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? - you can't use the 32-bit ABI to pass 64-bit pointers to write, and the string you want to output is not in the low 32 bits of virtual address space.
Of course, with the 64-bit syscall ABI, you're fine if you can hardcode the length.
push rdx
pop rsi
shr eax, 16 ; fun 3-byte way to turn 0x10000` into `1`, __NR_write 64-bit, instead of just push 1 / pop
mov edi, eax ; STDOUT_FD = __NR_write
lea edx, [rax + 13 - 1] ; 3 bytes. RDX = 13 = string length
; or mov dl, 0xff ; 2 bytes leaving garbage in rest of RDX
syscall
But this is 12 bytes, as well as hard-coding the length of the string (which was supposed to be part of the secret?).
mov dl, 0xff could make sure the length was at least 255, and actually much more in this case, if you don't mind getting reams of garbage after the string you want, until write hits an unmapped page and returns early. That would save a byte, making this 11.
(Fun fact, Linux write does not return an error when it's successfully written some bytes; instead it returns how many it did write. If you try again with buf + write_len, you would get a -EFAULT return value for passing a bad pointer to write.)

Writing a putchar in Assembly for x86_64 with 64 bit Linux?

I am trying to use the write syscall in order to reproduce the putchar function behavior which prints a single character. My code is as follows,
asm_putchar:
push rbp
mov rbp, rsp
mov r8, rdi
call:
mov rax, 1
mov rdi, 1
mov rsi, r8
mov rdx, 1
syscall
return:
mov rsp, rbp
pop rbp
ret
From man 2 write, you can see the signature of write is,
ssize_t write(int fd, const void *buf, size_t count);
It takes a pointer (const void *buf) to a buffer in memory. You can't pass it a char by value, so you have to store it to memory and pass a pointer.
(Don't print one char at a time unless you only have one to print, that's really inefficient. Construct a buffer in memory and print that. e.g. this x86-64 Linux NASM function: How do I print an integer in Assembly Level Programming without printf from the c library?)
A NASM version of GCC: putchar(char) in inline assembly:
; x86-64 System V calling convention: input = byte in DIL
; clobbers: RDI, RSI, RDX, RCX, R11 (last 2 by syscall itself)
; returns: RAX = write return value: 1 for success, -1..-4095 for error
writechar:
mov byte [rsp-4], dil ; store the char from RDI
mov edi, 1 ; EDI = fd=1 = stdout
lea rsi, [rsp-4] ; RSI = buf
mov edx, edi ; RDX = len = 1
syscall ; rax = write(1, buf, 1)
ret
If you do pass an invalid pointer in RSI, such as '2' (integer 50), the system call will return -EFAULT (-14) in RAX. (The kernel returns error codes on bad pointers to system calls, instead of delivering a SIGSEGV like it would if you deref in user-space).
See also What are the return values of system calls in Assembly?
Instead of writing code to check return values, in toy programs / experiments you should just run them under strace ./a.out, especially if you're writing your own _start without libc there won't be any other system calls during startup that you don't make yourself, so it's very easy to read the output. How should strace be used?

Why so much stack space used for each recursion?

I have a simple recursive function RCompare() that calls a more complex function Compare() which returns before the recursive call. Each recursion level uses 248 bytes of stack space which seems like way more than it should. Here is the recursive function:
void CMList::RCompare(MP n1) // RECURSIVE and Looping compare function
{
auto MP ne=n1->mf;
while(StkAvl() && Compare(n1=ne->mb))
RCompare(n1); // Recursive call !
}
StkAvl() is a simple stack space check function that compares the address of an auto variable to the value of an address near the end of the stack stored in a static variable.
It seems to me that the only things added to the stack in each recursion are two pointer variables (MP is a pointer to a structure) and the stuff that one function call stores, a few saved registers, base pointer, return address, etc., all 32-bit (4 byte) values. There's no way that is 248 bytes is it?
I don't no how to actually look at the stack in a meaningful way in Visual Studio 2008.
Thanks
Added disassembly:
CMList::RCompare:
0043E000 push ebp
0043E001 mov ebp,esp
0043E003 sub esp,0E4h
0043E009 push ebx
0043E00A push esi
0043E00B push edi
0043E00C push ecx
0043E00D lea edi,[ebp-0E4h]
0043E013 mov ecx,39h
0043E018 mov eax,0CCCCCCCCh
0043E01D rep stos dword ptr es:[edi]
0043E01F pop ecx
0043E020 mov dword ptr [ebp-8],edx
0043E023 mov dword ptr [ebp-14h],ecx
0043E026 mov eax,dword ptr [n1]
0043E029 mov ecx,dword ptr [eax+20h]
0043E02C mov dword ptr [ne],ecx
0043E02F mov ecx,dword ptr [this]
0043E032 call CMList::StkAvl (41D46Fh)
0043E037 test eax,eax
0043E039 je CMList::RCompare+63h (43E063h)
0043E03B mov eax,dword ptr [ne]
0043E03E mov ecx,dword ptr [eax+1Ch]
0043E041 mov dword ptr [n1],ecx
0043E044 mov edx,dword ptr [n1]
0043E047 mov ecx,dword ptr [this]
0043E04A call CMList::Compare (41DA05h)
0043E04F movzx edx,al
0043E052 test edx,edx
0043E054 je CMList::RCompare+63h (43E063h)
0043E056 mov edx,dword ptr [n1]
0043E059 mov ecx,dword ptr [this]
0043E05C call CMList::RCompare (41EC9Dh)
0043E061 jmp CMList::RCompare+2Fh (43E02Fh)
0043E063 pop edi
0043E064 pop esi
0043E065 pop ebx
0043E066 add esp,0E4h
0043E06C cmp ebp,esp
0043E06E call #ILT+5295(__RTC_CheckEsp) (41E4B4h)
0043E073 mov esp,ebp
0043E075 pop ebp
0043E076 ret
Why 0E4h?
More Info:
class mch // match node structure
{
public:
T_FSZ c1,c2; // file indexes
T_MSZ sz; // match size
enum ntyp typ; // type of node
mch *mb,*mf; // pointers to next and previous match nodes
};
typedef mch * MP; // for use in casting (MP) x
Should be a plain old pointer right? The same pointers are in the structure itself and they are just normal 4 byte pointers.
Edit: Added:
#pragma check_stack(off)
void CMList::RCompare(MP n1) // RECURSIVE and Looping compare function
{
auto MP ne=n1->mf;
while(StkAvl() && Compare(n1=ne->mb))
RCompare(n1); // Recursive call !
} // end RCompare()
#pragma check_stack()
But it didn't change anything. :(
Now what?
Note that on debug mode the compiler binds many bytes from the stack, on every function,
to catch buffer overflow bugs.
0043E003 sub esp, 0E4h ; < -- bound 228 bytes
...
0043E00D lea edi,[ebp-0E4h]
0043E013 mov ecx, 39h
0043E018 mov eax, 0CCCCCCCCh ; <-- sentinel
0043E01D rep stos dword ptr es:[edi] ; <-- write sentinels
Edit: the OP Harvey found the pragma that turns on/off stack probes.
check_stack
Instructs the compiler to turn off
stack probes if off (or –) is
specified,
or to turn on stack probes
if on (or +) is specified.
#pragma check_stack([ {on | off}] )
#pragma check_stack{+ | –}
Update: well, probes is another story, as it appears.
Try this: /GZ (Enable Stack Frame Run-Time Error Checking)
That also depends on the compiler and the architecture you're running - e.g. it could be aligning to 256 bytes for faster execution, so that each level uses the 8 bytes of the variable + 248 padding.
I guess some space has to be allocated for exception handling. Did you look at the disassembly?
In Visual Studio you can look at the register "esp", the stack pointer, in a watch (or register) windows. Set a breakpoint in your function between one call and the next to see who much stack you consume.
On a pain function on debug mode in Visual Studio 2008 it is 16 byes per function call.

Resources