Linux perf_events annotation frame pointer confusion - linux

I ran sudo perf record -F 99 find / followed by sudo perf report and selected "Annotate fdopendir" and here are the first seven instructions:
push %rbp
push %rbx
mov %edi,%esi
mov %edi,%ebx
mov $0x1,%edi
sub $0xa8,%rsp
mov %rsp,%rbp
The first instruction appears to be saving the caller's base frame pointer. I believe instructions 2 through 5 are irrelevant to this question but here for completeness. Instructions 6 and 7 are confusing to me. Shouldn't the assignment of rbp to rsp occur before subtracting 0xa8 from rsp?

The x86-64 System V ABI doesn't require making a traditional / legacy stack-frame. This looks close to a traditional stack frame setup, but it's definitely not because there's no mov %rsp, %rbp right after the first push %rbp.
We're seeing compiler-generated code that simply uses RBP as a temporary register, and is using it to hold a pointer to a local on the stack. It's just a coincidence that this happens to involve the instruction mov %rsp, %rbp sometime after push %rbp. This is not making a stack frame.
In x86-64 System V, RBX and RBP are the only 2 "low 8" registers that are call-preserved, and thus usable without REX prefixes in some cases (e.g. for the push/pop, and when used in addressing modes), saving code-size. GCC prefers to use them before saving/restoring any of R12..R15. What registers are preserved through a linux x86-64 function call (For pointers, copying them with mov always requires a REX prefix for 64-bit operand-size, so there are fewer savings than for 32-bit integers, but gcc still goes for RBX then RBP, in that order, when it needs to save/restore call-preserved regs in a function.)
Disassembly of /lib/libc.so.6 (glibc) on my system (Arch Linux) shows similar but different code-gen for fdopendir. You stopped the disassembly too soon, before it makes a function call. That sheds some light on why it wanted a call-preserved temporary register: it wanted the var in a reg across the call.
00000000000c1260 <fdopendir>:
c1260: 55 push %rbp
c1261: 89 fe mov %edi,%esi
c1263: 53 push %rbx
c1264: 89 fb mov %edi,%ebx
c1266: bf 01 00 00 00 mov $0x1,%edi
c126b: 48 81 ec a8 00 00 00 sub $0xa8,%rsp
c1272: 64 48 8b 04 25 28 00 00 00 mov %fs:0x28,%rax # stack-check cookie
c127b: 48 89 84 24 98 00 00 00 mov %rax,0x98(%rsp)
c1283: 31 c0 xor %eax,%eax
c1285: 48 89 e5 mov %rsp,%rbp # save a pointer
c1288: 48 89 ea mov %rbp,%rdx # and pass it as a function arg
c128b: e8 90 7d 02 00 callq e9020 <__fxstat>
c1290: 85 c0 test %eax,%eax
c1292: 78 6a js c12fe <fdopendir+0x9e>
c1294: 8b 44 24 18 mov 0x18(%rsp),%eax
c1298: 25 00 f0 00 00 and $0xf000,%eax
c129d: 3d 00 40 00 00 cmp $0x4000,%eax
c12a2: 75 4c jne c12f0 <fdopendir+0x90>
....
c12c1: 48 89 e9 mov %rbp,%rcx # pass the pointer as the 4th arg
c12c4: 89 c2 mov %eax,%edx
c12c6: 31 f6 xor %esi,%esi
c12c8: 89 df mov %ebx,%edi
c12ca: e8 d1 f7 ff ff callq c0aa0 <__alloc_dir>
c12cf: 48 8b 8c 24 98 00 00 00 mov 0x98(%rsp),%rcx
c12d7: 64 48 33 0c 25 28 00 00 00 xor %fs:0x28,%rcx # check the stack cookie
c12e0: 75 38 jne c131a <fdopendir+0xba>
c12e2: 48 81 c4 a8 00 00 00 add $0xa8,%rsp
c12e9: 5b pop %rbx
c12ea: 5d pop %rbp
c12eb: c3 retq
This is pretty silly code-gen; gcc could have simply used mov %rsp, %rcx the 2nd time it needed it. I'd call this a missed-optimization. It never needed that pointer in a call-preserved register because it always knew where it was relative to RSP.
(Even if it hadn't been exactly at RSP+0, lea something(%rsp), %rdx and lea something(%rsp), %rcx would have been totally fine the two times it was needed, with probably less total cost than saving/restoring RBP + the required mov instructions.)
Or it could have used mov 0x18(%rbp),%eax instead of rsp to save a byte of code-size in that addressing mode. Avoiding direct references to RSP between function calls reduces the amount of stack-sync uops Intel CPUs need to insert.

Related

Can I use scasb on a program's own code? [duplicate]

I wrote:
mov 60, %rax
GNU as accepted it, although I should have written
mov $60, %rax
Is there any difference between two such calls?
Yes; the first loads the value stored in memory at address 60 and stores the result in rax, the second stores the immediate value 60 into rax.
Just try it...
mov 60,%rax
mov $60,%rax
mov 0x60,%rax
0000000000000000 <.text>:
0: 48 8b 04 25 3c 00 00 mov 0x3c,%rax
7: 00
8: 48 c7 c0 3c 00 00 00 mov $0x3c,%rax
f: 48 8b 04 25 60 00 00 mov 0x60,%rax
16: 00
Ewww! Historically the dollar sign meant hex $60 = 0x60, but gas also has a history of screwing up assembly languages...and historically x86 assembly languages allowed 60h to indicate hex, but got an error when I did that.
So with and without the dollar sigh you get a different instruction.
0x8B is a register/memory to register, 0xC7 is an immediate to register. so as davmac answered mov 60,%rax is a mov memory location to register, and mov $60,%rax is mov immediate to register.

Decoding Shellcode from MSFvenom(xor x64)?

Encoding the shellcode three times using the x64 xor Encoder
I'm writing my own exploit and I've wondered if I need to decode the shellcode when adding it in my program or is the decoder stub inside of the shellcode already? If I need to decode, how can I do that, there is no key given?
No. You don't have to decrypt the shellcode. I ran the same command and got something which looked like this
0: 48 31 c9 xor rcx, rcx
3: 48 81 e9 b6 ff ff ff sub rcx, 0xffffffffffffffb6
a: 48 8d 05 ef ff ff ff lea rax, [rip+0xffffffffffffffef] # 0x0
11: 48 bb af cc c5 c0 90 movabs rbx, 0x29153c90c0c5ccaf
18: 3c 15 29
1b: 48 31 58 27 xor QWORD PTR [rax+0x27], rbx
1f: 48 2d f8 ff ff ff sub rax, 0xfffffffffffffff8
25: e2 f4 loop 0x1b
This was the starting part of shellcode followed by xor'd 2nd iteration payload. On decrypting I saw that It had a similar stub attached. So you don't have to decrypt. Just point execution to the start of the buffer.

Why is there a tight polling loop on register rax in libc nanosleep?

The disassembly of nanosleep in libc-2.7.so on 64-bit Linux looks like this:
Disassembly of section .text:
00000000000bd460 <__nanosleep>:
cmpl $0x0,__libc_multiple_threads
jne 10
00000000000bd469 <__nanosleep_nocancel>:
mov $0x23,%eax
syscal
10: cmp $0xfffffffffffff001,%rax
jae 40
retq
sub $0x8,%rsp
callq __libc_enable_asynccancel
mov %rax,(%rsp)
mov $0x23,%eax
syscal
mov (%rsp),%rdi
mov %rax,%rdx
callq __libc_disable_asynccancel
mov %rdx,%rax
add $0x8,%rsp
40: cmp $0xfffffffffffff001,%rax
jae 40
retq
mov _DYNAMIC+0x2e0,%rcx
neg %eax
mov %eax,%fs:(%rcx)
or $0xffffffffffffffff,%rax
retq
Near the bottom of this assembly code, there is this polling loop:
40: cmp $0xfffffffffffff001,%rax
jae 40
How would the value of rax change while this loop is executing? Wouldn't it either loop forever or not at all? What is this loop meant to accomplish?
I suspect this is related to the syscall instruction since the return value of syscall is put into register rax, but I'm not sure how this is related exactly. The way the code is written makes it look like syscall doesn't block and the value in rax changes spontaneously but that doesn't seem right.
I'm interested to know what's going on here.
I don't see these spin loops.
Here's what I get from objdump -d /lib/x86_64-linux-gnu/libc.so.6, with what you show as loops highlighted with ** and the address they jump to with ->.
00000000000c0f10 <__nanosleep>:
c0f10: 83 3d 5d 31 30 00 00 cmpl $0x0,0x30315d(%rip) # 3c4074 <argp_program_version_hook+0x1cc>
c0f17: 75 10 jne c0f29 <__nanosleep+0x19>
c0f19: b8 23 00 00 00 mov $0x23,%eax
c0f1e: 0f 05 syscall
c0f20: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
** c0f26: 73 31 jae c0f59 <__nanosleep+0x49>
c0f28: c3 retq
c0f29: 48 83 ec 08 sub $0x8,%rsp
c0f2d: e8 3e 72 04 00 callq 108170 <pthread_setcanceltype+0x80>
c0f32: 48 89 04 24 mov %rax,(%rsp)
c0f36: b8 23 00 00 00 mov $0x23,%eax
c0f3b: 0f 05 syscall
c0f3d: 48 8b 3c 24 mov (%rsp),%rdi
c0f41: 48 89 c2 mov %rax,%rdx
c0f44: e8 87 72 04 00 callq 1081d0 <pthread_setcanceltype+0xe0>
c0f49: 48 89 d0 mov %rdx,%rax
c0f4c: 48 83 c4 08 add $0x8,%rsp
c0f50: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
** c0f56: 73 01 jae c0f59 <__nanosleep+0x49>
c0f58: c3 retq
-> c0f59: 48 8b 0d 08 cf 2f 00 mov 0x2fcf08(%rip),%rcx # 3bde68 <_IO_file_jumps+0x7c8>
c0f60: f7 d8 neg %eax
c0f62: 64 89 01 mov %eax,%fs:(%rcx)
c0f65: 48 83 c8 ff or $0xffffffffffffffff,%rax
c0f69: c3 retq
c0f6a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
The rest of the code is similar. Maybe it's an issue with the disassembly?

Smallest Stack Frame Size

I'm currently doing the Capture-the-Flag event by Stripe (you should check it out if you haven't seen it yet). The event requires you to look at disassembled executables a lot, and my knowledge of asm is rusty.
I keep seeing the constant 0x18 show up as some sort of minimum stack size. For instance, in a function that allocates a char[1024] array and calls the function strcpy(), the assembly looks like this:
8048484: 55 push %ebp
8048485: 89 e5 mov %esp,%ebp
8048487: 81 ec 18 04 00 00 sub $0x418,%esp
804848d: 8b 45 08 mov 0x8(%ebp),%eax
8048490: 89 44 24 04 mov %eax,0x4(%esp)
8048494: 8d 85 f8 fb ff ff lea -0x408(%ebp),%eax
804849a: 89 04 24 mov %eax,(%esp)
804849d: e8 e6 fe ff ff call 8048388 <strcpy#plt>
80484a2: c9 leave
80484a3: c3 ret
Why is the extra space needed?

operand generation of CALL instruction on x86-64 AMD

Following is the output of objdump of a sample program,
080483b4 <display>:
80483b4: 55 push %ebp
80483b5: 89 e5 mov %esp,%ebp
80483b7: 83 ec 18 sub $0x18,%esp
80483ba: 8b 45 0c mov 0xc(%ebp),%eax
80483bd: 89 44 24 04 mov %eax,0x4(%esp)
80483c1: 8d 45 fe lea 0xfffffffe(%ebp),%eax
80483c4: 89 04 24 mov %eax,(%esp)
80483c7: e8 ec fe ff ff call 80482b8 <strcpy#plt>
80483cc: 8b 45 08 mov 0x8(%ebp),%eax
80483cf: 89 44 24 04 mov %eax,0x4(%esp)
80483d3: c7 04 24 f0 84 04 08 movl $0x80484f0,(%esp)
80483da: e8 e9 fe ff ff call 80482c8 <printf#plt>
80483df: c9 leave
80483e0: c3 ret
080483e1 <main>:
80483e1: 8d 4c 24 04 lea 0x4(%esp),%ecx
80483e5: 83 e4 f0 and $0xfffffff0,%esp
80483e8: ff 71 fc pushl 0xfffffffc(%ecx)
80483eb: 55 push %ebp
80483ec: 89 e5 mov %esp,%ebp
80483ee: 51 push %ecx
80483ef: 83 ec 24 sub $0x24,%esp
80483f2: c7 44 24 04 f3 84 04 movl $0x80484f3,0x4(%esp)
80483f9: 08
80483fa: c7 04 24 0a 00 00 00 movl $0xa,(%esp)
8048401: e8 ae ff ff ff call 80483b4 <display>
8048406: b8 00 00 00 00 mov $0x0,%eax
804840b: 83 c4 24 add $0x24,%esp
804840e: 59 pop %ecx
804840f: 5d pop %ebp
8048410: 8d 61 fc lea 0xfffffffc(%ecx),%esp
What i need to understand, is in main we see the following at address - 8048401, call 80483b4 , however the machine code is - e8 ae ff ff ff. I see that CALL instruction is E8 but how is the address of function 80483b4 getting decoded to FFFFFFAE? I did a lot of search in google but it did not return anything. Can Anyone please explain?
E8 is the operand for "Call Relative", meaning the destination address is computed by adding the operand to the address of the next instruction. The operand is 0xFFFFFFAE, which is negative 0x52. 0x808406 - 0x52 is 0x80483b4.
Most disassemblers helpfully calculate the actual target address rather than just give you the relative address in the operand.
Complete info for x86 ISA at: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2a-manual.html
Interesting question. I've had a look at Intel's documentation and the E8 opcode is CALL rel16/32. 0xffffffae is actually a 32-bit two's complement signed integer equal to -82 decimal; it is a relative address from the byte immediately after the opcode and its operands.
If you do the math you can see it checks out:
0x8048406 - 82 = 0x80483b4
This puts the instruction pointer at the beginning of the display function.
Near calls are typically IP-relative -- meaning, the "address" is actually an offset from the instruction pointer. In such case, EIP points to the next instruction (so its value is 8048406). Add ffffffae (or -00000052 in two's complement) to it, and you get 80483b4.
Note that all this math is 32-bit. You're not doing any 64-bit operations here (or your registers would have Rs instead of Es in their names, and the addresses would be much longer).

Resources