Trouble understanding an x86-64 function preamble - linux

I am experiencing a crash, and while investigating I found myself totally blocked by the following code:
0000000000000a00 <_IO_vfprintf>:
a00: 55 push %rbp
a01: 48 89 e5 mov %rsp,%rbp
a04: 41 57 push %r15
a06: 41 56 push %r14
a08: 41 55 push %r13
a0a: 41 54 push %r12
a0c: 53 push %rbx
a0d: 48 81 ec 48 06 00 00 sub $0x648,%rsp
a14: 48 89 95 98 f9 ff ff mov %rdx,0xfffffffffffff998(%rbp)
This is generated by running objdump --disassemble /usr/lib64/libc.a on a 64-bit Linux x86 system, and then searching through the output. This is AT&T syntax, so destinations are on the right.
Specifically, I don't understand the last instruction. It seems to be writing the value of the rdx register into memory somewhere on the stack (far, far away), before the function has touched that register. To me, this doesn't make any sense.
I tried reading up on the calling conventions, and my best theory now is that rdx is used for a parameter, so the code is basically "returning" the parameter value directly. This is not the end of the function, so it's not really returning, of course.

Yes, it's a parameter. The ABI used by Linux assigns up to 6 "INTEGER" (<= 64-bit integer, or pointer) type parameters to registers, in the obvious and easy-to-remember order %rdi, %rsi, %rdx, %rcx, %r8, %r9.
The stack frame is 1648 bytes (sub $0x648,%rsp claims 1608 bytes, plus 5 64-bit registers have been pushed before that), and 0xfffffffffffff998 is -1640.
So the code is storing the 3rd parameter near the bottom of the stack frame.
(Note: the Windows 64-bit ABI is different to the Linux one.)

Related

Why does xmm0 return 0 instead of the value it should hold? [duplicate]

I believe I understand how the linux x86-64 ABI uses registers and stack to pass parameters to a function (cf. previous ABI discussion). What I'm confused about is if/what registers are expected to be preserved across a function call. That is, what registers are guarenteed not to get clobbered?
Here's the complete table of registers and their use from the documentation [PDF Link]:
r12, r13, r14, r15, rbx, rsp, rbp are the callee-saved registers - they have a "Yes" in the "Preserved across function calls" column.
Experimental approach: disassemble GCC code
Mostly for fun, but also as a quick verification that you understood the ABI right.
Let's try to clobber all registers with inline assembly to force GCC to save and restore them:
main.c
#include <inttypes.h>
uint64_t inc(uint64_t i) {
__asm__ __volatile__(
""
: "+m" (i)
:
: "rax",
"rbx",
"rcx",
"rdx",
"rsi",
"rdi",
"rbp",
"rsp",
"r8",
"r9",
"r10",
"r11",
"r12",
"r13",
"r14",
"r15",
"ymm0",
"ymm1",
"ymm2",
"ymm3",
"ymm4",
"ymm5",
"ymm6",
"ymm7",
"ymm8",
"ymm9",
"ymm10",
"ymm11",
"ymm12",
"ymm13",
"ymm14",
"ymm15"
);
return i + 1;
}
int main(int argc, char **argv) {
(void)argv;
return inc(argc);
}
GitHub upstream.
Compile and disassemble:
gcc -std=gnu99 -O3 -ggdb3 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
00000000000011a0 <inc>:
11a0: 55 push %rbp
11a1: 48 89 e5 mov %rsp,%rbp
11a4: 41 57 push %r15
11a6: 41 56 push %r14
11a8: 41 55 push %r13
11aa: 41 54 push %r12
11ac: 53 push %rbx
11ad: 48 83 ec 08 sub $0x8,%rsp
11b1: 48 89 7d d0 mov %rdi,-0x30(%rbp)
11b5: 48 8b 45 d0 mov -0x30(%rbp),%rax
11b9: 48 8d 65 d8 lea -0x28(%rbp),%rsp
11bd: 5b pop %rbx
11be: 41 5c pop %r12
11c0: 48 83 c0 01 add $0x1,%rax
11c4: 41 5d pop %r13
11c6: 41 5e pop %r14
11c8: 41 5f pop %r15
11ca: 5d pop %rbp
11cb: c3 retq
11cc: 0f 1f 40 00 nopl 0x0(%rax)
and so we clearly see that the following are pushed and popped:
rbx
r12
r13
r14
r15
rbp
The only missing one from the spec is rsp, but we expect the stack to be restored of course. Careful reading of the assembly confirms that it is maintained in this case:
sub $0x8, %rsp: allocates 8 bytes on stack to save %rdi at %rdi, -0x30(%rbp), which is done for the inline assembly +m constraint
lea -0x28(%rbp), %rsp restores %rsp back to before the sub, i.e. 5 pops after mov %rsp, %rbp
there are 6 pushes and 6 corresponding pops
no other instructions touch %rsp
Tested in Ubuntu 18.10, GCC 8.2.0.
The ABI specifies what a piece of standard-conforming software is allowed to expect. It is written primarily for authors of compilers, linkers and other language processing software. These authors want their compiler to produce code that will work properly with code that is compiled by the same (or a different) compiler. They all have to agree to a set of rules: how are formal arguments to functions passed from caller to callee, how are function return values passed back from callee to caller, which registers are preserved/scratch/undefined across the call boundary, and so on.
For example, one rule states that the generated assembly code for a function must save the value of a preserved register before changing the value, and that the code must restore the saved value before returning to its caller. For a scratch register, the generated code is not required to save and restore the register value; it can do so if it wants, but standard-conforming software is not allowed to depend upon this behavior (if it does it is not standard-conforming software).
If you are writing assembly code, you are responsible for playing by these same rules (you are playing the role of the compiler). That is, if your code changes a callee-preserved register, you are responsible for inserting instructions that save and restore the original register value. If your assembly code calls an external function, your code must pass arguments in the standard-conforming way, and it can depend upon the fact that, when the callee returns, preserved register values are in fact preserved.
The rules define how standards-conforming software can get along. However, it is perfectly legal to write (or generate) code that does not play by these rules! Compilers do this all the time, because they know that the rules don't need to be followed under certain circumstances.
For example, consider a C function named foo that is declared as follows, and never has its address taken:
static foo(int x);
At compile-time, the compiler is 100% certain that this function can only be called by other code in the file(s) it is currently compiling. Function foo cannot be called by anything else, ever, given the definition of what it means to be static. Because the compiler knows all of the callers of foo at compile time, the compiler is free to use whatever calling sequence it wants (up to and including not making a call at all, that is, inlining the code for foo into the callers of foo.
As an author of assembly code, you can do this too. That is, you can implement a "private agreement" between two or more routines, as long as that agreement doesn't interfere with or violate the expectations of standards-conforming software.

ELF label address

I have the following code in .s file:
pushq $afterjmp
nop
afterjmp:
movl %eax, %edx
Its object file has the following:
20: 68 00 00 00 00 pushq $0x0
25: 90 nop
0000000000000026 <afterjmp>:
26: 89 c2 mov %eax,%edx
After linking, it becomes:
400572: 68 78 05 40 00 pushq $0x400578
400577: 90 nop
400578: 89 c2 mov %eax,%edx
How does the argument 0x0 to pushq at byte 20 of the object file gets converted to 0x400578 in the final executable?
Which section of the object file contains this information?
You answered your own question: After linking....
Here is a good article:
Linkers and Loaders
In particular, note the section about "symbol relocation":
Relocation. Compilers and assemblers generate the object code for each
input module with a starting address of zero. Relocation is the
process of assigning load addresses to different parts of the program
by merging all sections of the same type into one section. The code
and data section also are adjusted so they point to the correct
runtime addresses.
There's no way to know the program address of "afterjmp" when a single object file is assembled. It's only when all the object files are assembled into an executable image can the addresses (relative to offset "0") be computed. That's one of the jobs of the linker: to keep track of "symbol references" (like "afterjmp"), and compute the machine address ("symbol relocation").

What registers are preserved through a linux x86-64 function call

I believe I understand how the linux x86-64 ABI uses registers and stack to pass parameters to a function (cf. previous ABI discussion). What I'm confused about is if/what registers are expected to be preserved across a function call. That is, what registers are guarenteed not to get clobbered?
Here's the complete table of registers and their use from the documentation [PDF Link]:
r12, r13, r14, r15, rbx, rsp, rbp are the callee-saved registers - they have a "Yes" in the "Preserved across function calls" column.
Experimental approach: disassemble GCC code
Mostly for fun, but also as a quick verification that you understood the ABI right.
Let's try to clobber all registers with inline assembly to force GCC to save and restore them:
main.c
#include <inttypes.h>
uint64_t inc(uint64_t i) {
__asm__ __volatile__(
""
: "+m" (i)
:
: "rax",
"rbx",
"rcx",
"rdx",
"rsi",
"rdi",
"rbp",
"rsp",
"r8",
"r9",
"r10",
"r11",
"r12",
"r13",
"r14",
"r15",
"ymm0",
"ymm1",
"ymm2",
"ymm3",
"ymm4",
"ymm5",
"ymm6",
"ymm7",
"ymm8",
"ymm9",
"ymm10",
"ymm11",
"ymm12",
"ymm13",
"ymm14",
"ymm15"
);
return i + 1;
}
int main(int argc, char **argv) {
(void)argv;
return inc(argc);
}
GitHub upstream.
Compile and disassemble:
gcc -std=gnu99 -O3 -ggdb3 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
00000000000011a0 <inc>:
11a0: 55 push %rbp
11a1: 48 89 e5 mov %rsp,%rbp
11a4: 41 57 push %r15
11a6: 41 56 push %r14
11a8: 41 55 push %r13
11aa: 41 54 push %r12
11ac: 53 push %rbx
11ad: 48 83 ec 08 sub $0x8,%rsp
11b1: 48 89 7d d0 mov %rdi,-0x30(%rbp)
11b5: 48 8b 45 d0 mov -0x30(%rbp),%rax
11b9: 48 8d 65 d8 lea -0x28(%rbp),%rsp
11bd: 5b pop %rbx
11be: 41 5c pop %r12
11c0: 48 83 c0 01 add $0x1,%rax
11c4: 41 5d pop %r13
11c6: 41 5e pop %r14
11c8: 41 5f pop %r15
11ca: 5d pop %rbp
11cb: c3 retq
11cc: 0f 1f 40 00 nopl 0x0(%rax)
and so we clearly see that the following are pushed and popped:
rbx
r12
r13
r14
r15
rbp
The only missing one from the spec is rsp, but we expect the stack to be restored of course. Careful reading of the assembly confirms that it is maintained in this case:
sub $0x8, %rsp: allocates 8 bytes on stack to save %rdi at %rdi, -0x30(%rbp), which is done for the inline assembly +m constraint
lea -0x28(%rbp), %rsp restores %rsp back to before the sub, i.e. 5 pops after mov %rsp, %rbp
there are 6 pushes and 6 corresponding pops
no other instructions touch %rsp
Tested in Ubuntu 18.10, GCC 8.2.0.
The ABI specifies what a piece of standard-conforming software is allowed to expect. It is written primarily for authors of compilers, linkers and other language processing software. These authors want their compiler to produce code that will work properly with code that is compiled by the same (or a different) compiler. They all have to agree to a set of rules: how are formal arguments to functions passed from caller to callee, how are function return values passed back from callee to caller, which registers are preserved/scratch/undefined across the call boundary, and so on.
For example, one rule states that the generated assembly code for a function must save the value of a preserved register before changing the value, and that the code must restore the saved value before returning to its caller. For a scratch register, the generated code is not required to save and restore the register value; it can do so if it wants, but standard-conforming software is not allowed to depend upon this behavior (if it does it is not standard-conforming software).
If you are writing assembly code, you are responsible for playing by these same rules (you are playing the role of the compiler). That is, if your code changes a callee-preserved register, you are responsible for inserting instructions that save and restore the original register value. If your assembly code calls an external function, your code must pass arguments in the standard-conforming way, and it can depend upon the fact that, when the callee returns, preserved register values are in fact preserved.
The rules define how standards-conforming software can get along. However, it is perfectly legal to write (or generate) code that does not play by these rules! Compilers do this all the time, because they know that the rules don't need to be followed under certain circumstances.
For example, consider a C function named foo that is declared as follows, and never has its address taken:
static foo(int x);
At compile-time, the compiler is 100% certain that this function can only be called by other code in the file(s) it is currently compiling. Function foo cannot be called by anything else, ever, given the definition of what it means to be static. Because the compiler knows all of the callers of foo at compile time, the compiler is free to use whatever calling sequence it wants (up to and including not making a call at all, that is, inlining the code for foo into the callers of foo.
As an author of assembly code, you can do this too. That is, you can implement a "private agreement" between two or more routines, as long as that agreement doesn't interfere with or violate the expectations of standards-conforming software.

Does cmpxchg call provided by linux ever crash?

I am using cmpxchg() provided by linux kernel (SLES11-SP2)
Its panicking.
the exact point its crashing is in line 2005:
if (cmpxchg(var, old, new) == old)
2002: 48 89 d8 mov %rbx,%rax
2005: f0 4d 0f b1 34 24 lock cmpxchg %r14,(%r12)
200b: 48 39 c3 cmp %rax,%rbx
200e: 74 27 je 2037 <atomicPatchFnPtr+0x77>
Any clue about, how I can go about debugging ?Is this happening due to race condition in locking a variable?
Or do i need to post this as a bug on kernel ?
The lock cmpxchg instruction can cause an access violation if the address it is passed (in %r12 here) is invalid. That is probably the variable var in the line of code above. It suggests that var is pointing to some invalid memory. It isn't a race in the cmpxchg function, but it might still be a race condition in the calling function.

Why does the VC++ compiler MOV+PUSH args instead of just PUSH them? x86

In this disassembly from VC++ a function call is being made. The compiler MOVs the local pointers to a register before pushing them:
memcpy( nodeNewLocation, pNode, sizeCurrentNode );
0041A5DA 8B 45 F8 mov eax,dword ptr [ebp-8]
0041A5DD 50 push eax
0041A5DE 8B 4D 0C mov ecx,dword ptr [ebp+0Ch]
0041A5E1 51 push ecx
0041A5E2 8B 55 D4 mov edx,dword ptr [ebp-2Ch]
0041A5E5 52 push edx
0041A5E6 E8 67 92 FF FF call 00413852
0041A5EB 83 C4 0C add esp,0Ch
Why not just push them directly? ie
push dword ptr [ebp-8]
Also, if you are going to do a separate push, why not do it manually. In other words, instead of doing "push eax" above, do
mov [esp], eax
Etc. the advantage of this is that after doing the 3 movs you can do a single subtract to set the new stack pointer, instead of implicitly subtracting three times with the pushes.
UPDATE---Release version
This is the same code compiled for release:
; 741 : memcpy( nodeNewLocation, pNode, sizeCurrentNode );
00087 8b 45 f8 mov eax, DWORD PTR _sizeCurrentNode$[ebp]
0008a 8b 7b 04 mov edi, DWORD PTR [ebx+4]
0008d 50 push eax
0008e 56 push esi
0008f 57 push edi
00090 e8 00 00 00 00 call _memcpy
00095 83 c4 0c add esp, 12 ; 0000000cH
Definitely more efficient than the debug version, but it is still doing a MOV/PUSH combo.
This is an optimization. It is explicitly mentioned in the Intel processor manuals, volume 4, section 12.3.3.6:
In Intel Atom microarchitecture, using PUSH/POP instructions to manage stack space
and address adjustment between function calls/returns will be more optimal than
using ENTER/LEAVE alternatives. This is because PUSH/POP will not need MSROM
flows and stack pointer address update is done at AGU.
When a callee function need to return to the caller, the callee could issue POP instruction
to restore data and restore the stack pointer from the EBP.
Assembly/Compiler Coding Rule 19. (MH impact, M generality) For Intel
Atom processors, favor register form of PUSH/POP and avoid using LEAVE; Use LEA
to adjust ESP instead of ADD/SUB.
The rest of the manual isn't that clear about the reason, but it does mention a possible 3 cycle AGU stall on implicit ESP adjustments.
I suspect it only does it in debug builds, or possibly in some situations where it's warranted by pipelining or other considerations (e.g. it could put a parameter into esi and use it after the call). I've looked into some binaries, and MSVC definitely does use such pushes:
push ebx ; mthd
push dword ptr [ebp+place+4]
push dword ptr [ebp+place] ; pos
push [ebp+filedes] ; fh
call __lseeki64_nolock
(code from the CRT)
As for the second question, instructions addressing esp are longer than pushes: "push eax" is one byte while "mov [esp-8], eax" is four bytes. In fact, this approach (mov instead of push) is used by GCC by default since a couple versions ago (option -maccumulate-outgoing-args) and it has led to notable increases in code size. Supposedly it makes code faster but I'm unconvinced.
I actually figured out the reason for it. It has to do with the way instructions are pipelined on the Pentium MMX. There are two pipelines, U and V, which allows MMX processors to process 2 instructions at a time IF they are pairable. PUSHs are not pairable with one another, but they are pairable with MOVs. So, if you write:
mov eax, [indirect]
mov esi, [indirect]
push eax
push esi
then, what happens is that instructions #1 and #3 get paired and #2 and #4 get paired so, effectively, these four instructions run in the same number of cycles as a single mov/push, and a single mov/push is faster than two push [indirect]s. This exact case is described in detail in Section 4.3, p. 41, Examples 4.11a and 4.11b, of the Microarchitecture optimization guide by Agner Fog, available widely on internet.

Resources