Related
I believe I understand how the linux x86-64 ABI uses registers and stack to pass parameters to a function (cf. previous ABI discussion). What I'm confused about is if/what registers are expected to be preserved across a function call. That is, what registers are guarenteed not to get clobbered?
Here's the complete table of registers and their use from the documentation [PDF Link]:
r12, r13, r14, r15, rbx, rsp, rbp are the callee-saved registers - they have a "Yes" in the "Preserved across function calls" column.
Experimental approach: disassemble GCC code
Mostly for fun, but also as a quick verification that you understood the ABI right.
Let's try to clobber all registers with inline assembly to force GCC to save and restore them:
main.c
#include <inttypes.h>
uint64_t inc(uint64_t i) {
__asm__ __volatile__(
""
: "+m" (i)
:
: "rax",
"rbx",
"rcx",
"rdx",
"rsi",
"rdi",
"rbp",
"rsp",
"r8",
"r9",
"r10",
"r11",
"r12",
"r13",
"r14",
"r15",
"ymm0",
"ymm1",
"ymm2",
"ymm3",
"ymm4",
"ymm5",
"ymm6",
"ymm7",
"ymm8",
"ymm9",
"ymm10",
"ymm11",
"ymm12",
"ymm13",
"ymm14",
"ymm15"
);
return i + 1;
}
int main(int argc, char **argv) {
(void)argv;
return inc(argc);
}
GitHub upstream.
Compile and disassemble:
gcc -std=gnu99 -O3 -ggdb3 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
00000000000011a0 <inc>:
11a0: 55 push %rbp
11a1: 48 89 e5 mov %rsp,%rbp
11a4: 41 57 push %r15
11a6: 41 56 push %r14
11a8: 41 55 push %r13
11aa: 41 54 push %r12
11ac: 53 push %rbx
11ad: 48 83 ec 08 sub $0x8,%rsp
11b1: 48 89 7d d0 mov %rdi,-0x30(%rbp)
11b5: 48 8b 45 d0 mov -0x30(%rbp),%rax
11b9: 48 8d 65 d8 lea -0x28(%rbp),%rsp
11bd: 5b pop %rbx
11be: 41 5c pop %r12
11c0: 48 83 c0 01 add $0x1,%rax
11c4: 41 5d pop %r13
11c6: 41 5e pop %r14
11c8: 41 5f pop %r15
11ca: 5d pop %rbp
11cb: c3 retq
11cc: 0f 1f 40 00 nopl 0x0(%rax)
and so we clearly see that the following are pushed and popped:
rbx
r12
r13
r14
r15
rbp
The only missing one from the spec is rsp, but we expect the stack to be restored of course. Careful reading of the assembly confirms that it is maintained in this case:
sub $0x8, %rsp: allocates 8 bytes on stack to save %rdi at %rdi, -0x30(%rbp), which is done for the inline assembly +m constraint
lea -0x28(%rbp), %rsp restores %rsp back to before the sub, i.e. 5 pops after mov %rsp, %rbp
there are 6 pushes and 6 corresponding pops
no other instructions touch %rsp
Tested in Ubuntu 18.10, GCC 8.2.0.
The ABI specifies what a piece of standard-conforming software is allowed to expect. It is written primarily for authors of compilers, linkers and other language processing software. These authors want their compiler to produce code that will work properly with code that is compiled by the same (or a different) compiler. They all have to agree to a set of rules: how are formal arguments to functions passed from caller to callee, how are function return values passed back from callee to caller, which registers are preserved/scratch/undefined across the call boundary, and so on.
For example, one rule states that the generated assembly code for a function must save the value of a preserved register before changing the value, and that the code must restore the saved value before returning to its caller. For a scratch register, the generated code is not required to save and restore the register value; it can do so if it wants, but standard-conforming software is not allowed to depend upon this behavior (if it does it is not standard-conforming software).
If you are writing assembly code, you are responsible for playing by these same rules (you are playing the role of the compiler). That is, if your code changes a callee-preserved register, you are responsible for inserting instructions that save and restore the original register value. If your assembly code calls an external function, your code must pass arguments in the standard-conforming way, and it can depend upon the fact that, when the callee returns, preserved register values are in fact preserved.
The rules define how standards-conforming software can get along. However, it is perfectly legal to write (or generate) code that does not play by these rules! Compilers do this all the time, because they know that the rules don't need to be followed under certain circumstances.
For example, consider a C function named foo that is declared as follows, and never has its address taken:
static foo(int x);
At compile-time, the compiler is 100% certain that this function can only be called by other code in the file(s) it is currently compiling. Function foo cannot be called by anything else, ever, given the definition of what it means to be static. Because the compiler knows all of the callers of foo at compile time, the compiler is free to use whatever calling sequence it wants (up to and including not making a call at all, that is, inlining the code for foo into the callers of foo.
As an author of assembly code, you can do this too. That is, you can implement a "private agreement" between two or more routines, as long as that agreement doesn't interfere with or violate the expectations of standards-conforming software.
I have this program:
static int aux() {
return 1;
}
int _start(){
int a = aux();
return a;
}
When I compile it using GCC with flags -nostdlib -m32 -fpie and generate an ELF binary, I get the following assembly code:
00001000 <aux>:
1000: f3 0f 1e fb endbr32
1004: 55 push %ebp
1005: 89 e5 mov %esp,%ebp
1007: e8 2d 00 00 00 call 1039 <__x86.get_pc_thunk.ax>
100c: 05 e8 2f 00 00 add $0x2fe8,%eax
1011: b8 01 00 00 00 mov $0x1,%eax
1016: 5d pop %ebp
1017: c3 ret
00001018 <_start>:
1018: f3 0f 1e fb endbr32
101c: 55 push %ebp
101d: 89 e5 mov %esp,%ebp
101f: 83 ec 10 sub $0x10,%esp
1022: e8 12 00 00 00 call 1039 <__x86.get_pc_thunk.ax>
1027: 05 cd 2f 00 00 add $0x2fcd,%eax
102c: e8 cf ff ff ff call 1000 <aux>
1031: 89 45 fc mov %eax,-0x4(%ebp)
1034: 8b 45 fc mov -0x4(%ebp),%eax
1037: c9 leave
1038: c3 ret
00001039 <__x86.get_pc_thunk.ax>:
1039: 8b 04 24 mov (%esp),%eax
103c: c3 ret
I know that the get_pc_thunk function is used to implement position-independent code in x86, but in this case I can't understand why it is being used. My questions are:
The function is returning the address of the next instruction in the eax register and, in both usages, an add instruction is being used to make eax point to the GOT. Normally, (at least when accessing global variables), this eax register would be immediately used to access a global variable in the table. In this case, however, the eax is being completely ignored. What is going on?
I also don't understand why the get_pc_thunk is even present in the code, since both call instructions are using relative addresses. Since the addresses are relative, shouldn't they already be position-independent out of the box?
Thanks!
You haven't enabled optimisation, so GCC emits function prologues without regard to if they are useful in the function in question.
To see the result of get_pc_thunk used access a global variable.
To remove the useless calls to get_pc_thunk enable optimisation for example by adding -O2 to the GCC command line.
If, however, I move the aux() function to another compilation unit, the get_pc_thunk function remains being called, even with -O2, and, again, its return value is being ignored.
IIRC, EBX=GOT point is assumed/required by the PLT itself, and the call has to go through the PLT because it's not known when compiling this compilation unit that an aux definition will be statically linked with it. (https://godbolt.org/z/Yere9o shows that effect for main with just a prototype for aux(), not a definition it can inline.)
With a "hidden" ELF visibility attribute, we can get that to go away because the compiler knows it doesn't need to indirect through the PLT because a call rel32 will be known at static link time without needing runtime relocation: https://godbolt.org/z/73dGKq
__attribute__((visibility("hidden"))) int aux(void);
int _start(){
int a = aux();
return a;
}
gcc10.1 -O2 -m32 -fpie
_start:
jmp aux
IMO it makes sense to have the call in object files generated for compilation units that are calling external functions, but I don't understand why the linker (or the 'flow') is not removing them in the final binary.
#felipeek: Good question. The linker doesn't know when it can relax a call foo#plt to call foo because that also disables symbol interposition. Even if there is a definition of foo in this ELF shared object, a definition in one loaded earlier could override it / take precedence. I think this "problem" is due to the fact that PIE executables evolved out of a kind of hack: put an entry point in a shared object and the dynamic linker will be willing to run it. i.e. at an ELF level, PIE executables are the same as .so, and -fpie and -fPIC look the same to the linker.
The linker can go the other way, though: if making a normal non-PIE executable (ELF type = EXEC), it can turn call foo into call foo#plt, but that PLT itself doesn't have to be PIE/PIC so it doesn't need EBX=GOT.
Are we saying that all calls to other compilation units will invoke a totally unnecessary call in the final binary when PIE is required?
No, only ones in 32-bit PIE code where you fail to tell the compiler that it's an "internal" symbol using ELF "hidden" visibility. You can even have 2 names for the same symbol, one with hidden visibility, so you can make a function that libraries can resolve by name, but that you can still call from within the executable using simple call rel32 instead of clunky indirect calls via the PLT.
This is one of the downsides of PIE. Even in 64-bit code, without the attribute you get jmp aux#PLT. (Or with -fno-plt, an indirect call using RIP-relative addressing for the GOT entry.)
32-bit PIE really sucks a lot for performance, like on average 15% (measured a while ago on CPUs at the time, could possibly be somewhat different.) Much smaller effect on x86-64 where RIP-relative addressing is available, like a couple %. 32-bit absolute addresses no longer allowed in x86-64 Linux? has some links to more details.
I want to do a statistic of memory bytes access on programs running on Linux (X86_64 architecture). I use perf tool to dump the file like this:
: ffffffff81484700 <load2+0x484700>:
2.86 : ffffffff8148473b: 41 8b 57 04 mov 0x4(%r15),%edx
5.71 : ffffffff81484800: 65 8b 3c 25 1c b0 00 mov %gs:0xb01c,%edi
22.86 : ffffffff814848a0: 42 8b b4 39 80 00 00 mov 0x80(%rcx,%r15,1),%esi
25.71 : ffffffff814848d8: 42 8b b4 39 80 00 00 mov 0x80(%rcx,%r15,1),%esi
2.86 : ffffffff81484947: 80 bb b0 00 00 00 00 cmpb $0x0,0xb0(%rbx)
2.86 : ffffffff81484954: 83 bb 88 03 00 00 01 cmpl $0x1,0x388(%rbx)
5.71 : ffffffff81484978: 80 79 40 00 cmpb $0x0,0x40(%rcx)
2.86 : ffffffff8148497e: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi
5.71 : ffffffff8148499b: 8b 71 34 mov 0x34(%rcx),%esi
5.71 : ffffffff814849a4: 0f af 34 24 imul (%rsp),%esi
My current method is to analyze file and get all memory access instructions, such as move, cmp, etc. Then calculate every access bytes of every instruction, such as mov 0x4(%r15),%edx will add 4 bytes.
I want to know whether there is possible way to calculate through machine code , such as by analyzing "41 8b 57 04", I can also add 4 bytes. Because I am not familiar with X86_64 machine code, could anyone give any clues? Or is there any better way to do statistics? Thanks in advance!
See https://stackoverflow.com/a/20319753/120163 for information about decoding Intel instructions; in fact, you really need to refer to Intel reference manuals: http://download.intel.com/design/intarch/manuals/24319101.pdf If you only want to do this manually for a few instructions, you can just look up the data in these manuals.
If you want to automate the computation of instruction total-memory-access, you will need a function that maps instructions to the amount of data accessed. Since the instruction set is complex, the corresponding function will be complex and take you a long time to write from scratch.
My SO answer https://stackoverflow.com/a/23843450/120163 provides C code that maps x86-32 instructions to their length, given a buffer that contains a block of binary code. Such code is necessary if one is to start at some point in the object code buffer and simply enumerate the instructions that are being used. (This code has been used in production; it is pretty solid). This routine was built basically by reading the Intel reference manual very carefully. For OP, this would have to be extended to x86-64, which shouldn't be very hard, mostly you have account for the extended-register prefix opcode byte and some differences from x86-32.
To solve OP's problem, one would also modify this routine to also return the number of byte reads by each individual instruction. This latter data also has to be extracted by careful inspection from the Intel reference manuals.
OP also has to worry about where he gets the object code from; if he doesn't run this routine in the address space of the object code itself, he will need to somehow get this object code from the .exe file. For that, he needs to build or run the equivalent of the Windows loader, and I'll bet that
has a bunch of dark corners. Check out the format of object code files.
I believe I understand how the linux x86-64 ABI uses registers and stack to pass parameters to a function (cf. previous ABI discussion). What I'm confused about is if/what registers are expected to be preserved across a function call. That is, what registers are guarenteed not to get clobbered?
Here's the complete table of registers and their use from the documentation [PDF Link]:
r12, r13, r14, r15, rbx, rsp, rbp are the callee-saved registers - they have a "Yes" in the "Preserved across function calls" column.
Experimental approach: disassemble GCC code
Mostly for fun, but also as a quick verification that you understood the ABI right.
Let's try to clobber all registers with inline assembly to force GCC to save and restore them:
main.c
#include <inttypes.h>
uint64_t inc(uint64_t i) {
__asm__ __volatile__(
""
: "+m" (i)
:
: "rax",
"rbx",
"rcx",
"rdx",
"rsi",
"rdi",
"rbp",
"rsp",
"r8",
"r9",
"r10",
"r11",
"r12",
"r13",
"r14",
"r15",
"ymm0",
"ymm1",
"ymm2",
"ymm3",
"ymm4",
"ymm5",
"ymm6",
"ymm7",
"ymm8",
"ymm9",
"ymm10",
"ymm11",
"ymm12",
"ymm13",
"ymm14",
"ymm15"
);
return i + 1;
}
int main(int argc, char **argv) {
(void)argv;
return inc(argc);
}
GitHub upstream.
Compile and disassemble:
gcc -std=gnu99 -O3 -ggdb3 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
00000000000011a0 <inc>:
11a0: 55 push %rbp
11a1: 48 89 e5 mov %rsp,%rbp
11a4: 41 57 push %r15
11a6: 41 56 push %r14
11a8: 41 55 push %r13
11aa: 41 54 push %r12
11ac: 53 push %rbx
11ad: 48 83 ec 08 sub $0x8,%rsp
11b1: 48 89 7d d0 mov %rdi,-0x30(%rbp)
11b5: 48 8b 45 d0 mov -0x30(%rbp),%rax
11b9: 48 8d 65 d8 lea -0x28(%rbp),%rsp
11bd: 5b pop %rbx
11be: 41 5c pop %r12
11c0: 48 83 c0 01 add $0x1,%rax
11c4: 41 5d pop %r13
11c6: 41 5e pop %r14
11c8: 41 5f pop %r15
11ca: 5d pop %rbp
11cb: c3 retq
11cc: 0f 1f 40 00 nopl 0x0(%rax)
and so we clearly see that the following are pushed and popped:
rbx
r12
r13
r14
r15
rbp
The only missing one from the spec is rsp, but we expect the stack to be restored of course. Careful reading of the assembly confirms that it is maintained in this case:
sub $0x8, %rsp: allocates 8 bytes on stack to save %rdi at %rdi, -0x30(%rbp), which is done for the inline assembly +m constraint
lea -0x28(%rbp), %rsp restores %rsp back to before the sub, i.e. 5 pops after mov %rsp, %rbp
there are 6 pushes and 6 corresponding pops
no other instructions touch %rsp
Tested in Ubuntu 18.10, GCC 8.2.0.
The ABI specifies what a piece of standard-conforming software is allowed to expect. It is written primarily for authors of compilers, linkers and other language processing software. These authors want their compiler to produce code that will work properly with code that is compiled by the same (or a different) compiler. They all have to agree to a set of rules: how are formal arguments to functions passed from caller to callee, how are function return values passed back from callee to caller, which registers are preserved/scratch/undefined across the call boundary, and so on.
For example, one rule states that the generated assembly code for a function must save the value of a preserved register before changing the value, and that the code must restore the saved value before returning to its caller. For a scratch register, the generated code is not required to save and restore the register value; it can do so if it wants, but standard-conforming software is not allowed to depend upon this behavior (if it does it is not standard-conforming software).
If you are writing assembly code, you are responsible for playing by these same rules (you are playing the role of the compiler). That is, if your code changes a callee-preserved register, you are responsible for inserting instructions that save and restore the original register value. If your assembly code calls an external function, your code must pass arguments in the standard-conforming way, and it can depend upon the fact that, when the callee returns, preserved register values are in fact preserved.
The rules define how standards-conforming software can get along. However, it is perfectly legal to write (or generate) code that does not play by these rules! Compilers do this all the time, because they know that the rules don't need to be followed under certain circumstances.
For example, consider a C function named foo that is declared as follows, and never has its address taken:
static foo(int x);
At compile-time, the compiler is 100% certain that this function can only be called by other code in the file(s) it is currently compiling. Function foo cannot be called by anything else, ever, given the definition of what it means to be static. Because the compiler knows all of the callers of foo at compile time, the compiler is free to use whatever calling sequence it wants (up to and including not making a call at all, that is, inlining the code for foo into the callers of foo.
As an author of assembly code, you can do this too. That is, you can implement a "private agreement" between two or more routines, as long as that agreement doesn't interfere with or violate the expectations of standards-conforming software.
I am experiencing a crash, and while investigating I found myself totally blocked by the following code:
0000000000000a00 <_IO_vfprintf>:
a00: 55 push %rbp
a01: 48 89 e5 mov %rsp,%rbp
a04: 41 57 push %r15
a06: 41 56 push %r14
a08: 41 55 push %r13
a0a: 41 54 push %r12
a0c: 53 push %rbx
a0d: 48 81 ec 48 06 00 00 sub $0x648,%rsp
a14: 48 89 95 98 f9 ff ff mov %rdx,0xfffffffffffff998(%rbp)
This is generated by running objdump --disassemble /usr/lib64/libc.a on a 64-bit Linux x86 system, and then searching through the output. This is AT&T syntax, so destinations are on the right.
Specifically, I don't understand the last instruction. It seems to be writing the value of the rdx register into memory somewhere on the stack (far, far away), before the function has touched that register. To me, this doesn't make any sense.
I tried reading up on the calling conventions, and my best theory now is that rdx is used for a parameter, so the code is basically "returning" the parameter value directly. This is not the end of the function, so it's not really returning, of course.
Yes, it's a parameter. The ABI used by Linux assigns up to 6 "INTEGER" (<= 64-bit integer, or pointer) type parameters to registers, in the obvious and easy-to-remember order %rdi, %rsi, %rdx, %rcx, %r8, %r9.
The stack frame is 1648 bytes (sub $0x648,%rsp claims 1608 bytes, plus 5 64-bit registers have been pushed before that), and 0xfffffffffffff998 is -1640.
So the code is storing the 3rd parameter near the bottom of the stack frame.
(Note: the Windows 64-bit ABI is different to the Linux one.)